fix(runtime): add subprocess timeout config for claude-code driver#2
Merged
benhoverter merged 2 commits intolocal-mainfrom Apr 28, 2026
Merged
fix(runtime): add subprocess timeout config for claude-code driver#2benhoverter merged 2 commits intolocal-mainfrom
benhoverter merged 2 commits intolocal-mainfrom
Conversation
The claude-code driver hardcodes its per-message turn timeout inside
ClaudeCodeDriver and exposed no operator-facing knob, so long-running
CC subprocess turns (large prompt-caches, deep tool chains) hit the
internal default with no escape hatch. Adds a public config surface,
honored today only by the claude-code driver, designed so future
subprocess drivers can opt in without re-shaping the API.
Public surface
- DriverConfig.subprocess_timeout_secs: Option<u64> (llm_driver.rs)
- OPENFANG_SUBPROCESS_TIMEOUT_SECS env var (drivers/mod.rs)
- Precedence in create_driver(): env var > config field > driver default
Naming rationale
- Field/env are scope-flavored, not semantic, on purpose: the name
telegraphs that HTTP providers (default/Anthropic, openai, bedrock,
qwen-code) accept-but-silently-ignore the field today. A semantic
name (message_timeout_secs) would have invited the same silent-no-op
footgun on those providers.
- Driver-internal field in claude_code.rs intentionally kept as
message_timeout_secs — it's not on the public boundary and the
semantic name accurately describes what it stores.
Tests (drivers/mod.rs)
- default_when_unset: no env, no config -> driver default
- config_set: config field flows through
- env_overrides_config: env var wins over config (construction-only
assertion; trait-object opacity prevents reading the value back)
- malformed_env_falls_through: unparseable env silently falls through
to config, matching the .parse::<u64>().ok() chain in production
- All four tests scrub OPENFANG_SUBPROCESS_TIMEOUT_SECS pre/post to
avoid cross-test pollution
Mechanical pass-throughs
- 12 x DriverConfig { .. } test fixtures in drivers/mod.rs gain
subprocess_timeout_secs: None
- routes.rs (1), kernel.rs (6), agent_loop.rs (2): same pass-through
fills in DriverConfig literals; no logic touched
Forward-compat note
- A NOTE block in drivers/mod.rs flags the scope-vs-implementation
gap so the next contributor adding a subprocess driver knows
exactly where to wire the config in.
Validated end-to-end against a live daemon: dry-run + full deploy
(deploy-local.sh, all 7 phases) + post-swap agent_send round-trip
through the claude-code dispatch path.
Follow-up to 79aa34c. The previous commit added the public surface (DriverConfig field + OPENFANG_SUBPROCESS_TIMEOUT_SECS env var) but left every DriverConfig construction site hardcoded to None — so the struct field was wired but had no on-disk source feeding it. The env var was the only operator-facing knob. This commit plumbs the missing layer: the timeout is now deserializable from config.toml on both the primary and global-fallback providers. Public surface - DefaultModelConfig.subprocess_timeout_secs: Option<u64> - FallbackProviderConfig.subprocess_timeout_secs: Option<u64> - Both fields are #[serde(default)] — existing config.toml files without the field deserialize cleanly to None (no breaking change). Placement rationale - Per-provider on each config struct, not a top-level field or a new [driver] section. This matches the existing per-provider config shape and lets operators set different timeouts for primary vs. fallback (e.g. tighter timeout on a fast fallback to fail over sooner). If a second driver-level setting ever lands, refactoring two struct fields into a [driver] section is cheap; we don't pre-pay for it now. Wiring (kernel.rs) - L663 primary driver ........ pulls config.default_model.subprocess_timeout_secs - L687 auto-detect path ...... inherits default_model intent (the swap is replacing the *provider*, not the timeout policy) - L736 global fallback loop .. pulls fb.subprocess_timeout_secs - L5031 agent primary ......... inherits effective_default's value when agent_provider == default_provider; None for cross-provider overrides - L5108 agent manifest fallback inherits dm's value when the manifest fallback resolves to "default" (matching the existing fb.provider sentinel logic); None for explicit cross-provider entries - L5139 global fallback (per-agent loop) — pulls fb.subprocess_timeout_secs Sites kept as None (intentional) - agent_loop.rs:1146, 1330: ModelNotFound recovery iterates over the agent manifest's fallback_models (FallbackModel, not the config-toml type) — no per-provider config in scope. - routes.rs:7701: provider connectivity test endpoint; no config source. - routes.rs:7529: dashboard hot-update path constructs a fresh DM with defaults (None) — operator sets timeout via config.toml, not via the set-key flow. Tests - test_subprocess_timeout_secs_in_toml: round-trips a TOML doc with default_model.subprocess_timeout_secs = 600 and one fallback at 180, one fallback omitted; asserts each value (or None) reaches the parsed config struct. - test_subprocess_timeout_secs_omitted_defaults_to_none: asserts a legacy-shaped config.toml (no timeout fields) parses cleanly with both fields = None — backward-compat guard. - 4 existing claude_code driver timeout tests still pass. Mechanical pass-throughs - 8 test fixtures across openfang-kernel/tests and openfang-api/tests gain subprocess_timeout_secs: None on their DefaultModelConfig literals. - 1 production literal in routes.rs gains the same field. - The existing FallbackProviderConfig serde-roundtrip test gains subprocess_timeout_secs: None plus an assertion. Precedence comment in drivers/mod.rs::create_driver updated to reflect that the config-field path is now real, with explicit pointers to the kernel.rs wiring sites for future contributors. Validated: cargo check --workspace --tests is clean; openfang-types (362), openfang-runtime (933), and openfang-kernel (260) lib tests all pass.
benhoverter
added a commit
that referenced
this pull request
Apr 30, 2026
A typo in any binding's match_rule no longer drops the entire bindings
table. Each entry is parsed independently; malformed entries log an
ERROR with index, agent name, and the underlying serde error, then are
skipped. A single WARN summarizes total dropped vs. surviving bindings.
Per-entry deny_unknown_fields is preserved so silent typos still fail
loudly — just no longer catastrophically.
Before this change, a single misspelled field anywhere in [[bindings]]
caused the whole table to fail parsing, silently unbinding every
agent — the worst possible failure mode for a routing config.
- New `lenient_extract_bindings` runs after include-merge / [api]
migration, before `try_into::<KernelConfig>()`.
- 7 new config tests cover the reproducer, happy path, all-malformed,
no-bindings, missing-agent, survivor-order preservation, and
top-level field typos:
* test_lenient_bindings_drops_typo_keeps_rest
* test_lenient_bindings_all_valid_unchanged
* test_lenient_bindings_all_malformed_yields_empty_but_keeps_rest_of_config
* test_lenient_bindings_no_bindings_section_is_noop
* test_lenient_bindings_missing_agent_field_dropped
* test_lenient_bindings_preserves_survivor_order — locks in that
first-match-wins routing semantics cannot silently regress when
a middle entry is dropped
* test_lenient_bindings_top_level_field_typo_dropped — locks in
that deny_unknown_fields catches operator typos at the binding
top level (e.g. \`agnt = ...\`), not just inside match_rule
- TODO marker added on the remaining \`warn!\` fallback in \`load_config\`
for the non-binding silent-default path (follow-up work).
Tested live: typo'd \`hannel\` field on binding #2 logged as expected;
remaining 5 bindings loaded and routed correctly.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The
claude-codedriver hardcoded its per-message turn timeout insideClaudeCodeDriverand exposed no operator-facing knob, so long-running CC subprocess turns (large prompt-caches, deep tool chains) hit the internal default with no escape hatch. This PR adds a complete config surface — env var +config.tomlfield — honored today only by the claude-code driver, designed so future subprocess drivers can opt in without re-shaping the API.The work lands as two commits:
79aa34cintroduces the public surface:DriverConfig.subprocess_timeout_secs,OPENFANG_SUBPROCESS_TIMEOUT_SECS, and the precedence chain increate_driver(). At this commit, every construction site still passedNone, so the config field was wired but unfed by on-disk config.b1c4061plumbs the field through toconfig.toml: deserializable onDefaultModelConfigandFallbackProviderConfig, with all relevantkernel.rs/routes.rsconstruction sites pulling the loaded value through. Validated end-to-end against a live daemon.Together they fully close the gap: operators can set the timeout statically in
config.toml, override transiently via env var, or fall through to the driver default — three sources, highest wins.Fixes RightNow-AI#1128
Changes
Commit 1 —
79aa34c: public surfacePublic surface (
crates/openfang-runtime/src/llm_driver.rs,drivers/mod.rs)DriverConfig.subprocess_timeout_secs: Option<u64>with a six-paragraph doc comment (semantic + scope + provider in/out list + forward-compat note).OPENFANG_SUBPROCESS_TIMEOUT_SECS.create_driver(): env var > config field > driver default. Nine-line precedence comment + a NOTE block flagging the scope-vs-implementation gap so the next subprocess-driver contributor knows where to wire in.Naming rationale (deliberate, documented)
subprocess_*), not semantic (message_*), on purpose. It telegraphs that HTTP providers (default/anthropic,openai,bedrock,qwen-code) accept-but-silently-ignore the field today. A semantic name would have invited the same silent-no-op footgun on those providers.claude_code.rsintentionally kept asmessage_timeout_secs— it's not on the public boundary and the semantic name accurately describes what the driver stores. Public-boundary = scope-flavored, internals = semantic.Tests (
drivers/mod.rs, four new unit tests covering all precedence branches)default_when_unset— no env, no config → driver defaultconfig_set— config field flows throughenv_overrides_config— env var wins over config (construction-only assertion; trait-object opacity prevents reading the value back fromLlmDriver)malformed_env_falls_through— unparseable env silently falls through to config, matching the.parse::<u64>().ok()chain in productionOPENFANG_SUBPROCESS_TIMEOUT_SECSpre/post to avoid cross-test pollution.Mechanical pass-throughs (no logic touched)
DriverConfig { .. }test fixtures indrivers/mod.rsgainsubprocess_timeout_secs: None.routes.rs(1),kernel.rs(6),agent_loop.rs(2): same pass-through fills inDriverConfigliterals.Diffstat: 5 files, +135 / -4. The +135 is dominated by the four new tests (~73 lines) and the doc comment (~17 lines); real logic change is the ~13-line precedence block in
create_driver().Commit 2 —
b1c4061: config.toml plumbingDeserializable fields (
crates/openfang-types/src/config.rs)DefaultModelConfig.subprocess_timeout_secs: Option<u64>with#[serde(default)].FallbackProviderConfig.subprocess_timeout_secs: Option<u64>with#[serde(default)].Construction-site wiring (replacing
subprocess_timeout_secs: Noneplaceholders left behind by commit 1)kernel.rs— 6 sites (663,687,736,5031,5108,5139).Sites carried forward as
None(kept-as-None, not wired)routes.rs— 2 sites:~7531(set_provider_keydashboard hot-update path) and7701(provider connectivity test endpoint). Both are mechanicalsubprocess_timeout_secs: Nonepass-throughs, not config-fed wiring.agent_loop.rs— 2 sites (1146,1330) intentionally left asNone: those iterate over the manifest fallback type, not the config-toml type. Wiring them would require widening the manifest path, which is out of scope. (Added in commit 1; carried forward unchanged by commit 2.) Documented inline.Tests (2 new)
subprocess_timeout_secssurvives serialize/deserialize.config.tomlfiles without the field still deserialize cleanly (the#[serde(default)]path).Precedence comment update in
drivers/mod.rs::create_driver— now reflects that the config-field path is real, not aspirational.Mechanical pass-throughs — 8 fixture updates across
openfang-api/testsandopenfang-kernel/testsadd the new field asNoneto existing config literals.Diffstat: 12 files, +112 / -7.
Testing
cargo clippy --workspace --all-targets -- -D warningspasses (Phase 1 ofdeploy-local.sh)cargo test --workspacepasses (commit body claims 362 + 933 + 260 lib tests green; precedence tests + TOML round-trip tests included)deploy-local.shrun (all 7 phases) against the local daemon: build → backup → binary swap →launchctl kickstart→ 10s log-tail → post-swapagent_sendround-trip through the patched dispatch path. Round-trip clean; daemon stable.subprocess_timeout_secs = 600under[default_model]; subsequent subprocess turns honor the new ceiling.Security
.parse::<u64>().ok(), malformed input silently falls through (test coverage for this branch); config field isOption<u64>deserialized from already-trusted TOML.Notes for reviewers
A few decision points worth flagging up front:
message_timeout_secs), the rename is mechanical (~12 occurrences across test fixtures + the Debug impl) but reintroduces the silent-no-op footgun on HTTP providers that the current name explicitly avoids.DefaultModelConfigandFallbackProviderConfig(option A). The alternative — a single[runtime]or top-level field — is cleaner if you never want different timeouts on primary vs. fallback. Open to flipping if the team prefers; the wiring sites would shrink.Nonesites. Three categories of construction sites carrysubprocess_timeout_secs: Nonerather than a config-fed value:agent_loop.rs:1146and1330(manifest fallback type — doesn't carry config-toml values; widening the manifest path is out of scope), androutes.rs:~7531(set_provider_keyhot-update) androutes.rs:7701(provider connectivity test). All flagged inline; reviewers can scan in one place here.env_overrides_configtest asserts construction succeeds, not that 600 wins over 120 numerically.LlmDriveris a trait object; reading the timeout back would require either a getter on the trait or a downcast. Current judgment: precedence is two lines of pure data flow (.or()), so construction test + code reading is sufficient. Open to adding a trait getter if reviewers prefer.drivers/mod.rsexplicitly flags the scope-vs-implementation gap (config exists on the public surface; only one driver honors it today). This is intentional documentation, not a TODO — the next subprocess-driver contributor should read it before wiring their driver in.create_driver) if reviewers want explicit observability ("Driver subprocess timeout: 600s [source: config.toml]"); deliberately not added in either commit to keep diffs tight.