Skip to content

fix(runtime): add subprocess timeout config for claude-code driver#2

Merged
benhoverter merged 2 commits intolocal-mainfrom
fix/message-timeout-config
Apr 28, 2026
Merged

fix(runtime): add subprocess timeout config for claude-code driver#2
benhoverter merged 2 commits intolocal-mainfrom
fix/message-timeout-config

Conversation

@benhoverter
Copy link
Copy Markdown
Owner

Summary

The claude-code driver hardcoded its per-message turn timeout inside ClaudeCodeDriver and exposed no operator-facing knob, so long-running CC subprocess turns (large prompt-caches, deep tool chains) hit the internal default with no escape hatch. This PR adds a complete config surface — env var + config.toml field — honored today only by the claude-code driver, designed so future subprocess drivers can opt in without re-shaping the API.

The work lands as two commits:

  1. 79aa34c introduces the public surface: DriverConfig.subprocess_timeout_secs, OPENFANG_SUBPROCESS_TIMEOUT_SECS, and the precedence chain in create_driver(). At this commit, every construction site still passed None, so the config field was wired but unfed by on-disk config.
  2. b1c4061 plumbs the field through to config.toml: deserializable on DefaultModelConfig and FallbackProviderConfig, with all relevant kernel.rs / routes.rs construction sites pulling the loaded value through. Validated end-to-end against a live daemon.

Together they fully close the gap: operators can set the timeout statically in config.toml, override transiently via env var, or fall through to the driver default — three sources, highest wins.

Fixes RightNow-AI#1128

Changes

Commit 1 — 79aa34c: public surface

Public surface (crates/openfang-runtime/src/llm_driver.rs, drivers/mod.rs)

  • New field: DriverConfig.subprocess_timeout_secs: Option<u64> with a six-paragraph doc comment (semantic + scope + provider in/out list + forward-compat note).
  • New env var: OPENFANG_SUBPROCESS_TIMEOUT_SECS.
  • Precedence in create_driver(): env var > config field > driver default. Nine-line precedence comment + a NOTE block flagging the scope-vs-implementation gap so the next subprocess-driver contributor knows where to wire in.

Naming rationale (deliberate, documented)

  • Field/env name is scope-flavored (subprocess_*), not semantic (message_*), on purpose. It telegraphs that HTTP providers (default / anthropic, openai, bedrock, qwen-code) accept-but-silently-ignore the field today. A semantic name would have invited the same silent-no-op footgun on those providers.
  • Driver-internal field in claude_code.rs intentionally kept as message_timeout_secs — it's not on the public boundary and the semantic name accurately describes what the driver stores. Public-boundary = scope-flavored, internals = semantic.

Tests (drivers/mod.rs, four new unit tests covering all precedence branches)

  • default_when_unset — no env, no config → driver default
  • config_set — config field flows through
  • env_overrides_config — env var wins over config (construction-only assertion; trait-object opacity prevents reading the value back from LlmDriver)
  • malformed_env_falls_through — unparseable env silently falls through to config, matching the .parse::<u64>().ok() chain in production
  • All four tests scrub OPENFANG_SUBPROCESS_TIMEOUT_SECS pre/post to avoid cross-test pollution.

Mechanical pass-throughs (no logic touched)

  • 12 × DriverConfig { .. } test fixtures in drivers/mod.rs gain subprocess_timeout_secs: None.
  • routes.rs (1), kernel.rs (6), agent_loop.rs (2): same pass-through fills in DriverConfig literals.

Diffstat: 5 files, +135 / -4. The +135 is dominated by the four new tests (~73 lines) and the doc comment (~17 lines); real logic change is the ~13-line precedence block in create_driver().

Commit 2 — b1c4061: config.toml plumbing

Deserializable fields (crates/openfang-types/src/config.rs)

  • DefaultModelConfig.subprocess_timeout_secs: Option<u64> with #[serde(default)].
  • FallbackProviderConfig.subprocess_timeout_secs: Option<u64> with #[serde(default)].
  • Per-provider granularity is deliberate (option A from the design discussion): each configured provider can have its own timeout, so primary can be generous and fallback can be strict. A single global field would have over-collapsed the design space.

Construction-site wiring (replacing subprocess_timeout_secs: None placeholders left behind by commit 1)

  • kernel.rs — 6 sites (663, 687, 736, 5031, 5108, 5139).

Sites carried forward as None (kept-as-None, not wired)

  • routes.rs — 2 sites: ~7531 (set_provider_key dashboard hot-update path) and 7701 (provider connectivity test endpoint). Both are mechanical subprocess_timeout_secs: None pass-throughs, not config-fed wiring.
  • agent_loop.rs — 2 sites (1146, 1330) intentionally left as None: those iterate over the manifest fallback type, not the config-toml type. Wiring them would require widening the manifest path, which is out of scope. (Added in commit 1; carried forward unchanged by commit 2.) Documented inline.

Tests (2 new)

  • TOML round-trip — populated subprocess_timeout_secs survives serialize/deserialize.
  • Legacy-shape parse — config.toml files without the field still deserialize cleanly (the #[serde(default)] path).

Precedence comment update in drivers/mod.rs::create_driver — now reflects that the config-field path is real, not aspirational.

Mechanical pass-throughs — 8 fixture updates across openfang-api/tests and openfang-kernel/tests add the new field as None to existing config literals.

Diffstat: 12 files, +112 / -7.

Testing

  • cargo clippy --workspace --all-targets -- -D warnings passes (Phase 1 of deploy-local.sh)
  • cargo test --workspace passes (commit body claims 362 + 933 + 260 lib tests green; precedence tests + TOML round-trip tests included)
  • Live integration tested — full deploy-local.sh run (all 7 phases) against the local daemon: build → backup → binary swap → launchctl kickstart → 10s log-tail → post-swap agent_send round-trip through the patched dispatch path. Round-trip clean; daemon stable.
  • Live config.toml validation — daemon restarted with subprocess_timeout_secs = 600 under [default_model]; subsequent subprocess turns honor the new ceiling.

Security

  • No new unsafe code
  • No secrets or API keys in diff
  • User input validated at boundaries — env var parsed via .parse::<u64>().ok(), malformed input silently falls through (test coverage for this branch); config field is Option<u64> deserialized from already-trusted TOML.

Notes for reviewers

A few decision points worth flagging up front:

  1. Naming axis (scope vs. semantic). We deliberately picked scope-flavored for the public boundary — see "Naming rationale" above. If you'd prefer semantic (message_timeout_secs), the rename is mechanical (~12 occurrences across test fixtures + the Debug impl) but reintroduces the silent-no-op footgun on HTTP providers that the current name explicitly avoids.
  2. Per-provider vs. global config field. Commit 2 puts the field on DefaultModelConfig and FallbackProviderConfig (option A). The alternative — a single [runtime] or top-level field — is cleaner if you never want different timeouts on primary vs. fallback. Open to flipping if the team prefers; the wiring sites would shrink.
  3. Intentionally-None sites. Three categories of construction sites carry subprocess_timeout_secs: None rather than a config-fed value: agent_loop.rs:1146 and 1330 (manifest fallback type — doesn't carry config-toml values; widening the manifest path is out of scope), and routes.rs:~7531 (set_provider_key hot-update) and routes.rs:7701 (provider connectivity test). All flagged inline; reviewers can scan in one place here.
  4. Trait-object test limitation. The env_overrides_config test asserts construction succeeds, not that 600 wins over 120 numerically. LlmDriver is a trait object; reading the timeout back would require either a getter on the trait or a downcast. Current judgment: precedence is two lines of pure data flow (.or()), so construction test + code reading is sufficient. Open to adding a trait getter if reviewers prefer.
  5. Forward-compat NOTE block. The comment in drivers/mod.rs explicitly flags the scope-vs-implementation gap (config exists on the public surface; only one driver honors it today). This is intentional documentation, not a TODO — the next subprocess-driver contributor should read it before wiring their driver in.
  6. No boot log line. There is currently no log line announcing the effective timeout + source at boot. Cheap follow-up (~3 lines in create_driver) if reviewers want explicit observability ("Driver subprocess timeout: 600s [source: config.toml]"); deliberately not added in either commit to keep diffs tight.

The claude-code driver hardcodes its per-message turn timeout inside
ClaudeCodeDriver and exposed no operator-facing knob, so long-running
CC subprocess turns (large prompt-caches, deep tool chains) hit the
internal default with no escape hatch. Adds a public config surface,
honored today only by the claude-code driver, designed so future
subprocess drivers can opt in without re-shaping the API.

Public surface
- DriverConfig.subprocess_timeout_secs: Option<u64> (llm_driver.rs)
- OPENFANG_SUBPROCESS_TIMEOUT_SECS env var (drivers/mod.rs)
- Precedence in create_driver(): env var > config field > driver default

Naming rationale
- Field/env are scope-flavored, not semantic, on purpose: the name
  telegraphs that HTTP providers (default/Anthropic, openai, bedrock,
  qwen-code) accept-but-silently-ignore the field today. A semantic
  name (message_timeout_secs) would have invited the same silent-no-op
  footgun on those providers.
- Driver-internal field in claude_code.rs intentionally kept as
  message_timeout_secs — it's not on the public boundary and the
  semantic name accurately describes what it stores.

Tests (drivers/mod.rs)
- default_when_unset: no env, no config -> driver default
- config_set: config field flows through
- env_overrides_config: env var wins over config (construction-only
  assertion; trait-object opacity prevents reading the value back)
- malformed_env_falls_through: unparseable env silently falls through
  to config, matching the .parse::<u64>().ok() chain in production
- All four tests scrub OPENFANG_SUBPROCESS_TIMEOUT_SECS pre/post to
  avoid cross-test pollution

Mechanical pass-throughs
- 12 x DriverConfig { .. } test fixtures in drivers/mod.rs gain
  subprocess_timeout_secs: None
- routes.rs (1), kernel.rs (6), agent_loop.rs (2): same pass-through
  fills in DriverConfig literals; no logic touched

Forward-compat note
- A NOTE block in drivers/mod.rs flags the scope-vs-implementation
  gap so the next contributor adding a subprocess driver knows
  exactly where to wire the config in.

Validated end-to-end against a live daemon: dry-run + full deploy
(deploy-local.sh, all 7 phases) + post-swap agent_send round-trip
through the claude-code dispatch path.
Follow-up to 79aa34c. The previous commit added the public surface
(DriverConfig field + OPENFANG_SUBPROCESS_TIMEOUT_SECS env var) but
left every DriverConfig construction site hardcoded to None — so the
struct field was wired but had no on-disk source feeding it. The env
var was the only operator-facing knob.

This commit plumbs the missing layer: the timeout is now deserializable
from config.toml on both the primary and global-fallback providers.

Public surface
- DefaultModelConfig.subprocess_timeout_secs: Option<u64>
- FallbackProviderConfig.subprocess_timeout_secs: Option<u64>
- Both fields are #[serde(default)] — existing config.toml files
  without the field deserialize cleanly to None (no breaking change).

Placement rationale
- Per-provider on each config struct, not a top-level field or a new
  [driver] section. This matches the existing per-provider config shape
  and lets operators set different timeouts for primary vs. fallback
  (e.g. tighter timeout on a fast fallback to fail over sooner). If a
  second driver-level setting ever lands, refactoring two struct fields
  into a [driver] section is cheap; we don't pre-pay for it now.

Wiring (kernel.rs)
- L663  primary driver  ........  pulls config.default_model.subprocess_timeout_secs
- L687  auto-detect path  ......  inherits default_model intent (the swap
                                  is replacing the *provider*, not the
                                  timeout policy)
- L736  global fallback loop  ..  pulls fb.subprocess_timeout_secs
- L5031 agent primary  .........  inherits effective_default's value when
                                  agent_provider == default_provider;
                                  None for cross-provider overrides
- L5108 agent manifest fallback   inherits dm's value when the manifest
                                  fallback resolves to "default" (matching
                                  the existing fb.provider sentinel logic);
                                  None for explicit cross-provider entries
- L5139 global fallback (per-agent loop) — pulls fb.subprocess_timeout_secs

Sites kept as None (intentional)
- agent_loop.rs:1146, 1330: ModelNotFound recovery iterates over the
  agent manifest's fallback_models (FallbackModel, not the config-toml
  type) — no per-provider config in scope.
- routes.rs:7701: provider connectivity test endpoint; no config source.
- routes.rs:7529: dashboard hot-update path constructs a fresh DM with
  defaults (None) — operator sets timeout via config.toml, not via the
  set-key flow.

Tests
- test_subprocess_timeout_secs_in_toml: round-trips a TOML doc with
  default_model.subprocess_timeout_secs = 600 and one fallback at 180,
  one fallback omitted; asserts each value (or None) reaches the parsed
  config struct.
- test_subprocess_timeout_secs_omitted_defaults_to_none: asserts a
  legacy-shaped config.toml (no timeout fields) parses cleanly with
  both fields = None — backward-compat guard.
- 4 existing claude_code driver timeout tests still pass.

Mechanical pass-throughs
- 8 test fixtures across openfang-kernel/tests and openfang-api/tests
  gain subprocess_timeout_secs: None on their DefaultModelConfig
  literals.
- 1 production literal in routes.rs gains the same field.
- The existing FallbackProviderConfig serde-roundtrip test gains
  subprocess_timeout_secs: None plus an assertion.

Precedence comment in drivers/mod.rs::create_driver updated to reflect
that the config-field path is now real, with explicit pointers to the
kernel.rs wiring sites for future contributors.

Validated: cargo check --workspace --tests is clean; openfang-types
(362), openfang-runtime (933), and openfang-kernel (260) lib tests
all pass.
@benhoverter benhoverter merged commit 74daaeb into local-main Apr 28, 2026
benhoverter added a commit that referenced this pull request Apr 30, 2026
A typo in any binding's match_rule no longer drops the entire bindings
table. Each entry is parsed independently; malformed entries log an
ERROR with index, agent name, and the underlying serde error, then are
skipped. A single WARN summarizes total dropped vs. surviving bindings.
Per-entry deny_unknown_fields is preserved so silent typos still fail
loudly — just no longer catastrophically.

Before this change, a single misspelled field anywhere in [[bindings]]
caused the whole table to fail parsing, silently unbinding every
agent — the worst possible failure mode for a routing config.

- New `lenient_extract_bindings` runs after include-merge / [api]
  migration, before `try_into::<KernelConfig>()`.
- 7 new config tests cover the reproducer, happy path, all-malformed,
  no-bindings, missing-agent, survivor-order preservation, and
  top-level field typos:
    * test_lenient_bindings_drops_typo_keeps_rest
    * test_lenient_bindings_all_valid_unchanged
    * test_lenient_bindings_all_malformed_yields_empty_but_keeps_rest_of_config
    * test_lenient_bindings_no_bindings_section_is_noop
    * test_lenient_bindings_missing_agent_field_dropped
    * test_lenient_bindings_preserves_survivor_order — locks in that
      first-match-wins routing semantics cannot silently regress when
      a middle entry is dropped
    * test_lenient_bindings_top_level_field_typo_dropped — locks in
      that deny_unknown_fields catches operator typos at the binding
      top level (e.g. \`agnt = ...\`), not just inside match_rule
- TODO marker added on the remaining \`warn!\` fallback in \`load_config\`
  for the non-binding silent-default path (follow-up work).

Tested live: typo'd \`hannel\` field on binding #2 logged as expected;
remaining 5 bindings loaded and routed correctly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant