Skip to content

fix(orchestration): apply >=3 escalation to PlanVerifier timeout arms#3873

Merged
bug-ops merged 1 commit into
mainfrom
3868-verifier-escalation
May 16, 2026
Merged

fix(orchestration): apply >=3 escalation to PlanVerifier timeout arms#3873
bug-ops merged 1 commit into
mainfrom
3868-verifier-escalation

Conversation

@bug-ops
Copy link
Copy Markdown
Owner

@bug-ops bug-ops commented May 16, 2026

Summary

Test plan

  • Added verify_timeout_increments_counter_and_crosses_threshold — drives verify() through 3 consecutive timeouts and asserts consecutive_failures == 3 so the escalation arm fires.
  • Added verify_plan_timeout_increments_counter_and_crosses_threshold — same for verify_plan().
  • cargo +nightly fmt --check
  • cargo clippy -p zeph-orchestration --all-targets -- -D warnings
  • cargo nextest run -p zeph-orchestration -p zeph-config --lib757 passed, 0 failed
  • RUSTDOCFLAGS="--deny rustdoc::broken_intra_doc_links" cargo doc --no-deps -p zeph-orchestration -p zeph-config

Closes #3868
Closes #3867

The timeout arms in `PlanVerifier::verify` and `verify_plan` incremented
`consecutive_failures` (fail-open policy) but skipped the `>= 3` escalation
check that the `Ok(Err(_))` arms run. A misconfigured or overloaded
`verify_provider` that always timed out therefore failed open silently
forever — operators saw repeated `warn!` entries and no `error!`.

Both timeout arms now mirror the LLM-error path: emit `error!` once
`consecutive_failures >= 3` advising the operator to inspect
`verify_provider` configuration; emit `warn!` otherwise.

`replan` and `replan_from_plan` timeout arms are left as-is — their
`Ok(Err(_))` arms do not track `consecutive_failures` either, so adding
the escalation only there would be inconsistent.

Also documents the orchestration timeout fields introduced in #3860 by
adding commented `aggregator_timeout_secs`, `planner_timeout_secs`, and
`verifier_timeout_secs` entries to `[orchestration]` in
`config/default.toml`.

Closes #3868
Closes #3867
@github-actions github-actions Bot added bug Something isn't working size/M Medium PR (51-200 lines) documentation Improvements or additions to documentation rust Rust code changes config Configuration file changes and removed size/M Medium PR (51-200 lines) labels May 16, 2026
@bug-ops bug-ops enabled auto-merge (squash) May 16, 2026 12:58
@bug-ops bug-ops merged commit fda6e87 into main May 16, 2026
32 checks passed
@bug-ops bug-ops deleted the 3868-verifier-escalation branch May 16, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working config Configuration file changes documentation Improvements or additions to documentation rust Rust code changes

Projects

None yet

1 participant