Finish Line
Agent orchestration retries transient provider failures automatically and reports retry/fallback metadata transparently.
Current Status
State: Completed in PR #230.
Next action: Observe real provider failures before deciding whether to add fallback-model policy.
Blocked by: None.
Waiting for: None.
Last verified: 2026-05-30.
Context
During Launchplane public-ingress scheduler design work, a claude-sonnet-4.6 agent failed immediately with API Error: Overloaded. Code continued with other agents and only retried Sonnet after the user noticed. That is poor ergonomics: transient agent-provider failures should be transparent, visible, and retryable by the harness itself where possible.
This is better as a code change than a skill/prompt reminder. The agent manager already knows the selected model, failure status, and batch context, so it is the right layer to classify retryable failures and decide whether to retry the same model, try a configured fallback, or return a clear exhausted-retries result.
Implemented Scope
- Implements bounded same-model retries around provider execution only.
- Classifies transient overload/rate-limit/timeout/upstream/transport failures as retryable.
- Keeps auth, configuration, missing command, policy, and cancellation failures as fail-fast.
- Adds retry metadata to agent status/result/wait/list tool responses.
- Documents retry reporting in
docs/agents.md.
- Does not implement fallback models; that remains a follow-up policy decision.
Validation
cargo test -p code-core model_facing_agent_queries_are_session_scoped --lib
cargo test -p code-core agent_tool --lib
git diff --check
./build-fast.sh
- GitHub blob-size policy
- Exec-harness smoke using a fake provider that overloaded once before succeeding
Acceptance Criteria
Relationships
Related: Launchplane scheduler-design session where Sonnet overload required manual retry.
Merged PR: #230
Finish Line
Agent orchestration retries transient provider failures automatically and reports retry/fallback metadata transparently.
Current Status
State: Completed in PR #230.
Next action: Observe real provider failures before deciding whether to add fallback-model policy.
Blocked by: None.
Waiting for: None.
Last verified: 2026-05-30.
Context
During Launchplane public-ingress scheduler design work, a
claude-sonnet-4.6agent failed immediately withAPI Error: Overloaded. Code continued with other agents and only retried Sonnet after the user noticed. That is poor ergonomics: transient agent-provider failures should be transparent, visible, and retryable by the harness itself where possible.This is better as a code change than a skill/prompt reminder. The agent manager already knows the selected model, failure status, and batch context, so it is the right layer to classify retryable failures and decide whether to retry the same model, try a configured fallback, or return a clear exhausted-retries result.
Implemented Scope
docs/agents.md.Validation
cargo test -p code-core model_facing_agent_queries_are_session_scoped --libcargo test -p code-core agent_tool --libgit diff --check./build-fast.shAcceptance Criteria
agent.createcall.Relationships
Related: Launchplane scheduler-design session where Sonnet overload required manual retry.
Merged PR: #230