Retry transient agent provider failures automatically

## Finish Line

Agent orchestration retries transient provider failures automatically and reports retry/fallback metadata transparently.

## Current Status

State: Completed in PR #230.
Next action: Observe real provider failures before deciding whether to add fallback-model policy.
Blocked by: None.
Waiting for: None.
Last verified: 2026-05-30.

## Context

During Launchplane public-ingress scheduler design work, a `claude-sonnet-4.6` agent failed immediately with `API Error: Overloaded`. Code continued with other agents and only retried Sonnet after the user noticed. That is poor ergonomics: transient agent-provider failures should be transparent, visible, and retryable by the harness itself where possible.

This is better as a code change than a skill/prompt reminder. The agent manager already knows the selected model, failure status, and batch context, so it is the right layer to classify retryable failures and decide whether to retry the same model, try a configured fallback, or return a clear exhausted-retries result.

## Implemented Scope

- Implements bounded same-model retries around provider execution only.
- Classifies transient overload/rate-limit/timeout/upstream/transport failures as retryable.
- Keeps auth, configuration, missing command, policy, and cancellation failures as fail-fast.
- Adds retry metadata to agent status/result/wait/list tool responses.
- Documents retry reporting in `docs/agents.md`.
- Does not implement fallback models; that remains a follow-up policy decision.

## Validation

- `cargo test -p code-core model_facing_agent_queries_are_session_scoped --lib`
- `cargo test -p code-core agent_tool --lib`
- `git diff --check`
- `./build-fast.sh`
- GitHub blob-size policy
- Exec-harness smoke using a fake provider that overloaded once before succeeding

## Acceptance Criteria

- [x] A transient provider overload from an agent run is retried automatically without Code issuing a second `agent.create` call.
- [x] Agent result metadata includes original model, retry count, final model, final status, and the last retryable error if exhausted.
- [x] Failed non-retryable agent runs still fail fast with clear reason.
- [x] Tests cover retry classification, bounded retry behavior, and no-retry behavior for non-retryable errors.
- [x] Documentation or release notes mention how automatic agent retries are reported.

## Relationships

Related: Launchplane scheduler-design session where Sonnet overload required manual retry.
Merged PR: #230


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry transient agent provider failures automatically #206

Finish Line

Current Status

Context

Implemented Scope

Validation

Acceptance Criteria

Relationships

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Retry transient agent provider failures automatically #206

Description

Finish Line

Current Status

Context

Implemented Scope

Validation

Acceptance Criteria

Relationships

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions