Skip to content

Retry transient agent provider failures automatically #206

@shiny-code-bot

Description

@shiny-code-bot

Finish Line

Agent orchestration retries transient provider failures automatically and reports retry/fallback metadata transparently.

Current Status

State: Completed in PR #230.
Next action: Observe real provider failures before deciding whether to add fallback-model policy.
Blocked by: None.
Waiting for: None.
Last verified: 2026-05-30.

Context

During Launchplane public-ingress scheduler design work, a claude-sonnet-4.6 agent failed immediately with API Error: Overloaded. Code continued with other agents and only retried Sonnet after the user noticed. That is poor ergonomics: transient agent-provider failures should be transparent, visible, and retryable by the harness itself where possible.

This is better as a code change than a skill/prompt reminder. The agent manager already knows the selected model, failure status, and batch context, so it is the right layer to classify retryable failures and decide whether to retry the same model, try a configured fallback, or return a clear exhausted-retries result.

Implemented Scope

  • Implements bounded same-model retries around provider execution only.
  • Classifies transient overload/rate-limit/timeout/upstream/transport failures as retryable.
  • Keeps auth, configuration, missing command, policy, and cancellation failures as fail-fast.
  • Adds retry metadata to agent status/result/wait/list tool responses.
  • Documents retry reporting in docs/agents.md.
  • Does not implement fallback models; that remains a follow-up policy decision.

Validation

  • cargo test -p code-core model_facing_agent_queries_are_session_scoped --lib
  • cargo test -p code-core agent_tool --lib
  • git diff --check
  • ./build-fast.sh
  • GitHub blob-size policy
  • Exec-harness smoke using a fake provider that overloaded once before succeeding

Acceptance Criteria

  • A transient provider overload from an agent run is retried automatically without Code issuing a second agent.create call.
  • Agent result metadata includes original model, retry count, final model, final status, and the last retryable error if exhausted.
  • Failed non-retryable agent runs still fail fast with clear reason.
  • Tests cover retry classification, bounded retry behavior, and no-retry behavior for non-retryable errors.
  • Documentation or release notes mention how automatic agent retries are reported.

Relationships

Related: Launchplane scheduler-design session where Sonnet overload required manual retry.
Merged PR: #230

Metadata

Metadata

Assignees

No one assigned

    Labels

    planDurable planning issueplan:activeCurrent active plan

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions