Skip to content

feat(mcp): reconnect_with_backoff — exponential retry for transient failures#235

Merged
emal-avala merged 1 commit intomainfrom
feat/mcp-reconnect-backoff
Apr 23, 2026
Merged

feat(mcp): reconnect_with_backoff — exponential retry for transient failures#235
emal-avala merged 1 commit intomainfrom
feat/mcp-reconnect-backoff

Conversation

@emal-avala
Copy link
Copy Markdown
Member

Summary

Adds McpClient::reconnect_with_backoff(max_attempts) so a transient MCP failure (subprocess died, network blip) can be recovered without tearing down the agent loop.

Backoff schedule

attempt: 1    2    3    4     5     6     7 …
delay:   0   1s   2s   4s   8s   16s   30s (capped)

The delay curve is a pure helper (backoff_delay_ms) so it's unit-tested without an actual transport. Huge attempt counts are clamped to avoid 1 << u32::MAX panics.

Behavior

  • Drops the stale McpTransportConnection first (so connect() installs a fresh subprocess or SSE stream).
  • Between attempts: tokio::time::sleep of backoff_delay_ms(n - 1).
  • On success: status moves back to Connected, tools/resources re-discovered.
  • On exhaustion: status set to Error(last) and the accumulated error is returned with the attempt count so callers can surface a helpful message.
  • max_attempts = 0 is rejected with a clear error rather than silently succeeding.

Tests

  • backoff_schedule_doubles_until_cap — 1/2/4/8/16 s ramp, then flat 30 s; stable at u32::MAX.
  • backoff_is_monotonic_non_decreasing — no accidental regressions in the curve.
  • reconnect_zero_attempts_is_rejected — guard against caller mistakes.

Test plan

  • cargo test -p agent-code-lib --lib services::mcp (3/3 pass)
  • cargo clippy --workspace --tests --no-deps -- -D warnings
  • cargo fmt --all --check

…ailures

Adds `McpClient::reconnect_with_backoff(max_attempts)` which drops the
stale transport and retries `connect()` with exponential backoff:

    1s → 2s → 4s → 8s → 16s → 30s (cap) → 30s …

After every attempt fails, the client status transitions to
`McpConnectionStatus::Error(last)` and the call returns the
accumulated error, so callers can surface the failure cleanly.

The backoff schedule is factored into a pure `backoff_delay_ms`
function so the curve can be unit-tested without a real transport.
Huge attempt counts are clamped to avoid shift overflow.

Use case: an MCP subprocess that died mid-session can now be brought
back without tearing down the agent loop.

Tests:
- schedule doubles to cap at 30s, stable past u32::MAX attempts
- schedule is monotonic non-decreasing
- `max_attempts = 0` is rejected with a clear error

Full MCP suite: 3 pass. Clippy clean.
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@emal-avala emal-avala merged commit aad7c88 into main Apr 23, 2026
14 checks passed
@emal-avala emal-avala deleted the feat/mcp-reconnect-backoff branch April 23, 2026 23:46
@emal-avala emal-avala mentioned this pull request Apr 24, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant