feat(mcp): reconnect_with_backoff — exponential retry for transient failures by emal-avala · Pull Request #235 · avala-ai/agent-code

emal-avala · 2026-04-23T23:28:27Z

Summary

Adds McpClient::reconnect_with_backoff(max_attempts) so a transient MCP failure (subprocess died, network blip) can be recovered without tearing down the agent loop.

Backoff schedule

attempt: 1    2    3    4     5     6     7 …
delay:   0   1s   2s   4s   8s   16s   30s (capped)

The delay curve is a pure helper (backoff_delay_ms) so it's unit-tested without an actual transport. Huge attempt counts are clamped to avoid 1 << u32::MAX panics.

Behavior

Drops the stale McpTransportConnection first (so connect() installs a fresh subprocess or SSE stream).
Between attempts: tokio::time::sleep of backoff_delay_ms(n - 1).
On success: status moves back to Connected, tools/resources re-discovered.
On exhaustion: status set to Error(last) and the accumulated error is returned with the attempt count so callers can surface a helpful message.
max_attempts = 0 is rejected with a clear error rather than silently succeeding.

Tests

backoff_schedule_doubles_until_cap — 1/2/4/8/16 s ramp, then flat 30 s; stable at u32::MAX.
backoff_is_monotonic_non_decreasing — no accidental regressions in the curve.
reconnect_zero_attempts_is_rejected — guard against caller mistakes.

Test plan

cargo test -p agent-code-lib --lib services::mcp (3/3 pass)
cargo clippy --workspace --tests --no-deps -- -D warnings
cargo fmt --all --check

…ailures Adds `McpClient::reconnect_with_backoff(max_attempts)` which drops the stale transport and retries `connect()` with exponential backoff: 1s → 2s → 4s → 8s → 16s → 30s (cap) → 30s … After every attempt fails, the client status transitions to `McpConnectionStatus::Error(last)` and the call returns the accumulated error, so callers can surface the failure cleanly. The backoff schedule is factored into a pure `backoff_delay_ms` function so the curve can be unit-tested without a real transport. Huge attempt counts are clamped to avoid shift overflow. Use case: an MCP subprocess that died mid-session can now be brought back without tearing down the agent loop. Tests: - schedule doubles to cap at 30s, stable past u32::MAX attempts - schedule is monotonic non-decreasing - `max_attempts = 0` is rejected with a clear error Full MCP suite: 3 pass. Clippy clean.

chatgpt-codex-connector · 2026-04-23T23:28:32Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

emal-avala merged commit aad7c88 into main Apr 23, 2026
14 checks passed

emal-avala deleted the feat/mcp-reconnect-backoff branch April 23, 2026 23:46

emal-avala mentioned this pull request Apr 24, 2026

Release v0.19.0 #244

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(mcp): reconnect_with_backoff — exponential retry for transient failures#235

feat(mcp): reconnect_with_backoff — exponential retry for transient failures#235
emal-avala merged 1 commit intomainfrom
feat/mcp-reconnect-backoff

emal-avala commented Apr 23, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

emal-avala commented Apr 23, 2026

Summary

Backoff schedule

Behavior

Tests

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant