Skip to content

fix: recover from ErrSessionMissing when remote MCP server restarts#2212

Merged
dgageot merged 1 commit intodocker:mainfrom
dgageot:mcp-reconnect
Mar 22, 2026
Merged

fix: recover from ErrSessionMissing when remote MCP server restarts#2212
dgageot merged 1 commit intodocker:mainfrom
dgageot:mcp-reconnect

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Mar 22, 2026

hi all - i /think/ we're observing an issue where a long-lived docker agent (via serve) fails to reconnect to MCP servers after the MCP server restarts - not recovering from errSessionMissing

Assisted-By: docker-agent

@dgageot dgageot marked this pull request as ready for review March 22, 2026 10:46
@dgageot dgageot requested a review from a team as a code owner March 22, 2026 10:46
Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

This PR adds reconnection logic to recover from ErrSessionMissing when MCP servers restart. The implementation is mostly sound, but there's one timing issue worth addressing.

return nil
case <-ctx.Done():
return ctx.Err()
case <-time.After(sessionMissingRetryTimeout):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ MEDIUM: Timeout not coordinated with retry backoff

The sessionMissingRetryTimeout is set to 30 seconds, but tryRestart() uses exponential backoff that can take up to 31 seconds total (1+2+4+8+16 seconds across 5 retry attempts).

This creates a race condition where:

  1. A tool call encounters ErrSessionMissing and calls forceReconnectAndWait()
  2. watchConnection is in the middle of a backoff sleep (e.g., the 16-second sleep on the 5th retry)
  3. The tool call times out at 30 seconds and returns an error to the user
  4. A second later, the reconnect completes successfully
  5. The next identical tool call succeeds

Impact: Non-deterministic failures where tool calls fail with "timed out waiting for MCP server reconnection" even though the server successfully reconnects moments later.

Recommendation: Either increase sessionMissingRetryTimeout to 35-40 seconds to account for the maximum backoff duration, or coordinate the timeout with the actual retry logic (e.g., calculate remaining backoff time).

@dgageot dgageot merged commit 0c2bf5d into docker:main Mar 22, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants