Skip to content

fix: retry MCP callTool on any connection error, not just ErrSessionMissing#2215

Merged
dgageot merged 1 commit intodocker:mainfrom
dgageot:fix-flaky-test
Mar 23, 2026
Merged

fix: retry MCP callTool on any connection error, not just ErrSessionMissing#2215
dgageot merged 1 commit intodocker:mainfrom
dgageot:fix-flaky-test

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Mar 22, 2026

When an MCP server restarts, callTool could receive transport errors (EOF, connection reset) instead of ErrSessionMissing depending on timing. Broaden the retry condition to trigger forceReconnectAndWait on any non-context-cancellation error, fixing a flaky TestRemoteReconnectAfterServerRestart.

Assisted-By: docker-agent

@dgageot dgageot requested a review from a team as a code owner March 22, 2026 18:11
docker-agent[bot]

This comment was marked as outdated.

@dgageot dgageot marked this pull request as draft March 22, 2026 18:52
…sionMissing

When an MCP server restarts, callTool may receive transport errors
(EOF, connection reset, broken pipe) instead of ErrSessionMissing
depending on timing. Add an isConnectionError helper that selectively
retries on known connection/session error types rather than retrying
on all errors indiscriminately.

The helper checks for:
- mcp.ErrSessionMissing (server lost our session)
- io.EOF (connection dropped)
- net.Error (timeout, connection reset, etc.)
- String-based fallback for transport errors the MCP SDK wraps
  with %v (losing the original error chain)

This fixes flaky TestRemoteReconnectAfterServerRestart without
broadening retry to application-level errors.

Assisted-By: docker-agent
@dgageot dgageot marked this pull request as ready for review March 23, 2026 08:17
Copy link

@docker-agent docker-agent bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

Found 3 medium-severity issues in the new isConnectionError function. These are logic concerns that could cause false-positive reconnection attempts or miss actual connection errors in edge cases. The core fix for the flaky test is sound, but the error detection logic could be more robust.

if msg := err.Error(); strings.Contains(msg, "connection reset") ||
strings.Contains(msg, "connection refused") ||
strings.Contains(msg, "broken pipe") ||
strings.Contains(msg, "EOF") {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: Redundant EOF string matching could cause false positives

The isConnectionError function checks for EOF twice:

  • Line 709: errors.Is(err, io.EOF) (proper error chain check)
  • Line 714: strings.Contains(msg, "EOF") (string matching)

If an error is io.EOF, it's already caught by line 709, making the string check redundant. More critically, the substring check could match unrelated errors that mention 'EOF' in their message (e.g., 'invalid EOF character in JSON', 'unexpected EOF in response body'), causing false-positive reconnection attempts.

Recommendation: Remove "EOF" from the string matching on line 714, since actual io.EOF errors are already handled by line 709.

return true
}
var netErr net.Error
if errors.As(err, &netErr) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: All net.Error types treated as retriable connection errors

The code treats any net.Error as a retriable connection error (lines 704-706):

if errors.As(err, &netErr) {
    return true
}

However, some net.Error implementations represent permanent failures that won't be fixed by retrying:

  • DNS lookup failures for non-existent hosts (no such host)
  • Invalid address formats
  • Network unreachable errors (may indicate configuration issues)

Recommendation: Consider checking netErr.Temporary() or filtering for specific retriable error types to avoid unnecessary retry attempts on permanent errors.

// client.Do) with its internal ErrRejected sentinel using %v, which
// drops the original error from the chain. Detect these by checking
// the error message for common transport-failure substrings.
if msg := err.Error(); strings.Contains(msg, "connection reset") ||
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 MEDIUM: String-based error detection is brittle and locale-dependent

The string-matching logic (lines 711-714) checks for 'connection reset', 'connection refused', 'broken pipe', and 'EOF'. While the comment explains this is needed because the MCP SDK wraps errors with %v, this approach has several risks:

  1. Not stable: Error messages are not part of Go's compatibility guarantee and can change across versions or operating systems
  2. Locale-dependent: Non-English systems may have different error messages
  3. Phrasing variations: Different libraries might phrase errors differently (e.g., 'reset by peer' vs 'connection reset')

Recommendation: While this is a documented workaround for upstream SDK limitations, consider:

  • Filing an issue with the MCP SDK to preserve error chains
  • Adding test coverage for the specific error messages you expect
  • Documenting the known limitation in code comments (already partially done)

@dgageot dgageot merged commit 0dbc1be into docker:main Mar 23, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants