Skip to content

fix: reconnect expired MCP backend sessions transparently, extend server session timeout#2597

Merged
lpcox merged 4 commits intomainfrom
copilot/fix-session-expiration-issue
Mar 26, 2026
Merged

fix: reconnect expired MCP backend sessions transparently, extend server session timeout#2597
lpcox merged 4 commits intomainfrom
copilot/fix-session-expiration-issue

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Mar 26, 2026

Long-running agent workflows (~30+ min) fail when the gateway's HTTP session to a backend (e.g. safeoutputs) expires mid-run. The agent receives session not found from the backend and enters a futile retry loop rather than recovering.

Changes

Client-side session reconnect (internal/mcp/connection.go, http_transport.go)

  • Plain JSON-RPC transport: sendHTTPRequest detects HTTP 404 + "session not found" body, calls reconnectPlainJSON() to re-initialize the session, then retries the original request once.
  • SDK transports (streamable / SSE): callSDKMethodWithReconnect() wraps all SDK method calls — on a "session not found" error it calls reconnectSDKTransport() (closes the dead session, dials a new one) and retries once. SendRequestWithServerID now routes through this wrapper for SDK transports.
  • Thread safety: sessionMu sync.RWMutex protects httpSessionID, session, and client. All reads go through getSDKSession() / getHTTPSessionID() (under RLock); reconnect functions hold the full Lock.
// Plain JSON-RPC: sendHTTPRequest now does this automatically
result, err := c.executeHTTPRequest(...)
if isSessionNotFoundHTTPResponse(result.StatusCode, result.ResponseBody) {
    if reconnErr := c.reconnectPlainJSON(); reconnErr == nil {
        result, err = c.executeHTTPRequest(...)  // retry once
    }
}

// SDK transports (streamable / SSE): via callSDKMethodWithReconnect
result, err := c.callSDKMethod(method, params)
if err != nil && isSessionNotFoundError(err) {
    if reconnErr := c.reconnectSDKTransport(); reconnErr == nil {
        result, err = c.callSDKMethod(method, params)
    }
}

Server-side idle timeout (internal/server/transport.go)

SessionTimeout increased from 30m → 2h so the agent→gateway inbound session doesn't expire during extended periods of no MCP activity (e.g. a long lake build).

Tests (internal/mcp/http_transport_test.go)

Added TestSendHTTPRequest_ReconnectsOnSessionNotFound, TestSendHTTPRequest_ReconnectFailure, TestSendHTTPRequest_NoReconnectOnOtherErrors, plus unit tests for isSessionNotFoundError and isSessionNotFoundHTTPResponse.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • example.com
    • Triggering command: /tmp/go-build2012957444/b330/launcher.test /tmp/go-build2012957444/b330/launcher.test -test.testlogfile=/tmp/go-build2012957444/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true g_.a --local es committer.email go (dns block)
    • Triggering command: /tmp/go-build916828748/b330/launcher.test /tmp/go-build916828748/b330/launcher.test -test.testlogfile=/tmp/go-build916828748/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true aw-mcpg/internal/config/rules/rules.go aw-mcpg/internal/config/rules/rules_test.go ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet . contextprotocol/checkout --64 ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet -I Scom/odrQfnn_R7r_as4UScom -I /opt/hostedtoolcache/go/1.25.8/x64/pkg/tool/linu-lang=go1.25 --gdwarf-5 --64 -o 2957444/b291/importcfg (dns block)
    • Triggering command: /tmp/go-build1482963661/b334/launcher.test /tmp/go-build1482963661/b334/launcher.test -test.testlogfile=/tmp/go-build1482963661/b334/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/context.go /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/guard.go ash b128.go s2rJ-u63W ache/go/1.25.8/x/tmp/go-build3142434740/b172/vet.cfg nIQTNFClwl8R --ve�� (dns block)
  • invalid-host-that-does-not-exist-12345.com
    • Triggering command: /tmp/go-build2012957444/b315/config.test /tmp/go-build2012957444/b315/config.test -test.testlogfile=/tmp/go-build2012957444/b315/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true -c=4 -nolocalimports -importcfg /tmp/go-build2012957444/b289/importcfg -pack /home/REDACTED/go/pkg/mod/golang.org/x/oauth2@v0.34.0/deviceauth.go /home/REDACTED/go/pkg/mod/golang.org/x/oauth2@v0.34.0/oauth2.go ortc�� rLdqHevS3 64/src/internal/byteorder/byteor--gdwarf2 x_amd64/vet pull.rebase abis (dns block)
    • Triggering command: /tmp/go-build588503108/b315/config.test /tmp/go-build588503108/b315/config.test -test.testlogfile=/tmp/go-build588503108/b315/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true ache/go/1.25.8/x64/src/runtime/c-I (dns block)
    • Triggering command: /tmp/go-build1482963661/b319/config.test /tmp/go-build1482963661/b319/config.test -test.testlogfile=/tmp/go-build1482963661/b319/testlog.txt -test.paniconexit0 -test.timeout=10m0s -o d -n 10 -importcfg 2957444/b370=> -s -w -buildmode=exe /usr/bin/runc.original --ve�� it tests..." -extld=gcc x_amd64/vet 64/src/runtime/c/opt/hostedtoolcache/go/1.25.8/x64/pkg/tool/linux_amd64/vet credential.helpe/tmp/go-build3142434740/b239/vet.cfg ache/go/1.25.8/x-lang=go1.17 x_amd64/vet (dns block)
  • nonexistent.local
    • Triggering command: /tmp/go-build2012957444/b330/launcher.test /tmp/go-build2012957444/b330/launcher.test -test.testlogfile=/tmp/go-build2012957444/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true g_.a --local es committer.email go (dns block)
    • Triggering command: /tmp/go-build916828748/b330/launcher.test /tmp/go-build916828748/b330/launcher.test -test.testlogfile=/tmp/go-build916828748/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true aw-mcpg/internal/config/rules/rules.go aw-mcpg/internal/config/rules/rules_test.go ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet . contextprotocol/checkout --64 ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet -I Scom/odrQfnn_R7r_as4UScom -I /opt/hostedtoolcache/go/1.25.8/x64/pkg/tool/linu-lang=go1.25 --gdwarf-5 --64 -o 2957444/b291/importcfg (dns block)
    • Triggering command: /tmp/go-build1482963661/b334/launcher.test /tmp/go-build1482963661/b334/launcher.test -test.testlogfile=/tmp/go-build1482963661/b334/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/context.go /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/guard.go ash b128.go s2rJ-u63W ache/go/1.25.8/x/tmp/go-build3142434740/b172/vet.cfg nIQTNFClwl8R --ve�� (dns block)
  • slow.example.com
    • Triggering command: /tmp/go-build2012957444/b330/launcher.test /tmp/go-build2012957444/b330/launcher.test -test.testlogfile=/tmp/go-build2012957444/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true g_.a --local es committer.email go (dns block)
    • Triggering command: /tmp/go-build916828748/b330/launcher.test /tmp/go-build916828748/b330/launcher.test -test.testlogfile=/tmp/go-build916828748/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true aw-mcpg/internal/config/rules/rules.go aw-mcpg/internal/config/rules/rules_test.go ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet . contextprotocol/checkout --64 ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet -I Scom/odrQfnn_R7r_as4UScom -I /opt/hostedtoolcache/go/1.25.8/x64/pkg/tool/linu-lang=go1.25 --gdwarf-5 --64 -o 2957444/b291/importcfg (dns block)
    • Triggering command: /tmp/go-build1482963661/b334/launcher.test /tmp/go-build1482963661/b334/launcher.test -test.testlogfile=/tmp/go-build1482963661/b334/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/context.go /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/guard.go ash b128.go s2rJ-u63W ache/go/1.25.8/x/tmp/go-build3142434740/b172/vet.cfg nIQTNFClwl8R --ve�� (dns block)
  • this-host-does-not-exist-12345.com
    • Triggering command: /tmp/go-build2012957444/b339/mcp.test /tmp/go-build2012957444/b339/mcp.test -test.testlogfile=/tmp/go-build2012957444/b339/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/runtime/cgo o x_amd64/vet (dns block)
    • Triggering command: /tmp/go-build916828748/b339/mcp.test /tmp/go-build916828748/b339/mcp.test -test.testlogfile=/tmp/go-build916828748/b339/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 2957444/b308/_pkg_.a /tmp/go-build2012957444/b165/ ache/go/1.25.8/x64/pkg/tool/linux_amd64/compile . contextprotocol/-c --64 ache/go/1.25.8/x64/pkg/tool/linux_amd64/compile -I 2957444/b320/_pkg_.a -I docker-buildx --gdwarf-5 b/gh-aw-mcpg/int--norc -o docker-buildx (dns block)
    • Triggering command: /tmp/go-build1482963661/b343/mcp.test /tmp/go-build1482963661/b343/mcp.test -test.testlogfile=/tmp/go-build1482963661/b343/testlog.txt -test.paniconexit0 -test.timeout=10m0s -o ssue -importcfg x_amd64/vet -s -w -buildmode=exe x_amd64/vet --no�� --noprofile -extld=gcc x_amd64/vet ternal/engine/wabash ternal/engine/wa/usr/bin/runc x_amd64/compile x_amd64/vet (dns block)

If you need me to access, download, or install something from one of these locations, you can either:


⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with Raycast.

Copilot AI and others added 3 commits March 26, 2026 19:20
…session timeout

When an MCP backend session expires (e.g. after 30+ minutes of inactivity on the
gateway→safeoutputs connection), the gateway now:

1. Detects the "session not found" HTTP 404 response from the backend.
2. Transparently re-initialises a new session (reconnectPlainJSON).
3. Retries the original request exactly once with the new session.

For SDK-based transports (streamable HTTP / SSE), the same reconnect-and-retry
pattern is applied via callSDKMethodWithReconnect whenever the SDK error message
contains "session not found".

Additionally, the server-side idle SessionTimeout is increased from 30 minutes to
2 hours so that long-running agent workflows (lake build, etc.) do not cause the
agent→gateway session to expire during periods of no MCP activity.

Fixes: MCP session expires mid-run, causing agent retry loops

Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/a43c25fb-e9bc-4fc8-865e-19acf449fa21

Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
…g reconnect failures

- Use sync.RWMutex (sessionMu) instead of sync.Mutex so concurrent readers don't block
- Add getSDKSession() and getHTTPSessionID() helpers that snapshot fields under RLock
- Update all c.session reads in method bodies to use getSDKSession()
- Update buildSessionHeaderModifier to use getHTTPSessionID() (eliminates data race)
- Log reconnect failure in callSDKMethodWithReconnect when reconnErr != nil
- Improve SessionTimeout comment in transport.go

Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/a43c25fb-e9bc-4fc8-865e-19acf449fa21

Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix safe outputs MCP session expiration causing agent retries fix: reconnect expired MCP backend sessions transparently, extend server session timeout Mar 26, 2026
Copilot AI requested a review from lpcox March 26, 2026 19:28
@lpcox lpcox marked this pull request as ready for review March 26, 2026 19:31
Copilot AI review requested due to automatic review settings March 26, 2026 19:32
@lpcox lpcox merged commit f861516 into main Mar 26, 2026
25 checks passed
@lpcox lpcox deleted the copilot/fix-session-expiration-issue branch March 26, 2026 19:32
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes long-running MCP agent workflows failing due to expired backend sessions by adding transparent client-side session reconnection and extending the server’s idle session timeout.

Changes:

  • Add automatic reconnect-and-retry behavior for plain JSON-RPC HTTP requests on 404 + "session not found".
  • Add automatic reconnect-and-retry wrapper for SDK-based transports (streamable HTTP / SSE) when encountering "session not found" errors, with RWMutex protection around mutable session state.
  • Increase gateway inbound session idle timeout from 30m to 2h and add targeted unit tests for the new reconnect detection/behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
internal/mcp/connection.go Adds session locking plus reconnect-and-retry for SDK transports when sessions expire.
internal/mcp/http_transport.go Adds “session not found” detection helpers and reconnect-and-retry for plain JSON-RPC HTTP transport.
internal/mcp/http_transport_test.go Adds unit tests for session-not-found detection and plain JSON-RPC reconnect behavior.
internal/server/transport.go Extends server-side session timeout to reduce idle expiration during long workflows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

require.Len(t, receivedSessionIDs, 2, "expected two tool calls: initial (failed) + retry (succeeded)")
assert.Equal(t, "session-1", receivedSessionIDs[0], "first attempt should use the initial session")
assert.Equal(t, "session-2", receivedSessionIDs[1], "retry should use the reconnected session")
assert.Equal(t, "session-2", conn.httpSessionID, "connection should store the new session ID")
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test reads conn.httpSessionID directly, but the production code now guards session fields with sessionMu. Accessing the field without the lock can trigger data races under -race. Prefer asserting via conn.getHTTPSessionID() (or another lock-protected accessor) instead of reading the struct field directly.

Suggested change
assert.Equal(t, "session-2", conn.httpSessionID, "connection should store the new session ID")
assert.Equal(t, "session-2", conn.getHTTPSessionID(), "connection should store the new session ID")

Copilot uses AI. Check for mistakes.
Comment on lines +492 to 495
// Check for HTTP errors after parsing.
// If we have a non-OK status but successfully parsed a JSON-RPC response,
// pass it through (it may already contain an error field)
// pass it through (it may already contain an error field).
if result.StatusCode != http.StatusOK {
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here suggests that for non-200 HTTP statuses we might still "successfully parse a JSON-RPC response" and pass it through. However parseJSONRPCResponseWithSSE currently returns a synthetic JSON-RPC error for any statusCode != 200 even when the body is valid JSON-RPC, so this branch/doc is misleading and may hide server-provided error details. Consider either (a) adjusting parseJSONRPCResponseWithSSE/parseHTTPResult to preserve parsed JSON-RPC error bodies on non-200 responses, or (b) updating the comment/logging and simplifying the non-OK handling accordingly.

Copilot uses AI. Check for mistakes.
Comment on lines +290 to +299
sessionID, err := c.initializeHTTPSession()
if err != nil {
logger.LogError("backend", "Session reconnect failed for %s: %v", c.serverID, err)
return fmt.Errorf("session reconnect failed: %w", err)
}

c.httpSessionID = sessionID
logConn.Printf("Reconnected plain JSON-RPC session for serverID=%s, new sessionID=%s", c.serverID, sessionID)
logger.LogInfo("backend", "Session successfully reconnected for %s", c.serverID)
return nil
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconnectPlainJSON calls initializeHTTPSession(), which uses context.Background() internally. That means a reconnect ignores caller cancellation/deadlines and can hang until the HTTP client timeout, even if the original request context is canceled. Consider plumbing a context into initializeHTTPSession (or adding a context-aware variant) and using a bounded timeout (similar to reconnectSDKTransport) so reconnect attempts are cancelable and predictable.

Suggested change
sessionID, err := c.initializeHTTPSession()
if err != nil {
logger.LogError("backend", "Session reconnect failed for %s: %v", c.serverID, err)
return fmt.Errorf("session reconnect failed: %w", err)
}
c.httpSessionID = sessionID
logConn.Printf("Reconnected plain JSON-RPC session for serverID=%s, new sessionID=%s", c.serverID, sessionID)
logger.LogInfo("backend", "Session successfully reconnected for %s", c.serverID)
return nil
// Use a bounded context so reconnect attempts are cancelable and predictable,
// similar to reconnectSDKTransport.
connectCtx, cancel := context.WithTimeout(c.ctx, 10*time.Second)
defer cancel()
type reconnectResult struct {
sessionID string
err error
}
resultCh := make(chan reconnectResult, 1)
go func() {
sessionID, err := c.initializeHTTPSession()
resultCh <- reconnectResult{sessionID: sessionID, err: err}
}()
select {
case <-connectCtx.Done():
err := connectCtx.Err()
logger.LogError("backend", "Session reconnect canceled or timed out for %s: %v", c.serverID, err)
return fmt.Errorf("session reconnect canceled or timed out: %w", err)
case res := <-resultCh:
if res.err != nil {
logger.LogError("backend", "Session reconnect failed for %s: %v", c.serverID, res.err)
return fmt.Errorf("session reconnect failed: %w", res.err)
}
c.httpSessionID = res.sessionID
logConn.Printf("Reconnected plain JSON-RPC session for serverID=%s, new sessionID=%s", c.serverID, res.sessionID)
logger.LogInfo("backend", "Session successfully reconnected for %s", c.serverID)
return nil
}

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Safe Outputs MCP session expires mid-run, causing agent retry loops

3 participants