Conversation
…session timeout When an MCP backend session expires (e.g. after 30+ minutes of inactivity on the gateway→safeoutputs connection), the gateway now: 1. Detects the "session not found" HTTP 404 response from the backend. 2. Transparently re-initialises a new session (reconnectPlainJSON). 3. Retries the original request exactly once with the new session. For SDK-based transports (streamable HTTP / SSE), the same reconnect-and-retry pattern is applied via callSDKMethodWithReconnect whenever the SDK error message contains "session not found". Additionally, the server-side idle SessionTimeout is increased from 30 minutes to 2 hours so that long-running agent workflows (lake build, etc.) do not cause the agent→gateway session to expire during periods of no MCP activity. Fixes: MCP session expires mid-run, causing agent retry loops Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/a43c25fb-e9bc-4fc8-865e-19acf449fa21 Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
…g reconnect failures - Use sync.RWMutex (sessionMu) instead of sync.Mutex so concurrent readers don't block - Add getSDKSession() and getHTTPSessionID() helpers that snapshot fields under RLock - Update all c.session reads in method bodies to use getSDKSession() - Update buildSessionHeaderModifier to use getHTTPSessionID() (eliminates data race) - Log reconnect failure in callSDKMethodWithReconnect when reconnErr != nil - Improve SessionTimeout comment in transport.go Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/a43c25fb-e9bc-4fc8-865e-19acf449fa21 Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/a43c25fb-e9bc-4fc8-865e-19acf449fa21 Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes long-running MCP agent workflows failing due to expired backend sessions by adding transparent client-side session reconnection and extending the server’s idle session timeout.
Changes:
- Add automatic reconnect-and-retry behavior for plain JSON-RPC HTTP requests on
404+"session not found". - Add automatic reconnect-and-retry wrapper for SDK-based transports (streamable HTTP / SSE) when encountering
"session not found"errors, with RWMutex protection around mutable session state. - Increase gateway inbound session idle timeout from
30mto2hand add targeted unit tests for the new reconnect detection/behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| internal/mcp/connection.go | Adds session locking plus reconnect-and-retry for SDK transports when sessions expire. |
| internal/mcp/http_transport.go | Adds “session not found” detection helpers and reconnect-and-retry for plain JSON-RPC HTTP transport. |
| internal/mcp/http_transport_test.go | Adds unit tests for session-not-found detection and plain JSON-RPC reconnect behavior. |
| internal/server/transport.go | Extends server-side session timeout to reduce idle expiration during long workflows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| require.Len(t, receivedSessionIDs, 2, "expected two tool calls: initial (failed) + retry (succeeded)") | ||
| assert.Equal(t, "session-1", receivedSessionIDs[0], "first attempt should use the initial session") | ||
| assert.Equal(t, "session-2", receivedSessionIDs[1], "retry should use the reconnected session") | ||
| assert.Equal(t, "session-2", conn.httpSessionID, "connection should store the new session ID") |
There was a problem hiding this comment.
This test reads conn.httpSessionID directly, but the production code now guards session fields with sessionMu. Accessing the field without the lock can trigger data races under -race. Prefer asserting via conn.getHTTPSessionID() (or another lock-protected accessor) instead of reading the struct field directly.
| assert.Equal(t, "session-2", conn.httpSessionID, "connection should store the new session ID") | |
| assert.Equal(t, "session-2", conn.getHTTPSessionID(), "connection should store the new session ID") |
| // Check for HTTP errors after parsing. | ||
| // If we have a non-OK status but successfully parsed a JSON-RPC response, | ||
| // pass it through (it may already contain an error field) | ||
| // pass it through (it may already contain an error field). | ||
| if result.StatusCode != http.StatusOK { |
There was a problem hiding this comment.
The comment here suggests that for non-200 HTTP statuses we might still "successfully parse a JSON-RPC response" and pass it through. However parseJSONRPCResponseWithSSE currently returns a synthetic JSON-RPC error for any statusCode != 200 even when the body is valid JSON-RPC, so this branch/doc is misleading and may hide server-provided error details. Consider either (a) adjusting parseJSONRPCResponseWithSSE/parseHTTPResult to preserve parsed JSON-RPC error bodies on non-200 responses, or (b) updating the comment/logging and simplifying the non-OK handling accordingly.
| sessionID, err := c.initializeHTTPSession() | ||
| if err != nil { | ||
| logger.LogError("backend", "Session reconnect failed for %s: %v", c.serverID, err) | ||
| return fmt.Errorf("session reconnect failed: %w", err) | ||
| } | ||
|
|
||
| c.httpSessionID = sessionID | ||
| logConn.Printf("Reconnected plain JSON-RPC session for serverID=%s, new sessionID=%s", c.serverID, sessionID) | ||
| logger.LogInfo("backend", "Session successfully reconnected for %s", c.serverID) | ||
| return nil |
There was a problem hiding this comment.
reconnectPlainJSON calls initializeHTTPSession(), which uses context.Background() internally. That means a reconnect ignores caller cancellation/deadlines and can hang until the HTTP client timeout, even if the original request context is canceled. Consider plumbing a context into initializeHTTPSession (or adding a context-aware variant) and using a bounded timeout (similar to reconnectSDKTransport) so reconnect attempts are cancelable and predictable.
| sessionID, err := c.initializeHTTPSession() | |
| if err != nil { | |
| logger.LogError("backend", "Session reconnect failed for %s: %v", c.serverID, err) | |
| return fmt.Errorf("session reconnect failed: %w", err) | |
| } | |
| c.httpSessionID = sessionID | |
| logConn.Printf("Reconnected plain JSON-RPC session for serverID=%s, new sessionID=%s", c.serverID, sessionID) | |
| logger.LogInfo("backend", "Session successfully reconnected for %s", c.serverID) | |
| return nil | |
| // Use a bounded context so reconnect attempts are cancelable and predictable, | |
| // similar to reconnectSDKTransport. | |
| connectCtx, cancel := context.WithTimeout(c.ctx, 10*time.Second) | |
| defer cancel() | |
| type reconnectResult struct { | |
| sessionID string | |
| err error | |
| } | |
| resultCh := make(chan reconnectResult, 1) | |
| go func() { | |
| sessionID, err := c.initializeHTTPSession() | |
| resultCh <- reconnectResult{sessionID: sessionID, err: err} | |
| }() | |
| select { | |
| case <-connectCtx.Done(): | |
| err := connectCtx.Err() | |
| logger.LogError("backend", "Session reconnect canceled or timed out for %s: %v", c.serverID, err) | |
| return fmt.Errorf("session reconnect canceled or timed out: %w", err) | |
| case res := <-resultCh: | |
| if res.err != nil { | |
| logger.LogError("backend", "Session reconnect failed for %s: %v", c.serverID, res.err) | |
| return fmt.Errorf("session reconnect failed: %w", res.err) | |
| } | |
| c.httpSessionID = res.sessionID | |
| logConn.Printf("Reconnected plain JSON-RPC session for serverID=%s, new sessionID=%s", c.serverID, res.sessionID) | |
| logger.LogInfo("backend", "Session successfully reconnected for %s", c.serverID) | |
| return nil | |
| } |
Long-running agent workflows (~30+ min) fail when the gateway's HTTP session to a backend (e.g. safeoutputs) expires mid-run. The agent receives
session not foundfrom the backend and enters a futile retry loop rather than recovering.Changes
Client-side session reconnect (
internal/mcp/connection.go,http_transport.go)sendHTTPRequestdetects HTTP 404 +"session not found"body, callsreconnectPlainJSON()to re-initialize the session, then retries the original request once.callSDKMethodWithReconnect()wraps all SDK method calls — on a"session not found"error it callsreconnectSDKTransport()(closes the dead session, dials a new one) and retries once.SendRequestWithServerIDnow routes through this wrapper for SDK transports.sessionMu sync.RWMutexprotectshttpSessionID,session, andclient. All reads go throughgetSDKSession()/getHTTPSessionID()(underRLock); reconnect functions hold the fullLock.Server-side idle timeout (
internal/server/transport.go)SessionTimeoutincreased from30m → 2hso the agent→gateway inbound session doesn't expire during extended periods of no MCP activity (e.g. a longlake build).Tests (
internal/mcp/http_transport_test.go)Added
TestSendHTTPRequest_ReconnectsOnSessionNotFound,TestSendHTTPRequest_ReconnectFailure,TestSendHTTPRequest_NoReconnectOnOtherErrors, plus unit tests forisSessionNotFoundErrorandisSessionNotFoundHTTPResponse.Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
example.com/tmp/go-build2012957444/b330/launcher.test /tmp/go-build2012957444/b330/launcher.test -test.testlogfile=/tmp/go-build2012957444/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true g_.a --local es committer.email go(dns block)/tmp/go-build916828748/b330/launcher.test /tmp/go-build916828748/b330/launcher.test -test.testlogfile=/tmp/go-build916828748/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true aw-mcpg/internal/config/rules/rules.go aw-mcpg/internal/config/rules/rules_test.go ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet . contextprotocol/checkout --64 ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet -I Scom/odrQfnn_R7r_as4UScom -I /opt/hostedtoolcache/go/1.25.8/x64/pkg/tool/linu-lang=go1.25 --gdwarf-5 --64 -o 2957444/b291/importcfg(dns block)/tmp/go-build1482963661/b334/launcher.test /tmp/go-build1482963661/b334/launcher.test -test.testlogfile=/tmp/go-build1482963661/b334/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/context.go /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/guard.go ash b128.go s2rJ-u63W ache/go/1.25.8/x/tmp/go-build3142434740/b172/vet.cfg nIQTNFClwl8R --ve��(dns block)invalid-host-that-does-not-exist-12345.com/tmp/go-build2012957444/b315/config.test /tmp/go-build2012957444/b315/config.test -test.testlogfile=/tmp/go-build2012957444/b315/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true -c=4 -nolocalimports -importcfg /tmp/go-build2012957444/b289/importcfg -pack /home/REDACTED/go/pkg/mod/golang.org/x/oauth2@v0.34.0/deviceauth.go /home/REDACTED/go/pkg/mod/golang.org/x/oauth2@v0.34.0/oauth2.go ortc�� rLdqHevS3 64/src/internal/byteorder/byteor--gdwarf2 x_amd64/vet pull.rebase abis(dns block)/tmp/go-build588503108/b315/config.test /tmp/go-build588503108/b315/config.test -test.testlogfile=/tmp/go-build588503108/b315/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true ache/go/1.25.8/x64/src/runtime/c-I(dns block)/tmp/go-build1482963661/b319/config.test /tmp/go-build1482963661/b319/config.test -test.testlogfile=/tmp/go-build1482963661/b319/testlog.txt -test.paniconexit0 -test.timeout=10m0s -o d -n 10 -importcfg 2957444/b370=> -s -w -buildmode=exe /usr/bin/runc.original --ve�� it tests..." -extld=gcc x_amd64/vet 64/src/runtime/c/opt/hostedtoolcache/go/1.25.8/x64/pkg/tool/linux_amd64/vet credential.helpe/tmp/go-build3142434740/b239/vet.cfg ache/go/1.25.8/x-lang=go1.17 x_amd64/vet(dns block)nonexistent.local/tmp/go-build2012957444/b330/launcher.test /tmp/go-build2012957444/b330/launcher.test -test.testlogfile=/tmp/go-build2012957444/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true g_.a --local es committer.email go(dns block)/tmp/go-build916828748/b330/launcher.test /tmp/go-build916828748/b330/launcher.test -test.testlogfile=/tmp/go-build916828748/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true aw-mcpg/internal/config/rules/rules.go aw-mcpg/internal/config/rules/rules_test.go ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet . contextprotocol/checkout --64 ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet -I Scom/odrQfnn_R7r_as4UScom -I /opt/hostedtoolcache/go/1.25.8/x64/pkg/tool/linu-lang=go1.25 --gdwarf-5 --64 -o 2957444/b291/importcfg(dns block)/tmp/go-build1482963661/b334/launcher.test /tmp/go-build1482963661/b334/launcher.test -test.testlogfile=/tmp/go-build1482963661/b334/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/context.go /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/guard.go ash b128.go s2rJ-u63W ache/go/1.25.8/x/tmp/go-build3142434740/b172/vet.cfg nIQTNFClwl8R --ve��(dns block)slow.example.com/tmp/go-build2012957444/b330/launcher.test /tmp/go-build2012957444/b330/launcher.test -test.testlogfile=/tmp/go-build2012957444/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true g_.a --local es committer.email go(dns block)/tmp/go-build916828748/b330/launcher.test /tmp/go-build916828748/b330/launcher.test -test.testlogfile=/tmp/go-build916828748/b330/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true aw-mcpg/internal/config/rules/rules.go aw-mcpg/internal/config/rules/rules_test.go ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet . contextprotocol/checkout --64 ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet -I Scom/odrQfnn_R7r_as4UScom -I /opt/hostedtoolcache/go/1.25.8/x64/pkg/tool/linu-lang=go1.25 --gdwarf-5 --64 -o 2957444/b291/importcfg(dns block)/tmp/go-build1482963661/b334/launcher.test /tmp/go-build1482963661/b334/launcher.test -test.testlogfile=/tmp/go-build1482963661/b334/testlog.txt -test.paniconexit0 -test.timeout=10m0s --ve�� /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/context.go /home/REDACTED/work/gh-aw-mcpg/gh-aw-mcpg/internal/guard/guard.go ash b128.go s2rJ-u63W ache/go/1.25.8/x/tmp/go-build3142434740/b172/vet.cfg nIQTNFClwl8R --ve��(dns block)this-host-does-not-exist-12345.com/tmp/go-build2012957444/b339/mcp.test /tmp/go-build2012957444/b339/mcp.test -test.testlogfile=/tmp/go-build2012957444/b339/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/runtime/cgo o x_amd64/vet(dns block)/tmp/go-build916828748/b339/mcp.test /tmp/go-build916828748/b339/mcp.test -test.testlogfile=/tmp/go-build916828748/b339/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 2957444/b308/_pkg_.a /tmp/go-build2012957444/b165/ ache/go/1.25.8/x64/pkg/tool/linux_amd64/compile . contextprotocol/-c --64 ache/go/1.25.8/x64/pkg/tool/linux_amd64/compile -I 2957444/b320/_pkg_.a -I docker-buildx --gdwarf-5 b/gh-aw-mcpg/int--norc -o docker-buildx(dns block)/tmp/go-build1482963661/b343/mcp.test /tmp/go-build1482963661/b343/mcp.test -test.testlogfile=/tmp/go-build1482963661/b343/testlog.txt -test.paniconexit0 -test.timeout=10m0s -o ssue -importcfg x_amd64/vet -s -w -buildmode=exe x_amd64/vet --no�� --noprofile -extld=gcc x_amd64/vet ternal/engine/wabash ternal/engine/wa/usr/bin/runc x_amd64/compile x_amd64/vet(dns block)If you need me to access, download, or install something from one of these locations, you can either:
⚡ Quickly spin up Copilot coding agent tasks from anywhere on your macOS or Windows machine with Raycast.