-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Problem
After a muster pod restart, sessions persisted in Valkey can trigger massive log spam. The OAuth access tokens survive in Valkey, so when a client reconnects with a still-valid token, the onAuthenticated callback fires. Since authStore.Touch(sessionID) returns false (the auth-store key didn't survive), initSSOForSession() is triggered in the background.
This successfully connects to downstream servers via both ID token forwarding (local servers) and RFC 8693 token exchange (remote servers). However, both paths produce persistent 403 spam:
Phase 1: Local servers (ID token forwarding) -- self-heals
The established StreamableHTTPClient uses a dynamic headerFunc (connection_helper.go:295) that runs on every outgoing MCP request to resolve the latest ID token from the OAuth store. For stale sessions, the OAuth store lookup fails, producing:
msg="Authentication failed: no ID token in OAuth store for session QXaMAqjA... to gazelle-mcp-kubernetes, using original token"
This generates ~2 warnings/second (one per downstream server) and accounted for 76% of all errors/warnings (1,378 out of 1,813 lines) in a 30-minute window on gazelle. The noise self-heals when the access token is eventually refreshed successfully (observed ~15 minutes after startup).
On the downstream mcp-kubernetes side, the stale forwarded ID tokens cause userinfo validation failures against Dex, logged as Token validation failed: userinfo request failed with status 403. On gazelle-mcp-kubernetes, this produced 456 consecutive 403 errors at ~1/sec from 06:55 to 07:02 (from muster startup until the token self-healed). The source IP is the Envoy Gateway proxy (since muster routes through ingress).
Phase 2: Remote servers (RFC 8693 token exchange) -- never self-heals
The same stale session also establishes connections to 16 remote installations via RFC 8693 token exchange. These exchanged tokens have an approximate 30-minute lifetime. While valid, health-check pings succeed every ~5 minutes. Once they expire, the listener retries every second with the expired token indefinitely.
Observed on alligator-mcp-kubernetes (representative of all 16 remote servers):
- Token exchange at 06:47:37 (gazelle-muster side)
- Successful health-checks at: 06:50, 06:55, 07:00, 07:05, 07:10, 07:15
- First 403 at 07:20:08 (~32.5 min after exchange, token expired)
- 403 spam continues at ~1/sec and never stops (819 errors in the log window)
On the gazelle-muster side, Phase 2 produces ~890 errors/min (~15/sec across 16 remote servers), totaling 14,038 failed to listen to server: Token validation failed entries in the log window.
This is worse than Phase 1 because:
- It affects every remote mcp-kubernetes instance (not just local ones)
- It never self-heals -- there is no mechanism to refresh the exchanged token for the stale session
- The error volume is ~7.5x higher than Phase 1 (890/min vs 118/min)
Root Cause Trace
- Pod restarts at
T+0. Valkey retains OAuth access tokens but auth-store session keys expire/are lost. - Client reconnects at
T+13swith a still-valid access token from Valkey. createAccessTokenInjectorMiddlewarevalidates the token, firesonAuthenticatedcallback (server.go:1192).authStore.Touch(sessionID)returnsfalse(no auth-store entry) -> triggersinitSSOForSession().initSSOForSessionconnects to downstream servers:- Local servers (e.g. gazelle-mcp-kubernetes): via ID token forwarding. The
headerFuncfails to find an ID token in the OAuth store, falls back to stale token. Downstream validates via Dex userinfo -> 403. - Remote servers (e.g. alligator-mcp-kubernetes): via RFC 8693 token exchange. Exchange succeeds, but the exchanged token expires after ~30 min with no refresh mechanism.
- Local servers (e.g. gazelle-mcp-kubernetes): via ID token forwarding. The
- Phase 1:
headerFuncwarning + downstream 403 at ~1/sec per local server. Self-heals when access token is refreshed (~15 min). - Phase 2: Listener retries with expired exchanged token at ~1/sec per remote server. Never self-heals.
Implementation Plan
Fix A: Gate SSO initialization on having a valid ID token
In the onAuthenticated callback (server.go:1192), before calling initSSOForSession(), verify that the OAuth store actually has a usable ID token for the session. If not, skip the SSO initialization -- the connection will be established on the next request that has a valid token (e.g., after a successful refresh or new login).
Location: internal/aggregator/server.go around line 1192-1216
Approach:
- After
authStore.Touch()returnsfalse, check whethergetIDTokenForForwarding(ctx, sessionID, musterIssuer)returns a non-empty token. - If no ID token is available, log a debug message and skip
initSSOForSession(). - This prevents establishing downstream connections that will immediately start spamming warnings.
- This fix addresses both Phase 1 and Phase 2, since the stale session never gets downstream connections in the first place.
Fix B: Rate-limit the headerFunc warning
The headerFunc in connection_helper.go:295 logs "Authentication failed: no ID token in OAuth store..." on every single request. This should be rate-limited to avoid flooding logs even if Fix A doesn't catch all edge cases.
Location: internal/aggregator/connection_helper.go around line 295-300
Approach:
- Use a per-session+server rate limiter (e.g.,
sync.Onceor a timestamp check) so the warning is logged at most once per minute per session+server pair. - After the first warning, subsequent failures within the rate-limit window should log at DEBUG level instead of WARN.
- When the token lookup succeeds again (self-heal), log the recovery at INFO level (the existing
"Token expired, refreshing"message already does this). - Note: This only reduces muster-side log noise for Phase 1. It does not prevent the downstream 403 spam on mcp-kubernetes instances.
Acceptance Criteria
- After a pod restart with stale sessions in Valkey, the
"no ID token in OAuth store"warning appears at most once per session+server pair per minute (not ~1/sec) - SSO connections are not established for sessions without a valid ID token in the OAuth store
- Sessions that later get a valid token (via successful refresh or re-login) still get their SSO connections established on the next request
- No downstream mcp-kubernetes instances receive 403 spam from stale session listeners
- No existing BDD test scenarios are broken
- Unit tests cover both fix paths
Metadata
Metadata
Assignees
Labels
Type
Fields
Give feedbackProjects
Status