Reduce log noise from stale Valkey sessions after pod restart

## Problem

After a muster pod restart, sessions persisted in Valkey can trigger massive log spam. The OAuth access tokens survive in Valkey, so when a client reconnects with a still-valid token, the `onAuthenticated` callback fires. Since `authStore.Touch(sessionID)` returns `false` (the auth-store key didn't survive), `initSSOForSession()` is triggered in the background.

This successfully connects to downstream servers via both **ID token forwarding** (local servers) and **RFC 8693 token exchange** (remote servers). However, both paths produce persistent 403 spam:

### Phase 1: Local servers (ID token forwarding) -- self-heals

The established `StreamableHTTPClient` uses a dynamic `headerFunc` (`connection_helper.go:295`) that runs on **every outgoing MCP request** to resolve the latest ID token from the OAuth store. For stale sessions, the OAuth store lookup fails, producing:

```
msg="Authentication failed: no ID token in OAuth store for session QXaMAqjA... to gazelle-mcp-kubernetes, using original token"
```

This generates **~2 warnings/second** (one per downstream server) and accounted for **76% of all errors/warnings** (1,378 out of 1,813 lines) in a 30-minute window on gazelle. The noise self-heals when the access token is eventually refreshed successfully (observed ~15 minutes after startup).

On the **downstream mcp-kubernetes side**, the stale forwarded ID tokens cause userinfo validation failures against Dex, logged as `Token validation failed: userinfo request failed with status 403`. On gazelle-mcp-kubernetes, this produced **456 consecutive 403 errors** at ~1/sec from 06:55 to 07:02 (from muster startup until the token self-healed). The source IP is the Envoy Gateway proxy (since muster routes through ingress).

### Phase 2: Remote servers (RFC 8693 token exchange) -- never self-heals

The same stale session also establishes connections to **16 remote installations** via RFC 8693 token exchange. These exchanged tokens have an approximate **30-minute lifetime**. While valid, health-check pings succeed every ~5 minutes. Once they expire, the listener retries every second with the expired token indefinitely.

Observed on alligator-mcp-kubernetes (representative of all 16 remote servers):
- Token exchange at 06:47:37 (gazelle-muster side)
- Successful health-checks at: 06:50, 06:55, 07:00, 07:05, 07:10, 07:15
- **First 403 at 07:20:08** (~32.5 min after exchange, token expired)
- 403 spam continues at ~1/sec and **never stops** (819 errors in the log window)

On the gazelle-muster side, Phase 2 produces **~890 errors/min** (~15/sec across 16 remote servers), totaling **14,038 `failed to listen to server: Token validation failed` entries** in the log window.

**This is worse than Phase 1** because:
1. It affects **every remote mcp-kubernetes instance** (not just local ones)
2. It **never self-heals** -- there is no mechanism to refresh the exchanged token for the stale session
3. The error volume is **~7.5x higher** than Phase 1 (890/min vs 118/min)

## Root Cause Trace

1. Pod restarts at `T+0`. Valkey retains OAuth access tokens but auth-store session keys expire/are lost.
2. Client reconnects at `T+13s` with a still-valid access token from Valkey.
3. `createAccessTokenInjectorMiddleware` validates the token, fires `onAuthenticated` callback (`server.go:1192`).
4. `authStore.Touch(sessionID)` returns `false` (no auth-store entry) -> triggers `initSSOForSession()`.
5. `initSSOForSession` connects to downstream servers:
   - **Local servers** (e.g. gazelle-mcp-kubernetes): via ID token forwarding. The `headerFunc` fails to find an ID token in the OAuth store, falls back to stale token. Downstream validates via Dex userinfo -> 403.
   - **Remote servers** (e.g. alligator-mcp-kubernetes): via RFC 8693 token exchange. Exchange succeeds, but the exchanged token expires after ~30 min with no refresh mechanism.
6. **Phase 1**: `headerFunc` warning + downstream 403 at ~1/sec per local server. Self-heals when access token is refreshed (~15 min).
7. **Phase 2**: Listener retries with expired exchanged token at ~1/sec per remote server. **Never self-heals**.

## Implementation Plan

### Fix A: Gate SSO initialization on having a valid ID token

In the `onAuthenticated` callback (`server.go:1192`), before calling `initSSOForSession()`, verify that the OAuth store actually has a usable ID token for the session. If not, skip the SSO initialization -- the connection will be established on the next request that has a valid token (e.g., after a successful refresh or new login).

**Location:** `internal/aggregator/server.go` around line 1192-1216

**Approach:**
- After `authStore.Touch()` returns `false`, check whether `getIDTokenForForwarding(ctx, sessionID, musterIssuer)` returns a non-empty token.
- If no ID token is available, log a debug message and skip `initSSOForSession()`.
- This prevents establishing downstream connections that will immediately start spamming warnings.
- **This fix addresses both Phase 1 and Phase 2**, since the stale session never gets downstream connections in the first place.

### Fix B: Rate-limit the headerFunc warning

The `headerFunc` in `connection_helper.go:295` logs `"Authentication failed: no ID token in OAuth store..."` on **every single request**. This should be rate-limited to avoid flooding logs even if Fix A doesn't catch all edge cases.

**Location:** `internal/aggregator/connection_helper.go` around line 295-300

**Approach:**
- Use a per-session+server rate limiter (e.g., `sync.Once` or a timestamp check) so the warning is logged at most once per minute per session+server pair.
- After the first warning, subsequent failures within the rate-limit window should log at DEBUG level instead of WARN.
- When the token lookup succeeds again (self-heal), log the recovery at INFO level (the existing `"Token expired, refreshing"` message already does this).
- **Note:** This only reduces muster-side log noise for Phase 1. It does **not** prevent the downstream 403 spam on mcp-kubernetes instances.

## Acceptance Criteria

- [ ] After a pod restart with stale sessions in Valkey, the `"no ID token in OAuth store"` warning appears at most once per session+server pair per minute (not ~1/sec)
- [ ] SSO connections are not established for sessions without a valid ID token in the OAuth store
- [ ] Sessions that later get a valid token (via successful refresh or re-login) still get their SSO connections established on the next request
- [ ] No downstream mcp-kubernetes instances receive 403 spam from stale session listeners
- [ ] No existing BDD test scenarios are broken
- [ ] Unit tests cover both fix paths

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce log noise from stale Valkey sessions after pod restart #520

Problem

Phase 1: Local servers (ID token forwarding) -- self-heals

Phase 2: Remote servers (RFC 8693 token exchange) -- never self-heals

Root Cause Trace

Implementation Plan

Fix A: Gate SSO initialization on having a valid ID token

Fix B: Rate-limit the headerFunc warning

Acceptance Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reduce log noise from stale Valkey sessions after pod restart #520

Description

Problem

Phase 1: Local servers (ID token forwarding) -- self-heals

Phase 2: Remote servers (RFC 8693 token exchange) -- never self-heals

Root Cause Trace

Implementation Plan

Fix A: Gate SSO initialization on having a valid ID token

Fix B: Rate-limit the headerFunc warning

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions