fix(mcp): stop the OAuth Authentication Request loop after the user clicks Cancel#2949
Conversation
The MCP SDK fires several initialize-stage RPCs concurrently. Each one that hits a 401 queues on oauthFlowMu and runs the OAuth flow. When the user clicks Cancel, the goroutine holding the mutex returns the decline and releases the mutex; the next queued goroutine sees no valid token and starts a fresh OAuth flow, re-emitting the elicitation the user just dismissed. The single-flight added in 8bc1fa9 coalesces concurrent attempts but does not latch a decline. Two changes to fix the loop without altering the happy-path flow: 1. New sentinel *OAuthDeclinedError in pkg/tools/mcp/interactive.go (plus IsOAuthDeclined helper), distinct from the existing AuthorizationRequiredError so callers can distinguish a deliberate user dismissal from a non-interactive deferral. 2. Sticky decline state on oauthTransport (lastOAuthDeclined): - authorizeOnce sets it under oauthFlowMu before the deferred Unlock fires, so any queued goroutine observes the decline before it can fire a fresh flow. - authorizeOnce short-circuits to OAuthDeclinedError at the top when the flag is set, avoiding handleOAuthFlow entirely. The MCP SDK wraps transport errors with fmt.Errorf("%w: %v", ...), dropping the original error from the unwrap chain (same reason AuthorizationRequiredError uses the lastAuthRequired flag pattern). enrichConnectError reads the new oauthDeclined() accessor and reconstitutes a clean *OAuthDeclinedError sentinel for callers. Regression tests: - TestUnmanagedOAuthFlow_ElicitationDeclineReturnsSentinel: a declined elicitation surfaces as IsOAuthDeclined, carries the server URL, and does not hit the token endpoint. - TestUnmanagedOAuthFlow_ElicitationDeclineLatchesAgainstConcurrentRoundTrips: 4 concurrent RoundTrips produce exactly ONE elicitation; every queued caller observes the latched decline and short-circuits.
… model After the transport-layer latch lands, the user-declined OAuth surfaces as an *OAuthDeclinedError out of Toolset.Start. Two follow-up issues remained: 1. The catalogs Tools() retried Start() on the same enabled wrapper on every agent loop iteration (mcp.IsAuthorizationRequired was the only short-circuit), so even with the dialog gone the catalog kept reporting "needs auth" forever and the meta-tools never re-collapsed. 2. handleEnable returned an unconditional "enabled X — proceed with the users original request; the host handles any required connection setup", added in 438058b to stop the model from stalling on YAML OAuth servers. With OAuth servers from the catalog the message is a flat promise the runtime cannot keep when the user dismisses the dialog: the model parrots it back ("On my next turn Ill have access to Notion tools") and either hallucinates results or asks for workspace details to call tools that no longer exist. Changes: - Tools(): on mcp.IsOAuthDeclined(err), call disableAfterDecline. disableAfterDecline removes the server from t.enabled, stops the inner toolset (so any partially-initialised session is cleaned up), and fires toolsChangedHandler exactly once — only when WE removed the entry, so a concurrent handleDisable / handleResetAuth / re-enable race does not double-notify. - handleEnable wording is now conditional and instructive instead of promissory: it tells the model to verify <id>_ tools are present on the next turn, and gives explicit guidance for the dismissed-dialog branch (do not claim connected, do not hallucinate tools, tell the user honestly; on re-ask treat it as implicit retry and call enable_remote_mcp_server again). Regression tests: - TestToolsOAuthDeclineRemovesServer: Start() returning OAuthDeclinedError removes the server from t.enabled, stops the inner toolset, fires tools-changed exactly once, and a second Tools() call does NOT call Start() again — this is the property that broke the "dialog re-appears every loop iteration" loop. - TestToolsOAuthDeclineNoNotifyWhenAlreadyDisabled: when the entry has been superseded by a fresh enable between the failing Start() and the cleanup, disableAfterDecline still stops the stale wrapper but does NOT evict the new entry or double-notify. - Existing TestEnableDisableLifecycle / TestEnableAPIKeyEnvPresent updated for the new wording (asserts the self-check + dismissed-dialog branch are surfaced to the model).
7ead54e to
95a2619
Compare
|
Thanks for the review. Summary of dispositions:
No code changes from this review pass. Happy to revisit either point in a focused follow-up if needed. |
|
❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs. |
aheritier
left a comment
There was a problem hiding this comment.
LGTM. CI is green for head 95a26196279cd8496dd2c6aa4dca84f0558093c2.
I reviewed the OAuth cancel loop fix across the transport and catalog layers. The transport now latches a user decline while still under oauthFlowMu, so queued concurrent initialize requests short-circuit instead of opening a fresh auth dialog (pkg/tools/mcp/oauth.go:327-369). It also reconstitutes a clean OAuthDeclinedError after the MCP SDK wrapping path via enrichConnectError (pkg/tools/mcp/remote.go:129-138).
The catalog side breaks the cross-turn retry loop by treating mcp.IsOAuthDeclined(err) as a deactivate outcome, removing the server from enabled, stopping the wrapped toolset, and notifying only when that wrapper is still the live entry (pkg/tools/builtin/mcpcatalog/mcpcatalog.go:481-512, pkg/tools/builtin/mcpcatalog/mcpcatalog.go:831-855). The model-facing enable message is also safer: it tells the model to verify <id>_ tools are actually present before proceeding, and to be honest if the dialog was dismissed (pkg/tools/builtin/mcpcatalog/mcpcatalog.go:725-729).
Tests cover the important regression paths: declined unmanaged elicitation returns the sentinel without hitting the token endpoint, concurrent RoundTrips emit exactly one elicitation, catalog cleanup removes/stops/notifies once, and stale-wrapper cleanup does not evict a replacement entry.
Non-blocking note: the managed OAuth path still appears to return a generic "user declined OAuth authorization" on decline, while this PR changes the unmanaged path. Since the PR/test scenario targets the catalog/API-host unmanaged flow and runtime API server mode opts into unmanaged OAuth, I don't think this should block, but it may be worth a follow-up if managed clients can hit the same loop.
No blocking findings.
The bug
The first time the agent enables a remote MCP server that requires OAuth (e.g.
mcp.notion.com), the host shows anAuthentication Requestdialog. Cancel used to be effectively useless.There are two distinct loops that conspire to produce this. Each is harmless on its own — together they make the cancel feel broken.
Loop A — clicking Cancel does nothing
What the user sees. They click Cancel on the Authentication Request. The dialog flips to
DENIED. Half a second later a fresh, identical Authentication Request appears. They click Cancel again. Another one appears. Trying to dismiss the prompt feels broken.What is actually happening underneath. A single connection attempt to the remote MCP server fans out into several parallel HTTP requests; every one of them comes back asking for OAuth. They queue up on a single-flight mutex (
oauthFlowMu,8bc1fa90e) so only one OAuth dance runs at a time. The mutex only protects against simultaneous OAuth — it does not remember that the user just said "no". So the moment the first goroutine returns with the cancel and releases the mutex, the next queued goroutine wakes up, sees no token, decides "I'd better start an OAuth flow", and pops the same dialog again.sequenceDiagram participant SDK as MCP SDK participant RT1 as RoundTrip #1 participant RT2 as RoundTrip #2 participant Mu as oauthFlowMu participant User SDK->>RT1: initialize-stage RPC SDK->>RT2: initialize-stage RPC (parallel) RT1->>Mu: Lock (wins) RT2->>Mu: Lock (queued) RT1->>User: elicit OAuth User-->>RT1: Cancel RT1-->>SDK: OAuthDeclinedError RT1->>Mu: Unlock Mu->>RT2: acquired RT2->>RT2: getValidToken() == nil RT2->>User: elicit OAuth (NEW dialog) Note over RT2,User: same dialog the user just dismissedLoop B — the model lies after the user cancels
What the user sees. Eventually they manage to dismiss the prompt. The agent calmly replies "The Notion server is now enabled. On my next turn I'll have access to Notion tools. Let me proceed with your request" — and then either makes up an answer, or asks for workspace details so it can call tools that do not exist. The user knows they cancelled; the agent is pretending nothing happened.
What is actually happening underneath. The catalog tells the model "enabled X — proceed with the user's original request" the moment the model calls
enable_remote_mcp_server, before the OAuth dialog has even been shown. Nothing later in the conversation tells the model "by the way, the user just cancelled". On top of that, the catalog'sTools()only treats one specific error as a soft failure (IsAuthorizationRequired= "defer this"); a user cancel falls through to the generic "transient failure" branch, so the server stays enabled in the catalog state and the next loop iteration would tryStart()again.flowchart LR A[loop iter N] --> B[Tools] B --> C[Start fails<br/>decline] C --> D{IsAuthorizationRequired?} D -->|no| E[server stays in t.enabled<br/>started=false] E --> F[loop iter N+1] F --> BThe fix
Layer 1 —
pkg/tools/mcp— latch the decline under the mutexWhat the user sees. Click Cancel once. The dialog goes away and stays away.
What changes underneath. Give the OAuth transport one new bit of memory: "did the user just cancel?". Set that bit while still holding the mutex, before any queued goroutine can wake up. Check the bit at the top of every authorization attempt; if it's set, return the cancel sentinel immediately instead of running the OAuth flow again.
The MCP SDK wraps transport errors in a way that destroys the unwrap chain (
fmt.Errorf("%w: %v", …)— the original sits in%vas text, not in the chain), soerrors.Ascan't find our sentinel after it bubbles up. That's the exact reason an existing flaglastAuthRequiredalready exists for the deferred-OAuth case; the new bit is a parallel of that, andenrichConnectErrorinremote.goreconstitutes a clean*OAuthDeclinedErrorfor callers.The queued goroutine now sees the latched cancel and never opens a fresh dialog:
sequenceDiagram participant RT1 as RoundTrip #1 participant RT2 as RoundTrip #2 participant Mu as oauthFlowMu participant User RT1->>Mu: Lock RT2->>Mu: Lock (queued) RT1->>User: elicit OAuth User-->>RT1: Cancel RT1->>RT1: lastOAuthDeclined = true RT1->>Mu: Unlock Mu->>RT2: acquired RT2->>RT2: lastOAuthDeclined? YES RT2-->>RT2: return OAuthDeclinedError Note over RT2,User: NO new dialogLayer 2 —
pkg/tools/builtin/mcpcatalog— drop the server and stop lyingWhat the user sees. After the cancel, the agent replies honestly — "the request needs you to authorize the connection — want to retry?" — instead of claiming Notion is connected. If the user re-asks the same thing (or says "yes" / "retry"), a fresh authorization dialog appears.
What changes underneath. Two small things:
Recognise the cancel as a distinct outcome. The catalog's
Tools()already special-casesIsAuthorizationRequired; we add a sibling case forIsOAuthDeclined. When a server'sStart()fails because the user dismissed the dialog, we delete the server fromt.enabled, stop the inner toolset (so any half-initialised session is cleaned up), and firetoolsChangedHandlerexactly once. ConcurrenthandleDisable/handleResetAuth/ re-enable races are handled by only doing the eviction when we still own the wrapper. After this, the next agent loop iteration enumerates a smaller tool list and does NOT retryStart()— the server is simply gone until somebody callsenable_remote_mcp_serveragain.Rewrite the
handleEnabletool result so the model can't lie. The old message was a flat promise. The new message is a conditional self-check the model can ground its next turn against: "if<id>_tools appear, proceed; if they don't, the user dismissed the dialog — be honest, and on user re-ask callenable_remote_mcp_serveragain to surface a fresh dialog".flowchart TD A[Tools] --> B[Start fails] B --> C{IsAuthorizationRequired?} C -->|yes| D[continue silently] C -->|no| E{IsOAuthDeclined?} E -->|yes - NEW| F[disableAfterDecline:<br/>delete from t.enabled<br/>Stop inner toolset<br/>fire toolsChangedHandler once] E -->|no| G[log + continue<br/>retry next iter]No registries, no TTLs, no visibility tweaks. The transport latch handles intra-
Initializeconcurrency; every cross-turn enable starts with a freshmcp.Toolset, so nothing carries over.