Skip to content

fix(mcp): stop the OAuth Authentication Request loop after the user clicks Cancel#2949

Merged
trungutt merged 2 commits into
docker:mainfrom
trungutt:fix/mcp-oauth-cancel-loop
Jun 1, 2026
Merged

fix(mcp): stop the OAuth Authentication Request loop after the user clicks Cancel#2949
trungutt merged 2 commits into
docker:mainfrom
trungutt:fix/mcp-oauth-cancel-loop

Conversation

@trungutt
Copy link
Copy Markdown
Contributor

@trungutt trungutt commented Jun 1, 2026

The bug

The first time the agent enables a remote MCP server that requires OAuth (e.g. mcp.notion.com), the host shows an Authentication Request dialog. Cancel used to be effectively useless.

There are two distinct loops that conspire to produce this. Each is harmless on its own — together they make the cancel feel broken.

Loop A — clicking Cancel does nothing

What the user sees. They click Cancel on the Authentication Request. The dialog flips to DENIED. Half a second later a fresh, identical Authentication Request appears. They click Cancel again. Another one appears. Trying to dismiss the prompt feels broken.

What is actually happening underneath. A single connection attempt to the remote MCP server fans out into several parallel HTTP requests; every one of them comes back asking for OAuth. They queue up on a single-flight mutex (oauthFlowMu, 8bc1fa90e) so only one OAuth dance runs at a time. The mutex only protects against simultaneous OAuth — it does not remember that the user just said "no". So the moment the first goroutine returns with the cancel and releases the mutex, the next queued goroutine wakes up, sees no token, decides "I'd better start an OAuth flow", and pops the same dialog again.

sequenceDiagram
    participant SDK as MCP SDK
    participant RT1 as RoundTrip #1
    participant RT2 as RoundTrip #2
    participant Mu as oauthFlowMu
    participant User

    SDK->>RT1: initialize-stage RPC
    SDK->>RT2: initialize-stage RPC (parallel)
    RT1->>Mu: Lock (wins)
    RT2->>Mu: Lock (queued)
    RT1->>User: elicit OAuth
    User-->>RT1: Cancel
    RT1-->>SDK: OAuthDeclinedError
    RT1->>Mu: Unlock
    Mu->>RT2: acquired
    RT2->>RT2: getValidToken() == nil
    RT2->>User: elicit OAuth (NEW dialog)
    Note over RT2,User: same dialog the user just dismissed
Loading

Loop B — the model lies after the user cancels

What the user sees. Eventually they manage to dismiss the prompt. The agent calmly replies "The Notion server is now enabled. On my next turn I'll have access to Notion tools. Let me proceed with your request" — and then either makes up an answer, or asks for workspace details so it can call tools that do not exist. The user knows they cancelled; the agent is pretending nothing happened.

What is actually happening underneath. The catalog tells the model "enabled X — proceed with the user's original request" the moment the model calls enable_remote_mcp_server, before the OAuth dialog has even been shown. Nothing later in the conversation tells the model "by the way, the user just cancelled". On top of that, the catalog's Tools() only treats one specific error as a soft failure (IsAuthorizationRequired = "defer this"); a user cancel falls through to the generic "transient failure" branch, so the server stays enabled in the catalog state and the next loop iteration would try Start() again.

flowchart LR
    A[loop iter N] --> B[Tools]
    B --> C[Start fails<br/>decline]
    C --> D{IsAuthorizationRequired?}
    D -->|no| E[server stays in t.enabled<br/>started=false]
    E --> F[loop iter N+1]
    F --> B
Loading

The fix

Layer 1 — pkg/tools/mcp — latch the decline under the mutex

What the user sees. Click Cancel once. The dialog goes away and stays away.

What changes underneath. Give the OAuth transport one new bit of memory: "did the user just cancel?". Set that bit while still holding the mutex, before any queued goroutine can wake up. Check the bit at the top of every authorization attempt; if it's set, return the cancel sentinel immediately instead of running the OAuth flow again.

The MCP SDK wraps transport errors in a way that destroys the unwrap chain (fmt.Errorf("%w: %v", …) — the original sits in %v as text, not in the chain), so errors.As can't find our sentinel after it bubbles up. That's the exact reason an existing flag lastAuthRequired already exists for the deferred-OAuth case; the new bit is a parallel of that, and enrichConnectError in remote.go reconstitutes a clean *OAuthDeclinedError for callers.

oauthTransport {
    + lastOAuthDeclined bool       // NEW
}

authorizeOnce(ctx) error {
    Lock(oauthFlowMu); defer Unlock

    if validToken { return nil }

    if lastOAuthDeclined {                     // NEW short-circuit
        return &OAuthDeclinedError{URL}
    }

    err = handleOAuthFlow(ctx)
    if IsOAuthDeclined(err) {                  // NEW latch
        lastOAuthDeclined = true               // BEFORE deferred Unlock
    }
    return err
}

The queued goroutine now sees the latched cancel and never opens a fresh dialog:

sequenceDiagram
    participant RT1 as RoundTrip #1
    participant RT2 as RoundTrip #2
    participant Mu as oauthFlowMu
    participant User

    RT1->>Mu: Lock
    RT2->>Mu: Lock (queued)
    RT1->>User: elicit OAuth
    User-->>RT1: Cancel
    RT1->>RT1: lastOAuthDeclined = true
    RT1->>Mu: Unlock
    Mu->>RT2: acquired
    RT2->>RT2: lastOAuthDeclined? YES
    RT2-->>RT2: return OAuthDeclinedError
    Note over RT2,User: NO new dialog
Loading

Layer 2 — pkg/tools/builtin/mcpcatalog — drop the server and stop lying

What the user sees. After the cancel, the agent replies honestly — "the request needs you to authorize the connection — want to retry?" — instead of claiming Notion is connected. If the user re-asks the same thing (or says "yes" / "retry"), a fresh authorization dialog appears.

What changes underneath. Two small things:

  1. Recognise the cancel as a distinct outcome. The catalog's Tools() already special-cases IsAuthorizationRequired; we add a sibling case for IsOAuthDeclined. When a server's Start() fails because the user dismissed the dialog, we delete the server from t.enabled, stop the inner toolset (so any half-initialised session is cleaned up), and fire toolsChangedHandler exactly once. Concurrent handleDisable / handleResetAuth / re-enable races are handled by only doing the eviction when we still own the wrapper. After this, the next agent loop iteration enumerates a smaller tool list and does NOT retry Start() — the server is simply gone until somebody calls enable_remote_mcp_server again.

  2. Rewrite the handleEnable tool result so the model can't lie. The old message was a flat promise. The new message is a conditional self-check the model can ground its next turn against: "if <id>_ tools appear, proceed; if they don't, the user dismissed the dialog — be honest, and on user re-ask call enable_remote_mcp_server again to surface a fresh dialog".

flowchart TD
    A[Tools] --> B[Start fails]
    B --> C{IsAuthorizationRequired?}
    C -->|yes| D[continue silently]
    C -->|no| E{IsOAuthDeclined?}
    E -->|yes - NEW| F[disableAfterDecline:<br/>delete from t.enabled<br/>Stop inner toolset<br/>fire toolsChangedHandler once]
    E -->|no| G[log + continue<br/>retry next iter]
Loading
old: "enabled <id> — the server's tools will appear on your next
      turn. Proceed with the user's original request; the host
      handles any required connection setup."

new: "enable requested for <id>. The host will surface any required
      authorization dialog. On your next turn:
        - if tools whose names start with <id>_ appear, proceed
          with the user's original request using them.
        - if NO such tools appear, the user dismissed the dialog.
          Tell the user honestly; on user re-ask (or yes/retry),
          call enable_remote_mcp_server again for a fresh dialog."

No registries, no TTLs, no visibility tweaks. The transport latch handles intra-Initialize concurrency; every cross-turn enable starts with a fresh mcp.Toolset, so nothing carries over.

@aheritier aheritier added area/mcp MCP protocol, MCP tool servers, integration kind/fix PR fixes a bug (maps to fix: commit prefix) labels Jun 1, 2026
trungutt added 2 commits June 1, 2026 16:33
The MCP SDK fires several initialize-stage RPCs concurrently. Each one
that hits a 401 queues on oauthFlowMu and runs the OAuth flow. When the
user clicks Cancel, the goroutine holding the mutex returns the decline
and releases the mutex; the next queued goroutine sees no valid token
and starts a fresh OAuth flow, re-emitting the elicitation the user
just dismissed. The single-flight added in 8bc1fa9 coalesces
concurrent attempts but does not latch a decline.

Two changes to fix the loop without altering the happy-path flow:

1. New sentinel *OAuthDeclinedError in pkg/tools/mcp/interactive.go
   (plus IsOAuthDeclined helper), distinct from the existing
   AuthorizationRequiredError so callers can distinguish a deliberate
   user dismissal from a non-interactive deferral.

2. Sticky decline state on oauthTransport (lastOAuthDeclined):
   - authorizeOnce sets it under oauthFlowMu before the deferred
     Unlock fires, so any queued goroutine observes the decline
     before it can fire a fresh flow.
   - authorizeOnce short-circuits to OAuthDeclinedError at the top
     when the flag is set, avoiding handleOAuthFlow entirely.

The MCP SDK wraps transport errors with fmt.Errorf("%w: %v", ...),
dropping the original error from the unwrap chain (same reason
AuthorizationRequiredError uses the lastAuthRequired flag pattern).
enrichConnectError reads the new oauthDeclined() accessor and
reconstitutes a clean *OAuthDeclinedError sentinel for callers.

Regression tests:

- TestUnmanagedOAuthFlow_ElicitationDeclineReturnsSentinel: a
  declined elicitation surfaces as IsOAuthDeclined, carries the
  server URL, and does not hit the token endpoint.
- TestUnmanagedOAuthFlow_ElicitationDeclineLatchesAgainstConcurrentRoundTrips:
  4 concurrent RoundTrips produce exactly ONE elicitation; every
  queued caller observes the latched decline and short-circuits.
… model

After the transport-layer latch lands, the user-declined OAuth surfaces
as an *OAuthDeclinedError out of Toolset.Start. Two follow-up issues
remained:

1. The catalogs Tools() retried Start() on the same enabled wrapper on
   every agent loop iteration (mcp.IsAuthorizationRequired was the only
   short-circuit), so even with the dialog gone the catalog kept
   reporting "needs auth" forever and the meta-tools never re-collapsed.

2. handleEnable returned an unconditional "enabled X — proceed with the
   users original request; the host handles any required connection
   setup", added in 438058b to stop the model from stalling on YAML
   OAuth servers. With OAuth servers from the catalog the message is a
   flat promise the runtime cannot keep when the user dismisses the
   dialog: the model parrots it back ("On my next turn Ill have access
   to Notion tools") and either hallucinates results or asks for
   workspace details to call tools that no longer exist.

Changes:

- Tools(): on mcp.IsOAuthDeclined(err), call disableAfterDecline.
  disableAfterDecline removes the server from t.enabled, stops the
  inner toolset (so any partially-initialised session is cleaned up),
  and fires toolsChangedHandler exactly once — only when WE removed
  the entry, so a concurrent handleDisable / handleResetAuth /
  re-enable race does not double-notify.

- handleEnable wording is now conditional and instructive instead of
  promissory: it tells the model to verify <id>_ tools are present on
  the next turn, and gives explicit guidance for the dismissed-dialog
  branch (do not claim connected, do not hallucinate tools, tell the
  user honestly; on re-ask treat it as implicit retry and call
  enable_remote_mcp_server again).

Regression tests:

- TestToolsOAuthDeclineRemovesServer: Start() returning
  OAuthDeclinedError removes the server from t.enabled, stops the
  inner toolset, fires tools-changed exactly once, and a second
  Tools() call does NOT call Start() again — this is the property
  that broke the "dialog re-appears every loop iteration" loop.
- TestToolsOAuthDeclineNoNotifyWhenAlreadyDisabled: when the entry
  has been superseded by a fresh enable between the failing Start()
  and the cleanup, disableAfterDecline still stops the stale wrapper
  but does NOT evict the new entry or double-notify.
- Existing TestEnableDisableLifecycle / TestEnableAPIKeyEnvPresent
  updated for the new wording (asserts the self-check + dismissed-dialog
  branch are surfaced to the model).
@trungutt trungutt force-pushed the fix/mcp-oauth-cancel-loop branch from 7ead54e to 95a2619 Compare June 1, 2026 14:33
docker-agent

This comment was marked as off-topic.

docker-agent

This comment was marked as off-topic.

@trungutt
Copy link
Copy Markdown
Contributor Author

trungutt commented Jun 1, 2026

Thanks for the review. Summary of dispositions:

Finding Disposition
🔴 [HIGH] handleManagedOAuthFlow missing sentinel Intentionally out of scope — see inline reply. This PR targets the unmanaged path exclusively (the deployment we ship does not use managed OAuth).
🟡 [LOW] disableAfterDecline does not swallow DeadlineExceeded Intentionally left alone for consistency — see inline reply. The pattern matches the two sibling callers (handleDisable / handleResetAuth); changing one in isolation would be worse than leaving them aligned.

No code changes from this review pass. Happy to revisit either point in a focused follow-up if needed.

@trungutt trungutt marked this pull request as ready for review June 1, 2026 14:42
@trungutt trungutt requested a review from a team as a code owner June 1, 2026 14:42
@trungutt trungutt merged commit 7882b41 into docker:main Jun 1, 2026
20 checks passed
@docker-agent
Copy link
Copy Markdown

PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

Copy link
Copy Markdown
Contributor

@aheritier aheritier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. CI is green for head 95a26196279cd8496dd2c6aa4dca84f0558093c2.

I reviewed the OAuth cancel loop fix across the transport and catalog layers. The transport now latches a user decline while still under oauthFlowMu, so queued concurrent initialize requests short-circuit instead of opening a fresh auth dialog (pkg/tools/mcp/oauth.go:327-369). It also reconstitutes a clean OAuthDeclinedError after the MCP SDK wrapping path via enrichConnectError (pkg/tools/mcp/remote.go:129-138).

The catalog side breaks the cross-turn retry loop by treating mcp.IsOAuthDeclined(err) as a deactivate outcome, removing the server from enabled, stopping the wrapped toolset, and notifying only when that wrapper is still the live entry (pkg/tools/builtin/mcpcatalog/mcpcatalog.go:481-512, pkg/tools/builtin/mcpcatalog/mcpcatalog.go:831-855). The model-facing enable message is also safer: it tells the model to verify <id>_ tools are actually present before proceeding, and to be honest if the dialog was dismissed (pkg/tools/builtin/mcpcatalog/mcpcatalog.go:725-729).

Tests cover the important regression paths: declined unmanaged elicitation returns the sentinel without hitting the token endpoint, concurrent RoundTrips emit exactly one elicitation, catalog cleanup removes/stops/notifies once, and stale-wrapper cleanup does not evict a replacement entry.

Non-blocking note: the managed OAuth path still appears to return a generic "user declined OAuth authorization" on decline, while this PR changes the unmanaged path. Since the PR/test scenario targets the catalog/API-host unmanaged flow and runtime API server mode opts into unmanaged OAuth, I don't think this should block, but it may be worth a follow-up if managed clients can hit the same loop.

No blocking findings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/mcp MCP protocol, MCP tool servers, integration kind/fix PR fixes a bug (maps to fix: commit prefix)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants