fix(mcp): stop the OAuth Authentication Request loop after the user clicks Cancel by trungutt · Pull Request #2949 · docker/docker-agent

trungutt · 2026-06-01T14:22:14Z

The bug

The first time the agent enables a remote MCP server that requires OAuth (e.g. mcp.notion.com), the host shows an Authentication Request dialog. Cancel used to be effectively useless.

There are two distinct loops that conspire to produce this. Each is harmless on its own — together they make the cancel feel broken.

Loop A — clicking Cancel does nothing

What the user sees. They click Cancel on the Authentication Request. The dialog flips to DENIED. Half a second later a fresh, identical Authentication Request appears. They click Cancel again. Another one appears. Trying to dismiss the prompt feels broken.

What is actually happening underneath. A single connection attempt to the remote MCP server fans out into several parallel HTTP requests; every one of them comes back asking for OAuth. They queue up on a single-flight mutex (oauthFlowMu, 8bc1fa90e) so only one OAuth dance runs at a time. The mutex only protects against simultaneous OAuth — it does not remember that the user just said "no". So the moment the first goroutine returns with the cancel and releases the mutex, the next queued goroutine wakes up, sees no token, decides "I'd better start an OAuth flow", and pops the same dialog again.

sequenceDiagram
    participant SDK as MCP SDK
    participant RT1 as RoundTrip #1
    participant RT2 as RoundTrip #2
    participant Mu as oauthFlowMu
    participant User

    SDK->>RT1: initialize-stage RPC
    SDK->>RT2: initialize-stage RPC (parallel)
    RT1->>Mu: Lock (wins)
    RT2->>Mu: Lock (queued)
    RT1->>User: elicit OAuth
    User-->>RT1: Cancel
    RT1-->>SDK: OAuthDeclinedError
    RT1->>Mu: Unlock
    Mu->>RT2: acquired
    RT2->>RT2: getValidToken() == nil
    RT2->>User: elicit OAuth (NEW dialog)
    Note over RT2,User: same dialog the user just dismissed

Loop B — the model lies after the user cancels

What the user sees. Eventually they manage to dismiss the prompt. The agent calmly replies "The Notion server is now enabled. On my next turn I'll have access to Notion tools. Let me proceed with your request" — and then either makes up an answer, or asks for workspace details so it can call tools that do not exist. The user knows they cancelled; the agent is pretending nothing happened.

What is actually happening underneath. The catalog tells the model "enabled X — proceed with the user's original request" the moment the model calls enable_remote_mcp_server, before the OAuth dialog has even been shown. Nothing later in the conversation tells the model "by the way, the user just cancelled". On top of that, the catalog's Tools() only treats one specific error as a soft failure (IsAuthorizationRequired = "defer this"); a user cancel falls through to the generic "transient failure" branch, so the server stays enabled in the catalog state and the next loop iteration would try Start() again.

flowchart LR
    A[loop iter N] --> B[Tools]
    B --> C[Start fails<br/>decline]
    C --> D{IsAuthorizationRequired?}
    D -->|no| E[server stays in t.enabled<br/>started=false]
    E --> F[loop iter N+1]
    F --> B

The fix

Layer 1 — `pkg/tools/mcp` — latch the decline under the mutex

What the user sees. Click Cancel once. The dialog goes away and stays away.

What changes underneath. Give the OAuth transport one new bit of memory: "did the user just cancel?". Set that bit while still holding the mutex, before any queued goroutine can wake up. Check the bit at the top of every authorization attempt; if it's set, return the cancel sentinel immediately instead of running the OAuth flow again.

The MCP SDK wraps transport errors in a way that destroys the unwrap chain (fmt.Errorf("%w: %v", …) — the original sits in %v as text, not in the chain), so errors.As can't find our sentinel after it bubbles up. That's the exact reason an existing flag lastAuthRequired already exists for the deferred-OAuth case; the new bit is a parallel of that, and enrichConnectError in remote.go reconstitutes a clean *OAuthDeclinedError for callers.

oauthTransport {
    + lastOAuthDeclined bool       // NEW
}

authorizeOnce(ctx) error {
    Lock(oauthFlowMu); defer Unlock

    if validToken { return nil }

    if lastOAuthDeclined {                     // NEW short-circuit
        return &OAuthDeclinedError{URL}
    }

    err = handleOAuthFlow(ctx)
    if IsOAuthDeclined(err) {                  // NEW latch
        lastOAuthDeclined = true               // BEFORE deferred Unlock
    }
    return err
}

The queued goroutine now sees the latched cancel and never opens a fresh dialog:

sequenceDiagram
    participant RT1 as RoundTrip #1
    participant RT2 as RoundTrip #2
    participant Mu as oauthFlowMu
    participant User

    RT1->>Mu: Lock
    RT2->>Mu: Lock (queued)
    RT1->>User: elicit OAuth
    User-->>RT1: Cancel
    RT1->>RT1: lastOAuthDeclined = true
    RT1->>Mu: Unlock
    Mu->>RT2: acquired
    RT2->>RT2: lastOAuthDeclined? YES
    RT2-->>RT2: return OAuthDeclinedError
    Note over RT2,User: NO new dialog

Layer 2 — `pkg/tools/builtin/mcpcatalog` — drop the server and stop lying

What the user sees. After the cancel, the agent replies honestly — "the request needs you to authorize the connection — want to retry?" — instead of claiming Notion is connected. If the user re-asks the same thing (or says "yes" / "retry"), a fresh authorization dialog appears.

What changes underneath. Two small things:

Recognise the cancel as a distinct outcome. The catalog's Tools() already special-cases IsAuthorizationRequired; we add a sibling case for IsOAuthDeclined. When a server's Start() fails because the user dismissed the dialog, we delete the server from t.enabled, stop the inner toolset (so any half-initialised session is cleaned up), and fire toolsChangedHandler exactly once. Concurrent handleDisable / handleResetAuth / re-enable races are handled by only doing the eviction when we still own the wrapper. After this, the next agent loop iteration enumerates a smaller tool list and does NOT retry Start() — the server is simply gone until somebody calls enable_remote_mcp_server again.
Rewrite the handleEnable tool result so the model can't lie. The old message was a flat promise. The new message is a conditional self-check the model can ground its next turn against: "if <id>_ tools appear, proceed; if they don't, the user dismissed the dialog — be honest, and on user re-ask call enable_remote_mcp_server again to surface a fresh dialog".

flowchart TD
    A[Tools] --> B[Start fails]
    B --> C{IsAuthorizationRequired?}
    C -->|yes| D[continue silently]
    C -->|no| E{IsOAuthDeclined?}
    E -->|yes - NEW| F[disableAfterDecline:<br/>delete from t.enabled<br/>Stop inner toolset<br/>fire toolsChangedHandler once]
    E -->|no| G[log + continue<br/>retry next iter]

old: "enabled <id> — the server's tools will appear on your next
      turn. Proceed with the user's original request; the host
      handles any required connection setup."

new: "enable requested for <id>. The host will surface any required
      authorization dialog. On your next turn:
        - if tools whose names start with <id>_ appear, proceed
          with the user's original request using them.
        - if NO such tools appear, the user dismissed the dialog.
          Tell the user honestly; on user re-ask (or yes/retry),
          call enable_remote_mcp_server again for a fresh dialog."

No registries, no TTLs, no visibility tweaks. The transport latch handles intra-Initialize concurrency; every cross-turn enable starts with a fresh mcp.Toolset, so nothing carries over.

The MCP SDK fires several initialize-stage RPCs concurrently. Each one that hits a 401 queues on oauthFlowMu and runs the OAuth flow. When the user clicks Cancel, the goroutine holding the mutex returns the decline and releases the mutex; the next queued goroutine sees no valid token and starts a fresh OAuth flow, re-emitting the elicitation the user just dismissed. The single-flight added in 8bc1fa9 coalesces concurrent attempts but does not latch a decline. Two changes to fix the loop without altering the happy-path flow: 1. New sentinel *OAuthDeclinedError in pkg/tools/mcp/interactive.go (plus IsOAuthDeclined helper), distinct from the existing AuthorizationRequiredError so callers can distinguish a deliberate user dismissal from a non-interactive deferral. 2. Sticky decline state on oauthTransport (lastOAuthDeclined): - authorizeOnce sets it under oauthFlowMu before the deferred Unlock fires, so any queued goroutine observes the decline before it can fire a fresh flow. - authorizeOnce short-circuits to OAuthDeclinedError at the top when the flag is set, avoiding handleOAuthFlow entirely. The MCP SDK wraps transport errors with fmt.Errorf("%w: %v", ...), dropping the original error from the unwrap chain (same reason AuthorizationRequiredError uses the lastAuthRequired flag pattern). enrichConnectError reads the new oauthDeclined() accessor and reconstitutes a clean *OAuthDeclinedError sentinel for callers. Regression tests: - TestUnmanagedOAuthFlow_ElicitationDeclineReturnsSentinel: a declined elicitation surfaces as IsOAuthDeclined, carries the server URL, and does not hit the token endpoint. - TestUnmanagedOAuthFlow_ElicitationDeclineLatchesAgainstConcurrentRoundTrips: 4 concurrent RoundTrips produce exactly ONE elicitation; every queued caller observes the latched decline and short-circuits.

… model After the transport-layer latch lands, the user-declined OAuth surfaces as an *OAuthDeclinedError out of Toolset.Start. Two follow-up issues remained: 1. The catalogs Tools() retried Start() on the same enabled wrapper on every agent loop iteration (mcp.IsAuthorizationRequired was the only short-circuit), so even with the dialog gone the catalog kept reporting "needs auth" forever and the meta-tools never re-collapsed. 2. handleEnable returned an unconditional "enabled X — proceed with the users original request; the host handles any required connection setup", added in 438058b to stop the model from stalling on YAML OAuth servers. With OAuth servers from the catalog the message is a flat promise the runtime cannot keep when the user dismisses the dialog: the model parrots it back ("On my next turn Ill have access to Notion tools") and either hallucinates results or asks for workspace details to call tools that no longer exist. Changes: - Tools(): on mcp.IsOAuthDeclined(err), call disableAfterDecline. disableAfterDecline removes the server from t.enabled, stops the inner toolset (so any partially-initialised session is cleaned up), and fires toolsChangedHandler exactly once — only when WE removed the entry, so a concurrent handleDisable / handleResetAuth / re-enable race does not double-notify. - handleEnable wording is now conditional and instructive instead of promissory: it tells the model to verify <id>_ tools are present on the next turn, and gives explicit guidance for the dismissed-dialog branch (do not claim connected, do not hallucinate tools, tell the user honestly; on re-ask treat it as implicit retry and call enable_remote_mcp_server again). Regression tests: - TestToolsOAuthDeclineRemovesServer: Start() returning OAuthDeclinedError removes the server from t.enabled, stops the inner toolset, fires tools-changed exactly once, and a second Tools() call does NOT call Start() again — this is the property that broke the "dialog re-appears every loop iteration" loop. - TestToolsOAuthDeclineNoNotifyWhenAlreadyDisabled: when the entry has been superseded by a fresh enable between the failing Start() and the cleanup, disableAfterDecline still stops the stale wrapper but does NOT evict the new entry or double-notify. - Existing TestEnableDisableLifecycle / TestEnableAPIKeyEnvPresent updated for the new wording (asserts the self-check + dismissed-dialog branch are surfaced to the model).

trungutt · 2026-06-01T14:39:52Z

Thanks for the review. Summary of dispositions:

Finding	Disposition
🔴 [HIGH] `handleManagedOAuthFlow` missing sentinel	Intentionally out of scope — see inline reply. This PR targets the unmanaged path exclusively (the deployment we ship does not use managed OAuth).
🟡 [LOW] `disableAfterDecline` does not swallow `DeadlineExceeded`	Intentionally left alone for consistency — see inline reply. The pattern matches the two sibling callers (`handleDisable` / `handleResetAuth`); changing one in isolation would be worse than leaving them aligned.

No code changes from this review pass. Happy to revisit either point in a focused follow-up if needed.

docker-agent · 2026-06-01T15:14:10Z

❌ PR Review Failed — The review agent encountered an error and could not complete the review. View logs.

aheritier

LGTM. CI is green for head 95a26196279cd8496dd2c6aa4dca84f0558093c2.

I reviewed the OAuth cancel loop fix across the transport and catalog layers. The transport now latches a user decline while still under oauthFlowMu, so queued concurrent initialize requests short-circuit instead of opening a fresh auth dialog (pkg/tools/mcp/oauth.go:327-369). It also reconstitutes a clean OAuthDeclinedError after the MCP SDK wrapping path via enrichConnectError (pkg/tools/mcp/remote.go:129-138).

The catalog side breaks the cross-turn retry loop by treating mcp.IsOAuthDeclined(err) as a deactivate outcome, removing the server from enabled, stopping the wrapped toolset, and notifying only when that wrapper is still the live entry (pkg/tools/builtin/mcpcatalog/mcpcatalog.go:481-512, pkg/tools/builtin/mcpcatalog/mcpcatalog.go:831-855). The model-facing enable message is also safer: it tells the model to verify <id>_ tools are actually present before proceeding, and to be honest if the dialog was dismissed (pkg/tools/builtin/mcpcatalog/mcpcatalog.go:725-729).

Tests cover the important regression paths: declined unmanaged elicitation returns the sentinel without hitting the token endpoint, concurrent RoundTrips emit exactly one elicitation, catalog cleanup removes/stops/notifies once, and stale-wrapper cleanup does not evict a replacement entry.

Non-blocking note: the managed OAuth path still appears to return a generic "user declined OAuth authorization" on decline, while this PR changes the unmanaged path. Since the PR/test scenario targets the catalog/API-host unmanaged flow and runtime API server mode opts into unmanaged OAuth, I don't think this should block, but it may be worth a follow-up if managed clients can hit the same loop.

No blocking findings.

aheritier added area/mcp MCP protocol, MCP tool servers, integration kind/fix PR fixes a bug (maps to fix: commit prefix) labels Jun 1, 2026

trungutt added 2 commits June 1, 2026 16:33

trungutt force-pushed the fix/mcp-oauth-cancel-loop branch from 7ead54e to 95a2619 Compare June 1, 2026 14:33

This comment was marked as off-topic.

Sign in to view

trungutt marked this pull request as ready for review June 1, 2026 14:42

trungutt requested a review from a team as a code owner June 1, 2026 14:42

dgageot approved these changes Jun 1, 2026

View reviewed changes

trungutt merged commit 7882b41 into docker:main Jun 1, 2026
20 checks passed

aheritier approved these changes Jun 1, 2026

View reviewed changes

BrewTestBot mentioned this pull request Jun 1, 2026

docker-agent 1.70.2 Homebrew/homebrew-core#285692

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): stop the OAuth Authentication Request loop after the user clicks Cancel#2949

fix(mcp): stop the OAuth Authentication Request loop after the user clicks Cancel#2949
trungutt merged 2 commits into
docker:mainfrom
trungutt:fix/mcp-oauth-cancel-loop

trungutt commented Jun 1, 2026 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

trungutt commented Jun 1, 2026

Uh oh!

Uh oh!

docker-agent commented Jun 1, 2026

Uh oh!

aheritier left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

trungutt commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The bug

Loop A — clicking Cancel does nothing

Loop B — the model lies after the user cancels

The fix

Layer 1 — pkg/tools/mcp — latch the decline under the mutex

Layer 2 — pkg/tools/builtin/mcpcatalog — drop the server and stop lying

Uh oh!

This comment was marked as off-topic.

Uh oh!

This comment was marked as off-topic.

Uh oh!

trungutt commented Jun 1, 2026

Uh oh!

Uh oh!

docker-agent commented Jun 1, 2026

Uh oh!

aheritier left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

trungutt commented Jun 1, 2026 •

edited

Loading

Layer 1 — `pkg/tools/mcp` — latch the decline under the mutex

Layer 2 — `pkg/tools/builtin/mcpcatalog` — drop the server and stop lying