Skip to content

fix(reliability): bail synchronously on AUTH_RETRY_LIMIT — controller.abort alone doesn't stop SDK retries#710

Merged
kelsonpw merged 1 commit into
mainfrom
fix/auth-retry-break-stream
May 13, 2026
Merged

fix(reliability): bail synchronously on AUTH_RETRY_LIMIT — controller.abort alone doesn't stop SDK retries#710
kelsonpw merged 1 commit into
mainfrom
fix/auth-retry-break-stream

Conversation

@kelsonpw
Copy link
Copy Markdown
Member

@kelsonpw kelsonpw commented May 10, 2026

Summary

User report (run `fe1fead2` / 2026-05-10): OAuth token expired mid-run, the Claude SDK started retrying with 401s, and the wizard's auth-retry circuit-breaker fired its abort — but the SDK kept retrying anyway:

```
[api_retry attempt:2 error_status:401]
Auth retry observed (2/2)
Auth retries exceeded threshold — aborting agent query
[api_retry attempt:3 error_status:401]
Auth retry observed (3/2)
Auth retries exceeded threshold — aborting agent query
[api_retry attempt:4 error_status:401]
Auth retry observed (4/2)
Auth retries exceeded threshold — aborting agent query
... (continues to attempt 6+ before SDK gives up at max_retries:10)
```

`controller.abort('auth_failed')` does not stop the SDK's internal retry loop. The for-await keeps draining the `api_retry` message stream; the wizard keeps incrementing `authRetryCount`; the user sits through ~30s of dead spinner time before `max_retries:10` finally fires.

Fix

Throw `AbortError` synchronously the FIRST time the threshold is crossed. The for-await unwinds, lands in the existing `catch (innerError)` branch, sees `authErrorDetected=true`, and exits the outer retry loop via the "Agent loop exiting: auth error detected" break — without letting another N doomed retries through.

This complements (does not replace) the existing `controller.abort()` call: the abort propagates eventually for the SDK's own bookkeeping, but the throw guarantees we exit immediately regardless of the SDK's response time.

Test

The existing race-condition test asserted that the generator's post-threshold trailing code ran (i.e. the wizard kept iterating). With this fix, the wizard MUST stop iterating the moment the threshold is hit. Replaced the `aborted` flag assertion with a strict `messagesDelivered === AUTH_RETRY_LIMIT` bound — pre-fix the count would have been 10 (the SDK's real `max_retries`). The new assertion is a regression test for the exact production behavior.

Test plan

  • `pnpm exec tsc --noEmit` — clean
  • `pnpm exec eslint src/lib/agent-interface.ts --max-warnings=0` — clean
  • `pnpm exec vitest run --pool=forks --maxWorkers=1 src/lib/tests/agent-interface.test.ts` — 292 passed
  • Manual: simulate token expiry mid-run, verify the agent transitions to AUTH_ERROR within a few seconds (not 30s+)

🤖 Generated with Claude Code


Note

Medium Risk
Changes runAgent control flow to throw an AbortError inside the streaming loop when auth retries exceed AUTH_RETRY_LIMIT, which could affect how other SDK abort/error paths unwind. Scope is limited to auth-related api_retry handling and is covered by an updated regression test.

Overview
Prevents prolonged “dead spinner” time on repeated 401s by exiting the SDK message stream immediately once AUTH_RETRY_LIMIT auth-flavored api_retry messages are observed.

Instead of only calling controller.abort('auth_failed'), runAgent now also throws an AbortError synchronously at the threshold so the for-await loop unwinds and the run routes directly to the AUTH_ERROR path.

Updates the corresponding test to assert the stream stops being pulled exactly at AUTH_RETRY_LIMIT by counting delivered retry messages (catching regressions where the loop drains to the SDK’s full retry budget).

Reviewed by Cursor Bugbot for commit 233a451. Bugbot is set up for automated code reviews on this repo. Configure here.

….abort alone doesn't stop SDK retries

Production observation (run fe1fead2 / 2026-05-10): when the
user's OAuth token expires mid-run, the Claude SDK starts
retrying with 401s and the wizard's auth-retry circuit-breaker
fires its abort:

  [api_retry attempt:2 error_status:401]
  Auth retry observed (2/2)
  Auth retries exceeded threshold — aborting agent query
  [api_retry attempt:3 error_status:401]
  Auth retry observed (3/2)
  Auth retries exceeded threshold — aborting agent query
  [api_retry attempt:4 error_status:401]
  Auth retry observed (4/2)
  Auth retries exceeded threshold — aborting agent query
  ... (continues to attempt 6+ before SDK gives up at max_retries:10)

Calling `controller.abort('auth_failed')` does not stop the
SDK's internal retry loop. The for-await keeps draining the
api_retry message stream, the wizard keeps incrementing
authRetryCount, and the user sits through ~30s of dead spinner
time before the SDK finally errors out at max_retries:10. The
abort signal propagates eventually (the SDK throws AbortError
once it gives up), but not in time to spare the user.

Throw an AbortError synchronously the FIRST time the threshold
is crossed. The for-await unwinds, lands in the existing
`catch (innerError)` branch, sees `authErrorDetected=true`, and
exits the outer retry loop via the
"Agent loop exiting: auth error detected" break — without
letting another N doomed retries through.

Test update: the existing race-condition test asserted that the
generator's post-threshold trailing code ran (i.e. the wizard
kept iterating). With this fix, the wizard MUST stop iterating
the moment the threshold is hit. Replace the `aborted` flag
assertion with a strict `messagesDelivered === AUTH_RETRY_LIMIT`
bound — pre-fix the count would have been 10 (the SDK's real
max_retries). The new assertion is a regression test for the
exact production behavior.

292 agent-interface tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@kelsonpw kelsonpw requested a review from a team as a code owner May 10, 2026 17:03
@kelsonpw kelsonpw merged commit 41f7a52 into main May 13, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant