fix(reliability): bail synchronously on AUTH_RETRY_LIMIT — controller.abort alone doesn't stop SDK retries#710
Merged
Conversation
….abort alone doesn't stop SDK retries
Production observation (run fe1fead2 / 2026-05-10): when the
user's OAuth token expires mid-run, the Claude SDK starts
retrying with 401s and the wizard's auth-retry circuit-breaker
fires its abort:
[api_retry attempt:2 error_status:401]
Auth retry observed (2/2)
Auth retries exceeded threshold — aborting agent query
[api_retry attempt:3 error_status:401]
Auth retry observed (3/2)
Auth retries exceeded threshold — aborting agent query
[api_retry attempt:4 error_status:401]
Auth retry observed (4/2)
Auth retries exceeded threshold — aborting agent query
... (continues to attempt 6+ before SDK gives up at max_retries:10)
Calling `controller.abort('auth_failed')` does not stop the
SDK's internal retry loop. The for-await keeps draining the
api_retry message stream, the wizard keeps incrementing
authRetryCount, and the user sits through ~30s of dead spinner
time before the SDK finally errors out at max_retries:10. The
abort signal propagates eventually (the SDK throws AbortError
once it gives up), but not in time to spare the user.
Throw an AbortError synchronously the FIRST time the threshold
is crossed. The for-await unwinds, lands in the existing
`catch (innerError)` branch, sees `authErrorDetected=true`, and
exits the outer retry loop via the
"Agent loop exiting: auth error detected" break — without
letting another N doomed retries through.
Test update: the existing race-condition test asserted that the
generator's post-threshold trailing code ran (i.e. the wizard
kept iterating). With this fix, the wizard MUST stop iterating
the moment the threshold is hit. Replace the `aborted` flag
assertion with a strict `messagesDelivered === AUTH_RETRY_LIMIT`
bound — pre-fix the count would have been 10 (the SDK's real
max_retries). The new assertion is a regression test for the
exact production behavior.
292 agent-interface tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
User report (run `fe1fead2` / 2026-05-10): OAuth token expired mid-run, the Claude SDK started retrying with 401s, and the wizard's auth-retry circuit-breaker fired its abort — but the SDK kept retrying anyway:
```
[api_retry attempt:2 error_status:401]
Auth retry observed (2/2)
Auth retries exceeded threshold — aborting agent query
[api_retry attempt:3 error_status:401]
Auth retry observed (3/2)
Auth retries exceeded threshold — aborting agent query
[api_retry attempt:4 error_status:401]
Auth retry observed (4/2)
Auth retries exceeded threshold — aborting agent query
... (continues to attempt 6+ before SDK gives up at max_retries:10)
```
`controller.abort('auth_failed')` does not stop the SDK's internal retry loop. The for-await keeps draining the `api_retry` message stream; the wizard keeps incrementing `authRetryCount`; the user sits through ~30s of dead spinner time before `max_retries:10` finally fires.
Fix
Throw `AbortError` synchronously the FIRST time the threshold is crossed. The for-await unwinds, lands in the existing `catch (innerError)` branch, sees `authErrorDetected=true`, and exits the outer retry loop via the "Agent loop exiting: auth error detected" break — without letting another N doomed retries through.
This complements (does not replace) the existing `controller.abort()` call: the abort propagates eventually for the SDK's own bookkeeping, but the throw guarantees we exit immediately regardless of the SDK's response time.
Test
The existing race-condition test asserted that the generator's post-threshold trailing code ran (i.e. the wizard kept iterating). With this fix, the wizard MUST stop iterating the moment the threshold is hit. Replaced the `aborted` flag assertion with a strict `messagesDelivered === AUTH_RETRY_LIMIT` bound — pre-fix the count would have been 10 (the SDK's real `max_retries`). The new assertion is a regression test for the exact production behavior.
Test plan
🤖 Generated with Claude Code
Note
Medium Risk
Changes
runAgentcontrol flow to throw anAbortErrorinside the streaming loop when auth retries exceedAUTH_RETRY_LIMIT, which could affect how other SDK abort/error paths unwind. Scope is limited to auth-relatedapi_retryhandling and is covered by an updated regression test.Overview
Prevents prolonged “dead spinner” time on repeated 401s by exiting the SDK message stream immediately once
AUTH_RETRY_LIMITauth-flavoredapi_retrymessages are observed.Instead of only calling
controller.abort('auth_failed'),runAgentnow also throws anAbortErrorsynchronously at the threshold so thefor-awaitloop unwinds and the run routes directly to theAUTH_ERRORpath.Updates the corresponding test to assert the stream stops being pulled exactly at
AUTH_RETRY_LIMITby counting delivered retry messages (catching regressions where the loop drains to the SDK’s full retry budget).Reviewed by Cursor Bugbot for commit 233a451. Bugbot is set up for automated code reviews on this repo. Configure here.