-
Notifications
You must be signed in to change notification settings - Fork 1.3k
HTTP/2 GOAWAY race condition causes cascading retry failures and silent premium request waste (consolidates #1743, #1754, #2050, #2101, #2189) #2421
Description
Describe the bug
The CLI's undici HTTP/2 connection pool has a race condition when handling server-sent GOAWAY frames. When a GOAWAY arrives while requests are in-flight, the pool's internal state invariant (pendingCount === 0) is violated, causing an AssertionError. The CLI's retry logic then makes 5 retry attempts against the same corrupted pool, all of which fail identically. Each retry consumes a premium request, and the entire sequence takes ~88 seconds before surfacing a terminal error.
The v1.0.6 changelog states "Resolve session crashes caused by HTTP/2 connection pool race conditions when sub-agents are active" — but the fix is incomplete. Issues #2101 (v1.0.6) and #2189 (v1.0.9) report the identical failure pattern post-fix.
This is the single most impactful reliability issue for Enterprise teams using the CLI with Claude models.
Error output
✗ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
aB(t[TNe]===0)
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
✗ Execution failed: Error: Failed to get response from the AI model; retried 5 times
(total retry wait time: 88.20631920058639 seconds)
Last error: CAPIError: 503 {"error":{"message":"HTTP/2 GOAWAY connection terminated","type":"connection_error"}}The minified assertion trace reveals undici Pool/PoolBase internals: TNe → kPending, SNe → kRunning, plus kClients, kNeedDrain, kAddClient, kGetDispatcher, kRemoveClient symbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).
Root cause analysis
The failure chain is:
- Server sends HTTP/2 GOAWAY frame (normal lifecycle — connection TTL, load rebalancing, deploy)
- undici pool receives GOAWAY on active multiplexed connection — the handler in
client-h2.jssetsclient[kHTTP2Session] = null, but in-flight streams haven't drained yet - Pool state invariant violated —
pendingCount === 0assertion fails because requests were queued between GOAWAY receipt and session close AssertionErrorthrown from minified pool codegetCompletionWithToolscatches error, classifies as "transient", retries on the SAME corrupted pool- 5 retries × same corrupted pool state = 5 wasted premium requests + guaranteed failure
This traces to a known, still-open undici bug: nodejs/undici#4059 — "Uncaught AssertionError thrown due to a possible race condition." The undici source at api-request.js:141-150 even contains a TODO: Does this need queueMicrotask? comment acknowledging the timing issue. A related but distinct bug (nodejs/undici#3140) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.
Why the v1.0.6 fix is incomplete
The v1.0.6 fix targeted the most common trigger vector: sub-agent pool contention. When multiple sub-agents share a single undici Pool with allowH2: true, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did not fix:
- Long-lived single-agent sessions (stale connections accumulate; AssertionError [ERR_ASSERTION] during retrospective generation followed by HTTP/2 GOAWAY connection error (503) #1754 was a 19h session)
- The transition between sub-agent exploration and parent-agent output generation (Request failed due to a transient API error. Retrying... using the Claude Opus 4.6 model #2189)
- The retry loop itself — retries still target the same corrupted pool instead of creating a fresh connection
Contributing factors (from community reports)
Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.
Proposed fix (3 layers)
Layer 1: Proactive connection recycling
The Pool constructor (visible in #2050's trace) accepts clientTtl. Setting clientTtl: 60000 (60s) would proactively cycle connections before servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.
Additionally, listen for the disconnect event on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.
Layer 2: Error-class-aware retry logic
Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):
- GOAWAY / connection_error: Reset the connection pool, then retry. 2-3 retries max, short backoff (1-2s). The pool is broken, not the server.
- 503 without GOAWAY: Server is overloaded. Exponential backoff, 3-5 retries. Pool is fine, server needs time.
- 429 / rate limit: Respect
Retry-After. 1 retry max. Don't burn quota. - 400 / client error: Don't retry at all. Surface immediately.
The key change: on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.
Layer 3: Premium request protection
Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):
- At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests
- Ideally: send an
X-Retry-Of: <original-request-id>header so the billing system can deduplicate connection-level retry charges - Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
✗ Execution failed: Error: Failed to get response from the AI model; retried 5 times
(total retry wait time: 88.20631920058639 seconds)
Last error: CAPIError: 503 {"error":{"message":"HTTP/2 GOAWAY connection terminated","type":"connection_error"}}
The minified assertion trace reveals undici Pool/PoolBase internals: TNe → kPending, SNe → kRunning, plus kClients, kNeedDrain, kAddClient, kGetDispatcher, kRemoveClient symbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).
Root cause analysis
The failure chain is:
Server sends HTTP/2 GOAWAY frame (normal lifecycle — connection TTL, load rebalancing, deploy)
undici pool receives GOAWAY on active multiplexed connection — the handler in client-h2.js sets client[kHTTP2Session] = null, but in-flight streams haven't drained yet
Pool state invariant violated — pendingCount === 0 assertion fails because requests were queued between GOAWAY receipt and session close
AssertionError thrown from minified pool code
getCompletionWithTools catches error, classifies as "transient", retries on the SAME corrupted pool
5 retries × same corrupted pool state = 5 wasted premium requests + guaranteed failure
This traces to a known, still-open undici bug: nodejs/undici#4059 — "Uncaught AssertionError thrown due to a possible race condition." The undici source at api-request.js:141-150 even contains a TODO: Does this need queueMicrotask? comment acknowledging the timing issue. A related but distinct bug (nodejs/undici#3140) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.
Why the v1.0.6 fix is incomplete
The v1.0.6 fix targeted the most common trigger vector: sub-agent pool contention. When multiple sub-agents share a single undici Pool with allowH2: true, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did not fix:
Long-lived single-agent sessions (stale connections accumulate; #1754 was a 19h session)
The transition between sub-agent exploration and parent-agent output generation (#2189)
The retry loop itself — retries still target the same corrupted pool instead of creating a fresh connection
Contributing factors (from community reports)
FactorEvidenceIssueClaude models hit this far more than OpenAI/Google"occurs much more frequently with opus-4.6 models. I've almost never seen it when using the sonnet models"#1743Long sessions (4+ hours)19h27m session with 2,812 lines changed#1754Sub-agent → parent transitions"When it explores the codebase using a subagent, everything works fine, but when it tries to write the plan..."#2189Large output generationFailures cluster when model transitions from reading to writing#2050, #2189Post-v1.0.6 persistenceIdentical failures on v1.0.6 and v1.0.9#2101, #2189
Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.
Proposed fix (3 layers)
Layer 1: Proactive connection recycling
The Pool constructor (visible in #2050's trace) accepts clientTtl. Setting clientTtl: 60000 (60s) would proactively cycle connections before servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.
Additionally, listen for the disconnect event on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.
Layer 2: Error-class-aware retry logic
Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):
GOAWAY / connection_error: Reset the connection pool, then retry. 2-3 retries max, short backoff (1-2s). The pool is broken, not the server.
503 without GOAWAY: Server is overloaded. Exponential backoff, 3-5 retries. Pool is fine, server needs time.
429 / rate limit: Respect Retry-After. 1 retry max. Don't burn quota.
400 / client error: Don't retry at all. Surface immediately.
The key change: on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.
Layer 3: Premium request protection
Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):
At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests
Ideally: send an X-Retry-Of: header so the billing system can deduplicate connection-level retry charges
Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session
Affected version
v1.0.6
Steps to reproduce the behavior
Most reliable (works ~80% of the time):
copilot
/model → select Claude Opus 4.6 (High)
/plan on a medium-to-large repo
Let sub-agent exploration complete
When the agent transitions to writing output → GOAWAY hits
Also reliable:
Run a session for 4+ hours with periodic prompts, then make a complex request
Use /fleet with Claude models to create parallel sub-agent connections
/resume a dormant session (4+ hours old) and immediately prompt a complex task
Expected behavior
GOAWAY frames are handled as normal HTTP/2 lifecycle events, not assertion-triggering errors
Connection pool is recycled before retrying after a GOAWAY
Retries on connection-level errors don't silently burn premium requests
Users see distinct, informative messages for connection errors vs. rate limits vs. server errors
Long-running sessions proactively recycle connections before GOAWAY is received
Additional context
Environment
CLI version: v1.0.12 (also reproducible on v1.0.6, v1.0.9, v1.0.10)
Plan: Enterprise (also affects Business, Pro, Pro+)
OS: Linux, macOS, Windows/WSL (cross-platform)
Models primarily affected: Claude Opus 4.6, Claude Sonnet 4.6
Network: Both direct connections and behind corporate proxies (Zscaler, Netskope)
Related issues
#1743 — Autopilot mode AssertionError (v0.0.420, established model correlation with Opus)
#1754 — AssertionError during retrospective after 19h session (v0.0.420, best root cause analysis)
#2050 — Claude Sonnet 4.6 GOAWAY failure with full stack trace (v1.0.5, reveals undici internals)
#2101 — Transient API error → rate limit cascade (v1.0.6, post-fix)
#2189 — Claude Opus 4.6 GOAWAY during plan generation (v1.0.9, post-fix, 4 consecutive reproductions)
#1627 — Retry loop burns premium requests, switching models doesn't help (v0.0.420)
#2073 — Frequent transient errors leading to rate limits (v1.0.5)
nodejs/undici#4059 — Root cause: AssertionError race condition (OPEN)
nodejs/undici#3140 — Related: GOAWAY request hang (fixed in v6.14.0)
nodejs/undici#3011 — Related: null session ref after GOAWAY