Skip to content

HTTP/2 GOAWAY race condition causes cascading retry failures and silent premium request waste (consolidates #1743, #1754, #2050, #2101, #2189) #2421

@sjanoe123

Description

@sjanoe123

Describe the bug

The CLI's undici HTTP/2 connection pool has a race condition when handling server-sent GOAWAY frames. When a GOAWAY arrives while requests are in-flight, the pool's internal state invariant (pendingCount === 0) is violated, causing an AssertionError. The CLI's retry logic then makes 5 retry attempts against the same corrupted pool, all of which fail identically. Each retry consumes a premium request, and the entire sequence takes ~88 seconds before surfacing a terminal error.

The v1.0.6 changelog states "Resolve session crashes caused by HTTP/2 connection pool race conditions when sub-agents are active" — but the fix is incomplete. Issues #2101 (v1.0.6) and #2189 (v1.0.9) report the identical failure pattern post-fix.

This is the single most impactful reliability issue for Enterprise teams using the CLI with Claude models.

Error output

✗ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
    aB(t[TNe]===0)

● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
✗ Execution failed: Error: Failed to get response from the AI model; retried 5 times
  (total retry wait time: 88.20631920058639 seconds)
  Last error: CAPIError: 503 {"error":{"message":"HTTP/2 GOAWAY connection terminated","type":"connection_error"}}

The minified assertion trace reveals undici Pool/PoolBase internals: TNekPending, SNekRunning, plus kClients, kNeedDrain, kAddClient, kGetDispatcher, kRemoveClient symbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).

Root cause analysis

The failure chain is:

  1. Server sends HTTP/2 GOAWAY frame (normal lifecycle — connection TTL, load rebalancing, deploy)
  2. undici pool receives GOAWAY on active multiplexed connection — the handler in client-h2.js sets client[kHTTP2Session] = null, but in-flight streams haven't drained yet
  3. Pool state invariant violatedpendingCount === 0 assertion fails because requests were queued between GOAWAY receipt and session close
  4. AssertionError thrown from minified pool code
  5. getCompletionWithTools catches error, classifies as "transient", retries on the SAME corrupted pool
  6. 5 retries × same corrupted pool state = 5 wasted premium requests + guaranteed failure

This traces to a known, still-open undici bug: nodejs/undici#4059 — "Uncaught AssertionError thrown due to a possible race condition." The undici source at api-request.js:141-150 even contains a TODO: Does this need queueMicrotask? comment acknowledging the timing issue. A related but distinct bug (nodejs/undici#3140) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.

Why the v1.0.6 fix is incomplete

The v1.0.6 fix targeted the most common trigger vector: sub-agent pool contention. When multiple sub-agents share a single undici Pool with allowH2: true, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did not fix:

Contributing factors (from community reports)

Factor | Evidence | Issue -- | -- | -- Claude models hit this far more than OpenAI/Google | "occurs much more frequently with opus-4.6 models. I've almost never seen it when using the sonnet models" | #1743 Long sessions (4+ hours) | 19h27m session with 2,812 lines changed | #1754 Sub-agent → parent transitions | "When it explores the codebase using a subagent, everything works fine, but when it tries to write the plan..." | #2189 Large output generation | Failures cluster when model transitions from reading to writing | #2050, #2189 Post-v1.0.6 persistence | Identical failures on v1.0.6 and v1.0.9 | #2101, #2189

Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.

Proposed fix (3 layers)

Layer 1: Proactive connection recycling

The Pool constructor (visible in #2050's trace) accepts clientTtl. Setting clientTtl: 60000 (60s) would proactively cycle connections before servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.

Additionally, listen for the disconnect event on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.

Layer 2: Error-class-aware retry logic

Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):

  • GOAWAY / connection_error: Reset the connection pool, then retry. 2-3 retries max, short backoff (1-2s). The pool is broken, not the server.
  • 503 without GOAWAY: Server is overloaded. Exponential backoff, 3-5 retries. Pool is fine, server needs time.
  • 429 / rate limit: Respect Retry-After. 1 retry max. Don't burn quota.
  • 400 / client error: Don't retry at all. Surface immediately.

The key change: on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.

Layer 3: Premium request protection

Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):

  • At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests
  • Ideally: send an X-Retry-Of: <original-request-id> header so the billing system can deduplicate connection-level retry charges
  • Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session

The CLI's undici HTTP/2 connection pool has a race condition when handling server-sent GOAWAY frames. When a GOAWAY arrives while requests are in-flight, the pool's internal state invariant (pendingCount === 0) is violated, causing an AssertionError. The CLI's retry logic then makes 5 retry attempts against the same corrupted pool, all of which fail identically. Each retry consumes a premium request, and the entire sequence takes ~88 seconds before surfacing a terminal error. The v1.0.6 changelog states "Resolve session crashes caused by HTTP/2 connection pool race conditions when sub-agents are active" — but the fix is incomplete. Issues #2101 (v1.0.6) and #2189 (v1.0.9) report the identical failure pattern post-fix. This is the single most impactful reliability issue for Enterprise teams using the CLI with Claude models. Error output ✗ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value: aB(t[TNe]===0)

● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
✗ Execution failed: Error: Failed to get response from the AI model; retried 5 times
(total retry wait time: 88.20631920058639 seconds)
Last error: CAPIError: 503 {"error":{"message":"HTTP/2 GOAWAY connection terminated","type":"connection_error"}}
The minified assertion trace reveals undici Pool/PoolBase internals: TNe → kPending, SNe → kRunning, plus kClients, kNeedDrain, kAddClient, kGetDispatcher, kRemoveClient symbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).
Root cause analysis
The failure chain is:

Server sends HTTP/2 GOAWAY frame (normal lifecycle — connection TTL, load rebalancing, deploy)
undici pool receives GOAWAY on active multiplexed connection — the handler in client-h2.js sets client[kHTTP2Session] = null, but in-flight streams haven't drained yet
Pool state invariant violated — pendingCount === 0 assertion fails because requests were queued between GOAWAY receipt and session close
AssertionError thrown from minified pool code
getCompletionWithTools catches error, classifies as "transient", retries on the SAME corrupted pool
5 retries × same corrupted pool state = 5 wasted premium requests + guaranteed failure

This traces to a known, still-open undici bug: nodejs/undici#4059 — "Uncaught AssertionError thrown due to a possible race condition." The undici source at api-request.js:141-150 even contains a TODO: Does this need queueMicrotask? comment acknowledging the timing issue. A related but distinct bug (nodejs/undici#3140) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.
Why the v1.0.6 fix is incomplete
The v1.0.6 fix targeted the most common trigger vector: sub-agent pool contention. When multiple sub-agents share a single undici Pool with allowH2: true, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did not fix:

Long-lived single-agent sessions (stale connections accumulate; #1754 was a 19h session)
The transition between sub-agent exploration and parent-agent output generation (#2189)
The retry loop itself — retries still target the same corrupted pool instead of creating a fresh connection

Contributing factors (from community reports)
FactorEvidenceIssueClaude models hit this far more than OpenAI/Google"occurs much more frequently with opus-4.6 models. I've almost never seen it when using the sonnet models"#1743Long sessions (4+ hours)19h27m session with 2,812 lines changed#1754Sub-agent → parent transitions"When it explores the codebase using a subagent, everything works fine, but when it tries to write the plan..."#2189Large output generationFailures cluster when model transitions from reading to writing#2050, #2189Post-v1.0.6 persistenceIdentical failures on v1.0.6 and v1.0.9#2101, #2189
Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.
Proposed fix (3 layers)
Layer 1: Proactive connection recycling
The Pool constructor (visible in #2050's trace) accepts clientTtl. Setting clientTtl: 60000 (60s) would proactively cycle connections before servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.
Additionally, listen for the disconnect event on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.
Layer 2: Error-class-aware retry logic
Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):

GOAWAY / connection_error: Reset the connection pool, then retry. 2-3 retries max, short backoff (1-2s). The pool is broken, not the server.
503 without GOAWAY: Server is overloaded. Exponential backoff, 3-5 retries. Pool is fine, server needs time.
429 / rate limit: Respect Retry-After. 1 retry max. Don't burn quota.
400 / client error: Don't retry at all. Surface immediately.

The key change: on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.
Layer 3: Premium request protection
Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):

At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests
Ideally: send an X-Retry-Of: header so the billing system can deduplicate connection-level retry charges
Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session

Affected version

v1.0.6

Steps to reproduce the behavior

Most reliable (works ~80% of the time):

copilot
/model → select Claude Opus 4.6 (High)
/plan on a medium-to-large repo
Let sub-agent exploration complete
When the agent transitions to writing output → GOAWAY hits

Also reliable:

Run a session for 4+ hours with periodic prompts, then make a complex request
Use /fleet with Claude models to create parallel sub-agent connections
/resume a dormant session (4+ hours old) and immediately prompt a complex task

Expected behavior

GOAWAY frames are handled as normal HTTP/2 lifecycle events, not assertion-triggering errors
Connection pool is recycled before retrying after a GOAWAY
Retries on connection-level errors don't silently burn premium requests
Users see distinct, informative messages for connection errors vs. rate limits vs. server errors
Long-running sessions proactively recycle connections before GOAWAY is received

Additional context

Environment

CLI version: v1.0.12 (also reproducible on v1.0.6, v1.0.9, v1.0.10)
Plan: Enterprise (also affects Business, Pro, Pro+)
OS: Linux, macOS, Windows/WSL (cross-platform)
Models primarily affected: Claude Opus 4.6, Claude Sonnet 4.6
Network: Both direct connections and behind corporate proxies (Zscaler, Netskope)

Related issues

#1743 — Autopilot mode AssertionError (v0.0.420, established model correlation with Opus)
#1754 — AssertionError during retrospective after 19h session (v0.0.420, best root cause analysis)
#2050 — Claude Sonnet 4.6 GOAWAY failure with full stack trace (v1.0.5, reveals undici internals)
#2101 — Transient API error → rate limit cascade (v1.0.6, post-fix)
#2189 — Claude Opus 4.6 GOAWAY during plan generation (v1.0.9, post-fix, 4 consecutive reproductions)
#1627 — Retry loop burns premium requests, switching models doesn't help (v0.0.420)
#2073 — Frequent transient errors leading to rate limits (v1.0.5)
nodejs/undici#4059 — Root cause: AssertionError race condition (OPEN)
nodejs/undici#3140 — Related: GOAWAY request hang (fixed in v6.14.0)
nodejs/undici#3011 — Related: null session ref after GOAWAY

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No fields configured for Bug.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions