HTTP/2 GOAWAY race condition causes cascading retry failures and silent premium request waste (consolidates #1743, #1754, #2050, #2101, #2189)

### Describe the bug

<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The CLI's undici HTTP/2 connection pool has a race condition when handling server-sent GOAWAY frames. When a GOAWAY arrives while requests are in-flight, the pool's internal state invariant (<code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">pendingCount === 0</code>) is violated, causing an <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">AssertionError</code>. The CLI's retry logic then makes 5 retry attempts against the <strong>same corrupted pool</strong>, all of which fail identically. Each retry consumes a premium request, and the entire sequence takes ~88 seconds before surfacing a terminal error.</p>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The v1.0.6 changelog states "Resolve session crashes caused by HTTP/2 connection pool race conditions when sub-agents are active" — but the fix is incomplete. Issues #2101 (v1.0.6) and #2189 (v1.0.9) report the identical failure pattern post-fix.</p>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">This is the single most impactful reliability issue for Enterprise teams using the CLI with Claude models.</p>
<h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Error output</h2>
<div role="group" aria-label="Code" tabindex="0" class="relative group/copy bg-bg-000/50 border-0.5 border-border-400 rounded-lg focus:outline-none focus-visible:ring-2 focus-visible:ring-accent-100"><div class="sticky opacity-0 group-hover/copy:opacity-100 group-focus-within/copy:opacity-100 top-2 py-2 h-12 w-0 float-right"><div class="absolute right-0 h-8 px-2 items-center inline-flex z-10"><button class="inline-flex
  items-center
  justify-center
  relative
  isolate
  shrink-0
  can-focus
  select-none
  disabled:pointer-events-none
  disabled:opacity-50
  disabled:shadow-none
  disabled:drop-shadow-none border-transparent
          transition
          font-base
          duration-300
          ease-[cubic-bezier(0.165,0.85,0.45,1)] h-8 w-8 rounded-md backdrop-blur-md _fill_56vq7_9 _ghost_56vq7_96" type="button" aria-label="Copy to clipboard" data-state="closed"><div class="relative"><div class="transition-all opacity-100 scale-100" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-100 scale-100" aria-hidden="true" style="flex-shrink: 0;"><path d="M12.5 3A1.5 1.5 0 0 1 14 4.5V6h1.5A1.5 1.5 0 0 1 17 7.5v8a1.5 1.5 0 0 1-1.5 1.5h-8A1.5 1.5 0 0 1 6 15.5V14H4.5A1.5 1.5 0 0 1 3 12.5v-8A1.5 1.5 0 0 1 4.5 3zm1.5 9.5a1.5 1.5 0 0 1-1.5 1.5H7v1.5a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5H14zM4.5 4a.5.5 0 0 0-.5.5v8a.5.5 0 0 0 .5.5h8a.5.5 0 0 0 .5-.5v-8a.5.5 0 0 0-.5-.5z"></path></svg></div><div class="absolute inset-0 flex items-center justify-center"><div class="transition-all opacity-0 scale-50" style="width: 20px; height: 20px; display: flex; align-items: center; justify-content: center;"><svg width="20" height="20" viewBox="0 0 20 20" fill="currentColor" xmlns="http://www.w3.org/2000/svg" class="transition-all opacity-0 scale-50" aria-hidden="true" style="flex-shrink: 0;"><path d="M15.188 5.11a.5.5 0 0 1 .752.626l-.056.084-7.5 9a.5.5 0 0 1-.738.033l-3.5-3.5-.064-.078a.501.501 0 0 1 .693-.693l.078.064 3.113 3.113 7.15-8.58z"></path></svg></div></div></div></button></div></div><div class="overflow-x-auto"><pre class="code-block__code !my-0 !rounded-lg !text-sm !leading-relaxed p-3.5" style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono);"><code style="color: rgb(234, 236, 240); background: transparent; font-family: var(--font-mono); white-space: pre-wrap;"><span><span>✗ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
</span></span><span>    aB(t[TNe]===0)
</span><span>
</span><span>● Request failed due to a transient API error. Retrying...
</span><span>● Request failed due to a transient API error. Retrying...
</span><span>● Request failed due to a transient API error. Retrying...
</span><span>● Request failed due to a transient API error. Retrying...
</span><span>● Request failed due to a transient API error. Retrying...
</span><span>✗ Execution failed: Error: Failed to get response from the AI model; retried 5 times
</span><span>  (total retry wait time: 88.20631920058639 seconds)
</span><span>  Last error: CAPIError: 503 {"error":{"message":"HTTP/2 GOAWAY connection terminated","type":"connection_error"}}</span></code></pre></div></div>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The minified assertion trace reveals undici <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">Pool</code>/<code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">PoolBase</code> internals: <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">TNe</code> → <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">kPending</code>, <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">SNe</code> → <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">kRunning</code>, plus <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">kClients</code>, <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">kNeedDrain</code>, <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">kAddClient</code>, <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">kGetDispatcher</code>, <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">kRemoveClient</code> symbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).</p>
<h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Root cause analysis</h2>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The failure chain is:</p>
<ol class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-decimal flex flex-col gap-1 pl-8 mb-3">
<li class="whitespace-normal break-words pl-2"><strong>Server sends HTTP/2 GOAWAY frame</strong> (normal lifecycle — connection TTL, load rebalancing, deploy)</li>
<li class="whitespace-normal break-words pl-2"><strong>undici pool receives GOAWAY on active multiplexed connection</strong> — the handler in <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">client-h2.js</code> sets <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">client[kHTTP2Session] = null</code>, but in-flight streams haven't drained yet</li>
<li class="whitespace-normal break-words pl-2"><strong>Pool state invariant violated</strong> — <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">pendingCount === 0</code> assertion fails because requests were queued between GOAWAY receipt and session close</li>
<li class="whitespace-normal break-words pl-2"><strong><code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">AssertionError</code> thrown</strong> from minified pool code</li>
<li class="whitespace-normal break-words pl-2"><strong><code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">getCompletionWithTools</code> catches error, classifies as "transient", retries on the SAME corrupted pool</strong></li>
<li class="whitespace-normal break-words pl-2"><strong>5 retries × same corrupted pool state = 5 wasted premium requests + guaranteed failure</strong></li>
</ol>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">This traces to a known, <strong>still-open</strong> undici bug: <a class="underline underline underline-offset-2 decoration-1 decoration-current/40 hover:decoration-current focus:decoration-current" href="https://github.com/nodejs/undici/issues/4059">nodejs/undici#4059</a> — "Uncaught AssertionError thrown due to a possible race condition." The undici source at <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">api-request.js:141-150</code> even contains a <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">TODO: Does this need queueMicrotask?</code> comment acknowledging the timing issue. A related but distinct bug (<a class="underline underline underline-offset-2 decoration-1 decoration-current/40 hover:decoration-current focus:decoration-current" href="https://github.com/nodejs/undici/issues/3140">nodejs/undici#3140</a>) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.</p>
<h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Why the v1.0.6 fix is incomplete</h2>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The v1.0.6 fix targeted the most common trigger vector: <strong>sub-agent pool contention</strong>. When multiple sub-agents share a single undici <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">Pool</code> with <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">allowH2: true</code>, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did <strong>not</strong> fix:</p>
<ul class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3">
<li class="whitespace-normal break-words pl-2">Long-lived single-agent sessions (stale connections accumulate; #1754 was a 19h session)</li>
<li class="whitespace-normal break-words pl-2">The transition between sub-agent exploration and parent-agent output generation (#2189)</li>
<li class="whitespace-normal break-words pl-2">The retry loop itself — retries still target the same corrupted pool instead of creating a fresh connection</li>
</ul>
<h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Contributing factors (from community reports)</h2>
<div class="overflow-x-auto w-full px-2 mb-6">
Factor | Evidence | Issue
-- | -- | --
Claude models hit this far more than OpenAI/Google | "occurs much more frequently with opus-4.6 models. I've almost never seen it when using the sonnet models" | #1743
Long sessions (4+ hours) | 19h27m session with 2,812 lines changed | #1754
Sub-agent → parent transitions | "When it explores the codebase using a subagent, everything works fine, but when it tries to write the plan..." | #2189
Large output generation | Failures cluster when model transitions from reading to writing | #2050, #2189
Post-v1.0.6 persistence | Identical failures on v1.0.6 and v1.0.9 | #2101, #2189

</div>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.</p>
<h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold">Proposed fix (3 layers)</h2>
<h3 class="text-text-100 mt-2 -mb-1 text-base font-bold">Layer 1: Proactive connection recycling</h3>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">Pool</code> constructor (visible in #2050's trace) accepts <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">clientTtl</code>. Setting <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">clientTtl: 60000</code> (60s) would proactively cycle connections <strong>before</strong> servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.</p>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Additionally, listen for the <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">disconnect</code> event on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.</p>
<h3 class="text-text-100 mt-2 -mb-1 text-base font-bold">Layer 2: Error-class-aware retry logic</h3>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):</p>
<ul class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3">
<li class="whitespace-normal break-words pl-2"><strong>GOAWAY / connection_error</strong>: Reset the connection pool, then retry. 2-3 retries max, short backoff (1-2s). The pool is broken, not the server.</li>
<li class="whitespace-normal break-words pl-2"><strong>503 without GOAWAY</strong>: Server is overloaded. Exponential backoff, 3-5 retries. Pool is fine, server needs time.</li>
<li class="whitespace-normal break-words pl-2"><strong>429 / rate limit</strong>: Respect <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">Retry-After</code>. 1 retry max. Don't burn quota.</li>
<li class="whitespace-normal break-words pl-2"><strong>400 / client error</strong>: Don't retry at all. Surface immediately.</li>
</ul>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">The key change: <strong>on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.</strong></p>
<h3 class="text-text-100 mt-2 -mb-1 text-base font-bold">Layer 3: Premium request protection</h3>
<p class="font-claude-response-body break-words whitespace-normal leading-[1.7]">Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):</p>
<ul class="[li_&amp;]:mb-0 [li_&amp;]:mt-1 [li_&amp;]:gap-1 [&amp;:not(:last-child)_ul]:pb-1 [&amp;:not(:last-child)_ol]:pb-1 list-disc flex flex-col gap-1 pl-8 mb-3">
<li class="whitespace-normal break-words pl-2">At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests</li>
<li class="whitespace-normal break-words pl-2">Ideally: send an <code class="bg-text-200/5 border border-0.5 border-border-300 text-danger-000 whitespace-pre-wrap rounded-[0.4rem] px-1 py-px text-[0.9rem]">X-Retry-Of: &lt;original-request-id&gt;</code> header so the billing system can deduplicate connection-level retry charges</li>
<li class="whitespace-normal break-words pl-2">Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session</li>
</ul>
<h2 class="text-text-100 mt-3 -mb-1 text-[1.125rem] font-bold"></h2>The CLI's undici HTTP/2 connection pool has a race condition when handling server-sent GOAWAY frames. When a GOAWAY arrives while requests are in-flight, the pool's internal state invariant (pendingCount === 0) is violated, causing an AssertionError. The CLI's retry logic then makes 5 retry attempts against the same corrupted pool, all of which fail identically. Each retry consumes a premium request, and the entire sequence takes ~88 seconds before surfacing a terminal error.
The v1.0.6 changelog states "Resolve session crashes caused by HTTP/2 connection pool race conditions when sub-agents are active" — but the fix is incomplete. Issues #2101 (v1.0.6) and #2189 (v1.0.9) report the identical failure pattern post-fix.
This is the single most impactful reliability issue for Enterprise teams using the CLI with Claude models.
Error output
✗ AssertionError [ERR_ASSERTION]: The expression evaluated to a falsy value:
    aB(t[TNe]===0)

● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
● Request failed due to a transient API error. Retrying...
✗ Execution failed: Error: Failed to get response from the AI model; retried 5 times
  (total retry wait time: 88.20631920058639 seconds)
  Last error: CAPIError: 503 {"error":{"message":"HTTP/2 GOAWAY connection terminated","type":"connection_error"}}
The minified assertion trace reveals undici Pool/PoolBase internals: TNe → kPending, SNe → kRunning, plus kClients, kNeedDrain, kAddClient, kGetDispatcher, kRemoveClient symbols and a 2048-element ring buffer request queue (visible in #2050's full stack trace).
Root cause analysis
The failure chain is:

Server sends HTTP/2 GOAWAY frame (normal lifecycle — connection TTL, load rebalancing, deploy)
undici pool receives GOAWAY on active multiplexed connection — the handler in client-h2.js sets client[kHTTP2Session] = null, but in-flight streams haven't drained yet
Pool state invariant violated — pendingCount === 0 assertion fails because requests were queued between GOAWAY receipt and session close
AssertionError thrown from minified pool code
getCompletionWithTools catches error, classifies as "transient", retries on the SAME corrupted pool
5 retries × same corrupted pool state = 5 wasted premium requests + guaranteed failure

This traces to a known, still-open undici bug: [nodejs/undici#4059](https://github.com/nodejs/undici/issues/4059) — "Uncaught AssertionError thrown due to a possible race condition." The undici source at api-request.js:141-150 even contains a TODO: Does this need queueMicrotask? comment acknowledging the timing issue. A related but distinct bug ([nodejs/undici#3140](https://github.com/nodejs/undici/issues/3140)) was fixed in undici v6.14.0, but #4059 remains open and is the likely root cause of post-v1.0.6 occurrences.
Why the v1.0.6 fix is incomplete
The v1.0.6 fix targeted the most common trigger vector: sub-agent pool contention. When multiple sub-agents share a single undici Pool with allowH2: true, concurrent requests multiply the GOAWAY collision window. The fix appears to have addressed the sub-agent coordination layer but did not fix:

Long-lived single-agent sessions (stale connections accumulate; #1754 was a 19h session)
The transition between sub-agent exploration and parent-agent output generation (#2189)
The retry loop itself — retries still target the same corrupted pool instead of creating a fresh connection

Contributing factors (from community reports)
FactorEvidenceIssueClaude models hit this far more than OpenAI/Google"occurs much more frequently with opus-4.6 models. I've almost never seen it when using the sonnet models"#1743Long sessions (4+ hours)19h27m session with 2,812 lines changed#1754Sub-agent → parent transitions"When it explores the codebase using a subagent, everything works fine, but when it tries to write the plan..."#2189Large output generationFailures cluster when model transitions from reading to writing#2050, #2189Post-v1.0.6 persistenceIdentical failures on v1.0.6 and v1.0.9#2101, #2189
Claude models likely trigger this more because Anthropic's API gateway uses shorter HTTP/2 connection TTLs or more aggressive GOAWAY behavior, and Opus generates longer responses that keep streams open longer.
Proposed fix (3 layers)
Layer 1: Proactive connection recycling
The Pool constructor (visible in #2050's trace) accepts clientTtl. Setting clientTtl: 60000 (60s) would proactively cycle connections before servers send GOAWAY frames, eliminating the stale-connection trigger entirely. For Anthropic endpoints specifically, a shorter TTL (30-45s) may be appropriate.
Additionally, listen for the disconnect event on the pool and recreate the dispatcher when GOAWAY is detected, rather than retrying on the corrupted pool.
Layer 2: Error-class-aware retry logic
Currently all 5xx errors and connection errors are retried identically. A GOAWAY (connection-level) failure requires a fundamentally different retry strategy than a capacity 503 (server-level):

GOAWAY / connection_error: Reset the connection pool, then retry. 2-3 retries max, short backoff (1-2s). The pool is broken, not the server.
503 without GOAWAY: Server is overloaded. Exponential backoff, 3-5 retries. Pool is fine, server needs time.
429 / rate limit: Respect Retry-After. 1 retry max. Don't burn quota.
400 / client error: Don't retry at all. Surface immediately.

The key change: on GOAWAY detection, destroy and recreate the connection pool before the first retry attempt.
Layer 3: Premium request protection
Each failed retry currently consumes a premium request. For connection-level failures (where the request never reached the model):

At minimum: show users a running count of premium requests consumed by retries, with an opt-out after 2-3 wasted requests
Ideally: send an X-Retry-Of: <original-request-id> header so the billing system can deduplicate connection-level retry charges
Consider: per-sub-agent connection pools instead of shared pools, to prevent one sub-agent's GOAWAY from poisoning the entire session

### Affected version

v1.0.6

### Steps to reproduce the behavior

Most reliable (works ~80% of the time):

copilot
/model → select Claude Opus 4.6 (High)
/plan on a medium-to-large repo
Let sub-agent exploration complete
When the agent transitions to writing output → GOAWAY hits

Also reliable:

Run a session for 4+ hours with periodic prompts, then make a complex request
Use /fleet with Claude models to create parallel sub-agent connections
/resume a dormant session (4+ hours old) and immediately prompt a complex task

### Expected behavior

GOAWAY frames are handled as normal HTTP/2 lifecycle events, not assertion-triggering errors
Connection pool is recycled before retrying after a GOAWAY
Retries on connection-level errors don't silently burn premium requests
Users see distinct, informative messages for connection errors vs. rate limits vs. server errors
Long-running sessions proactively recycle connections before GOAWAY is received

### Additional context

Environment

CLI version: v1.0.12 (also reproducible on v1.0.6, v1.0.9, v1.0.10)
Plan: Enterprise (also affects Business, Pro, Pro+)
OS: Linux, macOS, Windows/WSL (cross-platform)
Models primarily affected: Claude Opus 4.6, Claude Sonnet 4.6
Network: Both direct connections and behind corporate proxies (Zscaler, Netskope)

Related issues

#1743 — Autopilot mode AssertionError (v0.0.420, established model correlation with Opus)
#1754 — AssertionError during retrospective after 19h session (v0.0.420, best root cause analysis)
#2050 — Claude Sonnet 4.6 GOAWAY failure with full stack trace (v1.0.5, reveals undici internals)
#2101 — Transient API error → rate limit cascade (v1.0.6, post-fix)
#2189 — Claude Opus 4.6 GOAWAY during plan generation (v1.0.9, post-fix, 4 consecutive reproductions)
#1627 — Retry loop burns premium requests, switching models doesn't help (v0.0.420)
#2073 — Frequent transient errors leading to rate limits (v1.0.5)
[nodejs/undici#4059](https://github.com/nodejs/undici/issues/4059) — Root cause: AssertionError race condition (OPEN)
[nodejs/undici#3140](https://github.com/nodejs/undici/issues/3140) — Related: GOAWAY request hang (fixed in v6.14.0)
[nodejs/undici#3011](https://github.com/nodejs/undici/issues/3011) — Related: null session ref after GOAWAY

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTTP/2 GOAWAY race condition causes cascading retry failures and silent premium request waste (consolidates #1743, #1754, #2050, #2101, #2189) #2421

Describe the bug

Error output

Root cause analysis

Why the v1.0.6 fix is incomplete

Contributing factors (from community reports)

Proposed fix (3 layers)

Layer 1: Proactive connection recycling

Layer 2: Error-class-aware retry logic

Layer 3: Premium request protection

Affected version

Steps to reproduce the behavior

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

HTTP/2 GOAWAY race condition causes cascading retry failures and silent premium request waste (consolidates #1743, #1754, #2050, #2101, #2189) #2421

Description

Describe the bug

Error output

Root cause analysis

Why the v1.0.6 fix is incomplete

Contributing factors (from community reports)

Proposed fix (3 layers)

Layer 1: Proactive connection recycling

Layer 2: Error-class-aware retry logic

Layer 3: Premium request protection

Affected version

Steps to reproduce the behavior

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions