Skip to content

fix(opencode): chatgpt subscription stream hangs due to watchdog timeout #29420

@liuhaoyang

Description

@liuhaoyang

Problem

ChatGPT subscription streams can hang indefinitely without timing out, leaving users with frozen connections and no feedback.

Symptoms

  • LLM response streams from ChatGPT subscription endpoints stall permanently with no data
  • No timeout mechanism detects or recovers from hung connections
  • Users experience completely frozen sessions with no error message or recovery path

Root Cause

  1. Broken provider request timeout: The existing timeout mechanism used AbortSignal.timeout() which does not work correctly in Bun's runtime. Provider HTTP requests had no effective timeout, allowing connections to hang indefinitely.

  2. No stream-level watchdog: The session processor lacked any idle/first-byte timeout on LLM streams. Once a stream stopped producing chunks (or never started), there was no detection or recovery.

  3. Incomplete retry classification: Transient server errors (429, 502, 503, 504) and common ChatGPT error shapes were not always classified as retryable, causing permanent failures on recoverable conditions.

  4. Missing abort signal propagation: Token refresh calls in the Codex auth plugin did not propagate the caller's abort signal, so cancellation during auth refresh leaked hanging requests.

Impact

  • Users on ChatGPT subscription plans experience frozen/hung sessions with no feedback
  • No automatic recovery from transient server-side stalls
  • Subagent tasks can run indefinitely without timeout bounds

Fix Approach

Repair the stream watchdog and timeout infrastructure across multiple layers:

  • Replace broken AbortSignal.timeout() with manual setTimeout + AbortController for provider request timeouts (5 min default)
  • Add a stream-level watchdog with separate first-byte (30s) and idle (120s) timeouts that wraps LLM streams
  • Extend retry classification to recognize transient server errors (429/502/503/504) and deeply-nested ChatGPT error patterns
  • Convert DOMException(TimeoutError) into retryable APIError for proper recovery
  • Propagate abort signals through Codex OAuth token exchange and refresh paths
  • Add configurable subagent timeout with input validation

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions