Skip to content

fix: add timeout and retry recovery for subagent LLM streams#13846

Open
timvw wants to merge 1 commit intoanomalyco:devfrom
timvw:fix/subagent-timeout-recovery
Open

fix: add timeout and retry recovery for subagent LLM streams#13846
timvw wants to merge 1 commit intoanomalyco:devfrom
timvw:fix/subagent-timeout-recovery

Conversation

@timvw
Copy link

@timvw timvw commented Feb 16, 2026

Summary

Fixes #13841 — Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 (and other providers) due to four compounding gaps in timeout/retry handling.

Changes

1. Default 300s fetch timeout (provider.ts)

The config schema documented a 300s default, but the field was .optional() with no enforced default. When unconfigured, options["timeout"] was undefined and the AbortSignal.timeout() guard was skipped entirely. Combined with Bun's socket timeout being explicitly disabled (timeout: false), there was no fallback timeout of any kind.

Now: const timeout = options["timeout"] ?? 300_000 — always applies the documented default unless explicitly set to false.

2. Stream-idle watchdog (processor.ts)

A connection that stays open but stops delivering SSE chunks was never detected. The LLM.stream() call only passed the session's AbortSignal with no idle detection.

Now: withIdleTimeout() async generator wraps stream.fullStream with a 2-minute idle timer that resets on each received chunk. On timeout, throws a StreamIdleTimeoutError (marked retryable via message-v2.ts).

3. Session-level retry cap (processor.ts)

The retry loop had no maximum attempt count — attempt++ with no ceiling. If errors stayed retryable (e.g. rate limits, overloaded), it looped forever.

Now: MAX_RETRY_ATTEMPTS = 10 — after exceeding the cap, the error is surfaced to the user instead of retrying.

4. Subagent timeout (task.ts)

The Task tool awaited SessionPrompt.prompt() with no timeout wrapper. Cancellation only happened when the parent session was manually aborted.

Now: wraps the prompt call with abortAfterAny(10 * 60 * 1000, ctx.abort) — 10-minute hard ceiling that also cascades to SessionPrompt.cancel().

Testing

  • All 980 existing tests pass (0 failures)
  • Changes are minimal and surgical — 59 lines added, 6 removed across 4 files

Related Issues

Addresses anomalyco#13841 - Explore subagent hangs indefinitely with no timeout.

Four compounding gaps allowed a stalled LLM stream to hang forever:

1. Apply default 300s fetch timeout when provider timeout is unconfigured
   (provider.ts). The documented default was never enforced.

2. Add 2-minute stream-idle watchdog (processor.ts). Detects when an open
   connection stops delivering SSE chunks and throws a retryable error.

3. Cap session-level retries at 10 attempts (processor.ts). Previously
   unbounded, retries could loop forever on persistent errors.

4. Add 10-minute subagent timeout (task.ts). Wraps SessionPrompt.prompt()
   with abortAfterAny() so a hung subagent cannot block the parent session
   indefinitely.
@github-actions
Copy link
Contributor

The following comment was made by an LLM, it may be inaccurate:

Based on my searches, one result stands out as potentially related:

Potential Related PR:

However, all other searches returned only PR #13846 (the current PR), indicating there are no other open PRs with substantial overlap on subagent hangs, stream idle detection, or the specific timeout/retry recovery mechanisms being implemented in this PR.

The related PR #13502 appears to address timeout retries but may not be the exact same fix. You should verify if #13502 is already merged or if it addresses the same root causes as #13846.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 -- no timeout or recovery

1 participant