fix: add timeout and retry recovery for subagent LLM streams#13846
Open
timvw wants to merge 1 commit intoanomalyco:devfrom
Open
fix: add timeout and retry recovery for subagent LLM streams#13846timvw wants to merge 1 commit intoanomalyco:devfrom
timvw wants to merge 1 commit intoanomalyco:devfrom
Conversation
Addresses anomalyco#13841 - Explore subagent hangs indefinitely with no timeout. Four compounding gaps allowed a stalled LLM stream to hang forever: 1. Apply default 300s fetch timeout when provider timeout is unconfigured (provider.ts). The documented default was never enforced. 2. Add 2-minute stream-idle watchdog (processor.ts). Detects when an open connection stops delivering SSE chunks and throws a retryable error. 3. Cap session-level retries at 10 attempts (processor.ts). Previously unbounded, retries could loop forever on persistent errors. 4. Add 10-minute subagent timeout (task.ts). Wraps SessionPrompt.prompt() with abortAfterAny() so a hung subagent cannot block the parent session indefinitely.
Contributor
|
The following comment was made by an LLM, it may be inaccurate: Based on my searches, one result stands out as potentially related: Potential Related PR:
However, all other searches returned only PR #13846 (the current PR), indicating there are no other open PRs with substantial overlap on subagent hangs, stream idle detection, or the specific timeout/retry recovery mechanisms being implemented in this PR. The related PR #13502 appears to address timeout retries but may not be the exact same fix. You should verify if #13502 is already merged or if it addresses the same root causes as #13846. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #13841 — Explore subagent hangs indefinitely with Anthropic Claude Opus 4.6 (and other providers) due to four compounding gaps in timeout/retry handling.
Changes
1. Default 300s fetch timeout (
provider.ts)The config schema documented a 300s default, but the field was
.optional()with no enforced default. When unconfigured,options["timeout"]wasundefinedand theAbortSignal.timeout()guard was skipped entirely. Combined with Bun's socket timeout being explicitly disabled (timeout: false), there was no fallback timeout of any kind.Now:
const timeout = options["timeout"] ?? 300_000— always applies the documented default unless explicitly set tofalse.2. Stream-idle watchdog (
processor.ts)A connection that stays open but stops delivering SSE chunks was never detected. The
LLM.stream()call only passed the session'sAbortSignalwith no idle detection.Now:
withIdleTimeout()async generator wrapsstream.fullStreamwith a 2-minute idle timer that resets on each received chunk. On timeout, throws aStreamIdleTimeoutError(marked retryable viamessage-v2.ts).3. Session-level retry cap (
processor.ts)The retry loop had no maximum attempt count —
attempt++with no ceiling. If errors stayed retryable (e.g. rate limits, overloaded), it looped forever.Now:
MAX_RETRY_ATTEMPTS = 10— after exceeding the cap, the error is surfaced to the user instead of retrying.4. Subagent timeout (
task.ts)The Task tool awaited
SessionPrompt.prompt()with no timeout wrapper. Cancellation only happened when the parent session was manually aborted.Now: wraps the prompt call with
abortAfterAny(10 * 60 * 1000, ctx.abort)— 10-minute hard ceiling that also cascades toSessionPrompt.cancel().Testing
Related Issues