fix(agents,think,ai-chat): re-attach to still-running sub-agent runs on parent recovery (#1630)#1640
Merged
Merged
Conversation
…on parent recovery (#1630) When a parent agent was evicted (deploy / DO reset) while a child agentTool() run was in flight, recovery sealed the run `interrupted` within ~5s and the parent re-issued the task — re-running the child's already-completed work ("the agent went all the way back and lost the files it wrote"). - Stable child runId: agentTool() now derives `agent-tool:<toolCallId>` from the recovery-preserved tool call id instead of a fresh nanoid, so a turn re-run by recovery resolves to the SAME idempotent child facet rather than spawning a new one (the primary amplification fix). - Bounded re-attach: a duplicate non-terminal runId (runAgentTool) and a still-running child during startup reconciliation now tail the live child to its real terminal result, bounded by DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS (120s). A hung child still seals `interrupted` after the budget so recovery can never block forever. - think + ai-chat child tails are now read-only on consumer detach: a parent's re-attach budget expiring cancels only the read view, never the still-running child (so it keeps advancing toward its own terminal for a later collect). Adds an internal `agent_tool:recovery:reattach` observability event; no new public config. Reworked the reconcile unit tests (running child -> completed; new bounded-tail-able-stuck -> interrupted after budget), a stable-runId test, and a natural-agentTool() task-amplification e2e variant. Co-authored-by: Cursor <cursoragent@cursor.com>
… eviction (#1630) A child facet self-heals its interrupted agent-tool turn via its own chatRecovery, but that path never wrote the child's run row — so after a real eviction the row stranded `running` (think) / was force-errored (ai-chat) and the parent could only ever collect `interrupted`/`error`, never the recovered result. Both packages now reconcile a stale child-run row (running, no live abort controller = original isolate gone) from the durable transcript on inspectAgentToolRun, gated on recovery state via _classifyAgentToolChildRecovery (lists chat-recovery incidents on the 1:1 child facet): in-progress -> keep `running` so the parent's bounded re-attach keeps waiting; settled with a completed assistant response -> `completed`; failed/empty -> `error`. This keeps the child's own (working) saveMessages recovery path untouched. Validated by a real-eviction e2e: task-amplification now drives the natural agentTool() path through kill/restart churn and asserts the child reaches all 30 steps AND the parent's run row settles `completed`. Note: routing the child through the durable submission registry (submitMessages) was tried first and reverted — the e2e showed it regresses multi-restart, multi-step tool-loop recovery (_recoverSubmissionsOnStart errors a running submission whose messages were applied, fighting the chat-recovery fiber). The transcript reconcile avoids touching the recovery path entirely. Co-authored-by: Cursor <cursoragent@cursor.com>
…t-less recovered runs (#1630) Review hardening for the sub-agent recovery work: - Reconcile no longer starves siblings. _reconcileAgentToolRuns is now a two-pass sweep: a deadline-bounded inspect/classify, then re-attach of still-running children IN PARALLEL, each bounded by its own re-attach budget. Previously the shared total-recovery deadline was consumed by the first child's (up to 120s) re-attach, so a slow/hung child caused every later sibling to be abandoned `interrupted` without an attempt. - A settled recovery that produced an assistant turn is now collected as `completed` even when the turn ended on a tool result with no final text — keying off text alone (think _getAgentToolFinalText / ai-chat _extractLatestAssistantText) mis-sealed a legitimately-finished but text-less run as `error`. getAgentToolSummary still falls back to "". New reconcileParallelThinkChildrenForTest asserts a hung child (started first) no longer starves a fast sibling. Full think (457) + ai-chat (480) suites green; both real-eviction e2e scenarios pass. Co-authored-by: Cursor <cursoragent@cursor.com>
…#1630) Addresses the three deferred review items: (a) Re-attach now tails from the child's CURRENT max chunk sequence instead of afterSequence: -1. The client reducer appends chunks by arrival order (ignoring sequence/replay), and a reconnected client already has the stored chunks via _replayAgentToolRuns — so replaying them on re-attach duplicated the run's parts on a connected client under repeated re-attach. Forwarding only genuinely-new chunks keeps the live stream correct without dupes. (b) Documented that the tail reader is deliberately abandoned (not cancelled) on budget-abort: cancelling a remote child-facet RPC stream surfaces an unswallowable "Stream was cancelled" pump rejection (verified). The hold is already bounded — the child reaches terminal within its own chat-recovery ceiling, firing the tail's registered closer which releases the reader. (c) Extracted the inspect reconcile-persist into a named private helper (_reconcileStaleAgentToolChildRun) in both packages so read vs reconcile are separated. The persist is retained intentionally (enables prompt tail-close + cheap subsequent inspects) and documented as lazy materialization of the run's true terminal. think 457 / ai-chat 480 suites green; both real-eviction e2e scenarios pass. Co-authored-by: Cursor <cursoragent@cursor.com>
Deep-review finding: once a run was sealed `interrupted` (e.g. a reconcile exhausted its re-attach budget before a slow child finished), the parent could never recover it — `_updateAgentToolTerminal` excluded `interrupted` from its overwrite guard, and `runAgentTool` returned the cached row for any terminal status. So a re-issue (stable runId) got stale `interrupted`, the model saw a retryable failure, and retried with a NEW toolCallId → fresh child → re-ran the already-completed work, re-introducing amplification for slow children. #1630 explicitly called for `_updateAgentToolTerminal` to let a later child completion repair the parent row. - `_updateAgentToolTerminal` now overwrites `interrupted` (soft terminal); only completed/error/aborted are hard/immutable. - `runAgentTool` routes an existing `interrupted` run through the re-attach path (like a non-terminal run) instead of returning the cached interrupted, so a re-issue re-attaches and repairs the row to the child's real result. Also: corrected a stale comment in the reconcile defer branch (re-attach now tails from the child's max chunk sequence, not -1). New reissueInterruptedThinkChildForTest locks in the repair. think 458 / ai-chat 480 green; both e2e scenarios pass. Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: 965f60a The changes in this PR will be included in the next version bump. This PR includes changesets to release 3 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
Contributor
Author
|
/bonk do a deep review, look for edge cases, is the ux/dx good |
agents
@cloudflare/ai-chat
@cloudflare/codemode
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
…oys (#1630) Extends the deploy-churn reliability harness (real `wrangler deploy`s mid-turn, higher fidelity than the SIGKILL e2e) to cover sub-agent re-attach: - `DeployChurnSubAgentChild` — a long, recoverable `recordStep` ledger child (mirrors the proven packages/think `ThinkToolRollbackE2EAgent`). - A `"subagent"` harness mode on `DeployChurnAgent` that drives the child via `agentTool()` (natural stable-runId path), plus `configureSubAgentRun` / `getSubAgentStatus` RPCs. - `churn.ts --mode subagent` with a RE-ATTACHED-vs-AMPLIFIED verdict. Run against the deployed worker (2 real mid-loop deploys) it confirms the #1630 fix holds under real deploys — the child completed all 30 steps with 1 re-run (no amplification, no data loss) — and surfaced a deeper follow-up: an orchestrating parent that only `await`s a sub-agent makes no forward progress of its own, so its chat recovery can exhaust before the child finishes (parent collected `interrupted` despite the child completing). That's the plan's N1 "apply recovery to the sub-agent path" item, beyond this PR's re-attach scope. Co-authored-by: Cursor <cursoragent@cursor.com>
Contributor
|
@threepointone Bonk workflow was cancelled. View workflow run · To retry, trigger Bonk again. |
4 tasks
threepointone
added a commit
that referenced
this pull request
Jun 1, 2026
…trating parent's recovery progress (N9) (#1641) A parent turn whose work is "run a sub-agent and await its result" produced no recoverable content of its own, so under deploy churn the PARENT's own chat-recovery no-progress window could exhaust while the child was still healthily streaming — abandoning the turn as `interrupted` and collecting an interrupted result even though the child went on to complete. Reproduced by the `examples/deploy-churn --mode subagent` harness: the parent exhausted at `attempt 6/6` with `progress: 1` while the child self-healed all 30 steps. Root cause: `_forwardAgentToolStream` (base Agent) only relayed child chunks to the parent's connections; it never touched the parent's durable recovery progress marker (which lives in the chat layer, not base Agent). Fix: add a base-Agent seam `_onAgentToolStreamProgress()` (no-op by default), invoked by `_forwardAgentToolStream` once per read iteration that actually forwarded >=1 child chunk. Think and AIChatAgent override it to bump their recovery-progress marker (throttled in-memory to 5s, reset per isolate so the first forwarded chunk after any restart always credits). Covers both the live `runAgentTool` path and the recovery `_reconcileAgentToolRuns` re-attach path. The bump is best-effort (wrapped so a failure never breaks the child stream). Design guard: credit is granted only on a genuinely-forwarded chunk — a silent/hung child forwards nothing, so the parent still exhausts on its own no-progress timer and a stuck sub-agent can never pin recovery open forever. This completes the sub-agent recovery story started by #1630/#1640: the child self-heals and the parent both re-attaches to it AND keeps its own recovery alive while doing so. Validation: - Unit (think + ai-chat): forwarding a child stream advances the marker and resets the parent's attempt budget like in-band content; a silent child does not credit, so the cap still binds. - SIGKILL e2e: the natural-agentTool() task-amplification scenario stays BOUNDED and the parent collects `completed` (no regression to #1640). - Real-deploy `deploy-churn --mode subagent` (3 mid-loop deploys): verdict flipped NOT COLLECTED -> RE-ATTACHED; the parent's incident shows progress: 8, attempt 5/6 (did not exhaust). Also fixed a harness gap: reset() now clears the parent transcript so a reused session doesn't short-circuit the mock model. Co-authored-by: Cursor <cursoragent@cursor.com>
Merged
This was referenced Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1630.
Problem
When a parent agent is interrupted (deploy / Durable Object eviction) while a child
agentTool()run is still in flight, parent recovery sealed the runinterruptedwithin a ~5s window and the parent re-issued the task — re-running the child's already-completed work. For long-running children under continuous deploys this surfaced to users as "the agent went all the way back and lost the files it already wrote."Root causes (all in
packages/agents/src/index.tsunless noted):agentTool()minted a freshnanoidper call, so a turn re-run by chat recovery spawned a brand-new child instead of re-using the in-flight one.runAgentTool) and a still-running child during startup reconciliation were sealedinterruptedat ~5s instead of being tailed to their real terminal result.chatRecovery, but that path never wrote the child's agent-tool run row, so the row strandedrunning(think) / was force-errored (ai-chat) — the parent could never collect the recovered result.interruptedwas an immutable terminal. Once sealedinterrupted, a later child completion could never repair the parent row, so a re-issue returned staleinterruptedand the model retried with a new toolCallId → fresh child → re-ran work.Change
1. Stable child runId —
agentTool()(agent-tools.ts)Derive the child
runIdfrom the (recovery-preserved) tool call id:agent-tool:<toolCallId>instead of a freshnanoid. A turn re-run by recovery resolves to the same idempotent child facet, so completed child work is never re-run. Falls back to a fresh id only when there's no tool call id.2. Bounded re-attach (
runAgentTool+_reconcileAgentToolRuns)A duplicate non-terminal
runIdand a still-running child during startup reconciliation now tail the live child to its real terminal result and collect it, instead of immediately sealinginterrupted. Bounded by a generous internal wall-clock budgetDEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS(120s), consistent with the wall-clock-dominant chat-recovery budget (#1638): a child that keeps advancing toward terminal within the window is collected; a genuinely hung child still sealsinterruptedso recovery can never block forever._reconcileAgentToolRunsis a two-pass sweep: a deadline-bounded inspect/classify, then parallel re-attach of still-running children (each with its own budget) — so one slow/hung child can't starve every later sibling against the shared deadline.3. Durable child-run reconcile — think + ai-chat (transcript-based, symmetric)
Both packages reconcile a stale child-run row (
running, no live abort controller = the original isolate is gone) from the durable transcript oninspectAgentToolRun, gated on recovery state via_classifyAgentToolChildRecovery()(lists chat-recovery incidents on the 1:1 child facet):in-progress→ keeprunningso the parent's bounded re-attach keeps waiting;completed(the parent collects the real result);failed/empty →error.This keeps the child's own (working)
saveMessagesrecovery path untouched.4.
interruptedis now a soft, repairable terminal (_updateAgentToolTerminal)_updateAgentToolTerminalno longer excludesinterruptedfrom its overwrite guard (onlycompleted/error/abortedare hard/immutable), andrunAgentToolroutes an existinginterruptedrun through the re-attach path. So a re-issue (or later child completion) repairs the row to the child's real result instead of returning staleinterrupted— implements #1630's "fix_updateAgentToolTerminalso a later child completion can repair the parent row."Edge cases handled (from review)
completedwhen the recovered turn produced any assistant message (text or tool-only), not only when there's final text.-1, so it forwards only new chunks (a reconnected client already replays stored chunks via_replayAgentToolRuns; the client reducer appends by arrival order)._checkRunFibers()(creates the recovery incident) isawaited inonStartbefore the facet serves RPC, so a stale-row inspect always sees the in-progress incident.API / surface
maxAttemptsstays the one knob).agent_tool:recovery:reattachobservability event.agentTool()runId derivation and the re-attach behavior are internal.Tests
agent-tools.test.ts): reworked still-running →completedvia re-attach; new bounded-tail-able-stuck →interruptedafter budget; new parallel reconcile (hung child can't starve a sibling); new interrupted-repair (re-issue collectscompleted). Plus theagentstest thatagentTool()derives a stable runId.completed/error/in-progresspaths.task-amplification.test.ts): drives both the hand-picked stable-runId parent and the naturalagentTool()parent through real kill/restart churn, asserting the child reaches all 30 steps and the parent's run row settlescompleted(parentChildStatus). Both scenarios pass.examples/deploy-churn, operator-run): a new--mode subagentdrives a child viaagentTool()and fires realwrangler deploys mid-child-loop. Run with 2 mid-loop deploys, it confirms the fix holds under real deploys — the child completed all 30 steps with 1 re-run (no amplification, no data loss). It also surfaced the follow-up below.Verification
tscclean (agents + think + ai-chat);oxlint+oxfmtclean.agentsagent-tool tests green.Known follow-ups (tracked, not in this PR)
awaits a sub-agent makes no forward progress of its own, so under heavy deploy churn the parent's own chat recovery can exhaust (stable_timeout,progress: 1) before the child finishes — the parent then collectsinterruptedeven though the child completed (its work preserved, no amplification). Fix (= plan's N1 "apply recovery to the sub-agent path"): count the child's forwarded progress as parent progress, and/or re-attach a still-recoverableinterruptedrow on a later restart. Beyond this PR's re-attach scope → separate PR.streamText-compatible mock child that completes after recovery (an AI-SDK harness problem orthogonal to this change; ai-chat's reconcile is unit-validated).chatRecoveryturn under churn (the latent base-API bug behind the reverted submission experiment).Test plan
agent-tool:<toolCallId>runId derivationexamples/deploy-churn+ the customer chaos harnessMade with Cursor