fix(agents,think,ai-chat): re-attach to still-running sub-agent runs on parent recovery (#1630) by threepointone · Pull Request #1640 · cloudflare/agents

threepointone · 2026-06-01T11:48:42Z

Closes #1630.

Problem

When a parent agent is interrupted (deploy / Durable Object eviction) while a child agentTool() run is still in flight, parent recovery sealed the run interrupted within a ~5s window and the parent re-issued the task — re-running the child's already-completed work. For long-running children under continuous deploys this surfaced to users as "the agent went all the way back and lost the files it already wrote."

Root causes (all in packages/agents/src/index.ts unless noted):

Fresh runId per call. agentTool() minted a fresh nanoid per call, so a turn re-run by chat recovery spawned a brand-new child instead of re-using the in-flight one.
Abandon, don't re-attach. A duplicate non-terminal run (runAgentTool) and a still-running child during startup reconciliation were sealed interrupted at ~5s instead of being tailed to their real terminal result.
No durable child-run reconcile. A child facet self-heals its interrupted turn via its own chatRecovery, but that path never wrote the child's agent-tool run row, so the row stranded running (think) / was force-errored (ai-chat) — the parent could never collect the recovered result.
interrupted was an immutable terminal. Once sealed interrupted, a later child completion could never repair the parent row, so a re-issue returned stale interrupted and the model retried with a new toolCallId → fresh child → re-ran work.

Change

1. Stable child runId — `agentTool()` (`agent-tools.ts`)

Derive the child runId from the (recovery-preserved) tool call id: agent-tool:<toolCallId> instead of a fresh nanoid. A turn re-run by recovery resolves to the same idempotent child facet, so completed child work is never re-run. Falls back to a fresh id only when there's no tool call id.

2. Bounded re-attach (`runAgentTool` + `_reconcileAgentToolRuns`)

A duplicate non-terminal runId and a still-running child during startup reconciliation now tail the live child to its real terminal result and collect it, instead of immediately sealing interrupted. Bounded by a generous internal wall-clock budget DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS (120s), consistent with the wall-clock-dominant chat-recovery budget (#1638): a child that keeps advancing toward terminal within the window is collected; a genuinely hung child still seals interrupted so recovery can never block forever.

_reconcileAgentToolRuns is a two-pass sweep: a deadline-bounded inspect/classify, then parallel re-attach of still-running children (each with its own budget) — so one slow/hung child can't starve every later sibling against the shared deadline.

3. Durable child-run reconcile — think + ai-chat (transcript-based, symmetric)

Both packages reconcile a stale child-run row (running, no live abort controller = the original isolate is gone) from the durable transcript on inspectAgentToolRun, gated on recovery state via _classifyAgentToolChildRecovery() (lists chat-recovery incidents on the 1:1 child facet):

recovery in-progress → keep running so the parent's bounded re-attach keeps waiting;
settled with an assistant turn (text or tool-only) → completed (the parent collects the real result);
failed/empty → error.

This keeps the child's own (working) saveMessages recovery path untouched.

Why not the durable submission registry? We first tried running the child via submitMessages({ submissionId: runId }) so the durable submission row would be the recovery-reconciled terminal marker. The unit suite stayed green, but the real-eviction e2e caught a regression: _recoverSubmissionsOnStart's "error a running submission whose messages were applied" policy fights the chat-recovery fiber resuming a multi-restart, multi-step tool loop — the child stalled at 10/30 steps and the parent collected error. Submissions are a clean single-turn durable primitive but the wrong execution path for resumable agent-tool loops. Reverted in favor of the transcript reconcile, which never touches the recovery path. (Tracked as base-API follow-ups N7/N8 in the working doc.)

4. `interrupted` is now a soft, repairable terminal (`_updateAgentToolTerminal`)

_updateAgentToolTerminal no longer excludes interrupted from its overwrite guard (only completed/error/aborted are hard/immutable), and runAgentTool routes an existing interrupted run through the re-attach path. So a re-issue (or later child completion) repairs the row to the child's real result instead of returning stale interrupted — implements #1630's "fix _updateAgentToolTerminal so a later child completion can repair the parent row."

Edge cases handled (from review)

Multi-child starvation — parallel re-attach (above).
Text-less completed runs — collected as completed when the recovered turn produced any assistant message (text or tool-only), not only when there's final text.
Chunk re-replay — re-attach tails from the child's current max chunk sequence, not -1, so it forwards only new chunks (a reconnected client already replays stored chunks via _replayAgentToolRuns; the client reducer appends by arrival order).
Hung-child reader — on budget-abort the tail reader is abandoned (a pre-caught read), not cancelled: cancelling a remote child-facet RPC stream surfaces an unswallowable "Stream was cancelled" pump rejection (verified). The hold is bounded — the child reaches terminal within its own chat-recovery ceiling, firing the tail's closer which releases the reader.
No premature terminalize — _checkRunFibers() (creates the recovery incident) is awaited in onStart before the facet serves RPC, so a stale-row inspect always sees the in-progress incident.

API / surface

No new public configuration (budget is an internal constant; the existing maxAttempts stays the one knob).
Adds an internal agent_tool:recovery:reattach observability event.
agentTool() runId derivation and the re-attach behavior are internal.

Tests

think unit (agent-tools.test.ts): reworked still-running → completed via re-attach; new bounded-tail-able-stuck → interrupted after budget; new parallel reconcile (hung child can't starve a sibling); new interrupted-repair (re-issue collects completed). Plus the agents test that agentTool() derives a stable runId.
ai-chat unit: the reconcile classify/completed/error/in-progress paths.
Real-eviction e2e (task-amplification.test.ts): drives both the hand-picked stable-runId parent and the natural agentTool() parent through real kill/restart churn, asserting the child reaches all 30 steps and the parent's run row settles completed (parentChildStatus). Both scenarios pass.
Real-deploy harness (examples/deploy-churn, operator-run): a new --mode subagent drives a child via agentTool() and fires real wrangler deploys mid-child-loop. Run with 2 mid-loop deploys, it confirms the fix holds under real deploys — the child completed all 30 steps with 1 re-run (no amplification, no data loss). It also surfaced the follow-up below.

Verification

tsc clean (agents + think + ai-chat); oxlint + oxfmt clean.
Full suites green: think 458, ai-chat 480; agents agent-tool tests green.
Both real-eviction e2e scenarios pass (~224s).

Known follow-ups (tracked, not in this PR)

Orchestrating-parent recovery progress (surfaced by the real-deploy harness). A parent turn that merely awaits a sub-agent makes no forward progress of its own, so under heavy deploy churn the parent's own chat recovery can exhaust (stable_timeout, progress: 1) before the child finishes — the parent then collects interrupted even though the child completed (its work preserved, no amplification). Fix (= plan's N1 "apply recovery to the sub-agent path"): count the child's forwarded progress as parent progress, and/or re-attach a still-recoverable interrupted row on a later restart. Beyond this PR's re-attach scope → separate PR.
ai-chat real-eviction e2e — blocked on building a deterministic, LLM-free, streamText-compatible mock child that completes after recovery (an AI-SDK harness problem orthogonal to this change; ai-chat's reconcile is unit-validated).
N7 — durable submissions can double-recover / prematurely error a chatRecovery turn under churn (the latent base-API bug behind the reverted submission experiment).
N8 — unify the durable "turn status/result" record (transcript vs deleted-on-success incident vs submissions).

Test plan

CI green
Review the re-attach budget (120s) and the agent-tool:<toolCallId> runId derivation
Validate against examples/deploy-churn + the customer chaos harness

Made with Cursor

…on parent recovery (#1630) When a parent agent was evicted (deploy / DO reset) while a child agentTool() run was in flight, recovery sealed the run `interrupted` within ~5s and the parent re-issued the task — re-running the child's already-completed work ("the agent went all the way back and lost the files it wrote"). - Stable child runId: agentTool() now derives `agent-tool:<toolCallId>` from the recovery-preserved tool call id instead of a fresh nanoid, so a turn re-run by recovery resolves to the SAME idempotent child facet rather than spawning a new one (the primary amplification fix). - Bounded re-attach: a duplicate non-terminal runId (runAgentTool) and a still-running child during startup reconciliation now tail the live child to its real terminal result, bounded by DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS (120s). A hung child still seals `interrupted` after the budget so recovery can never block forever. - think + ai-chat child tails are now read-only on consumer detach: a parent's re-attach budget expiring cancels only the read view, never the still-running child (so it keeps advancing toward its own terminal for a later collect). Adds an internal `agent_tool:recovery:reattach` observability event; no new public config. Reworked the reconcile unit tests (running child -> completed; new bounded-tail-able-stuck -> interrupted after budget), a stable-runId test, and a natural-agentTool() task-amplification e2e variant. Co-authored-by: Cursor <cursoragent@cursor.com>

… eviction (#1630) A child facet self-heals its interrupted agent-tool turn via its own chatRecovery, but that path never wrote the child's run row — so after a real eviction the row stranded `running` (think) / was force-errored (ai-chat) and the parent could only ever collect `interrupted`/`error`, never the recovered result. Both packages now reconcile a stale child-run row (running, no live abort controller = original isolate gone) from the durable transcript on inspectAgentToolRun, gated on recovery state via _classifyAgentToolChildRecovery (lists chat-recovery incidents on the 1:1 child facet): in-progress -> keep `running` so the parent's bounded re-attach keeps waiting; settled with a completed assistant response -> `completed`; failed/empty -> `error`. This keeps the child's own (working) saveMessages recovery path untouched. Validated by a real-eviction e2e: task-amplification now drives the natural agentTool() path through kill/restart churn and asserts the child reaches all 30 steps AND the parent's run row settles `completed`. Note: routing the child through the durable submission registry (submitMessages) was tried first and reverted — the e2e showed it regresses multi-restart, multi-step tool-loop recovery (_recoverSubmissionsOnStart errors a running submission whose messages were applied, fighting the chat-recovery fiber). The transcript reconcile avoids touching the recovery path entirely. Co-authored-by: Cursor <cursoragent@cursor.com>

…t-less recovered runs (#1630) Review hardening for the sub-agent recovery work: - Reconcile no longer starves siblings. _reconcileAgentToolRuns is now a two-pass sweep: a deadline-bounded inspect/classify, then re-attach of still-running children IN PARALLEL, each bounded by its own re-attach budget. Previously the shared total-recovery deadline was consumed by the first child's (up to 120s) re-attach, so a slow/hung child caused every later sibling to be abandoned `interrupted` without an attempt. - A settled recovery that produced an assistant turn is now collected as `completed` even when the turn ended on a tool result with no final text — keying off text alone (think _getAgentToolFinalText / ai-chat _extractLatestAssistantText) mis-sealed a legitimately-finished but text-less run as `error`. getAgentToolSummary still falls back to "". New reconcileParallelThinkChildrenForTest asserts a hung child (started first) no longer starves a fast sibling. Full think (457) + ai-chat (480) suites green; both real-eviction e2e scenarios pass. Co-authored-by: Cursor <cursoragent@cursor.com>

…#1630) Addresses the three deferred review items: (a) Re-attach now tails from the child's CURRENT max chunk sequence instead of afterSequence: -1. The client reducer appends chunks by arrival order (ignoring sequence/replay), and a reconnected client already has the stored chunks via _replayAgentToolRuns — so replaying them on re-attach duplicated the run's parts on a connected client under repeated re-attach. Forwarding only genuinely-new chunks keeps the live stream correct without dupes. (b) Documented that the tail reader is deliberately abandoned (not cancelled) on budget-abort: cancelling a remote child-facet RPC stream surfaces an unswallowable "Stream was cancelled" pump rejection (verified). The hold is already bounded — the child reaches terminal within its own chat-recovery ceiling, firing the tail's registered closer which releases the reader. (c) Extracted the inspect reconcile-persist into a named private helper (_reconcileStaleAgentToolChildRun) in both packages so read vs reconcile are separated. The persist is retained intentionally (enables prompt tail-close + cheap subsequent inspects) and documented as lazy materialization of the run's true terminal. think 457 / ai-chat 480 suites green; both real-eviction e2e scenarios pass. Co-authored-by: Cursor <cursoragent@cursor.com>

Deep-review finding: once a run was sealed `interrupted` (e.g. a reconcile exhausted its re-attach budget before a slow child finished), the parent could never recover it — `_updateAgentToolTerminal` excluded `interrupted` from its overwrite guard, and `runAgentTool` returned the cached row for any terminal status. So a re-issue (stable runId) got stale `interrupted`, the model saw a retryable failure, and retried with a NEW toolCallId → fresh child → re-ran the already-completed work, re-introducing amplification for slow children. #1630 explicitly called for `_updateAgentToolTerminal` to let a later child completion repair the parent row. - `_updateAgentToolTerminal` now overwrites `interrupted` (soft terminal); only completed/error/aborted are hard/immutable. - `runAgentTool` routes an existing `interrupted` run through the re-attach path (like a non-terminal run) instead of returning the cached interrupted, so a re-issue re-attaches and repairs the row to the child's real result. Also: corrected a stale comment in the reconcile defer branch (re-attach now tails from the child's max chunk sequence, not -1). New reissueInterruptedThinkChildForTest locks in the repair. think 458 / ai-chat 480 green; both e2e scenarios pass. Co-authored-by: Cursor <cursoragent@cursor.com>

changeset-bot · 2026-06-01T11:48:46Z

🦋 Changeset detected

Latest commit: 965f60a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages

Name	Type
agents	Minor
@cloudflare/think	Patch
@cloudflare/ai-chat	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

threepointone · 2026-06-01T11:51:50Z

/bonk do a deep review, look for edge cases, is the ux/dx good

pkg-pr-new · 2026-06-01T11:55:48Z

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1640

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1640

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1640

hono-agents

npm i https://pkg.pr.new/hono-agents@1640

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1640

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1640

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1640

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1640

commit: 965f60a

…oys (#1630) Extends the deploy-churn reliability harness (real `wrangler deploy`s mid-turn, higher fidelity than the SIGKILL e2e) to cover sub-agent re-attach: - `DeployChurnSubAgentChild` — a long, recoverable `recordStep` ledger child (mirrors the proven packages/think `ThinkToolRollbackE2EAgent`). - A `"subagent"` harness mode on `DeployChurnAgent` that drives the child via `agentTool()` (natural stable-runId path), plus `configureSubAgentRun` / `getSubAgentStatus` RPCs. - `churn.ts --mode subagent` with a RE-ATTACHED-vs-AMPLIFIED verdict. Run against the deployed worker (2 real mid-loop deploys) it confirms the #1630 fix holds under real deploys — the child completed all 30 steps with 1 re-run (no amplification, no data loss) — and surfaced a deeper follow-up: an orchestrating parent that only `await`s a sub-agent makes no forward progress of its own, so its chat recovery can exhaust before the child finishes (parent collected `interrupted` despite the child completing). That's the plan's N1 "apply recovery to the sub-agent path" item, beyond this PR's re-attach scope. Co-authored-by: Cursor <cursoragent@cursor.com>

ask-bonk · 2026-06-01T12:12:11Z

@threepointone Bonk workflow was cancelled.

View workflow run · To retry, trigger Bonk again.

…trating parent's recovery progress (N9) (#1641) A parent turn whose work is "run a sub-agent and await its result" produced no recoverable content of its own, so under deploy churn the PARENT's own chat-recovery no-progress window could exhaust while the child was still healthily streaming — abandoning the turn as `interrupted` and collecting an interrupted result even though the child went on to complete. Reproduced by the `examples/deploy-churn --mode subagent` harness: the parent exhausted at `attempt 6/6` with `progress: 1` while the child self-healed all 30 steps. Root cause: `_forwardAgentToolStream` (base Agent) only relayed child chunks to the parent's connections; it never touched the parent's durable recovery progress marker (which lives in the chat layer, not base Agent). Fix: add a base-Agent seam `_onAgentToolStreamProgress()` (no-op by default), invoked by `_forwardAgentToolStream` once per read iteration that actually forwarded >=1 child chunk. Think and AIChatAgent override it to bump their recovery-progress marker (throttled in-memory to 5s, reset per isolate so the first forwarded chunk after any restart always credits). Covers both the live `runAgentTool` path and the recovery `_reconcileAgentToolRuns` re-attach path. The bump is best-effort (wrapped so a failure never breaks the child stream). Design guard: credit is granted only on a genuinely-forwarded chunk — a silent/hung child forwards nothing, so the parent still exhausts on its own no-progress timer and a stuck sub-agent can never pin recovery open forever. This completes the sub-agent recovery story started by #1630/#1640: the child self-heals and the parent both re-attaches to it AND keeps its own recovery alive while doing so. Validation: - Unit (think + ai-chat): forwarding a child stream advances the marker and resets the parent's attempt budget like in-band content; a silent child does not credit, so the cap still binds. - SIGKILL e2e: the natural-agentTool() task-amplification scenario stays BOUNDED and the parent collects `completed` (no regression to #1640). - Real-deploy `deploy-churn --mode subagent` (3 mid-loop deploys): verdict flipped NOT COLLECTED -> RE-ATTACHED; the parent's incident shows progress: 8, attempt 5/6 (did not exhaust). Also fixed a harness gap: reset() now clears the parent transcript so a reused session doesn't short-circuit the mock model. Co-authored-by: Cursor <cursoragent@cursor.com>

threepointone and others added 5 commits June 1, 2026 04:33

threepointone merged commit edb126a into main Jun 1, 2026
4 checks passed

threepointone deleted the fix/sub-agent-reattach-1630 branch June 1, 2026 12:13

threepointone mentioned this pull request Jun 1, 2026

fix(agents,think,ai-chat): count a sub-agent's progress as the orchestrating parent's recovery progress (N9) #1641

Merged

4 tasks

github-actions Bot mentioned this pull request Jun 1, 2026

Version Packages #1597

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agents,think,ai-chat): re-attach to still-running sub-agent runs on parent recovery (#1630)#1640

fix(agents,think,ai-chat): re-attach to still-running sub-agent runs on parent recovery (#1630)#1640
threepointone merged 6 commits into
mainfrom
fix/sub-agent-reattach-1630

threepointone commented Jun 1, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

threepointone commented Jun 1, 2026

Uh oh!

pkg-pr-new Bot commented Jun 1, 2026 •

edited

Loading

Uh oh!

ask-bonk Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

threepointone commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Change

1. Stable child runId — agentTool() (agent-tools.ts)

2. Bounded re-attach (runAgentTool + _reconcileAgentToolRuns)

3. Durable child-run reconcile — think + ai-chat (transcript-based, symmetric)

4. interrupted is now a soft, repairable terminal (_updateAgentToolTerminal)

Edge cases handled (from review)

API / surface

Tests

Verification

Known follow-ups (tracked, not in this PR)

Test plan

Uh oh!

changeset-bot Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

threepointone commented Jun 1, 2026

Uh oh!

pkg-pr-new Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ask-bonk Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

threepointone commented Jun 1, 2026 •

edited

Loading

1. Stable child runId — `agentTool()` (`agent-tools.ts`)

2. Bounded re-attach (`runAgentTool` + `_reconcileAgentToolRuns`)

4. `interrupted` is now a soft, repairable terminal (`_updateAgentToolTerminal`)

changeset-bot Bot commented Jun 1, 2026 •

edited

Loading

pkg-pr-new Bot commented Jun 1, 2026 •

edited

Loading