Skip to content

fix(agents,think,ai-chat): re-attach to still-running sub-agent runs on parent recovery (#1630)#1640

Merged
threepointone merged 6 commits into
mainfrom
fix/sub-agent-reattach-1630
Jun 1, 2026
Merged

fix(agents,think,ai-chat): re-attach to still-running sub-agent runs on parent recovery (#1630)#1640
threepointone merged 6 commits into
mainfrom
fix/sub-agent-reattach-1630

Conversation

@threepointone
Copy link
Copy Markdown
Contributor

@threepointone threepointone commented Jun 1, 2026

Closes #1630.

Problem

When a parent agent is interrupted (deploy / Durable Object eviction) while a child agentTool() run is still in flight, parent recovery sealed the run interrupted within a ~5s window and the parent re-issued the task — re-running the child's already-completed work. For long-running children under continuous deploys this surfaced to users as "the agent went all the way back and lost the files it already wrote."

Root causes (all in packages/agents/src/index.ts unless noted):

  1. Fresh runId per call. agentTool() minted a fresh nanoid per call, so a turn re-run by chat recovery spawned a brand-new child instead of re-using the in-flight one.
  2. Abandon, don't re-attach. A duplicate non-terminal run (runAgentTool) and a still-running child during startup reconciliation were sealed interrupted at ~5s instead of being tailed to their real terminal result.
  3. No durable child-run reconcile. A child facet self-heals its interrupted turn via its own chatRecovery, but that path never wrote the child's agent-tool run row, so the row stranded running (think) / was force-errored (ai-chat) — the parent could never collect the recovered result.
  4. interrupted was an immutable terminal. Once sealed interrupted, a later child completion could never repair the parent row, so a re-issue returned stale interrupted and the model retried with a new toolCallId → fresh child → re-ran work.

Change

1. Stable child runId — agentTool() (agent-tools.ts)

Derive the child runId from the (recovery-preserved) tool call id: agent-tool:<toolCallId> instead of a fresh nanoid. A turn re-run by recovery resolves to the same idempotent child facet, so completed child work is never re-run. Falls back to a fresh id only when there's no tool call id.

2. Bounded re-attach (runAgentTool + _reconcileAgentToolRuns)

A duplicate non-terminal runId and a still-running child during startup reconciliation now tail the live child to its real terminal result and collect it, instead of immediately sealing interrupted. Bounded by a generous internal wall-clock budget DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS (120s), consistent with the wall-clock-dominant chat-recovery budget (#1638): a child that keeps advancing toward terminal within the window is collected; a genuinely hung child still seals interrupted so recovery can never block forever.

_reconcileAgentToolRuns is a two-pass sweep: a deadline-bounded inspect/classify, then parallel re-attach of still-running children (each with its own budget) — so one slow/hung child can't starve every later sibling against the shared deadline.

3. Durable child-run reconcile — think + ai-chat (transcript-based, symmetric)

Both packages reconcile a stale child-run row (running, no live abort controller = the original isolate is gone) from the durable transcript on inspectAgentToolRun, gated on recovery state via _classifyAgentToolChildRecovery() (lists chat-recovery incidents on the 1:1 child facet):

  • recovery in-progress → keep running so the parent's bounded re-attach keeps waiting;
  • settled with an assistant turn (text or tool-only) → completed (the parent collects the real result);
  • failed/empty → error.

This keeps the child's own (working) saveMessages recovery path untouched.

Why not the durable submission registry? We first tried running the child via submitMessages({ submissionId: runId }) so the durable submission row would be the recovery-reconciled terminal marker. The unit suite stayed green, but the real-eviction e2e caught a regression: _recoverSubmissionsOnStart's "error a running submission whose messages were applied" policy fights the chat-recovery fiber resuming a multi-restart, multi-step tool loop — the child stalled at 10/30 steps and the parent collected error. Submissions are a clean single-turn durable primitive but the wrong execution path for resumable agent-tool loops. Reverted in favor of the transcript reconcile, which never touches the recovery path. (Tracked as base-API follow-ups N7/N8 in the working doc.)

4. interrupted is now a soft, repairable terminal (_updateAgentToolTerminal)

_updateAgentToolTerminal no longer excludes interrupted from its overwrite guard (only completed/error/aborted are hard/immutable), and runAgentTool routes an existing interrupted run through the re-attach path. So a re-issue (or later child completion) repairs the row to the child's real result instead of returning stale interrupted — implements #1630's "fix _updateAgentToolTerminal so a later child completion can repair the parent row."

Edge cases handled (from review)

  • Multi-child starvation — parallel re-attach (above).
  • Text-less completed runs — collected as completed when the recovered turn produced any assistant message (text or tool-only), not only when there's final text.
  • Chunk re-replay — re-attach tails from the child's current max chunk sequence, not -1, so it forwards only new chunks (a reconnected client already replays stored chunks via _replayAgentToolRuns; the client reducer appends by arrival order).
  • Hung-child reader — on budget-abort the tail reader is abandoned (a pre-caught read), not cancelled: cancelling a remote child-facet RPC stream surfaces an unswallowable "Stream was cancelled" pump rejection (verified). The hold is bounded — the child reaches terminal within its own chat-recovery ceiling, firing the tail's closer which releases the reader.
  • No premature terminalize_checkRunFibers() (creates the recovery incident) is awaited in onStart before the facet serves RPC, so a stale-row inspect always sees the in-progress incident.

API / surface

  • No new public configuration (budget is an internal constant; the existing maxAttempts stays the one knob).
  • Adds an internal agent_tool:recovery:reattach observability event.
  • agentTool() runId derivation and the re-attach behavior are internal.

Tests

  • think unit (agent-tools.test.ts): reworked still-running → completed via re-attach; new bounded-tail-able-stuck → interrupted after budget; new parallel reconcile (hung child can't starve a sibling); new interrupted-repair (re-issue collects completed). Plus the agents test that agentTool() derives a stable runId.
  • ai-chat unit: the reconcile classify/completed/error/in-progress paths.
  • Real-eviction e2e (task-amplification.test.ts): drives both the hand-picked stable-runId parent and the natural agentTool() parent through real kill/restart churn, asserting the child reaches all 30 steps and the parent's run row settles completed (parentChildStatus). Both scenarios pass.
  • Real-deploy harness (examples/deploy-churn, operator-run): a new --mode subagent drives a child via agentTool() and fires real wrangler deploys mid-child-loop. Run with 2 mid-loop deploys, it confirms the fix holds under real deploys — the child completed all 30 steps with 1 re-run (no amplification, no data loss). It also surfaced the follow-up below.

Verification

  • tsc clean (agents + think + ai-chat); oxlint + oxfmt clean.
  • Full suites green: think 458, ai-chat 480; agents agent-tool tests green.
  • Both real-eviction e2e scenarios pass (~224s).

Known follow-ups (tracked, not in this PR)

  • Orchestrating-parent recovery progress (surfaced by the real-deploy harness). A parent turn that merely awaits a sub-agent makes no forward progress of its own, so under heavy deploy churn the parent's own chat recovery can exhaust (stable_timeout, progress: 1) before the child finishes — the parent then collects interrupted even though the child completed (its work preserved, no amplification). Fix (= plan's N1 "apply recovery to the sub-agent path"): count the child's forwarded progress as parent progress, and/or re-attach a still-recoverable interrupted row on a later restart. Beyond this PR's re-attach scope → separate PR.
  • ai-chat real-eviction e2e — blocked on building a deterministic, LLM-free, streamText-compatible mock child that completes after recovery (an AI-SDK harness problem orthogonal to this change; ai-chat's reconcile is unit-validated).
  • N7 — durable submissions can double-recover / prematurely error a chatRecovery turn under churn (the latent base-API bug behind the reverted submission experiment).
  • N8 — unify the durable "turn status/result" record (transcript vs deleted-on-success incident vs submissions).

Test plan

  • CI green
  • Review the re-attach budget (120s) and the agent-tool:<toolCallId> runId derivation
  • Validate against examples/deploy-churn + the customer chaos harness

Made with Cursor

threepointone and others added 5 commits June 1, 2026 04:33
…on parent recovery (#1630)

When a parent agent was evicted (deploy / DO reset) while a child agentTool()
run was in flight, recovery sealed the run `interrupted` within ~5s and the
parent re-issued the task — re-running the child's already-completed work
("the agent went all the way back and lost the files it wrote").

- Stable child runId: agentTool() now derives `agent-tool:<toolCallId>` from
  the recovery-preserved tool call id instead of a fresh nanoid, so a turn
  re-run by recovery resolves to the SAME idempotent child facet rather than
  spawning a new one (the primary amplification fix).
- Bounded re-attach: a duplicate non-terminal runId (runAgentTool) and a
  still-running child during startup reconciliation now tail the live child to
  its real terminal result, bounded by DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS
  (120s). A hung child still seals `interrupted` after the budget so recovery
  can never block forever.
- think + ai-chat child tails are now read-only on consumer detach: a parent's
  re-attach budget expiring cancels only the read view, never the still-running
  child (so it keeps advancing toward its own terminal for a later collect).

Adds an internal `agent_tool:recovery:reattach` observability event; no new
public config. Reworked the reconcile unit tests (running child -> completed;
new bounded-tail-able-stuck -> interrupted after budget), a stable-runId test,
and a natural-agentTool() task-amplification e2e variant.

Co-authored-by: Cursor <cursoragent@cursor.com>
… eviction (#1630)

A child facet self-heals its interrupted agent-tool turn via its own
chatRecovery, but that path never wrote the child's run row — so after a real
eviction the row stranded `running` (think) / was force-errored (ai-chat) and
the parent could only ever collect `interrupted`/`error`, never the recovered
result.

Both packages now reconcile a stale child-run row (running, no live abort
controller = original isolate gone) from the durable transcript on
inspectAgentToolRun, gated on recovery state via _classifyAgentToolChildRecovery
(lists chat-recovery incidents on the 1:1 child facet): in-progress -> keep
`running` so the parent's bounded re-attach keeps waiting; settled with a
completed assistant response -> `completed`; failed/empty -> `error`. This keeps
the child's own (working) saveMessages recovery path untouched.

Validated by a real-eviction e2e: task-amplification now drives the natural
agentTool() path through kill/restart churn and asserts the child reaches all
30 steps AND the parent's run row settles `completed`.

Note: routing the child through the durable submission registry (submitMessages)
was tried first and reverted — the e2e showed it regresses multi-restart,
multi-step tool-loop recovery (_recoverSubmissionsOnStart errors a running
submission whose messages were applied, fighting the chat-recovery fiber). The
transcript reconcile avoids touching the recovery path entirely.

Co-authored-by: Cursor <cursoragent@cursor.com>
…t-less recovered runs (#1630)

Review hardening for the sub-agent recovery work:

- Reconcile no longer starves siblings. _reconcileAgentToolRuns is now a
  two-pass sweep: a deadline-bounded inspect/classify, then re-attach of
  still-running children IN PARALLEL, each bounded by its own re-attach budget.
  Previously the shared total-recovery deadline was consumed by the first
  child's (up to 120s) re-attach, so a slow/hung child caused every later
  sibling to be abandoned `interrupted` without an attempt.

- A settled recovery that produced an assistant turn is now collected as
  `completed` even when the turn ended on a tool result with no final text —
  keying off text alone (think _getAgentToolFinalText / ai-chat
  _extractLatestAssistantText) mis-sealed a legitimately-finished but text-less
  run as `error`. getAgentToolSummary still falls back to "".

New reconcileParallelThinkChildrenForTest asserts a hung child (started first)
no longer starves a fast sibling. Full think (457) + ai-chat (480) suites green;
both real-eviction e2e scenarios pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
…#1630)

Addresses the three deferred review items:

(a) Re-attach now tails from the child's CURRENT max chunk sequence instead of
    afterSequence: -1. The client reducer appends chunks by arrival order
    (ignoring sequence/replay), and a reconnected client already has the stored
    chunks via _replayAgentToolRuns — so replaying them on re-attach duplicated
    the run's parts on a connected client under repeated re-attach. Forwarding
    only genuinely-new chunks keeps the live stream correct without dupes.

(b) Documented that the tail reader is deliberately abandoned (not cancelled) on
    budget-abort: cancelling a remote child-facet RPC stream surfaces an
    unswallowable "Stream was cancelled" pump rejection (verified). The hold is
    already bounded — the child reaches terminal within its own chat-recovery
    ceiling, firing the tail's registered closer which releases the reader.

(c) Extracted the inspect reconcile-persist into a named private helper
    (_reconcileStaleAgentToolChildRun) in both packages so read vs reconcile are
    separated. The persist is retained intentionally (enables prompt tail-close +
    cheap subsequent inspects) and documented as lazy materialization of the
    run's true terminal.

think 457 / ai-chat 480 suites green; both real-eviction e2e scenarios pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
Deep-review finding: once a run was sealed `interrupted` (e.g. a reconcile
exhausted its re-attach budget before a slow child finished), the parent could
never recover it — `_updateAgentToolTerminal` excluded `interrupted` from its
overwrite guard, and `runAgentTool` returned the cached row for any terminal
status. So a re-issue (stable runId) got stale `interrupted`, the model saw a
retryable failure, and retried with a NEW toolCallId → fresh child → re-ran the
already-completed work, re-introducing amplification for slow children. #1630
explicitly called for `_updateAgentToolTerminal` to let a later child completion
repair the parent row.

- `_updateAgentToolTerminal` now overwrites `interrupted` (soft terminal); only
  completed/error/aborted are hard/immutable.
- `runAgentTool` routes an existing `interrupted` run through the re-attach path
  (like a non-terminal run) instead of returning the cached interrupted, so a
  re-issue re-attaches and repairs the row to the child's real result.

Also: corrected a stale comment in the reconcile defer branch (re-attach now
tails from the child's max chunk sequence, not -1). New
reissueInterruptedThinkChildForTest locks in the repair. think 458 / ai-chat 480
green; both e2e scenarios pass.

Co-authored-by: Cursor <cursoragent@cursor.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented Jun 1, 2026

🦋 Changeset detected

Latest commit: 965f60a

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
agents Minor
@cloudflare/think Patch
@cloudflare/ai-chat Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@threepointone
Copy link
Copy Markdown
Contributor Author

/bonk do a deep review, look for edge cases, is the ux/dx good

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Jun 1, 2026

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1640

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1640

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1640

hono-agents

npm i https://pkg.pr.new/hono-agents@1640

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1640

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1640

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1640

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1640

commit: 965f60a

…oys (#1630)

Extends the deploy-churn reliability harness (real `wrangler deploy`s mid-turn,
higher fidelity than the SIGKILL e2e) to cover sub-agent re-attach:

- `DeployChurnSubAgentChild` — a long, recoverable `recordStep` ledger child
  (mirrors the proven packages/think `ThinkToolRollbackE2EAgent`).
- A `"subagent"` harness mode on `DeployChurnAgent` that drives the child via
  `agentTool()` (natural stable-runId path), plus `configureSubAgentRun` /
  `getSubAgentStatus` RPCs.
- `churn.ts --mode subagent` with a RE-ATTACHED-vs-AMPLIFIED verdict.

Run against the deployed worker (2 real mid-loop deploys) it confirms the #1630
fix holds under real deploys — the child completed all 30 steps with 1 re-run
(no amplification, no data loss) — and surfaced a deeper follow-up: an
orchestrating parent that only `await`s a sub-agent makes no forward progress of
its own, so its chat recovery can exhaust before the child finishes (parent
collected `interrupted` despite the child completing). That's the plan's N1
"apply recovery to the sub-agent path" item, beyond this PR's re-attach scope.

Co-authored-by: Cursor <cursoragent@cursor.com>
@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Jun 1, 2026

@threepointone Bonk workflow was cancelled.

View workflow run · To retry, trigger Bonk again.

@threepointone threepointone merged commit edb126a into main Jun 1, 2026
4 checks passed
@threepointone threepointone deleted the fix/sub-agent-reattach-1630 branch June 1, 2026 12:13
threepointone added a commit that referenced this pull request Jun 1, 2026
…trating parent's recovery progress (N9) (#1641)

A parent turn whose work is "run a sub-agent and await its result" produced
no recoverable content of its own, so under deploy churn the PARENT's own
chat-recovery no-progress window could exhaust while the child was still
healthily streaming — abandoning the turn as `interrupted` and collecting an
interrupted result even though the child went on to complete. Reproduced by
the `examples/deploy-churn --mode subagent` harness: the parent exhausted at
`attempt 6/6` with `progress: 1` while the child self-healed all 30 steps.

Root cause: `_forwardAgentToolStream` (base Agent) only relayed child chunks
to the parent's connections; it never touched the parent's durable recovery
progress marker (which lives in the chat layer, not base Agent).

Fix: add a base-Agent seam `_onAgentToolStreamProgress()` (no-op by default),
invoked by `_forwardAgentToolStream` once per read iteration that actually
forwarded >=1 child chunk. Think and AIChatAgent override it to bump their
recovery-progress marker (throttled in-memory to 5s, reset per isolate so the
first forwarded chunk after any restart always credits). Covers both the live
`runAgentTool` path and the recovery `_reconcileAgentToolRuns` re-attach path.
The bump is best-effort (wrapped so a failure never breaks the child stream).

Design guard: credit is granted only on a genuinely-forwarded chunk — a
silent/hung child forwards nothing, so the parent still exhausts on its own
no-progress timer and a stuck sub-agent can never pin recovery open forever.

This completes the sub-agent recovery story started by #1630/#1640: the child
self-heals and the parent both re-attaches to it AND keeps its own recovery
alive while doing so.

Validation:
- Unit (think + ai-chat): forwarding a child stream advances the marker and
  resets the parent's attempt budget like in-band content; a silent child does
  not credit, so the cap still binds.
- SIGKILL e2e: the natural-agentTool() task-amplification scenario stays
  BOUNDED and the parent collects `completed` (no regression to #1640).
- Real-deploy `deploy-churn --mode subagent` (3 mid-loop deploys): verdict
  flipped NOT COLLECTED -> RE-ATTACHED; the parent's incident shows progress: 8,
  attempt 5/6 (did not exhaust). Also fixed a harness gap: reset() now clears
  the parent transcript so a reused session doesn't short-circuit the mock model.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions github-actions Bot mentioned this pull request Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: sub-agent (agent-tool) runs are abandoned and re-run on parent recovery instead of re-attaching

1 participant