You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When a parent agent is interrupted (deploy / Durable Object eviction) while a child agent-tool run is still in flight, parent recovery marks the run interrupted within a ~5s budget and the parent re-issues the task — re-running the child's already-completed work. For long-running children (50–70s child turns observed in production) this is the common case under continuous deployment, and surfaces to users as "the agent went all the way back and lost the files it already wrote."
Surfaced while investigating a customer deploy-churn rollback report. (That customer's app turned out not to use sub-agents, so this is not their specific regression — but it is a genuine stock-Think bug, and it matches the production field-notes gap "a poison/long sub-agent blocks or is abandoned by its parent.")
Root cause (all in packages/agents/src/index.ts unless noted)
Single, short reconciliation pass. On onStart, _scheduleAgentToolRunRecovery → _reconcileAgentToolRuns inspects each non-terminal child once within DEFAULT_AGENT_TOOL_RECOVERY_TIMEOUT_MS = 2000 / ..._TOTAL_TIMEOUT_MS = 5000 (~903). A child still running/starting after that window is finalized interrupted (~8005):
if(!inspection||inspection.status==="running"||inspection.status==="starting"){result={runId: row.run_id,agentType: row.agent_type,status: "interrupted",error: "Agent tool run was still running, but live-tail reattachment is not supported in this runtime."};}
The interrupted terminal write blocks later repair._updateAgentToolTerminal (~7536) only updates rows whose status is NOT already terminal, so once a row is interrupted, a child that later completes (via its own chatRecovery) never repairs the parent row.
Re-issue can't re-attach.agentTool() does not pass a runId (it only passes parentToolCallId: executeOptions?.toolCallId), so runAgentTool generates a fresh nanoid(12) → a brand-new child. Even with a stable runId (the documented "correct pattern"), runAgentTool on a duplicate non-terminal runId calls _replayAndInterruptAgentToolRun (~7840) — replays stored chunks then returns interrupted — rather than re-attaching to the live child.
Lost in-memory promise. The original runAgentTool await is gone after a restart; nothing bridges it (recovery runs in ctx.waitUntil, not by resuming the old promise). The child, however, is a separate facet with its own chatRecovery, so it self-completes its interrupted turn — the parent just never collects that result.
Net: the child's work is durable in the child facet (cf_agent_tool_child_runs in Think / cf_ai_chat_agent_tool_runs in ai-chat), but the parent abandons the run and the model re-runs it.
Why the existing e2e looks "BOUNDED"
packages/think/src/e2e-tests/task-amplification.test.ts stays BOUNDED only because its runTask hard-codes a stableCHILD_TASK_RUN_ID ("the correct pattern") AND the child finishes inside the recovery window. A natural agentTool() re-issue (fresh runId) — or any child longer than ~5s — defeats both conditions and amplifies.
Proposed fix (bounded re-attach)
Make a still-running recoverable child re-attach to its real terminal result instead of being abandoned. Both halves are needed — (b) keeps the row non-terminal so (a)'s re-issue can re-attach:
(a) runAgentTool duplicate non-terminal runId → re-attach. Replace _replayAndInterruptAgentToolRun for the live-duplicate case with: resolve the child adapter, tail the live child to terminal (reuse _forwardAgentToolStream + inspectAgentToolRun + _terminalResultFromInspection + _finishAgentToolRun), return the real result. Fall back to interrupted only if there's no tailAgentToolRun adapter or it never reaches terminal. Makes the stable-runId pattern robust.
(b) Reconciliation re-attach. For a still-running child whose adapter supports tailAgentToolRun, tail-to-terminal within a generous bounded budget (new DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS, e.g. 120_000) — or a this.schedule() defer-and-poll loop mirroring _rescheduleRecoveryAfterStableTimeout from fix(think,ai-chat,agents): harden recovery, transcript integrity & compaction under deploy churn #1623 — instead of marking interrupted at 5s. Only interrupted once the budget is exhausted.
Open design question: generous-bounded re-attach (e.g. 120s then interrupted) vs. a stricter cap; and whether the deferred poll should consume the same wall-clock bound as chat recovery.
Scope / risk
Changes agent-tool recovery timing semantics — re-attach can run for the child's lifetime inside waitUntil, so it MUST stay bounded so a genuinely hung child can't block forever.
Reworks the tests that lock the current behavior:
reconcileRunningThinkChildForTest (in packages/think/src/tests/agent-tools.test.ts and packages/ai-chat/src/tests/agent-tools.test.ts) — currently assert status: "interrupted" with the "live-tail reattachment is not supported" message. New expectation: a child that completes during reattach → completed.
reconcileStuckThinkChildWithTimeoutForTest — asserts elapsedMs < 1000; a genuinely stuck child must still end interrupted but after the bounded budget (thread a small timeout through the test seam so the test stays fast).
Think and ai-chat both implement tailAgentToolRun and child chatRecovery, so both can self-complete and be tailed.
TDD plan
e2e repro (the gate): a variant of task-amplification where runTask uses a natural fresh runId (the agentTool() path) and/or a child longer than the recovery window → assert it currently AMPLIFIES, BOUNDED after the fix.
Implement (a), then (b); rework the two reconcile unit tests; add a "child completes during reattach → parent row completed" unit test.
Confirm the existing task-amplification (stable-runId) test still BOUNDED.
Summary
When a parent agent is interrupted (deploy / Durable Object eviction) while a child agent-tool run is still in flight, parent recovery marks the run
interruptedwithin a ~5s budget and the parent re-issues the task — re-running the child's already-completed work. For long-running children (50–70s child turns observed in production) this is the common case under continuous deployment, and surfaces to users as "the agent went all the way back and lost the files it already wrote."Surfaced while investigating a customer deploy-churn rollback report. (That customer's app turned out not to use sub-agents, so this is not their specific regression — but it is a genuine stock-Think bug, and it matches the production field-notes gap "a poison/long sub-agent blocks or is abandoned by its parent.")
Root cause (all in
packages/agents/src/index.tsunless noted)Single, short reconciliation pass. On
onStart,_scheduleAgentToolRunRecovery→_reconcileAgentToolRunsinspects each non-terminal child once withinDEFAULT_AGENT_TOOL_RECOVERY_TIMEOUT_MS = 2000/..._TOTAL_TIMEOUT_MS = 5000(~903). A child stillrunning/startingafter that window is finalizedinterrupted(~8005):The interrupted terminal write blocks later repair.
_updateAgentToolTerminal(~7536) only updates rows whose status is NOT already terminal, so once a row isinterrupted, a child that later completes (via its ownchatRecovery) never repairs the parent row.Re-issue can't re-attach.
agentTool()does not pass arunId(it only passesparentToolCallId: executeOptions?.toolCallId), sorunAgentToolgenerates a freshnanoid(12)→ a brand-new child. Even with a stable runId (the documented "correct pattern"),runAgentToolon a duplicate non-terminal runId calls_replayAndInterruptAgentToolRun(~7840) — replays stored chunks then returnsinterrupted— rather than re-attaching to the live child.Lost in-memory promise. The original
runAgentToolawait is gone after a restart; nothing bridges it (recovery runs inctx.waitUntil, not by resuming the old promise). The child, however, is a separate facet with its ownchatRecovery, so it self-completes its interrupted turn — the parent just never collects that result.Net: the child's work is durable in the child facet (
cf_agent_tool_child_runsin Think /cf_ai_chat_agent_tool_runsin ai-chat), but the parent abandons the run and the model re-runs it.Why the existing e2e looks "BOUNDED"
packages/think/src/e2e-tests/task-amplification.test.tsstays BOUNDED only because itsrunTaskhard-codes a stableCHILD_TASK_RUN_ID("the correct pattern") AND the child finishes inside the recovery window. A naturalagentTool()re-issue (fresh runId) — or any child longer than ~5s — defeats both conditions and amplifies.Proposed fix (bounded re-attach)
Make a still-running recoverable child re-attach to its real terminal result instead of being abandoned. Both halves are needed — (b) keeps the row non-terminal so (a)'s re-issue can re-attach:
runAgentToolduplicate non-terminal runId → re-attach. Replace_replayAndInterruptAgentToolRunfor the live-duplicate case with: resolve the child adapter, tail the live child to terminal (reuse_forwardAgentToolStream+inspectAgentToolRun+_terminalResultFromInspection+_finishAgentToolRun), return the real result. Fall back tointerruptedonly if there's notailAgentToolRunadapter or it never reaches terminal. Makes the stable-runId pattern robust.tailAgentToolRun, tail-to-terminal within a generous bounded budget (newDEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS, e.g. 120_000) — or athis.schedule()defer-and-poll loop mirroring_rescheduleRecoveryAfterStableTimeoutfrom fix(think,ai-chat,agents): harden recovery, transcript integrity & compaction under deploy churn #1623 — instead of markinginterruptedat 5s. Onlyinterruptedonce the budget is exhausted.Open design question: generous-bounded re-attach (e.g. 120s then
interrupted) vs. a stricter cap; and whether the deferred poll should consume the same wall-clock bound as chat recovery.Scope / risk
waitUntil, so it MUST stay bounded so a genuinely hung child can't block forever.reconcileRunningThinkChildForTest(inpackages/think/src/tests/agent-tools.test.tsandpackages/ai-chat/src/tests/agent-tools.test.ts) — currently assertstatus: "interrupted"with the "live-tail reattachment is not supported" message. New expectation: a child that completes during reattach →completed.reconcileStuckThinkChildWithTimeoutForTest— assertselapsedMs < 1000; a genuinely stuck child must still endinterruptedbut after the bounded budget (thread a small timeout through the test seam so the test stays fast).tailAgentToolRunand childchatRecovery, so both can self-complete and be tailed.TDD plan
task-amplificationwhererunTaskuses a natural fresh runId (theagentTool()path) and/or a child longer than the recovery window → assert it currently AMPLIFIES, BOUNDED after the fix.task-amplification(stable-runId) test still BOUNDED.References
packages/agents/src/index.ts:runAgentTool(~7102),_replayAndInterruptAgentToolRun(~7840),_reconcileAgentToolRuns(~7946) + still-running branch (~8005),_inspectAgentToolRunForRecovery(~8068),_scheduleAgentToolRunRecovery(~8094),_finishAgentToolRun(~7491),_updateAgentToolTerminal(~7536),_forwardAgentToolStream(~7671), constants (~903).packages/think/src/think.ts(cf_agent_tool_child_runs,tailAgentToolRun, childchatRecovery),packages/ai-chat/src/index.ts(cf_ai_chat_agent_tool_runs, force-terminal-on-inspect when no abort controller).packages/think/src/tests/agent-tools.test.ts,packages/ai-chat/src/tests/agent-tools.test.ts,packages/think/src/e2e-tests/task-amplification.test.ts.AgentToolFailure.retryable—interrupted → retryable: true— is the parent-facing half of this), and the field-notes gap on synchronous child agents.