Skip to content

bug: sub-agent (agent-tool) runs are abandoned and re-run on parent recovery instead of re-attaching #1630

@threepointone

Description

@threepointone

Summary

When a parent agent is interrupted (deploy / Durable Object eviction) while a child agent-tool run is still in flight, parent recovery marks the run interrupted within a ~5s budget and the parent re-issues the task — re-running the child's already-completed work. For long-running children (50–70s child turns observed in production) this is the common case under continuous deployment, and surfaces to users as "the agent went all the way back and lost the files it already wrote."

Surfaced while investigating a customer deploy-churn rollback report. (That customer's app turned out not to use sub-agents, so this is not their specific regression — but it is a genuine stock-Think bug, and it matches the production field-notes gap "a poison/long sub-agent blocks or is abandoned by its parent.")

Root cause (all in packages/agents/src/index.ts unless noted)

  1. Single, short reconciliation pass. On onStart, _scheduleAgentToolRunRecovery_reconcileAgentToolRuns inspects each non-terminal child once within DEFAULT_AGENT_TOOL_RECOVERY_TIMEOUT_MS = 2000 / ..._TOTAL_TIMEOUT_MS = 5000 (~903). A child still running/starting after that window is finalized interrupted (~8005):

    if (!inspection || inspection.status === "running" || inspection.status === "starting") {
      result = { runId: row.run_id, agentType: row.agent_type, status: "interrupted",
        error: "Agent tool run was still running, but live-tail reattachment is not supported in this runtime." };
    }
  2. The interrupted terminal write blocks later repair. _updateAgentToolTerminal (~7536) only updates rows whose status is NOT already terminal, so once a row is interrupted, a child that later completes (via its own chatRecovery) never repairs the parent row.

  3. Re-issue can't re-attach. agentTool() does not pass a runId (it only passes parentToolCallId: executeOptions?.toolCallId), so runAgentTool generates a fresh nanoid(12) → a brand-new child. Even with a stable runId (the documented "correct pattern"), runAgentTool on a duplicate non-terminal runId calls _replayAndInterruptAgentToolRun (~7840) — replays stored chunks then returns interrupted — rather than re-attaching to the live child.

  4. Lost in-memory promise. The original runAgentTool await is gone after a restart; nothing bridges it (recovery runs in ctx.waitUntil, not by resuming the old promise). The child, however, is a separate facet with its own chatRecovery, so it self-completes its interrupted turn — the parent just never collects that result.

Net: the child's work is durable in the child facet (cf_agent_tool_child_runs in Think / cf_ai_chat_agent_tool_runs in ai-chat), but the parent abandons the run and the model re-runs it.

Why the existing e2e looks "BOUNDED"

packages/think/src/e2e-tests/task-amplification.test.ts stays BOUNDED only because its runTask hard-codes a stable CHILD_TASK_RUN_ID ("the correct pattern") AND the child finishes inside the recovery window. A natural agentTool() re-issue (fresh runId) — or any child longer than ~5s — defeats both conditions and amplifies.

Proposed fix (bounded re-attach)

Make a still-running recoverable child re-attach to its real terminal result instead of being abandoned. Both halves are needed — (b) keeps the row non-terminal so (a)'s re-issue can re-attach:

  • (a) runAgentTool duplicate non-terminal runId → re-attach. Replace _replayAndInterruptAgentToolRun for the live-duplicate case with: resolve the child adapter, tail the live child to terminal (reuse _forwardAgentToolStream + inspectAgentToolRun + _terminalResultFromInspection + _finishAgentToolRun), return the real result. Fall back to interrupted only if there's no tailAgentToolRun adapter or it never reaches terminal. Makes the stable-runId pattern robust.
  • (b) Reconciliation re-attach. For a still-running child whose adapter supports tailAgentToolRun, tail-to-terminal within a generous bounded budget (new DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS, e.g. 120_000) — or a this.schedule() defer-and-poll loop mirroring _rescheduleRecoveryAfterStableTimeout from fix(think,ai-chat,agents): harden recovery, transcript integrity & compaction under deploy churn #1623 — instead of marking interrupted at 5s. Only interrupted once the budget is exhausted.

Open design question: generous-bounded re-attach (e.g. 120s then interrupted) vs. a stricter cap; and whether the deferred poll should consume the same wall-clock bound as chat recovery.

Scope / risk

  • Changes agent-tool recovery timing semantics — re-attach can run for the child's lifetime inside waitUntil, so it MUST stay bounded so a genuinely hung child can't block forever.
  • Reworks the tests that lock the current behavior:
    • reconcileRunningThinkChildForTest (in packages/think/src/tests/agent-tools.test.ts and packages/ai-chat/src/tests/agent-tools.test.ts) — currently assert status: "interrupted" with the "live-tail reattachment is not supported" message. New expectation: a child that completes during reattach → completed.
    • reconcileStuckThinkChildWithTimeoutForTest — asserts elapsedMs < 1000; a genuinely stuck child must still end interrupted but after the bounded budget (thread a small timeout through the test seam so the test stays fast).
  • Think and ai-chat both implement tailAgentToolRun and child chatRecovery, so both can self-complete and be tailed.

TDD plan

  1. e2e repro (the gate): a variant of task-amplification where runTask uses a natural fresh runId (the agentTool() path) and/or a child longer than the recovery window → assert it currently AMPLIFIES, BOUNDED after the fix.
  2. Implement (a), then (b); rework the two reconcile unit tests; add a "child completes during reattach → parent row completed" unit test.
  3. Confirm the existing task-amplification (stable-runId) test still BOUNDED.

References

  • packages/agents/src/index.ts: runAgentTool (~7102), _replayAndInterruptAgentToolRun (~7840), _reconcileAgentToolRuns (~7946) + still-running branch (~8005), _inspectAgentToolRunForRecovery (~8068), _scheduleAgentToolRunRecovery (~8094), _finishAgentToolRun (~7491), _updateAgentToolTerminal (~7536), _forwardAgentToolStream (~7671), constants (~903).
  • Child adapters: packages/think/src/think.ts (cf_agent_tool_child_runs, tailAgentToolRun, child chatRecovery), packages/ai-chat/src/index.ts (cf_ai_chat_agent_tool_runs, force-terminal-on-inspect when no abort controller).
  • Tests: packages/think/src/tests/agent-tools.test.ts, packages/ai-chat/src/tests/agent-tools.test.ts, packages/think/src/e2e-tests/task-amplification.test.ts.
  • Related: fix(think,ai-chat,agents): harden recovery, transcript integrity & compaction under deploy churn #1623 (recovery hardening; AgentToolFailure.retryableinterrupted → retryable: true — is the parent-facing half of this), and the field-notes gap on synchronous child agents.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions