bug: sub-agent (agent-tool) runs are abandoned and re-run on parent recovery instead of re-attaching

## Summary

When a parent agent is interrupted (deploy / Durable Object eviction) **while a child agent-tool run is still in flight**, parent recovery marks the run `interrupted` within a ~5s budget and the parent re-issues the task — **re-running the child's already-completed work**. For long-running children (50–70s child turns observed in production) this is the common case under continuous deployment, and surfaces to users as "the agent went all the way back and lost the files it already wrote."

Surfaced while investigating a customer deploy-churn rollback report. (That customer's app turned out not to use sub-agents, so this is not their specific regression — but it is a genuine stock-Think bug, and it matches the production field-notes gap "a poison/long sub-agent blocks or is abandoned by its parent.")

## Root cause (all in `packages/agents/src/index.ts` unless noted)

1. **Single, short reconciliation pass.** On `onStart`, `_scheduleAgentToolRunRecovery` → `_reconcileAgentToolRuns` inspects each non-terminal child **once** within `DEFAULT_AGENT_TOOL_RECOVERY_TIMEOUT_MS = 2000` / `..._TOTAL_TIMEOUT_MS = 5000` (`~903`). A child still `running`/`starting` after that window is finalized `interrupted` (`~8005`):

   ```ts
   if (!inspection || inspection.status === "running" || inspection.status === "starting") {
     result = { runId: row.run_id, agentType: row.agent_type, status: "interrupted",
       error: "Agent tool run was still running, but live-tail reattachment is not supported in this runtime." };
   }
   ```

2. **The interrupted terminal write blocks later repair.** `_updateAgentToolTerminal` (`~7536`) only updates rows whose status is NOT already terminal, so once a row is `interrupted`, a child that *later* completes (via its own `chatRecovery`) never repairs the parent row.

3. **Re-issue can't re-attach.** `agentTool()` does not pass a `runId` (it only passes `parentToolCallId: executeOptions?.toolCallId`), so `runAgentTool` generates a fresh `nanoid(12)` → a brand-new child. Even with a *stable* runId (the documented "correct pattern"), `runAgentTool` on a duplicate **non-terminal** runId calls `_replayAndInterruptAgentToolRun` (`~7840`) — replays stored chunks then returns `interrupted` — rather than re-attaching to the live child.

4. **Lost in-memory promise.** The original `runAgentTool` await is gone after a restart; nothing bridges it (recovery runs in `ctx.waitUntil`, not by resuming the old promise). The child, however, is a **separate facet with its own `chatRecovery`**, so it self-completes its interrupted turn — the parent just never collects that result.

Net: the child's work is durable in the child facet (`cf_agent_tool_child_runs` in Think / `cf_ai_chat_agent_tool_runs` in ai-chat), but the parent abandons the run and the model re-runs it.

## Why the existing e2e looks "BOUNDED"

`packages/think/src/e2e-tests/task-amplification.test.ts` stays BOUNDED only because its `runTask` hard-codes a **stable** `CHILD_TASK_RUN_ID` ("the correct pattern") AND the child finishes inside the recovery window. A natural `agentTool()` re-issue (fresh runId) — or any child longer than ~5s — defeats both conditions and amplifies.

## Proposed fix (bounded re-attach)

Make a still-running **recoverable** child re-attach to its real terminal result instead of being abandoned. Both halves are needed — (b) keeps the row non-terminal so (a)'s re-issue can re-attach:

- **(a) `runAgentTool` duplicate non-terminal runId → re-attach.** Replace `_replayAndInterruptAgentToolRun` for the live-duplicate case with: resolve the child adapter, tail the live child to terminal (reuse `_forwardAgentToolStream` + `inspectAgentToolRun` + `_terminalResultFromInspection` + `_finishAgentToolRun`), return the real result. Fall back to `interrupted` only if there's no `tailAgentToolRun` adapter or it never reaches terminal. Makes the stable-runId pattern robust.
- **(b) Reconciliation re-attach.** For a still-running child whose adapter supports `tailAgentToolRun`, tail-to-terminal within a **generous bounded** budget (new `DEFAULT_AGENT_TOOL_REATTACH_TIMEOUT_MS`, e.g. 120_000) — or a `this.schedule()` defer-and-poll loop mirroring `_rescheduleRecoveryAfterStableTimeout` from #1623 — instead of marking `interrupted` at 5s. Only `interrupted` once the budget is exhausted.

Open design question: generous-bounded re-attach (e.g. 120s then `interrupted`) vs. a stricter cap; and whether the deferred poll should consume the same wall-clock bound as chat recovery.

## Scope / risk

- Changes agent-tool recovery **timing semantics** — re-attach can run for the child's lifetime inside `waitUntil`, so it MUST stay bounded so a genuinely hung child can't block forever.
- Reworks the tests that lock the current behavior:
  - `reconcileRunningThinkChildForTest` (in `packages/think/src/tests/agent-tools.test.ts` and `packages/ai-chat/src/tests/agent-tools.test.ts`) — currently assert `status: "interrupted"` with the "live-tail reattachment is not supported" message. New expectation: a child that completes during reattach → `completed`.
  - `reconcileStuckThinkChildWithTimeoutForTest` — asserts `elapsedMs < 1000`; a genuinely stuck child must still end `interrupted` but after the bounded budget (thread a small timeout through the test seam so the test stays fast).
- Think and ai-chat both implement `tailAgentToolRun` and child `chatRecovery`, so both can self-complete and be tailed.

## TDD plan

1. **e2e repro (the gate):** a variant of `task-amplification` where `runTask` uses a **natural fresh runId** (the `agentTool()` path) and/or a child longer than the recovery window → assert it currently **AMPLIFIES**, BOUNDED after the fix.
2. Implement (a), then (b); rework the two reconcile unit tests; add a "child completes during reattach → parent row completed" unit test.
3. Confirm the existing `task-amplification` (stable-runId) test still BOUNDED.

## References

- `packages/agents/src/index.ts`: `runAgentTool` (~7102), `_replayAndInterruptAgentToolRun` (~7840), `_reconcileAgentToolRuns` (~7946) + still-running branch (~8005), `_inspectAgentToolRunForRecovery` (~8068), `_scheduleAgentToolRunRecovery` (~8094), `_finishAgentToolRun` (~7491), `_updateAgentToolTerminal` (~7536), `_forwardAgentToolStream` (~7671), constants (~903).
- Child adapters: `packages/think/src/think.ts` (`cf_agent_tool_child_runs`, `tailAgentToolRun`, child `chatRecovery`), `packages/ai-chat/src/index.ts` (`cf_ai_chat_agent_tool_runs`, force-terminal-on-inspect when no abort controller).
- Tests: `packages/think/src/tests/agent-tools.test.ts`, `packages/ai-chat/src/tests/agent-tools.test.ts`, `packages/think/src/e2e-tests/task-amplification.test.ts`.
- Related: #1623 (recovery hardening; `AgentToolFailure.retryable` — `interrupted → retryable: true` — is the parent-facing half of this), and the field-notes gap on synchronous child agents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: sub-agent (agent-tool) runs are abandoned and re-run on parent recovery instead of re-attaching #1630

Summary

Root cause (all in `packages/agents/src/index.ts` unless noted)

Why the existing e2e looks "BOUNDED"

Proposed fix (bounded re-attach)

Scope / risk

TDD plan

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

bug: sub-agent (agent-tool) runs are abandoned and re-run on parent recovery instead of re-attaching #1630

Description

Summary

Root cause (all in packages/agents/src/index.ts unless noted)

Why the existing e2e looks "BOUNDED"

Proposed fix (bounded re-attach)

Scope / risk

TDD plan

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Root cause (all in `packages/agents/src/index.ts` unless noted)