refactor: unify the durable turn-status / result record (N8) — fixes the submission double-recover bug (N7), simplifies agent-tool reconcile (N6)

## Summary

Introduce a single durable, recovery-reconciled per-turn status/result record (`cf_chat_turns`) in `@cloudflare/think` and `@cloudflare/ai-chat` that becomes the authoritative answer to **"did this turn finish, and what was its result?"** Submissions, agent-tool reconciliation, and chat recovery all defer to it — fixing the **N7** submission double-recover bug by construction and replacing the heuristic terminal-derivation behind **N6**.

> This is the design captured during planning. It is the largest/highest-risk item in the deploy-churn recovery effort and is intended as a multi-PR, phased change. It does **not** gate the customer DoD (a fresh release for g3 to validate bare) — that remains higher business priority and can be done first or interleaved.

## Why

"Did this turn finish, and what was its result" lives in **three uncoordinated places** today, so submissions and agent-tools each re-derive it with heuristics that can disagree with recovery's real state:

- **Submission row** (`cf_think_submissions`, think-only) — `_recoverSubmissionsOnStart` (`packages/think/src/think.ts` ~5265) uses a fragile fiber-freshness / scheduled-continuation heuristic. When it loses the race it seals a still-recovering turn `error`, and the recovery continuation then aborts itself (`submission_not_running`). **This is the confirmed N7 bug** — the #1640 submission experiment hit it (child stalled at 10/30, parent collected `error`; only the e2e caught it). Blast radius: `submitMessages` + `chatRecovery` turns under deploy churn (a pure `chat()` turn has no submission row and is unaffected).
- **Recovery incident** (`cf:chat-recovery:incident:*` KV, both packages) — **deleted on `completed`** (`think.ts` ~7458), so it cannot answer "this turn completed" after the fact.
- **Agent-tool run rows** (`cf_agent_tool_runs` parent + `cf_agent_tool_child_runs` think + `cf_ai_chat_agent_tool_runs` ai-chat) — terminal state is **heuristically derived** from the transcript + incident scan in `_classifyAgentToolChildRecovery` / `_reconcileStaleAgentToolChildRun` (`think.ts` ~3696/3754; `ai-chat` ~2599/2634).

## The record

New table in **both** think and ai-chat (duplicated for now, matching current reality; the shared-layer unification refactor #1642 later hoists it to `agents/chat`):

\`\`\`sql
CREATE TABLE IF NOT EXISTS cf_chat_turns (
  request_id    TEXT PRIMARY KEY,   -- recoveryRootRequestId (stable across continuations)
  status        TEXT NOT NULL,      -- running | completed | interrupted | error | aborted | skipped
  stream_id     TEXT,
  result_text   TEXT,               -- final assistant text / summary (nullable)
  error_message TEXT,
  reason        TEXT,               -- recovery reason on terminal-via-recovery (e.g. stable_timeout)
  created_at    INTEGER NOT NULL,
  updated_at    INTEGER NOT NULL,
  completed_at  INTEGER
)
\`\`\`

Keyed by `recoveryRootRequestId` — the stable identity that submissions (`request_id` = submissionId), WS chat turns, and agent-tool child turns all already share. One row per turn; continuations update the same row.

Single writer helper `_recordTurnStatus(requestId, status, { streamId?, resultText?, error?, reason? })` (idempotent: terminal statuses are never overwritten by a later non-terminal, mirroring the submission `AND status='running'` invariant).

## Design decisions (baked in)

- **Status column on submission / agent-tool rows stays** as the API-facing projection/cache. "Become projections" means their *recovery/terminal decision defers to* `cf_chat_turns`, not that we delete their status columns (that would churn the public submission API for no gain).
- **Safe rollout via read-with-fallback.** Readers treat a present `cf_chat_turns` row as authoritative; **absent → fall back to today's heuristic.** This makes every phase additive and upgrade-safe (in-flight turns from before the upgrade have no row), and lets us delete the heuristics only once the record is proven universal.
- **`running` covers in-flight + recovering** (no separate `recovering` — consistent with the durable-submissions design and #1620's separate broadcast / `useAgentChat.isRecovering`).
- **Retention:** terminal rows retained, swept by a TTL sweep (reuse the incident-sweep cadence).

## Writers (where status is set)

- Turn start → `running`: WS/submission/agent-tool turn entry (`_runInferenceLoop` / `_streamResult` start; `_runSubmission` claim `think.ts` ~5120; `startAgentToolRun` `think.ts` ~3575 / `ai-chat` ~2430).
- Normal terminal: `_recordTerminalChatStatus` (think) and the ai-chat completion path — extend the existing terminal choke to also write the per-turn record.
- Recovery terminal: `_exhaustChatRecovery` → `interrupted`; `_updateChatRecoveryIncident` completed/skipped/failed; `_completeRecoveredSubmission` / `_markRecoveredSubmissionInterrupted`.

## Readers (deferral points)

- **N7 fix** — `_recoverSubmissionsOnStart` (`think.ts` ~5325): read `cf_chat_turns[request_id]` → `completed`→complete the submission, `running`→defer, `interrupted`/`error`/`aborted`→that terminal; absent→current heuristic.
- **N6 simplify** — `_classifyAgentToolChildRecovery` / `_reconcileStaleAgentToolChildRun` (`think.ts` ~3696/3754; `ai-chat` ~2599/2634): read the child's turn record instead of scanning incidents + deriving from transcript; keep transcript-derive as fallback.

## Phases (each independently shippable + e2e-gated)

- [ ] **Phase 1 (think) — record + writer (additive, zero reader change).** Table + `_recordTurnStatus` + `getChatTurn(requestId)` + all turn-start and terminal write sites (`_recordTerminalChatStatus`, `_exhaustChatRecovery`, `_updateChatRecoveryIncident`, `_runSubmission`, `startAgentToolRun`); unit tests for every terminal path.
- [ ] **Phase 1 (ai-chat)** — mirror the table + helper + write sites (completion path, `_exhaustChatRecovery`, `_updateChatRecoveryIncident`, `startAgentToolRun`). Lockstep with think.
- [ ] **Phase 1 verify** — full think + ai-chat suites green; assert the record matches real outcomes in `deploy-churn` (incl. `--mode subagent`) + SIGKILL e2e. Changeset. Ship as its own PR. (No behavior change → near-zero risk.)
- [ ] **Phase 2 (N7 fix)** — convert `_recoverSubmissionsOnStart` to defer to `cf_chat_turns` (fallback to heuristic when absent). Add the N7 repro test: a chatRecovery `submitMessages` turn interrupted mid-loop must NOT be prematurely errored. Gate on submission + deploy-churn e2e. Own PR.
- [ ] **Phase 3 (N6 simplify)** — convert `_classifyAgentToolChildRecovery` / `_reconcileStaleAgentToolChildRun` (think + ai-chat) to read the turn record (fallback to transcript-derive). Gate on SIGKILL + `deploy-churn --mode subagent` e2e. Own PR.
- [ ] **Phase 4 (later / with #1642)** — once the record is proven universal, remove the dead heuristics + fallbacks and hoist `cf_chat_turns` + `_recordTurnStatus` into the shared `agents/chat` layer.

## Testing / gates (non-negotiable)

This touches the exact submission-recovery + agent-tool-reconcile code where **only e2e caught the #1640 regression**. Every phase must pass: full think + ai-chat unit suites, the think + ai-chat e2e (`packages/*/src/e2e-tests`), and the `examples/deploy-churn` churn run (incl. `--mode subagent`). Changesets: `@cloudflare/think` + `@cloudflare/ai-chat` (+ `agents` if any shared types change → remember `nx run agents:build` before typecheck).

## Risks

- Largest, highest-risk item in the recovery effort; multi-PR. The read-with-fallback strategy is the main mitigation (additive, reversible per phase).
- Duplicated think↔ai-chat (the layer isn't unified yet, #1642) — edit in lockstep.
- Missing a write site → a turn record stuck `running`; the fallback keeps behavior correct, but Phase 1 tests must cover every terminal path.

## Related

- Built on the deploy-churn recovery stack: #1633 / #1638 / #1640 / #1641 / #1643 / #1646 / #1647 / #1648.
- #1642 — unify the duplicated chat-recovery/repair machinery into `agents/chat` (N3). Phase 4 folds this record into that refactor.
- Subsumes the previously-considered standalone "N7" targeted fix (defer-to-recovery in `_recoverSubmissionsOnStart`) — N7 is fixed here by construction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor: unify the durable turn-status / result record (N8) — fixes the submission double-recover bug (N7), simplifies agent-tool reconcile (N6) #1652

Summary

Why

The record

Design decisions (baked in)

Writers (where status is set)

Readers (deferral points)

Phases (each independently shippable + e2e-gated)

Testing / gates (non-negotiable)

Risks

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

refactor: unify the durable turn-status / result record (N8) — fixes the submission double-recover bug (N7), simplifies agent-tool reconcile (N6) #1652

Description

Summary

Why

The record

Design decisions (baked in)

Writers (where status is set)

Readers (deferral points)

Phases (each independently shippable + e2e-gated)

Testing / gates (non-negotiable)

Risks

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions