You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Introduce a single durable, recovery-reconciled per-turn status/result record (cf_chat_turns) in @cloudflare/think and @cloudflare/ai-chat that becomes the authoritative answer to "did this turn finish, and what was its result?" Submissions, agent-tool reconciliation, and chat recovery all defer to it — fixing the N7 submission double-recover bug by construction and replacing the heuristic terminal-derivation behind N6.
This is the design captured during planning. It is the largest/highest-risk item in the deploy-churn recovery effort and is intended as a multi-PR, phased change. It does not gate the customer DoD (a fresh release for g3 to validate bare) — that remains higher business priority and can be done first or interleaved.
Why
"Did this turn finish, and what was its result" lives in three uncoordinated places today, so submissions and agent-tools each re-derive it with heuristics that can disagree with recovery's real state:
Submission row (cf_think_submissions, think-only) — _recoverSubmissionsOnStart (packages/think/src/think.ts ~5265) uses a fragile fiber-freshness / scheduled-continuation heuristic. When it loses the race it seals a still-recovering turn error, and the recovery continuation then aborts itself (submission_not_running). This is the confirmed N7 bug — the fix(agents,think,ai-chat): re-attach to still-running sub-agent runs on parent recovery (#1630) #1640 submission experiment hit it (child stalled at 10/30, parent collected error; only the e2e caught it). Blast radius: submitMessages + chatRecovery turns under deploy churn (a pure chat() turn has no submission row and is unaffected).
Recovery incident (cf:chat-recovery:incident:* KV, both packages) — deleted on completed (think.ts ~7458), so it cannot answer "this turn completed" after the fact.
Agent-tool run rows (cf_agent_tool_runs parent + cf_agent_tool_child_runs think + cf_ai_chat_agent_tool_runs ai-chat) — terminal state is heuristically derived from the transcript + incident scan in _classifyAgentToolChildRecovery / _reconcileStaleAgentToolChildRun (think.ts ~3696/3754; ai-chat ~2599/2634).
The record
New table in both think and ai-chat (duplicated for now, matching current reality; the shared-layer unification refactor #1642 later hoists it to agents/chat):
```sql
CREATE TABLE IF NOT EXISTS cf_chat_turns (
request_id TEXT PRIMARY KEY, -- recoveryRootRequestId (stable across continuations)
status TEXT NOT NULL, -- running | completed | interrupted | error | aborted | skipped
stream_id TEXT,
result_text TEXT, -- final assistant text / summary (nullable)
error_message TEXT,
reason TEXT, -- recovery reason on terminal-via-recovery (e.g. stable_timeout)
created_at INTEGER NOT NULL,
updated_at INTEGER NOT NULL,
completed_at INTEGER
)
```
Keyed by recoveryRootRequestId — the stable identity that submissions (request_id = submissionId), WS chat turns, and agent-tool child turns all already share. One row per turn; continuations update the same row.
Single writer helper _recordTurnStatus(requestId, status, { streamId?, resultText?, error?, reason? }) (idempotent: terminal statuses are never overwritten by a later non-terminal, mirroring the submission AND status='running' invariant).
Design decisions (baked in)
Status column on submission / agent-tool rows stays as the API-facing projection/cache. "Become projections" means their recovery/terminal decision defers tocf_chat_turns, not that we delete their status columns (that would churn the public submission API for no gain).
Safe rollout via read-with-fallback. Readers treat a present cf_chat_turns row as authoritative; absent → fall back to today's heuristic. This makes every phase additive and upgrade-safe (in-flight turns from before the upgrade have no row), and lets us delete the heuristics only once the record is proven universal.
Normal terminal: _recordTerminalChatStatus (think) and the ai-chat completion path — extend the existing terminal choke to also write the per-turn record.
N6 simplify — _classifyAgentToolChildRecovery / _reconcileStaleAgentToolChildRun (think.ts ~3696/3754; ai-chat ~2599/2634): read the child's turn record instead of scanning incidents + deriving from transcript; keep transcript-derive as fallback.
Phases (each independently shippable + e2e-gated)
Phase 1 (think) — record + writer (additive, zero reader change). Table + _recordTurnStatus + getChatTurn(requestId) + all turn-start and terminal write sites (_recordTerminalChatStatus, _exhaustChatRecovery, _updateChatRecoveryIncident, _runSubmission, startAgentToolRun); unit tests for every terminal path.
Phase 1 (ai-chat) — mirror the table + helper + write sites (completion path, _exhaustChatRecovery, _updateChatRecoveryIncident, startAgentToolRun). Lockstep with think.
Phase 1 verify — full think + ai-chat suites green; assert the record matches real outcomes in deploy-churn (incl. --mode subagent) + SIGKILL e2e. Changeset. Ship as its own PR. (No behavior change → near-zero risk.)
Phase 2 (N7 fix) — convert _recoverSubmissionsOnStart to defer to cf_chat_turns (fallback to heuristic when absent). Add the N7 repro test: a chatRecovery submitMessages turn interrupted mid-loop must NOT be prematurely errored. Gate on submission + deploy-churn e2e. Own PR.
Phase 3 (N6 simplify) — convert _classifyAgentToolChildRecovery / _reconcileStaleAgentToolChildRun (think + ai-chat) to read the turn record (fallback to transcript-derive). Gate on SIGKILL + deploy-churn --mode subagent e2e. Own PR.
This touches the exact submission-recovery + agent-tool-reconcile code where only e2e caught the #1640 regression. Every phase must pass: full think + ai-chat unit suites, the think + ai-chat e2e (packages/*/src/e2e-tests), and the examples/deploy-churn churn run (incl. --mode subagent). Changesets: @cloudflare/think + @cloudflare/ai-chat (+ agents if any shared types change → remember nx run agents:build before typecheck).
Risks
Largest, highest-risk item in the recovery effort; multi-PR. The read-with-fallback strategy is the main mitigation (additive, reversible per phase).
Summary
Introduce a single durable, recovery-reconciled per-turn status/result record (
cf_chat_turns) in@cloudflare/thinkand@cloudflare/ai-chatthat becomes the authoritative answer to "did this turn finish, and what was its result?" Submissions, agent-tool reconciliation, and chat recovery all defer to it — fixing the N7 submission double-recover bug by construction and replacing the heuristic terminal-derivation behind N6.Why
"Did this turn finish, and what was its result" lives in three uncoordinated places today, so submissions and agent-tools each re-derive it with heuristics that can disagree with recovery's real state:
cf_think_submissions, think-only) —_recoverSubmissionsOnStart(packages/think/src/think.ts~5265) uses a fragile fiber-freshness / scheduled-continuation heuristic. When it loses the race it seals a still-recovering turnerror, and the recovery continuation then aborts itself (submission_not_running). This is the confirmed N7 bug — the fix(agents,think,ai-chat): re-attach to still-running sub-agent runs on parent recovery (#1630) #1640 submission experiment hit it (child stalled at 10/30, parent collectederror; only the e2e caught it). Blast radius:submitMessages+chatRecoveryturns under deploy churn (a purechat()turn has no submission row and is unaffected).cf:chat-recovery:incident:*KV, both packages) — deleted oncompleted(think.ts~7458), so it cannot answer "this turn completed" after the fact.cf_agent_tool_runsparent +cf_agent_tool_child_runsthink +cf_ai_chat_agent_tool_runsai-chat) — terminal state is heuristically derived from the transcript + incident scan in_classifyAgentToolChildRecovery/_reconcileStaleAgentToolChildRun(think.ts~3696/3754;ai-chat~2599/2634).The record
New table in both think and ai-chat (duplicated for now, matching current reality; the shared-layer unification refactor #1642 later hoists it to
agents/chat):```sql
CREATE TABLE IF NOT EXISTS cf_chat_turns (
request_id TEXT PRIMARY KEY, -- recoveryRootRequestId (stable across continuations)
status TEXT NOT NULL, -- running | completed | interrupted | error | aborted | skipped
stream_id TEXT,
result_text TEXT, -- final assistant text / summary (nullable)
error_message TEXT,
reason TEXT, -- recovery reason on terminal-via-recovery (e.g. stable_timeout)
created_at INTEGER NOT NULL,
updated_at INTEGER NOT NULL,
completed_at INTEGER
)
```
Keyed by
recoveryRootRequestId— the stable identity that submissions (request_id= submissionId), WS chat turns, and agent-tool child turns all already share. One row per turn; continuations update the same row.Single writer helper
_recordTurnStatus(requestId, status, { streamId?, resultText?, error?, reason? })(idempotent: terminal statuses are never overwritten by a later non-terminal, mirroring the submissionAND status='running'invariant).Design decisions (baked in)
cf_chat_turns, not that we delete their status columns (that would churn the public submission API for no gain).cf_chat_turnsrow as authoritative; absent → fall back to today's heuristic. This makes every phase additive and upgrade-safe (in-flight turns from before the upgrade have no row), and lets us delete the heuristics only once the record is proven universal.runningcovers in-flight + recovering (no separaterecovering— consistent with the durable-submissions design and Surface a live "recovering…" status to chat clients during durable recovery #1620's separate broadcast /useAgentChat.isRecovering).Writers (where status is set)
running: WS/submission/agent-tool turn entry (_runInferenceLoop/_streamResultstart;_runSubmissionclaimthink.ts~5120;startAgentToolRunthink.ts~3575 /ai-chat~2430)._recordTerminalChatStatus(think) and the ai-chat completion path — extend the existing terminal choke to also write the per-turn record._exhaustChatRecovery→interrupted;_updateChatRecoveryIncidentcompleted/skipped/failed;_completeRecoveredSubmission/_markRecoveredSubmissionInterrupted.Readers (deferral points)
_recoverSubmissionsOnStart(think.ts~5325): readcf_chat_turns[request_id]→completed→complete the submission,running→defer,interrupted/error/aborted→that terminal; absent→current heuristic._classifyAgentToolChildRecovery/_reconcileStaleAgentToolChildRun(think.ts~3696/3754;ai-chat~2599/2634): read the child's turn record instead of scanning incidents + deriving from transcript; keep transcript-derive as fallback.Phases (each independently shippable + e2e-gated)
_recordTurnStatus+getChatTurn(requestId)+ all turn-start and terminal write sites (_recordTerminalChatStatus,_exhaustChatRecovery,_updateChatRecoveryIncident,_runSubmission,startAgentToolRun); unit tests for every terminal path._exhaustChatRecovery,_updateChatRecoveryIncident,startAgentToolRun). Lockstep with think.deploy-churn(incl.--mode subagent) + SIGKILL e2e. Changeset. Ship as its own PR. (No behavior change → near-zero risk.)_recoverSubmissionsOnStartto defer tocf_chat_turns(fallback to heuristic when absent). Add the N7 repro test: a chatRecoverysubmitMessagesturn interrupted mid-loop must NOT be prematurely errored. Gate on submission + deploy-churn e2e. Own PR._classifyAgentToolChildRecovery/_reconcileStaleAgentToolChildRun(think + ai-chat) to read the turn record (fallback to transcript-derive). Gate on SIGKILL +deploy-churn --mode subagente2e. Own PR.cf_chat_turns+_recordTurnStatusinto the sharedagents/chatlayer.Testing / gates (non-negotiable)
This touches the exact submission-recovery + agent-tool-reconcile code where only e2e caught the #1640 regression. Every phase must pass: full think + ai-chat unit suites, the think + ai-chat e2e (
packages/*/src/e2e-tests), and theexamples/deploy-churnchurn run (incl.--mode subagent). Changesets:@cloudflare/think+@cloudflare/ai-chat(+agentsif any shared types change → remembernx run agents:buildbefore typecheck).Risks
running; the fallback keeps behavior correct, but Phase 1 tests must cover every terminal path.Related
agents/chat(N3). Phase 4 folds this record into that refactor._recoverSubmissionsOnStart) — N7 is fixed here by construction.