Skip to content

refactor: unify the durable turn-status / result record (N8) — fixes the submission double-recover bug (N7), simplifies agent-tool reconcile (N6) #1652

@threepointone

Description

@threepointone

Summary

Introduce a single durable, recovery-reconciled per-turn status/result record (cf_chat_turns) in @cloudflare/think and @cloudflare/ai-chat that becomes the authoritative answer to "did this turn finish, and what was its result?" Submissions, agent-tool reconciliation, and chat recovery all defer to it — fixing the N7 submission double-recover bug by construction and replacing the heuristic terminal-derivation behind N6.

This is the design captured during planning. It is the largest/highest-risk item in the deploy-churn recovery effort and is intended as a multi-PR, phased change. It does not gate the customer DoD (a fresh release for g3 to validate bare) — that remains higher business priority and can be done first or interleaved.

Why

"Did this turn finish, and what was its result" lives in three uncoordinated places today, so submissions and agent-tools each re-derive it with heuristics that can disagree with recovery's real state:

  • Submission row (cf_think_submissions, think-only) — _recoverSubmissionsOnStart (packages/think/src/think.ts ~5265) uses a fragile fiber-freshness / scheduled-continuation heuristic. When it loses the race it seals a still-recovering turn error, and the recovery continuation then aborts itself (submission_not_running). This is the confirmed N7 bug — the fix(agents,think,ai-chat): re-attach to still-running sub-agent runs on parent recovery (#1630) #1640 submission experiment hit it (child stalled at 10/30, parent collected error; only the e2e caught it). Blast radius: submitMessages + chatRecovery turns under deploy churn (a pure chat() turn has no submission row and is unaffected).
  • Recovery incident (cf:chat-recovery:incident:* KV, both packages) — deleted on completed (think.ts ~7458), so it cannot answer "this turn completed" after the fact.
  • Agent-tool run rows (cf_agent_tool_runs parent + cf_agent_tool_child_runs think + cf_ai_chat_agent_tool_runs ai-chat) — terminal state is heuristically derived from the transcript + incident scan in _classifyAgentToolChildRecovery / _reconcileStaleAgentToolChildRun (think.ts ~3696/3754; ai-chat ~2599/2634).

The record

New table in both think and ai-chat (duplicated for now, matching current reality; the shared-layer unification refactor #1642 later hoists it to agents/chat):

```sql
CREATE TABLE IF NOT EXISTS cf_chat_turns (
request_id TEXT PRIMARY KEY, -- recoveryRootRequestId (stable across continuations)
status TEXT NOT NULL, -- running | completed | interrupted | error | aborted | skipped
stream_id TEXT,
result_text TEXT, -- final assistant text / summary (nullable)
error_message TEXT,
reason TEXT, -- recovery reason on terminal-via-recovery (e.g. stable_timeout)
created_at INTEGER NOT NULL,
updated_at INTEGER NOT NULL,
completed_at INTEGER
)
```

Keyed by recoveryRootRequestId — the stable identity that submissions (request_id = submissionId), WS chat turns, and agent-tool child turns all already share. One row per turn; continuations update the same row.

Single writer helper _recordTurnStatus(requestId, status, { streamId?, resultText?, error?, reason? }) (idempotent: terminal statuses are never overwritten by a later non-terminal, mirroring the submission AND status='running' invariant).

Design decisions (baked in)

  • Status column on submission / agent-tool rows stays as the API-facing projection/cache. "Become projections" means their recovery/terminal decision defers to cf_chat_turns, not that we delete their status columns (that would churn the public submission API for no gain).
  • Safe rollout via read-with-fallback. Readers treat a present cf_chat_turns row as authoritative; absent → fall back to today's heuristic. This makes every phase additive and upgrade-safe (in-flight turns from before the upgrade have no row), and lets us delete the heuristics only once the record is proven universal.
  • running covers in-flight + recovering (no separate recovering — consistent with the durable-submissions design and Surface a live "recovering…" status to chat clients during durable recovery #1620's separate broadcast / useAgentChat.isRecovering).
  • Retention: terminal rows retained, swept by a TTL sweep (reuse the incident-sweep cadence).

Writers (where status is set)

  • Turn start → running: WS/submission/agent-tool turn entry (_runInferenceLoop / _streamResult start; _runSubmission claim think.ts ~5120; startAgentToolRun think.ts ~3575 / ai-chat ~2430).
  • Normal terminal: _recordTerminalChatStatus (think) and the ai-chat completion path — extend the existing terminal choke to also write the per-turn record.
  • Recovery terminal: _exhaustChatRecoveryinterrupted; _updateChatRecoveryIncident completed/skipped/failed; _completeRecoveredSubmission / _markRecoveredSubmissionInterrupted.

Readers (deferral points)

  • N7 fix_recoverSubmissionsOnStart (think.ts ~5325): read cf_chat_turns[request_id]completed→complete the submission, running→defer, interrupted/error/aborted→that terminal; absent→current heuristic.
  • N6 simplify_classifyAgentToolChildRecovery / _reconcileStaleAgentToolChildRun (think.ts ~3696/3754; ai-chat ~2599/2634): read the child's turn record instead of scanning incidents + deriving from transcript; keep transcript-derive as fallback.

Phases (each independently shippable + e2e-gated)

  • Phase 1 (think) — record + writer (additive, zero reader change). Table + _recordTurnStatus + getChatTurn(requestId) + all turn-start and terminal write sites (_recordTerminalChatStatus, _exhaustChatRecovery, _updateChatRecoveryIncident, _runSubmission, startAgentToolRun); unit tests for every terminal path.
  • Phase 1 (ai-chat) — mirror the table + helper + write sites (completion path, _exhaustChatRecovery, _updateChatRecoveryIncident, startAgentToolRun). Lockstep with think.
  • Phase 1 verify — full think + ai-chat suites green; assert the record matches real outcomes in deploy-churn (incl. --mode subagent) + SIGKILL e2e. Changeset. Ship as its own PR. (No behavior change → near-zero risk.)
  • Phase 2 (N7 fix) — convert _recoverSubmissionsOnStart to defer to cf_chat_turns (fallback to heuristic when absent). Add the N7 repro test: a chatRecovery submitMessages turn interrupted mid-loop must NOT be prematurely errored. Gate on submission + deploy-churn e2e. Own PR.
  • Phase 3 (N6 simplify) — convert _classifyAgentToolChildRecovery / _reconcileStaleAgentToolChildRun (think + ai-chat) to read the turn record (fallback to transcript-derive). Gate on SIGKILL + deploy-churn --mode subagent e2e. Own PR.
  • Phase 4 (later / with refactor: unify duplicated chat-recovery/repair machinery into the shared agents/chat layer (N3) #1642) — once the record is proven universal, remove the dead heuristics + fallbacks and hoist cf_chat_turns + _recordTurnStatus into the shared agents/chat layer.

Testing / gates (non-negotiable)

This touches the exact submission-recovery + agent-tool-reconcile code where only e2e caught the #1640 regression. Every phase must pass: full think + ai-chat unit suites, the think + ai-chat e2e (packages/*/src/e2e-tests), and the examples/deploy-churn churn run (incl. --mode subagent). Changesets: @cloudflare/think + @cloudflare/ai-chat (+ agents if any shared types change → remember nx run agents:build before typecheck).

Risks

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions