fix(think,ai-chat): preserve settled work when a recovery turn is given up (#1631) by threepointone · Pull Request #1634 · cloudflare/agents

threepointone · 2026-05-31T21:51:25Z

Summary

When a chat-recovery turn is given up, the framework could throw away a partial assistant message holding completed, often non-idempotent tool results — the most valuable, least-reproducible state in a turn. This PR makes sure that never happens.

Updated: rebased onto main (it now includes #1633, #1638/N1, #1640, and #1641/N9) and the persist: false footgun is now fixed with the stronger default (R1) instead of a warning — see below.

Two paths fixed

1. Framework budget exhaustion dropped the settled partial

The budget check returns before onChatRecovery is consulted, and _exhaustChatRecovery sealed the turn (terminal status + banner) without persisting the orphaned stream. So when the framework's own budget tripped, every settled tool result was discarded and the model re-ran them on the next message.

Fix: the exhaustion branch now persists the settled partial first, reusing the normal path's gating (_shouldPersistOrphanedPartial) so it can never duplicate a partial an earlier attempt already saved. Because this sits in the same if (exhausted) branch that N1/#1638 rewrote, it now covers every exhaustion path (no-progress window + 15-min ceiling + attempt cap), not just the raw attempt cap.

2. `onChatRecovery` returning `{ persist: false }` dropped settled work (R1 — stronger default)

{ persist: false } reads like "don't bother continuing," but it actually deleted the settled partial — losing completed tool calls with no signal.

Fix (R1): settled work is now NEVER dropped. { persist: false } only suppresses persistence of a partial that has nothing settled to lose; a partial carrying settled tool results is persisted regardless. An app can no longer accidentally discard completed work — and never needs { persist: true } just to stay safe. (A safe default beats a warning about an unsafe one — chosen over the earlier "warn once" approach so there is no footgun and no app decision.)

New helpers: _shouldPersistOrphanedPartial, _partialHasSettledToolResults (recognizes output-available / output-error / output-denied and output/result shapes). Applied identically to @cloudflare/think and @cloudflare/ai-chat.

g3 impact

Lets g3 delete its { persist: true } recovery override (the default already persists by default, and now never loses settled work even on an explicit persist: false).

Tests

Exhaustion preserves the settled partial — seed an incident at the cap + a terminal stream with a partial, trigger recovery, assert the partial is persisted and the incident is exhausted. (Adapted to N1/fix(think,ai-chat): wall-clock-keyed-to-progress recovery budget + alarm debounce (#1637) #1638's alarm-debounce: the seeded incident's lastAttemptAt is aged past the debounce window so the wake counts as a genuine attempt and actually exhausts.)
{ persist: false } never drops settled tool results — settled partial IS persisted, with no warning.
{ persist: false } honored for a text-only partial — nothing settled to lose → nothing persisted, no warning.
Full suites green: think 463, ai-chat 485; npm run check clean (91 projects).

Notes for reviewers

Rebased onto current main (the original branch carried the now-merged fix(think,ai-chat): make chat-recovery progress signal compaction-immune (#1628) #1633 commit; dropped it). The diff is just this PR's change.
The earlier revision of this PR shipped a one-time console.warn on persist: false data loss; this revision replaces it with the stronger no-loss default (R1) per the defaults-over-APIs review.

Test plan

CI green
think + ai-chat suites green (463 / 485)
npm run check clean

changeset-bot · 2026-05-31T21:51:29Z

🦋 Changeset detected

Latest commit: f4bfb14

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages

Name	Type
@cloudflare/think	Patch
@cloudflare/ai-chat	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

pkg-pr-new · 2026-05-31T21:56:53Z

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1634

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1634

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1634

hono-agents

npm i https://pkg.pr.new/hono-agents@1634

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1634

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1634

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1634

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1634

commit: f4bfb14

…en up (#1631) Two paths could throw away a partial assistant message holding completed, often non-idempotent tool results — the most valuable, least-reproducible state in a turn: 1. Framework budget exhaustion sealed the turn (terminal status + banner) BEFORE the orphaned stream was persisted, so settled tool results were discarded and re-run on the next message. Exhaustion now persists the settled partial first, reusing the normal path's gating so it can't duplicate an already-saved partial. (This now also covers N1/#1638's wall-clock and no-progress exhaustion paths, not just the attempt cap.) 2. A subclass onChatRecovery returning { persist: false } silently dropped the settled partial. Settled work is now NEVER dropped: { persist: false } only suppresses persistence of a partial that has nothing settled to lose; a partial carrying settled tool results is persisted regardless. An app can no longer accidentally discard completed work, and never needs { persist: true } just to stay safe. A safe default beats a warning about an unsafe one (R1). Adds _shouldPersistOrphanedPartial / _partialHasSettledToolResults helpers. Applied identically to @cloudflare/think and @cloudflare/ai-chat. Tests: - Unit (think + ai-chat): exhaustion preserves a settled TOOL RESULT (not just text); { persist: false } never drops settled tool results, and is honored for a text-only partial. (Exhaustion test aged past N1/#1638's alarm-debounce so the wake counts as a genuine attempt and actually exhausts.) - E2E (think, real SIGKILL): a recordStep loop whose onChatRecovery returns { persist: false, continue: false } is killed mid-turn; after recovery the settled tool results produced before the kill are still in the durable transcript (R1) and the turn does not continue. (ThinkPersistFalseE2EAgent.) NOTE on coverage: the EXHAUSTION-with-prior-settled-work path stays unit-tested (not e2e) on purpose — under N1's budget, settling a tool result IS forward progress that resets the budget and prevents exhaustion, so that scenario can't be forced deterministically under real churn. Rebased onto main (dropping the already-merged #1633 commit). Co-authored-by: Cursor <cursoragent@cursor.com>

threepointone · 2026-06-01T14:15:11Z

Review-driven coverage hardening (pushed)

Following a confidence review of the test coverage, added:

Strengthened the exhaustion unit tests (think + ai-chat): they now seed a settled tool result (not just text) and assert it survives budget exhaustion — directly proving the headline claim ("completed, non-idempotent tool results aren't dropped on exhaustion").
New real-SIGKILL e2e (ThinkPersistFalseE2EAgent + persist-false-preserves.test.ts): a recordStep loop whose onChatRecovery returns { persist: false, continue: false } is killed mid-turn; after recovery the settled tool results produced before the kill are still in the durable transcript (R1) and the turn does not continue. This validates the R1 no-loss default under a real process kill (it would fail on the pre-R1 "persist:false drops the partial" behavior).

Two findings from the review (both resolved/understood, no code change needed):

ai-chat's streamStillActive-only persist gate vs think's _shouldPersistOrphanedPartial is an intentional asymmetry, not a data-loss gap: ai-chat restores interrupted streams as active (insertInterruptedStream → status:'streaming' + _resumableStream.restore()), so recovery always sees an active orphan that the gate covers; terminal orphans are persisted by the client-reconnect ACK path. (think tracks an explicit terminal stream status and needs the extra branch.)
The exhaustion-with-prior-settled-work path stays unit-tested, not e2e — on purpose. Under N1/fix(think,ai-chat): wall-clock-keyed-to-progress recovery budget + alarm debounce (#1637) #1638's budget, settling a tool result is forward progress that resets the recovery budget, which (correctly) prevents exhaustion. So that exact scenario can't be forced deterministically under real churn; the unit test (which seeds the at-cap incident directly) is the right level. The real-kill e2e instead covers the deterministic R1 persist:false path.

Suites green: think 463, ai-chat 485; new e2e green (2 deterministic runs, ~20s); npm run check clean.

…sted payload (#1631) Lets products build a terminal-state policy without re-deriving anything: - ChatRecoveryContext (onChatRecovery) gains recoveryRootRequestId — the stable request id for the whole continuation chain, the right key for per-incident budget tracking / fresh-incident detection (no re-deriving from message IDs). - ChatRecoveryExhaustedContext (onExhausted) gains recoveryRootRequestId, terminalMessage (exact user-facing text), partialText/partialParts (what the turn produced before it was given up on), and streamId/createdAt — enough to render/persist a terminal banner AND emit correlated terminal telemetry (msSinceTurnStart, stream correlation) directly. streamId + createdAt were added after verifying the payload against the actual consumer (g3's _emitExhaustedRecovery): it reads both from the recovery context for telemetry, and they already exist on ChatRecoveryContext (the Pick source), so adding them to the exhausted context is additive and unblocks re-homing the exhaustion handler onto onExhausted with zero re-derivation (D4). Shared types in `agents`; wired through think + ai-chat (_exhaustChatRecovery now receives streamId + createdAt). Test agents capture the exhausted context; tests assert both contexts (incl. streamId + createdAt) in both packages. Rebased onto main (dropping the merged #1633/#1634/#1635 commits); adapted the exhausted-ctx test to N1/#1638's alarm-debounce and gave the think harness an explicit return shape (the context's MessagePart[] over-instantiates the RPC stub type). Co-authored-by: Cursor <cursoragent@cursor.com>

…sted payload (#1631) (#1636) Lets products build a terminal-state policy without re-deriving anything: - ChatRecoveryContext (onChatRecovery) gains recoveryRootRequestId — the stable request id for the whole continuation chain, the right key for per-incident budget tracking / fresh-incident detection (no re-deriving from message IDs). - ChatRecoveryExhaustedContext (onExhausted) gains recoveryRootRequestId, terminalMessage (exact user-facing text), partialText/partialParts (what the turn produced before it was given up on), and streamId/createdAt — enough to render/persist a terminal banner AND emit correlated terminal telemetry (msSinceTurnStart, stream correlation) directly. streamId + createdAt were added after verifying the payload against the actual consumer (g3's _emitExhaustedRecovery): it reads both from the recovery context for telemetry, and they already exist on ChatRecoveryContext (the Pick source), so adding them to the exhausted context is additive and unblocks re-homing the exhaustion handler onto onExhausted with zero re-derivation (D4). Shared types in `agents`; wired through think + ai-chat (_exhaustChatRecovery now receives streamId + createdAt). Test agents capture the exhausted context; tests assert both contexts (incl. streamId + createdAt) in both packages. Rebased onto main (dropping the merged #1633/#1634/#1635 commits); adapted the exhausted-ctx test to N1/#1638's alarm-debounce and gave the think harness an explicit return shape (the context's MessagePart[] over-instantiates the RPC stub type). Co-authored-by: Cursor <cursoragent@cursor.com>

Base automatically changed from fix/chat-recovery-progress-monotonic to main June 1, 2026 01:54

threepointone force-pushed the fix/chat-recovery-preserve-settled-work branch from 1893c2a to a4f8698 Compare June 1, 2026 13:53

threepointone force-pushed the fix/chat-recovery-preserve-settled-work branch from a4f8698 to f4bfb14 Compare June 1, 2026 14:14

threepointone merged commit a4225fd into main Jun 1, 2026
4 of 5 checks passed

threepointone deleted the fix/chat-recovery-preserve-settled-work branch June 1, 2026 14:28

threepointone mentioned this pull request Jun 1, 2026

refactor: unify duplicated chat-recovery/repair machinery into the shared agents/chat layer (N3) #1642

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(think,ai-chat): preserve settled work when a recovery turn is given up (#1631)#1634

fix(think,ai-chat): preserve settled work when a recovery turn is given up (#1631)#1634
threepointone merged 1 commit into
mainfrom
fix/chat-recovery-preserve-settled-work

threepointone commented May 31, 2026 •

edited

Loading

Uh oh!

changeset-bot Bot commented May 31, 2026 •

edited

Loading

Uh oh!

pkg-pr-new Bot commented May 31, 2026 •

edited

Loading

Uh oh!

threepointone commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

threepointone commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Two paths fixed

1. Framework budget exhaustion dropped the settled partial

2. onChatRecovery returning { persist: false } dropped settled work (R1 — stronger default)

g3 impact

Tests

Notes for reviewers

Test plan

Uh oh!

changeset-bot Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

pkg-pr-new Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

threepointone commented Jun 1, 2026

Review-driven coverage hardening (pushed)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

threepointone commented May 31, 2026 •

edited

Loading

2. `onChatRecovery` returning `{ persist: false }` dropped settled work (R1 — stronger default)

changeset-bot Bot commented May 31, 2026 •

edited

Loading

pkg-pr-new Bot commented May 31, 2026 •

edited

Loading