fix(think,ai-chat): make chat-recovery progress signal compaction-immune (#1628) by threepointone · Pull Request #1633 · cloudflare/agents

threepointone · 2026-05-31T21:50:54Z

Stack

This is PR 1 of 4 addressing #1631 (recovery/repair ergonomics) and #1628. The series is stacked; review/merge bottom-up:

➡️ bug: chat-recovery progress marker is count-based, so compaction can cause premature budget exhaustion #1628 — progress signal compaction-immune (this PR) · patch
preserve settled work when a turn is given up · patch
repairInterruptedToolPart hook · minor
incident identity + onExhausted payload · minor

Each PR is independently reviewable; this one has no dependency on the others (it's the base of the stack). Targets main.

Problem (#1628)

Bounded chat recovery (#1623) resets the per-incident retry budget when an interrupted turn is making forward progress, so that a turn repeatedly interrupted by deploy churn isn't falsely treated as a poison turn and sealed at maxAttempts. That progress signal was recomputed from the live transcript by counting assistant messages:

// packages/think/src/think.ts (and the verbatim mirror in packages/ai-chat/src/index.ts)
private _chatRecoveryProgressMarker(): number {
  return this.messages.reduce(
    (count, message) => (message.role === "assistant" ? count + 1 : count),
    0
  );
}
// ...in _beginChatRecoveryIncident:
const prevProgress = existing?.progress ?? 0;
const currentProgress = this._chatRecoveryProgressMarker();
const madeProgress = existing != null && currentProgress > prevProgress;

this.messages is the live, mutable transcript. Compaction collapses older assistant messages into a summary, lowering the count. So a turn that genuinely advanced between two recovery attempts can read as currentProgress < prevProgress → madeProgress = false → the budget is not reset → the turn exhausts at maxAttempts and is sealed, even though it was healthy. The stored progress: Math.max(prevProgress, currentProgress) only protects the persisted high-water; the comparison still uses the live (compacted) recompute, so it doesn't help.

This is the data-loss path a tool-heavy agent under continuous deploys actually hits.

Fix

Track progress with a durable, monotonic counter that compaction can never lower, instead of recomputing from the transcript:

New storage key cf:chat-recovery:progress.
Bumped exactly once when _persistOrphanedStream materializes a non-empty partial assistant message — which is precisely the "forward progress" event the old assistant-message count was a proxy for.
_chatRecoveryProgressMarker() now reads that counter; _beginChatRecoveryIncident awaits it.

Because it only ever increments, compaction is irrelevant. Semantics are otherwise preserved:

Healthy turn under churn: each interruption persists a non-empty partial → counter grows → madeProgress true → budget resets. ✅
Poison turn (no chunks): _persistOrphanedStream early-returns on empty chunks → counter flat → budget counts toward the cap → exhausts. ✅
Genuinely-stuck-but-producing turn: still ultimately bounded by the unchanged 15-minute CHAT_RECOVERY_MAX_WINDOW_MS wall-clock ceiling.

Timing (why the counter is read at incident-begin and bumped at persist)

Per interruption: _beginChatRecoveryIncident reads the counter and compares against the stored high-water → then (if not exhausted) _persistOrphanedStream bumps it. So attempt N's bump is observed at attempt N+1's comparison, exactly mirroring how the old assistant-message count grew after each persist.

Back-compat / migration

A session interrupted mid-incident before this counter existed has no cf:chat-recovery:progress key → it reads 0. For a new incident that's fine (no comparison on first detection). For an in-flight incident at upgrade time the comparison is conservative (treats it as no-progress for that one alarm), which is bounded and harmless. No persisted-data migration required.

Bonus

@cloudflare/ai-chat's _persistOrphanedStream merges a continuation partial into the existing assistant message in place, so its old marker barely grew across continuations (only the first persist added a message) — it was already under-counting progress. The monotonic counter fixes that too and brings ai-chat in line with think.

Files

packages/think/src/think.ts — counter constant, _chatRecoveryProgressMarker (now async, reads storage), new _bumpChatRecoveryProgress, bump in _persistOrphanedStream, await in _beginChatRecoveryIncident, updated ChatRecoveryIncident.progress doc.
packages/ai-chat/src/index.ts — same changes; _persistOrphanedStream made async (callers updated to await).
Tests + harness helpers (bumpRecoveryProgressForTest, dropAssistantMessagesForTest).
.changeset/chat-recovery-progress-monotonic.md (patch · think + ai-chat).

Tests

New regression test in both packages — "detects forward progress even after compaction collapses the transcript (#1628)": open an incident, make progress (bump the counter), then drop every assistant message from the live cache (simulating compaction), and assert the next alarm still resets the budget. The old message-count marker fails this; the durable counter passes. Existing progress-reset and wall-clock-ceiling tests were re-pointed at the counter and still pass.

Verification

tsc --noEmit clean for @cloudflare/think and @cloudflare/ai-chat.
oxlint clean; oxfmt applied.
Full suites green: think 460 passed (15 files), ai-chat 481 passed (42 files).

Test plan

CI green (npm run build && npm run check && nx affected -t test)
Confirm the compaction regression test fails on main and passes here
Sanity-check the back-compat note for in-flight incidents at deploy time

Made with Cursor

…une (#1628) The deploy-churn forward-progress signal that resets the recovery budget was recomputed from the live transcript by counting assistant messages. Compaction collapses older assistant messages into a summary, lowering that count, so a turn that had genuinely advanced could read as "no progress" between recovery attempts and exhaust at maxAttempts — sealing a healthy turn. Track progress with a durable, monotonic counter (cf:chat-recovery:progress) incremented when _persistOrphanedStream materializes a non-empty partial — the exact event the message count was proxying for — so compaction can never lower it. A turn that genuinely fails to advance still exhausts at the cap, and the 15-minute wall-clock ceiling is unchanged. Applied identically to think and ai-chat (the recovery code is duplicated verbatim). Co-authored-by: Cursor <cursoragent@cursor.com>

changeset-bot · 2026-05-31T21:50:59Z

🦋 Changeset detected

Latest commit: 5cde588

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages

Name	Type
@cloudflare/think	Patch
@cloudflare/ai-chat	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

threepointone · 2026-06-01T00:42:47Z

/bonk do a deep review on this pr, look for edge cases, how's the ux on this

pkg-pr-new · 2026-06-01T00:46:52Z

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1633

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1633

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1633

hono-agents

npm i https://pkg.pr.new/hono-agents@1633

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1633

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1633

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1633

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1633

commit: ce215d1

ask-bonk · 2026-06-01T00:48:14Z

UnknownError: ProviderInitError

github run

ask-bonk · 2026-06-01T00:48:16Z

@threepointone Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

The forward-progress bump in _persistOrphanedStream awaited _bumpChatRecoveryProgress but fired persistMessages without awaiting it. persistMessages is async; today its body is synchronous so ordering holds, but the bump should only advance after persistence settles. This aligns ai-chat with think's _persistOrphanedStream, which awaits its persist before bumping (#1628). Co-authored-by: Cursor <cursoragent@cursor.com>

…en up (#1631) Two paths could throw away a partial assistant message holding completed, often non-idempotent tool results — the most valuable, least-reproducible state in a turn: 1. Framework budget exhaustion sealed the turn (terminal status + banner) BEFORE the orphaned stream was persisted, so settled tool results were discarded and re-run on the next message. Exhaustion now persists the settled partial first, reusing the normal path's gating so it can't duplicate an already-saved partial. (This now also covers N1/#1638's wall-clock and no-progress exhaustion paths, not just the attempt cap.) 2. A subclass onChatRecovery returning { persist: false } silently dropped the settled partial. Settled work is now NEVER dropped: { persist: false } only suppresses persistence of a partial that has nothing settled to lose; a partial carrying settled tool results is persisted regardless. An app can no longer accidentally discard completed work, and never needs { persist: true } just to stay safe. A safe default beats a warning about an unsafe one (R1). Adds _shouldPersistOrphanedPartial / _partialHasSettledToolResults helpers. Applied identically to @cloudflare/think and @cloudflare/ai-chat. Rebased onto main (dropping the already-merged #1633 commit) and adapted the exhaustion test to N1/#1638's alarm-debounce (age the seeded incident past the debounce window so the wake counts as a genuine attempt and exhausts). Co-authored-by: Cursor <cursoragent@cursor.com>

…en up (#1631) Two paths could throw away a partial assistant message holding completed, often non-idempotent tool results — the most valuable, least-reproducible state in a turn: 1. Framework budget exhaustion sealed the turn (terminal status + banner) BEFORE the orphaned stream was persisted, so settled tool results were discarded and re-run on the next message. Exhaustion now persists the settled partial first, reusing the normal path's gating so it can't duplicate an already-saved partial. (This now also covers N1/#1638's wall-clock and no-progress exhaustion paths, not just the attempt cap.) 2. A subclass onChatRecovery returning { persist: false } silently dropped the settled partial. Settled work is now NEVER dropped: { persist: false } only suppresses persistence of a partial that has nothing settled to lose; a partial carrying settled tool results is persisted regardless. An app can no longer accidentally discard completed work, and never needs { persist: true } just to stay safe. A safe default beats a warning about an unsafe one (R1). Adds _shouldPersistOrphanedPartial / _partialHasSettledToolResults helpers. Applied identically to @cloudflare/think and @cloudflare/ai-chat. Tests: - Unit (think + ai-chat): exhaustion preserves a settled TOOL RESULT (not just text); { persist: false } never drops settled tool results, and is honored for a text-only partial. (Exhaustion test aged past N1/#1638's alarm-debounce so the wake counts as a genuine attempt and actually exhausts.) - E2E (think, real SIGKILL): a recordStep loop whose onChatRecovery returns { persist: false, continue: false } is killed mid-turn; after recovery the settled tool results produced before the kill are still in the durable transcript (R1) and the turn does not continue. (ThinkPersistFalseE2EAgent.) NOTE on coverage: the EXHAUSTION-with-prior-settled-work path stays unit-tested (not e2e) on purpose — under N1's budget, settling a tool result IS forward progress that resets the budget and prevents exhaustion, so that scenario can't be forced deterministically under real churn. Rebased onto main (dropping the already-merged #1633 commit). Co-authored-by: Cursor <cursoragent@cursor.com>

…en up (#1631) (#1634) Two paths could throw away a partial assistant message holding completed, often non-idempotent tool results — the most valuable, least-reproducible state in a turn: 1. Framework budget exhaustion sealed the turn (terminal status + banner) BEFORE the orphaned stream was persisted, so settled tool results were discarded and re-run on the next message. Exhaustion now persists the settled partial first, reusing the normal path's gating so it can't duplicate an already-saved partial. (This now also covers N1/#1638's wall-clock and no-progress exhaustion paths, not just the attempt cap.) 2. A subclass onChatRecovery returning { persist: false } silently dropped the settled partial. Settled work is now NEVER dropped: { persist: false } only suppresses persistence of a partial that has nothing settled to lose; a partial carrying settled tool results is persisted regardless. An app can no longer accidentally discard completed work, and never needs { persist: true } just to stay safe. A safe default beats a warning about an unsafe one (R1). Adds _shouldPersistOrphanedPartial / _partialHasSettledToolResults helpers. Applied identically to @cloudflare/think and @cloudflare/ai-chat. Tests: - Unit (think + ai-chat): exhaustion preserves a settled TOOL RESULT (not just text); { persist: false } never drops settled tool results, and is honored for a text-only partial. (Exhaustion test aged past N1/#1638's alarm-debounce so the wake counts as a genuine attempt and actually exhausts.) - E2E (think, real SIGKILL): a recordStep loop whose onChatRecovery returns { persist: false, continue: false } is killed mid-turn; after recovery the settled tool results produced before the kill are still in the durable transcript (R1) and the turn does not continue. (ThinkPersistFalseE2EAgent.) NOTE on coverage: the EXHAUSTION-with-prior-settled-work path stays unit-tested (not e2e) on purpose — under N1's budget, settling a tool result IS forward progress that resets the budget and prevents exhaustion, so that scenario can't be forced deterministically under real churn. Rebased onto main (dropping the already-merged #1633 commit). Co-authored-by: Cursor <cursoragent@cursor.com>

…sted payload (#1631) Lets products build a terminal-state policy without re-deriving anything: - ChatRecoveryContext (onChatRecovery) gains recoveryRootRequestId — the stable request id for the whole continuation chain, the right key for per-incident budget tracking / fresh-incident detection (no re-deriving from message IDs). - ChatRecoveryExhaustedContext (onExhausted) gains recoveryRootRequestId, terminalMessage (exact user-facing text), partialText/partialParts (what the turn produced before it was given up on), and streamId/createdAt — enough to render/persist a terminal banner AND emit correlated terminal telemetry (msSinceTurnStart, stream correlation) directly. streamId + createdAt were added after verifying the payload against the actual consumer (g3's _emitExhaustedRecovery): it reads both from the recovery context for telemetry, and they already exist on ChatRecoveryContext (the Pick source), so adding them to the exhausted context is additive and unblocks re-homing the exhaustion handler onto onExhausted with zero re-derivation (D4). Shared types in `agents`; wired through think + ai-chat (_exhaustChatRecovery now receives streamId + createdAt). Test agents capture the exhausted context; tests assert both contexts (incl. streamId + createdAt) in both packages. Rebased onto main (dropping the merged #1633/#1634/#1635 commits); adapted the exhausted-ctx test to N1/#1638's alarm-debounce and gave the think harness an explicit return shape (the context's MessagePart[] over-instantiates the RPC stub type). Co-authored-by: Cursor <cursoragent@cursor.com>

…sted payload (#1631) (#1636) Lets products build a terminal-state policy without re-deriving anything: - ChatRecoveryContext (onChatRecovery) gains recoveryRootRequestId — the stable request id for the whole continuation chain, the right key for per-incident budget tracking / fresh-incident detection (no re-deriving from message IDs). - ChatRecoveryExhaustedContext (onExhausted) gains recoveryRootRequestId, terminalMessage (exact user-facing text), partialText/partialParts (what the turn produced before it was given up on), and streamId/createdAt — enough to render/persist a terminal banner AND emit correlated terminal telemetry (msSinceTurnStart, stream correlation) directly. streamId + createdAt were added after verifying the payload against the actual consumer (g3's _emitExhaustedRecovery): it reads both from the recovery context for telemetry, and they already exist on ChatRecoveryContext (the Pick source), so adding them to the exhausted context is additive and unblocks re-homing the exhaustion handler onto onExhausted with zero re-derivation (D4). Shared types in `agents`; wired through think + ai-chat (_exhaustChatRecovery now receives streamId + createdAt). Test agents capture the exhausted context; tests assert both contexts (incl. streamId + createdAt) in both packages. Rebased onto main (dropping the merged #1633/#1634/#1635 commits); adapted the exhausted-ctx test to N1/#1638's alarm-debounce and gave the think harness an explicit return shape (the context's MessagePart[] over-instantiates the RPC stub type). Co-authored-by: Cursor <cursoragent@cursor.com>

This was referenced Jun 1, 2026

recovery budget should be wall-clock-keyed-to-progress + alarm-debounced, not raw attempt count (deploy churn) #1637

Closed

fix(think,ai-chat): wall-clock-keyed-to-progress recovery budget + alarm debounce (#1637) #1638

Merged

threepointone merged commit 1aca578 into main Jun 1, 2026
3 checks passed

threepointone deleted the fix/chat-recovery-progress-monotonic branch June 1, 2026 01:54

github-actions Bot mentioned this pull request Jun 1, 2026

Version Packages #1597

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(think,ai-chat): make chat-recovery progress signal compaction-immune (#1628)#1633

fix(think,ai-chat): make chat-recovery progress signal compaction-immune (#1628)#1633
threepointone merged 2 commits into
mainfrom
fix/chat-recovery-progress-monotonic

threepointone commented May 31, 2026

Uh oh!

changeset-bot Bot commented May 31, 2026 •

edited

Loading

Uh oh!

threepointone commented Jun 1, 2026

Uh oh!

pkg-pr-new Bot commented Jun 1, 2026

Uh oh!

ask-bonk Bot commented Jun 1, 2026

Uh oh!

ask-bonk Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

threepointone commented May 31, 2026

Stack

Problem (#1628)

Fix

Timing (why the counter is read at incident-begin and bumped at persist)

Back-compat / migration

Bonus

Files

Tests

Verification

Test plan

Uh oh!

changeset-bot Bot commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🦋 Changeset detected

Uh oh!

threepointone commented Jun 1, 2026

Uh oh!

pkg-pr-new Bot commented Jun 1, 2026

Uh oh!

ask-bonk Bot commented Jun 1, 2026

Uh oh!

ask-bonk Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

changeset-bot Bot commented May 31, 2026 •

edited

Loading