Skip to content

fix(think,ai-chat): make chat-recovery progress signal compaction-immune (#1628)#1633

Merged
threepointone merged 2 commits into
mainfrom
fix/chat-recovery-progress-monotonic
Jun 1, 2026
Merged

fix(think,ai-chat): make chat-recovery progress signal compaction-immune (#1628)#1633
threepointone merged 2 commits into
mainfrom
fix/chat-recovery-progress-monotonic

Conversation

@threepointone
Copy link
Copy Markdown
Contributor

Stack

This is PR 1 of 4 addressing #1631 (recovery/repair ergonomics) and #1628. The series is stacked; review/merge bottom-up:

  1. ➡️ bug: chat-recovery progress marker is count-based, so compaction can cause premature budget exhaustion #1628 — progress signal compaction-immune (this PR) · patch
  2. preserve settled work when a turn is given up · patch
  3. repairInterruptedToolPart hook · minor
  4. incident identity + onExhausted payload · minor

Each PR is independently reviewable; this one has no dependency on the others (it's the base of the stack). Targets main.


Problem (#1628)

Bounded chat recovery (#1623) resets the per-incident retry budget when an interrupted turn is making forward progress, so that a turn repeatedly interrupted by deploy churn isn't falsely treated as a poison turn and sealed at maxAttempts. That progress signal was recomputed from the live transcript by counting assistant messages:

// packages/think/src/think.ts (and the verbatim mirror in packages/ai-chat/src/index.ts)
private _chatRecoveryProgressMarker(): number {
  return this.messages.reduce(
    (count, message) => (message.role === "assistant" ? count + 1 : count),
    0
  );
}
// ...in _beginChatRecoveryIncident:
const prevProgress = existing?.progress ?? 0;
const currentProgress = this._chatRecoveryProgressMarker();
const madeProgress = existing != null && currentProgress > prevProgress;

this.messages is the live, mutable transcript. Compaction collapses older assistant messages into a summary, lowering the count. So a turn that genuinely advanced between two recovery attempts can read as currentProgress < prevProgressmadeProgress = false → the budget is not reset → the turn exhausts at maxAttempts and is sealed, even though it was healthy. The stored progress: Math.max(prevProgress, currentProgress) only protects the persisted high-water; the comparison still uses the live (compacted) recompute, so it doesn't help.

This is the data-loss path a tool-heavy agent under continuous deploys actually hits.

Fix

Track progress with a durable, monotonic counter that compaction can never lower, instead of recomputing from the transcript:

  • New storage key cf:chat-recovery:progress.
  • Bumped exactly once when _persistOrphanedStream materializes a non-empty partial assistant message — which is precisely the "forward progress" event the old assistant-message count was a proxy for.
  • _chatRecoveryProgressMarker() now reads that counter; _beginChatRecoveryIncident awaits it.

Because it only ever increments, compaction is irrelevant. Semantics are otherwise preserved:

  • Healthy turn under churn: each interruption persists a non-empty partial → counter grows → madeProgress true → budget resets. ✅
  • Poison turn (no chunks): _persistOrphanedStream early-returns on empty chunks → counter flat → budget counts toward the cap → exhausts. ✅
  • Genuinely-stuck-but-producing turn: still ultimately bounded by the unchanged 15-minute CHAT_RECOVERY_MAX_WINDOW_MS wall-clock ceiling.

Timing (why the counter is read at incident-begin and bumped at persist)

Per interruption: _beginChatRecoveryIncident reads the counter and compares against the stored high-water → then (if not exhausted) _persistOrphanedStream bumps it. So attempt N's bump is observed at attempt N+1's comparison, exactly mirroring how the old assistant-message count grew after each persist.

Back-compat / migration

A session interrupted mid-incident before this counter existed has no cf:chat-recovery:progress key → it reads 0. For a new incident that's fine (no comparison on first detection). For an in-flight incident at upgrade time the comparison is conservative (treats it as no-progress for that one alarm), which is bounded and harmless. No persisted-data migration required.

Bonus

@cloudflare/ai-chat's _persistOrphanedStream merges a continuation partial into the existing assistant message in place, so its old marker barely grew across continuations (only the first persist added a message) — it was already under-counting progress. The monotonic counter fixes that too and brings ai-chat in line with think.

Files

  • packages/think/src/think.ts — counter constant, _chatRecoveryProgressMarker (now async, reads storage), new _bumpChatRecoveryProgress, bump in _persistOrphanedStream, await in _beginChatRecoveryIncident, updated ChatRecoveryIncident.progress doc.
  • packages/ai-chat/src/index.ts — same changes; _persistOrphanedStream made async (callers updated to await).
  • Tests + harness helpers (bumpRecoveryProgressForTest, dropAssistantMessagesForTest).
  • .changeset/chat-recovery-progress-monotonic.md (patch · think + ai-chat).

Tests

New regression test in both packages — "detects forward progress even after compaction collapses the transcript (#1628)": open an incident, make progress (bump the counter), then drop every assistant message from the live cache (simulating compaction), and assert the next alarm still resets the budget. The old message-count marker fails this; the durable counter passes. Existing progress-reset and wall-clock-ceiling tests were re-pointed at the counter and still pass.

Verification

  • tsc --noEmit clean for @cloudflare/think and @cloudflare/ai-chat.
  • oxlint clean; oxfmt applied.
  • Full suites green: think 460 passed (15 files), ai-chat 481 passed (42 files).

Test plan

  • CI green (npm run build && npm run check && nx affected -t test)
  • Confirm the compaction regression test fails on main and passes here
  • Sanity-check the back-compat note for in-flight incidents at deploy time

Made with Cursor

…une (#1628)

The deploy-churn forward-progress signal that resets the recovery budget was
recomputed from the live transcript by counting assistant messages. Compaction
collapses older assistant messages into a summary, lowering that count, so a
turn that had genuinely advanced could read as "no progress" between recovery
attempts and exhaust at maxAttempts — sealing a healthy turn.

Track progress with a durable, monotonic counter (cf:chat-recovery:progress)
incremented when _persistOrphanedStream materializes a non-empty partial — the
exact event the message count was proxying for — so compaction can never lower
it. A turn that genuinely fails to advance still exhausts at the cap, and the
15-minute wall-clock ceiling is unchanged. Applied identically to think and
ai-chat (the recovery code is duplicated verbatim).

Co-authored-by: Cursor <cursoragent@cursor.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 31, 2026

🦋 Changeset detected

Latest commit: 5cde588

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 2 packages
Name Type
@cloudflare/think Patch
@cloudflare/ai-chat Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@threepointone
Copy link
Copy Markdown
Contributor Author

/bonk do a deep review on this pr, look for edge cases, how's the ux on this

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented Jun 1, 2026

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1633

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1633

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1633

hono-agents

npm i https://pkg.pr.new/hono-agents@1633

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1633

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1633

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1633

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1633

commit: ce215d1

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Jun 1, 2026

UnknownError: ProviderInitError

github run

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented Jun 1, 2026

@threepointone Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

The forward-progress bump in _persistOrphanedStream awaited
_bumpChatRecoveryProgress but fired persistMessages without awaiting it.
persistMessages is async; today its body is synchronous so ordering holds,
but the bump should only advance after persistence settles. This aligns
ai-chat with think's _persistOrphanedStream, which awaits its persist
before bumping (#1628).

Co-authored-by: Cursor <cursoragent@cursor.com>
@threepointone threepointone merged commit 1aca578 into main Jun 1, 2026
3 checks passed
@threepointone threepointone deleted the fix/chat-recovery-progress-monotonic branch June 1, 2026 01:54
@github-actions github-actions Bot mentioned this pull request Jun 1, 2026
threepointone added a commit that referenced this pull request Jun 1, 2026
…en up (#1631)

Two paths could throw away a partial assistant message holding completed,
often non-idempotent tool results — the most valuable, least-reproducible
state in a turn:

1. Framework budget exhaustion sealed the turn (terminal status + banner)
   BEFORE the orphaned stream was persisted, so settled tool results were
   discarded and re-run on the next message. Exhaustion now persists the
   settled partial first, reusing the normal path's gating so it can't
   duplicate an already-saved partial. (This now also covers N1/#1638's
   wall-clock and no-progress exhaustion paths, not just the attempt cap.)

2. A subclass onChatRecovery returning { persist: false } silently dropped
   the settled partial. Settled work is now NEVER dropped: { persist: false }
   only suppresses persistence of a partial that has nothing settled to lose;
   a partial carrying settled tool results is persisted regardless. An app can
   no longer accidentally discard completed work, and never needs
   { persist: true } just to stay safe. A safe default beats a warning about
   an unsafe one (R1).

Adds _shouldPersistOrphanedPartial / _partialHasSettledToolResults helpers.
Applied identically to @cloudflare/think and @cloudflare/ai-chat.

Rebased onto main (dropping the already-merged #1633 commit) and adapted the
exhaustion test to N1/#1638's alarm-debounce (age the seeded incident past the
debounce window so the wake counts as a genuine attempt and exhausts).

Co-authored-by: Cursor <cursoragent@cursor.com>
threepointone added a commit that referenced this pull request Jun 1, 2026
…en up (#1631)

Two paths could throw away a partial assistant message holding completed,
often non-idempotent tool results — the most valuable, least-reproducible
state in a turn:

1. Framework budget exhaustion sealed the turn (terminal status + banner)
   BEFORE the orphaned stream was persisted, so settled tool results were
   discarded and re-run on the next message. Exhaustion now persists the
   settled partial first, reusing the normal path's gating so it can't
   duplicate an already-saved partial. (This now also covers N1/#1638's
   wall-clock and no-progress exhaustion paths, not just the attempt cap.)

2. A subclass onChatRecovery returning { persist: false } silently dropped
   the settled partial. Settled work is now NEVER dropped: { persist: false }
   only suppresses persistence of a partial that has nothing settled to lose;
   a partial carrying settled tool results is persisted regardless. An app can
   no longer accidentally discard completed work, and never needs
   { persist: true } just to stay safe. A safe default beats a warning about
   an unsafe one (R1).

Adds _shouldPersistOrphanedPartial / _partialHasSettledToolResults helpers.
Applied identically to @cloudflare/think and @cloudflare/ai-chat.

Tests:
- Unit (think + ai-chat): exhaustion preserves a settled TOOL RESULT (not just
  text); { persist: false } never drops settled tool results, and is honored
  for a text-only partial. (Exhaustion test aged past N1/#1638's alarm-debounce
  so the wake counts as a genuine attempt and actually exhausts.)
- E2E (think, real SIGKILL): a recordStep loop whose onChatRecovery returns
  { persist: false, continue: false } is killed mid-turn; after recovery the
  settled tool results produced before the kill are still in the durable
  transcript (R1) and the turn does not continue. (ThinkPersistFalseE2EAgent.)

NOTE on coverage: the EXHAUSTION-with-prior-settled-work path stays unit-tested
(not e2e) on purpose — under N1's budget, settling a tool result IS forward
progress that resets the budget and prevents exhaustion, so that scenario can't
be forced deterministically under real churn.

Rebased onto main (dropping the already-merged #1633 commit).

Co-authored-by: Cursor <cursoragent@cursor.com>
threepointone added a commit that referenced this pull request Jun 1, 2026
…en up (#1631) (#1634)

Two paths could throw away a partial assistant message holding completed,
often non-idempotent tool results — the most valuable, least-reproducible
state in a turn:

1. Framework budget exhaustion sealed the turn (terminal status + banner)
   BEFORE the orphaned stream was persisted, so settled tool results were
   discarded and re-run on the next message. Exhaustion now persists the
   settled partial first, reusing the normal path's gating so it can't
   duplicate an already-saved partial. (This now also covers N1/#1638's
   wall-clock and no-progress exhaustion paths, not just the attempt cap.)

2. A subclass onChatRecovery returning { persist: false } silently dropped
   the settled partial. Settled work is now NEVER dropped: { persist: false }
   only suppresses persistence of a partial that has nothing settled to lose;
   a partial carrying settled tool results is persisted regardless. An app can
   no longer accidentally discard completed work, and never needs
   { persist: true } just to stay safe. A safe default beats a warning about
   an unsafe one (R1).

Adds _shouldPersistOrphanedPartial / _partialHasSettledToolResults helpers.
Applied identically to @cloudflare/think and @cloudflare/ai-chat.

Tests:
- Unit (think + ai-chat): exhaustion preserves a settled TOOL RESULT (not just
  text); { persist: false } never drops settled tool results, and is honored
  for a text-only partial. (Exhaustion test aged past N1/#1638's alarm-debounce
  so the wake counts as a genuine attempt and actually exhausts.)
- E2E (think, real SIGKILL): a recordStep loop whose onChatRecovery returns
  { persist: false, continue: false } is killed mid-turn; after recovery the
  settled tool results produced before the kill are still in the durable
  transcript (R1) and the turn does not continue. (ThinkPersistFalseE2EAgent.)

NOTE on coverage: the EXHAUSTION-with-prior-settled-work path stays unit-tested
(not e2e) on purpose — under N1's budget, settling a tool result IS forward
progress that resets the budget and prevents exhaustion, so that scenario can't
be forced deterministically under real churn.

Rebased onto main (dropping the already-merged #1633 commit).

Co-authored-by: Cursor <cursoragent@cursor.com>
threepointone added a commit that referenced this pull request Jun 1, 2026
…sted payload (#1631)

Lets products build a terminal-state policy without re-deriving anything:

- ChatRecoveryContext (onChatRecovery) gains recoveryRootRequestId — the stable
  request id for the whole continuation chain, the right key for per-incident
  budget tracking / fresh-incident detection (no re-deriving from message IDs).
- ChatRecoveryExhaustedContext (onExhausted) gains recoveryRootRequestId,
  terminalMessage (exact user-facing text), partialText/partialParts (what the
  turn produced before it was given up on), and streamId/createdAt — enough to
  render/persist a terminal banner AND emit correlated terminal telemetry
  (msSinceTurnStart, stream correlation) directly.

streamId + createdAt were added after verifying the payload against the actual
consumer (g3's _emitExhaustedRecovery): it reads both from the recovery context
for telemetry, and they already exist on ChatRecoveryContext (the Pick source),
so adding them to the exhausted context is additive and unblocks re-homing the
exhaustion handler onto onExhausted with zero re-derivation (D4).

Shared types in `agents`; wired through think + ai-chat (_exhaustChatRecovery
now receives streamId + createdAt). Test agents capture the exhausted context;
tests assert both contexts (incl. streamId + createdAt) in both packages.

Rebased onto main (dropping the merged #1633/#1634/#1635 commits); adapted the
exhausted-ctx test to N1/#1638's alarm-debounce and gave the think harness an
explicit return shape (the context's MessagePart[] over-instantiates the RPC
stub type).

Co-authored-by: Cursor <cursoragent@cursor.com>
threepointone added a commit that referenced this pull request Jun 1, 2026
…sted payload (#1631) (#1636)

Lets products build a terminal-state policy without re-deriving anything:

- ChatRecoveryContext (onChatRecovery) gains recoveryRootRequestId — the stable
  request id for the whole continuation chain, the right key for per-incident
  budget tracking / fresh-incident detection (no re-deriving from message IDs).
- ChatRecoveryExhaustedContext (onExhausted) gains recoveryRootRequestId,
  terminalMessage (exact user-facing text), partialText/partialParts (what the
  turn produced before it was given up on), and streamId/createdAt — enough to
  render/persist a terminal banner AND emit correlated terminal telemetry
  (msSinceTurnStart, stream correlation) directly.

streamId + createdAt were added after verifying the payload against the actual
consumer (g3's _emitExhaustedRecovery): it reads both from the recovery context
for telemetry, and they already exist on ChatRecoveryContext (the Pick source),
so adding them to the exhausted context is additive and unblocks re-homing the
exhaustion handler onto onExhausted with zero re-derivation (D4).

Shared types in `agents`; wired through think + ai-chat (_exhaustChatRecovery
now receives streamId + createdAt). Test agents capture the exhausted context;
tests assert both contexts (incl. streamId + createdAt) in both packages.

Rebased onto main (dropping the merged #1633/#1634/#1635 commits); adapted the
exhausted-ctx test to N1/#1638's alarm-debounce and gave the think harness an
explicit return shape (the context's MessagePart[] over-instantiates the RPC
stub type).

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant