Fix second deploy-churn recovery abandonment: defer one-shot callbacks on a code-update DO reset by threepointone · Pull Request #1617 · cloudflare/agents

threepointone · 2026-05-29T21:41:17Z

Summary

Follow-up to #1615. There are two distinct deploy-churn abandonment paths for durable chat recovery, at two layers. #1615 closed the first (the progress-blind attempt budget in _beginChatRecoveryIncident). This PR closes the second, which lives one layer down in the agents base class and #1615 cannot reach.

When the scheduled recovery continuation alarm fires on an isolate a deploy has just superseded, the first ctx.storage op throws the catchable Durable Object reset because its code was updated. for the entire invocation (code never reloads mid-invocation). Agent._executeScheduleCallback then:

burns its whole in-process tryN retry budget — every attempt is doomed for the same reason; then
swallows the error and returns normally, so alarm() marks the one-shot row executed and deletes it.

The orphaned cf_agents_runs fiber row was already deleted the moment recovery was "handled" (it scheduled the continuation), so the turn rode solely on that one-shot schedule row. Deleting it leaves nothing for the boot-time fiber scan to re-detect — _beginChatRecoveryIncident (and #1615's progress logic) never runs again, and the turn is permanently abandoned, even though the next fresh invocation would succeed. A customer observed this firing ~35× over ~19 mid-turn redeploys.

The fix

In Agent._executeScheduleCallback, for a one-shot row (delayed/scheduled) failing with the code-update reset transient:

skip the doomed in-process retries (shouldRetry), and
re-throw instead of swallowing, so alarm() rejects → the one-shot row is not deleted → Cloudflare re-runs the alarm on a fresh isolate (= new code) under the at-least-once alarm guarantee. The interrupted turn auto-resumes once the deploy storm settles, with no new user message.

Every other callback and error class keeps the existing swallow-and-exhaust behavior. Interval/cron schedules are untouched (they are not deleted on execution, so the bug does not apply).

The detector is a message match (/reset because its code was updated/i) — the only signal workerd surfaces for this reset class.

Why this is real (not a contrived test)

The reproduction drives the real dispatch path — alarm() → _executeScheduleCallback → tryN → swallow → one-shot DELETE — in the Workers runtime. The only injected element is the production-observed reset error itself, thrown from _chatRecoveryContinue via a test-only flag (the handler can't distinguish where the reset is thrown). The test setup confirms the structural precondition live: after recovery is handled, the fiber row is gone and the turn rides solely on the schedule row.

Verified both directions:

Without the fix: expected 0 to be greater than or equal to 1 — both recovery vehicles destroyed (permanent abandonment).
With the fix: green — the one-shot row survives for a fresh-code re-run.

(First commit adds the red reproduction; second commit is the fix that turns it green.)

Trade-off worth a look in review

Re-throwing propagates out of alarm(), so that one alarm tick's remaining work (other due schedules, fiber housekeeping, _scheduleNextAlarm) is skipped. This is safe because the platform's at-least-once alarm retry re-runs the tick, and the keepAlive heartbeat re-arms _scheduleNextAlarm as a backstop — but it is a behavioral nuance in the core alarm path.

Test plan

npx nx run-many -t test --projects=agents — 1662 pass
npm run test -w @cloudflare/ai-chat — 544 pass (incl. new reproduction)
npm run test -w @cloudflare/think — 431 pass
npm run test:e2e -w @cloudflare/think — 9 pass
npm run test:e2e -w @cloudflare/ai-chat — 2 pass
Confirmed the reproduction fails without the fix and passes with it
CI green

… alarm Adds a deterministic reproduction (currently RED) for a second deploy-churn abandonment path that #1615 cannot reach. After an interrupted turn is detected, recovery schedules `_chatRecoveryContinue` and deletes the orphaned `cf_agents_runs` row — so the turn rides solely on the one-shot schedule row. If that alarm then fires on a SUPERSEDED isolate, the first storage op throws the catchable `Durable Object reset because its code was updated.` for the whole invocation; `_executeScheduleCallback` burns its in-process retries, swallows the error, and `alarm()` deletes the one-shot row. Both recovery vehicles are now gone and `_beginChatRecoveryIncident` never runs again — the turn is permanently abandoned. The test drives the REAL dispatch path (`alarm()` -> `_executeScheduleCallback` -> swallow -> one-shot DELETE); the only injected element is the (production-observed) reset error itself, thrown from the recovery callback via a test-only `setSimulateSupersededIsolateForTest` flag. It asserts the desired post-fix behavior (a recovery vehicle survives), so it fails on current main. The fix follows in the next commit. Co-authored-by: Cursor <cursoragent@cursor.com>

When a scheduled one-shot callback fires on an isolate a deploy has just superseded, the first `ctx.storage` op throws `Durable Object reset because its code was updated.` for the whole invocation. `_executeScheduleCallback` previously burned its in-process retries (all doomed — code never reloads mid-invocation), swallowed the error, and let `alarm()` delete the one-shot row — permanently abandoning the work. For chat recovery this is a second deploy-churn abandonment path, upstream of and unreachable by the progress-aware budget added in #1615: the orphaned fiber row is deleted as soon as recovery is "handled", so the turn rides solely on the `_chatRecoveryContinue` / `_chatRecoveryRetry` schedule row, and deleting it leaves nothing for the boot scan to re-detect. Fix: for a one-shot row failing with this transient, skip the doomed in-process retries (`shouldRetry`) and re-throw instead of swallowing, so `alarm()` rejects, the one-shot row survives, and the platform re-runs the alarm on a fresh isolate (= new code) under the at-least-once guarantee. The work auto-resumes once the deploy settles. Every other callback and error class keeps the existing swallow-and-exhaust behavior. Turns the reproduction test added in the previous commit green; verified it fails without this change. All suites pass (agents 1662, ai-chat 544, think 431, e2e think 9 + ai-chat 2). Co-authored-by: Cursor <cursoragent@cursor.com>

changeset-bot · 2026-05-29T21:41:25Z

🦋 Changeset detected

Latest commit: f4f72cb

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package

Name	Type
agents	Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

pkg-pr-new · 2026-05-29T21:48:38Z

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1617

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1617

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1617

hono-agents

npm i https://pkg.pr.new/hono-agents@1617

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1617

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1617

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1617

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1617

commit: f4f72cb

threepointone and others added 2 commits May 29, 2026 22:21

threepointone merged commit 5e60034 into main May 29, 2026
4 checks passed

threepointone deleted the fix/recovery-defer-superseded-isolate-alarm branch May 29, 2026 22:43

github-actions Bot mentioned this pull request May 29, 2026

Version Packages #1597

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix second deploy-churn recovery abandonment: defer one-shot callbacks on a code-update DO reset#1617

Fix second deploy-churn recovery abandonment: defer one-shot callbacks on a code-update DO reset#1617
threepointone merged 2 commits into
mainfrom
fix/recovery-defer-superseded-isolate-alarm

threepointone commented May 29, 2026

Uh oh!

changeset-bot Bot commented May 29, 2026

Uh oh!

pkg-pr-new Bot commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

threepointone commented May 29, 2026

Summary

The fix

Why this is real (not a contrived test)

Trade-off worth a look in review

Test plan

Related

Uh oh!

changeset-bot Bot commented May 29, 2026

🦋 Changeset detected

Uh oh!

pkg-pr-new Bot commented May 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant