Skip to content

Fix second deploy-churn recovery abandonment: defer one-shot callbacks on a code-update DO reset#1617

Merged
threepointone merged 2 commits into
mainfrom
fix/recovery-defer-superseded-isolate-alarm
May 29, 2026
Merged

Fix second deploy-churn recovery abandonment: defer one-shot callbacks on a code-update DO reset#1617
threepointone merged 2 commits into
mainfrom
fix/recovery-defer-superseded-isolate-alarm

Conversation

@threepointone
Copy link
Copy Markdown
Contributor

Summary

Follow-up to #1615. There are two distinct deploy-churn abandonment paths for durable chat recovery, at two layers. #1615 closed the first (the progress-blind attempt budget in _beginChatRecoveryIncident). This PR closes the second, which lives one layer down in the agents base class and #1615 cannot reach.

When the scheduled recovery continuation alarm fires on an isolate a deploy has just superseded, the first ctx.storage op throws the catchable Durable Object reset because its code was updated. for the entire invocation (code never reloads mid-invocation). Agent._executeScheduleCallback then:

  1. burns its whole in-process tryN retry budget — every attempt is doomed for the same reason; then
  2. swallows the error and returns normally, so alarm() marks the one-shot row executed and deletes it.

The orphaned cf_agents_runs fiber row was already deleted the moment recovery was "handled" (it scheduled the continuation), so the turn rode solely on that one-shot schedule row. Deleting it leaves nothing for the boot-time fiber scan to re-detect — _beginChatRecoveryIncident (and #1615's progress logic) never runs again, and the turn is permanently abandoned, even though the next fresh invocation would succeed. A customer observed this firing ~35× over ~19 mid-turn redeploys.

The fix

In Agent._executeScheduleCallback, for a one-shot row (delayed/scheduled) failing with the code-update reset transient:

  • skip the doomed in-process retries (shouldRetry), and
  • re-throw instead of swallowing, so alarm() rejects → the one-shot row is not deleted → Cloudflare re-runs the alarm on a fresh isolate (= new code) under the at-least-once alarm guarantee. The interrupted turn auto-resumes once the deploy storm settles, with no new user message.

Every other callback and error class keeps the existing swallow-and-exhaust behavior. Interval/cron schedules are untouched (they are not deleted on execution, so the bug does not apply).

The detector is a message match (/reset because its code was updated/i) — the only signal workerd surfaces for this reset class.

Why this is real (not a contrived test)

The reproduction drives the real dispatch path — alarm()_executeScheduleCallbacktryN → swallow → one-shot DELETE — in the Workers runtime. The only injected element is the production-observed reset error itself, thrown from _chatRecoveryContinue via a test-only flag (the handler can't distinguish where the reset is thrown). The test setup confirms the structural precondition live: after recovery is handled, the fiber row is gone and the turn rides solely on the schedule row.

Verified both directions:

  • Without the fix: expected 0 to be greater than or equal to 1 — both recovery vehicles destroyed (permanent abandonment).
  • With the fix: green — the one-shot row survives for a fresh-code re-run.

(First commit adds the red reproduction; second commit is the fix that turns it green.)

Trade-off worth a look in review

Re-throwing propagates out of alarm(), so that one alarm tick's remaining work (other due schedules, fiber housekeeping, _scheduleNextAlarm) is skipped. This is safe because the platform's at-least-once alarm retry re-runs the tick, and the keepAlive heartbeat re-arms _scheduleNextAlarm as a backstop — but it is a behavioral nuance in the core alarm path.

Test plan

  • npx nx run-many -t test --projects=agents — 1662 pass
  • npm run test -w @cloudflare/ai-chat — 544 pass (incl. new reproduction)
  • npm run test -w @cloudflare/think — 431 pass
  • npm run test:e2e -w @cloudflare/think — 9 pass
  • npm run test:e2e -w @cloudflare/ai-chat — 2 pass
  • Confirmed the reproduction fails without the fix and passes with it
  • CI green

Related

Made with Cursor

threepointone and others added 2 commits May 29, 2026 22:21
… alarm

Adds a deterministic reproduction (currently RED) for a second deploy-churn
abandonment path that #1615 cannot reach. After an interrupted turn is
detected, recovery schedules `_chatRecoveryContinue` and deletes the orphaned
`cf_agents_runs` row — so the turn rides solely on the one-shot schedule row.
If that alarm then fires on a SUPERSEDED isolate, the first storage op throws
the catchable `Durable Object reset because its code was updated.` for the
whole invocation; `_executeScheduleCallback` burns its in-process retries,
swallows the error, and `alarm()` deletes the one-shot row. Both recovery
vehicles are now gone and `_beginChatRecoveryIncident` never runs again — the
turn is permanently abandoned.

The test drives the REAL dispatch path (`alarm()` ->
`_executeScheduleCallback` -> swallow -> one-shot DELETE); the only injected
element is the (production-observed) reset error itself, thrown from the
recovery callback via a test-only `setSimulateSupersededIsolateForTest` flag.
It asserts the desired post-fix behavior (a recovery vehicle survives), so it
fails on current main. The fix follows in the next commit.

Co-authored-by: Cursor <cursoragent@cursor.com>
When a scheduled one-shot callback fires on an isolate a deploy has just
superseded, the first `ctx.storage` op throws
`Durable Object reset because its code was updated.` for the whole invocation.
`_executeScheduleCallback` previously burned its in-process retries (all
doomed — code never reloads mid-invocation), swallowed the error, and let
`alarm()` delete the one-shot row — permanently abandoning the work. For chat
recovery this is a second deploy-churn abandonment path, upstream of and
unreachable by the progress-aware budget added in #1615: the orphaned fiber
row is deleted as soon as recovery is "handled", so the turn rides solely on
the `_chatRecoveryContinue` / `_chatRecoveryRetry` schedule row, and deleting
it leaves nothing for the boot scan to re-detect.

Fix: for a one-shot row failing with this transient, skip the doomed
in-process retries (`shouldRetry`) and re-throw instead of swallowing, so
`alarm()` rejects, the one-shot row survives, and the platform re-runs the
alarm on a fresh isolate (= new code) under the at-least-once guarantee. The
work auto-resumes once the deploy settles. Every other callback and error
class keeps the existing swallow-and-exhaust behavior.

Turns the reproduction test added in the previous commit green; verified it
fails without this change. All suites pass (agents 1662, ai-chat 544, think
431, e2e think 9 + ai-chat 2).

Co-authored-by: Cursor <cursoragent@cursor.com>
@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 29, 2026

🦋 Changeset detected

Latest commit: f4f72cb

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
agents Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@pkg-pr-new
Copy link
Copy Markdown

pkg-pr-new Bot commented May 29, 2026

Open in StackBlitz

agents

npm i https://pkg.pr.new/agents@1617

@cloudflare/ai-chat

npm i https://pkg.pr.new/@cloudflare/ai-chat@1617

@cloudflare/codemode

npm i https://pkg.pr.new/@cloudflare/codemode@1617

hono-agents

npm i https://pkg.pr.new/hono-agents@1617

@cloudflare/shell

npm i https://pkg.pr.new/@cloudflare/shell@1617

@cloudflare/think

npm i https://pkg.pr.new/@cloudflare/think@1617

@cloudflare/voice

npm i https://pkg.pr.new/@cloudflare/voice@1617

@cloudflare/worker-bundler

npm i https://pkg.pr.new/@cloudflare/worker-bundler@1617

commit: f4f72cb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant