Fix second deploy-churn recovery abandonment: defer one-shot callbacks on a code-update DO reset#1617
Merged
Conversation
… alarm Adds a deterministic reproduction (currently RED) for a second deploy-churn abandonment path that #1615 cannot reach. After an interrupted turn is detected, recovery schedules `_chatRecoveryContinue` and deletes the orphaned `cf_agents_runs` row — so the turn rides solely on the one-shot schedule row. If that alarm then fires on a SUPERSEDED isolate, the first storage op throws the catchable `Durable Object reset because its code was updated.` for the whole invocation; `_executeScheduleCallback` burns its in-process retries, swallows the error, and `alarm()` deletes the one-shot row. Both recovery vehicles are now gone and `_beginChatRecoveryIncident` never runs again — the turn is permanently abandoned. The test drives the REAL dispatch path (`alarm()` -> `_executeScheduleCallback` -> swallow -> one-shot DELETE); the only injected element is the (production-observed) reset error itself, thrown from the recovery callback via a test-only `setSimulateSupersededIsolateForTest` flag. It asserts the desired post-fix behavior (a recovery vehicle survives), so it fails on current main. The fix follows in the next commit. Co-authored-by: Cursor <cursoragent@cursor.com>
When a scheduled one-shot callback fires on an isolate a deploy has just superseded, the first `ctx.storage` op throws `Durable Object reset because its code was updated.` for the whole invocation. `_executeScheduleCallback` previously burned its in-process retries (all doomed — code never reloads mid-invocation), swallowed the error, and let `alarm()` delete the one-shot row — permanently abandoning the work. For chat recovery this is a second deploy-churn abandonment path, upstream of and unreachable by the progress-aware budget added in #1615: the orphaned fiber row is deleted as soon as recovery is "handled", so the turn rides solely on the `_chatRecoveryContinue` / `_chatRecoveryRetry` schedule row, and deleting it leaves nothing for the boot scan to re-detect. Fix: for a one-shot row failing with this transient, skip the doomed in-process retries (`shouldRetry`) and re-throw instead of swallowing, so `alarm()` rejects, the one-shot row survives, and the platform re-runs the alarm on a fresh isolate (= new code) under the at-least-once guarantee. The work auto-resumes once the deploy settles. Every other callback and error class keeps the existing swallow-and-exhaust behavior. Turns the reproduction test added in the previous commit green; verified it fails without this change. All suites pass (agents 1662, ai-chat 544, think 431, e2e think 9 + ai-chat 2). Co-authored-by: Cursor <cursoragent@cursor.com>
🦋 Changeset detectedLatest commit: f4f72cb The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
agents
@cloudflare/ai-chat
@cloudflare/codemode
hono-agents
@cloudflare/shell
@cloudflare/think
@cloudflare/voice
@cloudflare/worker-bundler
commit: |
This was referenced May 29, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to #1615. There are two distinct deploy-churn abandonment paths for durable chat recovery, at two layers. #1615 closed the first (the progress-blind attempt budget in
_beginChatRecoveryIncident). This PR closes the second, which lives one layer down in theagentsbase class and #1615 cannot reach.When the scheduled recovery continuation alarm fires on an isolate a deploy has just superseded, the first
ctx.storageop throws the catchableDurable Object reset because its code was updated.for the entire invocation (code never reloads mid-invocation).Agent._executeScheduleCallbackthen:tryNretry budget — every attempt is doomed for the same reason; thenalarm()marks the one-shot row executed and deletes it.The orphaned
cf_agents_runsfiber row was already deleted the moment recovery was "handled" (it scheduled the continuation), so the turn rode solely on that one-shot schedule row. Deleting it leaves nothing for the boot-time fiber scan to re-detect —_beginChatRecoveryIncident(and #1615's progress logic) never runs again, and the turn is permanently abandoned, even though the next fresh invocation would succeed. A customer observed this firing ~35× over ~19 mid-turn redeploys.The fix
In
Agent._executeScheduleCallback, for a one-shot row (delayed/scheduled) failing with the code-update reset transient:shouldRetry), andalarm()rejects → the one-shot row is not deleted → Cloudflare re-runs the alarm on a fresh isolate (= new code) under the at-least-once alarm guarantee. The interrupted turn auto-resumes once the deploy storm settles, with no new user message.Every other callback and error class keeps the existing swallow-and-exhaust behavior. Interval/cron schedules are untouched (they are not deleted on execution, so the bug does not apply).
The detector is a message match (
/reset because its code was updated/i) — the only signal workerd surfaces for this reset class.Why this is real (not a contrived test)
The reproduction drives the real dispatch path —
alarm()→_executeScheduleCallback→tryN→ swallow → one-shotDELETE— in the Workers runtime. The only injected element is the production-observed reset error itself, thrown from_chatRecoveryContinuevia a test-only flag (the handler can't distinguish where the reset is thrown). The test setup confirms the structural precondition live: after recovery is handled, the fiber row is gone and the turn rides solely on the schedule row.Verified both directions:
expected 0 to be greater than or equal to 1— both recovery vehicles destroyed (permanent abandonment).(First commit adds the red reproduction; second commit is the fix that turns it green.)
Trade-off worth a look in review
Re-throwing propagates out of
alarm(), so that one alarm tick's remaining work (other due schedules, fiber housekeeping,_scheduleNextAlarm) is skipped. This is safe because the platform's at-least-once alarm retry re-runs the tick, and the keepAlive heartbeat re-arms_scheduleNextAlarmas a backstop — but it is a behavioral nuance in the core alarm path.Test plan
npx nx run-many -t test --projects=agents— 1662 passnpm run test -w @cloudflare/ai-chat— 544 pass (incl. new reproduction)npm run test -w @cloudflare/think— 431 passnpm run test:e2e -w @cloudflare/think— 9 passnpm run test:e2e -w @cloudflare/ai-chat— 2 passRelated
Made with Cursor