Prevent unbounded alarm scheduling promise chain in ActorSqlite. by vaibhavSharma-cf · Pull Request #6383 · cloudflare/workerd

vaibhavSharma-cf · 2026-03-23T13:06:42Z

When user code repeatedly moves a Durable Object alarm to a later time and scheduleRun calls to the alarm manager take longer than local SQLite commits (common in production), the alarmLaterChain promise chain grew without bound. This caused alarms to fire at times far in the past and could block SRS commits for arbitrarily long when a subsequent setAlarm moved the alarm earlier.
Replace the unbounded chain with a coalescing mechanism: at most one scheduleRun is in-flight at a time, and rapid intermediate alarm times are collapsed into a single pending value.

github-actions · 2026-03-23T13:06:54Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

vaibhavSharma-cf · 2026-03-23T13:07:55Z

I have read the CLA Document and I hereby sign the CLA

ask-bonk

I'm Bonk, and I've done a quick review of your PR.

This PR replaces an unbounded alarmLaterChain promise chain with a coalescing mechanism (scheduleLaterAlarm) that caps move-later alarm scheduling to at most one in-flight request plus one pending value. Good fix for a real production issue.

Issues

alarmLaterIsInFlight desync when armAlarmHandler bypasses scheduleLaterAlarm (medium severity, correctness)

armAlarmHandler directly overwrites alarmLaterInFlight without updating alarmLaterIsInFlight or pendingLaterAlarmTime. If scheduleLaterAlarm had previously started a request (setting alarmLaterIsInFlight = true and adding a completion handler at line 408), the old completion handler still holds a branch of the old forked promise. When that old branch resolves, its .then callback sets alarmLaterIsInFlight = false -- but by then, alarmLaterInFlight points to the new promise from armAlarmHandler, which may still be pending. This means alarmLaterIsInFlight becomes false prematurely, and a subsequent scheduleLaterAlarm call could start a new request that races with the armAlarmHandler-initiated one.

See inline suggestion for a minimal fix.

src/workerd/io/actor-sqlite.c++

vaibhavSharma-cf · 2026-03-23T13:09:50Z

@ask-bonk

ask-bonk · 2026-03-23T13:09:57Z

Review posted successfully on PR #6383. Here's a summary:

One correctness issue found: In armAlarmHandler, the code directly overwrites alarmLaterInFlight without resetting alarmLaterIsInFlight or pendingLaterAlarmTime. This can cause the alarmLaterIsInFlight boolean to become desynchronized from the actual state of the in-flight promise, potentially allowing racing move-later operations. I posted an inline suggestion to reset the coalescing state when armAlarmHandler bypasses scheduleLaterAlarm.

The overall approach (replacing unbounded chaining with a coalescing mechanism) is sound and well-tested.

github run

ask-bonk · 2026-03-23T13:13:18Z

APIError: This model does not support assistant message prefill. The conversation must end with a user message.

github run

ask-bonk · 2026-03-23T13:13:20Z

@vaibhavSharma-cf Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

vaibhavSharma-cf · 2026-03-24T17:54:02Z

@ask-bonk

ask-bonk

Retraction of my previous review comment. After more careful analysis, the issue I flagged about alarmLaterIsInFlight desynchronization in armAlarmHandler is incorrect. The current code is correct as written. Here is why:

alarmLaterIsInFlight tracks whether the completion handler (added to commitTasks by scheduleLaterAlarm() at line 405) is still pending — not whether alarmLaterInFlight has an unresolved promise. When armAlarmHandler overwrites alarmLaterInFlight without touching alarmLaterIsInFlight, this is intentional:

The old completion handler (from the original scheduleLaterAlarm call) is still pending in commitTasks.
Leaving alarmLaterIsInFlight = true means subsequent scheduleLaterAlarm calls correctly coalesce into pendingLaterAlarmTime rather than starting a racing request.
When the old completion handler eventually fires, it sets alarmLaterIsInFlight = false and drains any accumulated pendingLaterAlarmTime, chaining off the current alarmLaterInFlight (which now points to armAlarmHandler's fork).

My previous suggestion to reset alarmLaterIsInFlight = false and pendingLaterAlarmTime = kj::none would have been harmful — it would allow the next scheduleLaterAlarm call to start a new in-flight request immediately, racing with both the old completion handler and the armAlarmHandler-initiated fork.

The second test (armAlarmHandler with coalesced pending alarms schedules reschedule exactly once) validates this exact interaction and confirms the code is correct.

No issues found in this updated review. The coalescing mechanism is sound, lifetime management is correct (forked promise branches keep shared state alive via refcounting; commitTasks destruction cancels pending tasks safely), and the .catch_() on alarmLaterInFlight before .fork() ensures the completion handler always fires on the success path.

This review was generated by an AI assistant (ask-bonk) and may contain inaccuracies.

ask-bonk · 2026-03-24T17:56:37Z

Review posted on PR #6383. Here's a summary:

Retracted the previous finding. After careful re-analysis, the alarmLaterIsInFlight desync issue I flagged in the first review was incorrect. The code is correct as written:

alarmLaterIsInFlight tracks the lifecycle of the completion handler in commitTasks, not the alarmLaterInFlight promise itself
When armAlarmHandler overwrites alarmLaterInFlight without clearing alarmLaterIsInFlight, this intentionally prevents racing scheduleLaterAlarm calls
The previous suggestion to reset both flags would have been harmful, enabling a race between the old completion handler and new requests
The second test (armAlarmHandler with coalesced pending alarms schedules reschedule exactly once) validates exactly this interaction

No new issues found. The coalescing mechanism is sound, lifetime management is correct, and error handling is properly structured.

github run

codecov-commenter · 2026-03-24T18:28:22Z

Codecov Report

❌ Patch coverage is 85.03401% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.90%. Comparing base (e6407c3) to head (2ad6f83).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
src/workerd/io/actor-sqlite-test.c++	84.03%	2 Missing and 17 partials ⚠️
src/workerd/io/actor-sqlite.c++	89.28%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6383      +/-   ##
==========================================
+ Coverage   70.88%   70.90%   +0.01%     
==========================================
  Files         438      438              
  Lines      123484   123622     +138     
  Branches    19426    19449      +23     
==========================================
+ Hits        87534    87652     +118     
- Misses      24442    24445       +3     
- Partials    11508    11525      +17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

a-robinson

Sorry for taking so long to get to this! I know I mentioned last week that I wouldn't be able to get to it quickly, but I should have suggested finding someone else to review it.

The nits are ignorable, but the new race condition seems worth fixing IIUC.

src/workerd/io/actor-sqlite.c++

src/workerd/io/actor-sqlite.h

src/workerd/io/actor-sqlite.c++

When user code repeatedly moves a Durable Object alarm to a later time and scheduleRun calls to the alarm manager take longer than local SQLite commits (common in production), the alarmLaterChain promise chain grew without bound. This caused alarms to fire at times far in the past and could block SRS commits for arbitrarily long when a subsequent setAlarm moved the alarm earlier. Replace the unbounded chain with a coalescing mechanism: at most one scheduleRun is in-flight at a time, and rapid intermediate alarm times are collapsed into a single pending value.

vaibhavSharma-cf · 2026-04-13T19:10:43Z

/bonk review

ask-bonk

Reviewing the current state of this PR in light of a-robinson's feedback.

Race condition status (a-robinson's main concern)

a-robinson's concern at scheduleLaterAlarm line 404 was about pendingLaterAlarmTime being drained concurrently with a move-earlier scheduleRun. Tracing through the code: this race is already handled by the existing pendingLaterAlarmTime = kj::none in startPrecommitAlarmScheduling() (line 432). When a move-earlier begins, the pending later time is cleared so the completion handler finds nothing to drain. The test at line 1276 ("coalesced move-later followed by move-earlier does not race") validates this exact scenario with a priorTask-respecting handler that would detect concurrent RPCs.

The author's reply ("Setting pendingLaterAlarmTime to none fix this") appears to be describing the existing code behavior, not proposing a new change. However, the reply was ambiguous and no new commit has been pushed since a-robinson's review, so a-robinson may still be waiting for a follow-up. It would be worth replying explicitly to a-robinson's comment to clarify that the fix is already present in the current code and point to the specific lines (432 and 983) where pendingLaterAlarmTime is cleared.

Style nits from a-robinson (still unaddressed)

The three style nits (unnecessary SpanParent() wrapper, unclear "tracking" comment, line length) were acknowledged in replies but no updated commit has been pushed. These should be addressed in a follow-up push.

No new issues found

The coalescing mechanism is correct and well-tested. Lifetime management is safe (commitTasks is a member of ActorSqlite, so its destructor cancels pending tasks before member destruction). The .catch_() on both the forked promise and the completion handler correctly prevents non-fatal move-later failures from breaking the output gate.

This review was generated by an AI assistant (ask-bonk) and may contain inaccuracies.

ask-bonk · 2026-04-13T19:14:48Z

Review posted on PR #6383. Summary:

Race condition (a-robinson's main concern): Already handled by pendingLaterAlarmTime = kj::none in startPrecommitAlarmScheduling() (line 432) and armAlarmHandler (line 983). The test at line 1276 validates this. The author's reply was ambiguous — recommended explicitly clarifying to a-robinson that the fix is already present.
Style nits: Three minor nits (unnecessary SpanParent() wrapper, unclear "tracking" comment, line length) acknowledged but not yet pushed as a new commit.
No new issues found: The coalescing mechanism is correct, lifetime management is safe, and error handling is properly structured.

github run

codspeed-hq · 2026-04-13T19:29:06Z

Merging this PR will improve performance by 39.47%

⚠️

Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 2 improved benchmarks
✅ 68 untouched benchmarks
⏩ 129 skipped benchmarks¹

Performance Changes

	Benchmark	`BASE`	`HEAD`	Efficiency
⚡	`simpleStringBody[Response]`	26.9 µs	19.3 µs	+39.47%
⚡	`Encode_ASCII_32[TextEncoder][0/0/32]`	3.1 ms	2.7 ms	+14.52%

_{Comparing vaibhav/alarm-later-chain-fix (2ad6f83) with main (e6407c3)}

129 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

a-robinson

LGTM!

vaibhavSharma-cf requested review from a team as code owners March 23, 2026 13:06

vaibhavSharma-cf requested review from a-robinson, anna-cf and jqmmes March 23, 2026 13:07

github-actions bot added a commit that referenced this pull request Mar 23, 2026

@vaibhavSharma-cf has signed the CLA in #6383

cd4a764

ask-bonk bot reviewed Mar 23, 2026

View reviewed changes

src/workerd/io/actor-sqlite.c++ Show resolved Hide resolved

vaibhavSharma-cf force-pushed the vaibhav/alarm-later-chain-fix branch 2 times, most recently from dbef72c to 898149e Compare March 24, 2026 17:53

ask-bonk bot reviewed Mar 24, 2026

View reviewed changes

vaibhavSharma-cf force-pushed the vaibhav/alarm-later-chain-fix branch from 898149e to ff1be18 Compare March 24, 2026 20:42

a-robinson reviewed Apr 1, 2026

View reviewed changes

src/workerd/io/actor-sqlite.c++ Outdated Show resolved Hide resolved

src/workerd/io/actor-sqlite.c++ Outdated Show resolved Hide resolved

src/workerd/io/actor-sqlite.h Outdated Show resolved Hide resolved

src/workerd/io/actor-sqlite.c++ Show resolved Hide resolved

vaibhavSharma-cf force-pushed the vaibhav/alarm-later-chain-fix branch from ff1be18 to 31caef9 Compare April 13, 2026 19:09

vaibhavSharma-cf force-pushed the vaibhav/alarm-later-chain-fix branch from 31caef9 to b345115 Compare April 13, 2026 19:10

ask-bonk bot reviewed Apr 13, 2026

View reviewed changes

vaibhavSharma-cf requested a review from a-robinson April 13, 2026 19:23

a-robinson approved these changes Apr 13, 2026

View reviewed changes

vaibhavSharma-cf enabled auto-merge April 13, 2026 21:06

Merge branch 'main' into vaibhav/alarm-later-chain-fix

2ad6f83

vaibhavSharma-cf merged commit a58c44b into main Apr 13, 2026
22 checks passed

vaibhavSharma-cf deleted the vaibhav/alarm-later-chain-fix branch April 13, 2026 21:35

Conversation

vaibhavSharma-cf commented Mar 23, 2026

Uh oh!

github-actions bot commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vaibhavSharma-cf commented Mar 23, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Issues

Uh oh!

Uh oh!

vaibhavSharma-cf commented Mar 23, 2026

Uh oh!

ask-bonk bot commented Mar 23, 2026

Uh oh!

ask-bonk bot commented Mar 23, 2026

Uh oh!

ask-bonk bot commented Mar 23, 2026

Uh oh!

vaibhavSharma-cf commented Mar 24, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Uh oh!

ask-bonk bot commented Mar 24, 2026

Uh oh!

codecov-commenter commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

a-robinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vaibhavSharma-cf commented Apr 13, 2026

Uh oh!

ask-bonk bot left a comment

Choose a reason for hiding this comment

Race condition status (a-robinson's main concern)

Style nits from a-robinson (still unaddressed)

No new issues found

Uh oh!

ask-bonk bot commented Apr 13, 2026

Uh oh!

codspeed-hq bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging this PR will improve performance by 39.47%

Performance Changes

Footnotes

Uh oh!

a-robinson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Mar 23, 2026 •

edited

Loading

codecov-commenter commented Mar 24, 2026 •

edited

Loading

codspeed-hq bot commented Apr 13, 2026 •

edited

Loading