Skip to content

Prevent unbounded alarm scheduling promise chain in ActorSqlite.#6383

Merged
vaibhavSharma-cf merged 2 commits intomainfrom
vaibhav/alarm-later-chain-fix
Apr 13, 2026
Merged

Prevent unbounded alarm scheduling promise chain in ActorSqlite.#6383
vaibhavSharma-cf merged 2 commits intomainfrom
vaibhav/alarm-later-chain-fix

Conversation

@vaibhavSharma-cf
Copy link
Copy Markdown
Contributor

When user code repeatedly moves a Durable Object alarm to a later time and scheduleRun calls to the alarm manager take longer than local SQLite commits (common in production), the alarmLaterChain promise chain grew without bound. This caused alarms to fire at times far in the past and could block SRS commits for arbitrarily long when a subsequent setAlarm moved the alarm earlier.
Replace the unbounded chain with a coalescing mechanism: at most one scheduleRun is in-flight at a time, and rapid intermediate alarm times are collapsed into a single pending value.

@vaibhavSharma-cf vaibhavSharma-cf requested review from a team as code owners March 23, 2026 13:06
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 23, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@vaibhavSharma-cf
Copy link
Copy Markdown
Contributor Author

I have read the CLA Document and I hereby sign the CLA

github-actions bot added a commit that referenced this pull request Mar 23, 2026
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR replaces an unbounded alarmLaterChain promise chain with a coalescing mechanism (scheduleLaterAlarm) that caps move-later alarm scheduling to at most one in-flight request plus one pending value. Good fix for a real production issue.

Issues

  1. alarmLaterIsInFlight desync when armAlarmHandler bypasses scheduleLaterAlarm (medium severity, correctness)

    armAlarmHandler directly overwrites alarmLaterInFlight without updating alarmLaterIsInFlight or pendingLaterAlarmTime. If scheduleLaterAlarm had previously started a request (setting alarmLaterIsInFlight = true and adding a completion handler at line 408), the old completion handler still holds a branch of the old forked promise. When that old branch resolves, its .then callback sets alarmLaterIsInFlight = false -- but by then, alarmLaterInFlight points to the new promise from armAlarmHandler, which may still be pending. This means alarmLaterIsInFlight becomes false prematurely, and a subsequent scheduleLaterAlarm call could start a new request that races with the armAlarmHandler-initiated one.

    See inline suggestion for a minimal fix.

@vaibhavSharma-cf
Copy link
Copy Markdown
Contributor Author

@ask-bonk

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 23, 2026

Review posted successfully on PR #6383. Here's a summary:

One correctness issue found: In armAlarmHandler, the code directly overwrites alarmLaterInFlight without resetting alarmLaterIsInFlight or pendingLaterAlarmTime. This can cause the alarmLaterIsInFlight boolean to become desynchronized from the actual state of the in-flight promise, potentially allowing racing move-later operations. I posted an inline suggestion to reset the coalescing state when armAlarmHandler bypasses scheduleLaterAlarm.

The overall approach (replacing unbounded chaining with a coalescing mechanism) is sound and well-tested.

github run

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 23, 2026

APIError: This model does not support assistant message prefill. The conversation must end with a user message.

github run

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 23, 2026

@vaibhavSharma-cf Bonk workflow failed. Check the logs for details.

View workflow run · To retry, trigger Bonk again.

@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/alarm-later-chain-fix branch 2 times, most recently from dbef72c to 898149e Compare March 24, 2026 17:53
@vaibhavSharma-cf
Copy link
Copy Markdown
Contributor Author

@ask-bonk

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retraction of my previous review comment. After more careful analysis, the issue I flagged about alarmLaterIsInFlight desynchronization in armAlarmHandler is incorrect. The current code is correct as written. Here is why:

alarmLaterIsInFlight tracks whether the completion handler (added to commitTasks by scheduleLaterAlarm() at line 405) is still pending — not whether alarmLaterInFlight has an unresolved promise. When armAlarmHandler overwrites alarmLaterInFlight without touching alarmLaterIsInFlight, this is intentional:

  1. The old completion handler (from the original scheduleLaterAlarm call) is still pending in commitTasks.
  2. Leaving alarmLaterIsInFlight = true means subsequent scheduleLaterAlarm calls correctly coalesce into pendingLaterAlarmTime rather than starting a racing request.
  3. When the old completion handler eventually fires, it sets alarmLaterIsInFlight = false and drains any accumulated pendingLaterAlarmTime, chaining off the current alarmLaterInFlight (which now points to armAlarmHandler's fork).

My previous suggestion to reset alarmLaterIsInFlight = false and pendingLaterAlarmTime = kj::none would have been harmful — it would allow the next scheduleLaterAlarm call to start a new in-flight request immediately, racing with both the old completion handler and the armAlarmHandler-initiated fork.

The second test (armAlarmHandler with coalesced pending alarms schedules reschedule exactly once) validates this exact interaction and confirms the code is correct.

No issues found in this updated review. The coalescing mechanism is sound, lifetime management is correct (forked promise branches keep shared state alive via refcounting; commitTasks destruction cancels pending tasks safely), and the .catch_() on alarmLaterInFlight before .fork() ensures the completion handler always fires on the success path.


This review was generated by an AI assistant (ask-bonk) and may contain inaccuracies.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Mar 24, 2026

Review posted on PR #6383. Here's a summary:

Retracted the previous finding. After careful re-analysis, the alarmLaterIsInFlight desync issue I flagged in the first review was incorrect. The code is correct as written:

  • alarmLaterIsInFlight tracks the lifecycle of the completion handler in commitTasks, not the alarmLaterInFlight promise itself
  • When armAlarmHandler overwrites alarmLaterInFlight without clearing alarmLaterIsInFlight, this intentionally prevents racing scheduleLaterAlarm calls
  • The previous suggestion to reset both flags would have been harmful, enabling a race between the old completion handler and new requests
  • The second test (armAlarmHandler with coalesced pending alarms schedules reschedule exactly once) validates exactly this interaction

No new issues found. The coalescing mechanism is sound, lifetime management is correct, and error handling is properly structured.

github run

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Mar 24, 2026

Codecov Report

❌ Patch coverage is 85.03401% with 22 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.90%. Comparing base (e6407c3) to head (2ad6f83).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/workerd/io/actor-sqlite-test.c++ 84.03% 2 Missing and 17 partials ⚠️
src/workerd/io/actor-sqlite.c++ 89.28% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6383      +/-   ##
==========================================
+ Coverage   70.88%   70.90%   +0.01%     
==========================================
  Files         438      438              
  Lines      123484   123622     +138     
  Branches    19426    19449      +23     
==========================================
+ Hits        87534    87652     +118     
- Misses      24442    24445       +3     
- Partials    11508    11525      +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/alarm-later-chain-fix branch from 898149e to ff1be18 Compare March 24, 2026 20:42
Copy link
Copy Markdown
Member

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for taking so long to get to this! I know I mentioned last week that I wouldn't be able to get to it quickly, but I should have suggested finding someone else to review it.

The nits are ignorable, but the new race condition seems worth fixing IIUC.

@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/alarm-later-chain-fix branch from ff1be18 to 31caef9 Compare April 13, 2026 19:09
When user code repeatedly moves a Durable Object alarm to a later time
and scheduleRun calls to the alarm manager take longer than local SQLite
commits (common in production), the alarmLaterChain promise chain grew
without bound. This caused alarms to fire at times far in the past and
could block SRS commits for arbitrarily long when a subsequent setAlarm
moved the alarm earlier.
Replace the unbounded chain with a coalescing mechanism: at most one
scheduleRun is in-flight at a time, and rapid intermediate alarm times
are collapsed into a single pending value.
@vaibhavSharma-cf vaibhavSharma-cf force-pushed the vaibhav/alarm-later-chain-fix branch from 31caef9 to b345115 Compare April 13, 2026 19:10
@vaibhavSharma-cf
Copy link
Copy Markdown
Contributor Author

/bonk review

Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing the current state of this PR in light of a-robinson's feedback.

Race condition status (a-robinson's main concern)

a-robinson's concern at scheduleLaterAlarm line 404 was about pendingLaterAlarmTime being drained concurrently with a move-earlier scheduleRun. Tracing through the code: this race is already handled by the existing pendingLaterAlarmTime = kj::none in startPrecommitAlarmScheduling() (line 432). When a move-earlier begins, the pending later time is cleared so the completion handler finds nothing to drain. The test at line 1276 ("coalesced move-later followed by move-earlier does not race") validates this exact scenario with a priorTask-respecting handler that would detect concurrent RPCs.

The author's reply ("Setting pendingLaterAlarmTime to none fix this") appears to be describing the existing code behavior, not proposing a new change. However, the reply was ambiguous and no new commit has been pushed since a-robinson's review, so a-robinson may still be waiting for a follow-up. It would be worth replying explicitly to a-robinson's comment to clarify that the fix is already present in the current code and point to the specific lines (432 and 983) where pendingLaterAlarmTime is cleared.

Style nits from a-robinson (still unaddressed)

The three style nits (unnecessary SpanParent() wrapper, unclear "tracking" comment, line length) were acknowledged in replies but no updated commit has been pushed. These should be addressed in a follow-up push.

No new issues found

The coalescing mechanism is correct and well-tested. Lifetime management is safe (commitTasks is a member of ActorSqlite, so its destructor cancels pending tasks before member destruction). The .catch_() on both the forked promise and the completion handler correctly prevents non-fatal move-later failures from breaking the output gate.


This review was generated by an AI assistant (ask-bonk) and may contain inaccuracies.

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk bot commented Apr 13, 2026

Review posted on PR #6383. Summary:

  • Race condition (a-robinson's main concern): Already handled by pendingLaterAlarmTime = kj::none in startPrecommitAlarmScheduling() (line 432) and armAlarmHandler (line 983). The test at line 1276 validates this. The author's reply was ambiguous — recommended explicitly clarifying to a-robinson that the fix is already present.
  • Style nits: Three minor nits (unnecessary SpanParent() wrapper, unclear "tracking" comment, line length) acknowledged but not yet pushed as a new commit.
  • No new issues found: The coalescing mechanism is correct, lifetime management is safe, and error handling is properly structured.

github run

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Apr 13, 2026

Merging this PR will improve performance by 39.47%

⚠️ Different runtime environments detected

Some benchmarks with significant performance changes were compared across different runtime environments,
which may affect the accuracy of the results.

Open the report in CodSpeed to investigate

⚡ 2 improved benchmarks
✅ 68 untouched benchmarks
⏩ 129 skipped benchmarks1

Performance Changes

Benchmark BASE HEAD Efficiency
simpleStringBody[Response] 26.9 µs 19.3 µs +39.47%
Encode_ASCII_32[TextEncoder][0/0/32] 3.1 ms 2.7 ms +14.52%

Comparing vaibhav/alarm-later-chain-fix (2ad6f83) with main (e6407c3)

Open in CodSpeed

Footnotes

  1. 129 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Copy link
Copy Markdown
Member

@a-robinson a-robinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vaibhavSharma-cf vaibhavSharma-cf merged commit a58c44b into main Apr 13, 2026
22 checks passed
@vaibhavSharma-cf vaibhavSharma-cf deleted the vaibhav/alarm-later-chain-fix branch April 13, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants