Prevent unbounded alarm scheduling promise chain in ActorSqlite.#6383
Prevent unbounded alarm scheduling promise chain in ActorSqlite.#6383vaibhavSharma-cf merged 2 commits intomainfrom
Conversation
|
All contributors have signed the CLA ✍️ ✅ |
|
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
This PR replaces an unbounded alarmLaterChain promise chain with a coalescing mechanism (scheduleLaterAlarm) that caps move-later alarm scheduling to at most one in-flight request plus one pending value. Good fix for a real production issue.
Issues
-
alarmLaterIsInFlightdesync whenarmAlarmHandlerbypassesscheduleLaterAlarm(medium severity, correctness)armAlarmHandlerdirectly overwritesalarmLaterInFlightwithout updatingalarmLaterIsInFlightorpendingLaterAlarmTime. IfscheduleLaterAlarmhad previously started a request (settingalarmLaterIsInFlight = trueand adding a completion handler at line 408), the old completion handler still holds a branch of the old forked promise. When that old branch resolves, its.thencallback setsalarmLaterIsInFlight = false-- but by then,alarmLaterInFlightpoints to the new promise fromarmAlarmHandler, which may still be pending. This meansalarmLaterIsInFlightbecomesfalseprematurely, and a subsequentscheduleLaterAlarmcall could start a new request that races with thearmAlarmHandler-initiated one.See inline suggestion for a minimal fix.
|
Review posted successfully on PR #6383. Here's a summary: One correctness issue found: In The overall approach (replacing unbounded chaining with a coalescing mechanism) is sound and well-tested. |
|
APIError: This model does not support assistant message prefill. The conversation must end with a user message. |
|
@vaibhavSharma-cf Bonk workflow failed. Check the logs for details. View workflow run · To retry, trigger Bonk again. |
dbef72c to
898149e
Compare
There was a problem hiding this comment.
Retraction of my previous review comment. After more careful analysis, the issue I flagged about alarmLaterIsInFlight desynchronization in armAlarmHandler is incorrect. The current code is correct as written. Here is why:
alarmLaterIsInFlight tracks whether the completion handler (added to commitTasks by scheduleLaterAlarm() at line 405) is still pending — not whether alarmLaterInFlight has an unresolved promise. When armAlarmHandler overwrites alarmLaterInFlight without touching alarmLaterIsInFlight, this is intentional:
- The old completion handler (from the original
scheduleLaterAlarmcall) is still pending incommitTasks. - Leaving
alarmLaterIsInFlight = truemeans subsequentscheduleLaterAlarmcalls correctly coalesce intopendingLaterAlarmTimerather than starting a racing request. - When the old completion handler eventually fires, it sets
alarmLaterIsInFlight = falseand drains any accumulatedpendingLaterAlarmTime, chaining off the currentalarmLaterInFlight(which now points toarmAlarmHandler's fork).
My previous suggestion to reset alarmLaterIsInFlight = false and pendingLaterAlarmTime = kj::none would have been harmful — it would allow the next scheduleLaterAlarm call to start a new in-flight request immediately, racing with both the old completion handler and the armAlarmHandler-initiated fork.
The second test (armAlarmHandler with coalesced pending alarms schedules reschedule exactly once) validates this exact interaction and confirms the code is correct.
No issues found in this updated review. The coalescing mechanism is sound, lifetime management is correct (forked promise branches keep shared state alive via refcounting; commitTasks destruction cancels pending tasks safely), and the .catch_() on alarmLaterInFlight before .fork() ensures the completion handler always fires on the success path.
This review was generated by an AI assistant (ask-bonk) and may contain inaccuracies.
|
Review posted on PR #6383. Here's a summary: Retracted the previous finding. After careful re-analysis, the
No new issues found. The coalescing mechanism is sound, lifetime management is correct, and error handling is properly structured. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6383 +/- ##
==========================================
+ Coverage 70.88% 70.90% +0.01%
==========================================
Files 438 438
Lines 123484 123622 +138
Branches 19426 19449 +23
==========================================
+ Hits 87534 87652 +118
- Misses 24442 24445 +3
- Partials 11508 11525 +17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
898149e to
ff1be18
Compare
a-robinson
left a comment
There was a problem hiding this comment.
Sorry for taking so long to get to this! I know I mentioned last week that I wouldn't be able to get to it quickly, but I should have suggested finding someone else to review it.
The nits are ignorable, but the new race condition seems worth fixing IIUC.
ff1be18 to
31caef9
Compare
When user code repeatedly moves a Durable Object alarm to a later time and scheduleRun calls to the alarm manager take longer than local SQLite commits (common in production), the alarmLaterChain promise chain grew without bound. This caused alarms to fire at times far in the past and could block SRS commits for arbitrarily long when a subsequent setAlarm moved the alarm earlier. Replace the unbounded chain with a coalescing mechanism: at most one scheduleRun is in-flight at a time, and rapid intermediate alarm times are collapsed into a single pending value.
31caef9 to
b345115
Compare
|
/bonk review |
There was a problem hiding this comment.
Reviewing the current state of this PR in light of a-robinson's feedback.
Race condition status (a-robinson's main concern)
a-robinson's concern at scheduleLaterAlarm line 404 was about pendingLaterAlarmTime being drained concurrently with a move-earlier scheduleRun. Tracing through the code: this race is already handled by the existing pendingLaterAlarmTime = kj::none in startPrecommitAlarmScheduling() (line 432). When a move-earlier begins, the pending later time is cleared so the completion handler finds nothing to drain. The test at line 1276 ("coalesced move-later followed by move-earlier does not race") validates this exact scenario with a priorTask-respecting handler that would detect concurrent RPCs.
The author's reply ("Setting pendingLaterAlarmTime to none fix this") appears to be describing the existing code behavior, not proposing a new change. However, the reply was ambiguous and no new commit has been pushed since a-robinson's review, so a-robinson may still be waiting for a follow-up. It would be worth replying explicitly to a-robinson's comment to clarify that the fix is already present in the current code and point to the specific lines (432 and 983) where pendingLaterAlarmTime is cleared.
Style nits from a-robinson (still unaddressed)
The three style nits (unnecessary SpanParent() wrapper, unclear "tracking" comment, line length) were acknowledged in replies but no updated commit has been pushed. These should be addressed in a follow-up push.
No new issues found
The coalescing mechanism is correct and well-tested. Lifetime management is safe (commitTasks is a member of ActorSqlite, so its destructor cancels pending tasks before member destruction). The .catch_() on both the forked promise and the completion handler correctly prevents non-fatal move-later failures from breaking the output gate.
This review was generated by an AI assistant (ask-bonk) and may contain inaccuracies.
|
Review posted on PR #6383. Summary:
|
Merging this PR will improve performance by 39.47%
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ⚡ | simpleStringBody[Response] |
26.9 µs | 19.3 µs | +39.47% |
| ⚡ | Encode_ASCII_32[TextEncoder][0/0/32] |
3.1 ms | 2.7 ms | +14.52% |
Comparing vaibhav/alarm-later-chain-fix (2ad6f83) with main (e6407c3)
Footnotes
-
129 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
When user code repeatedly moves a Durable Object alarm to a later time and scheduleRun calls to the alarm manager take longer than local SQLite commits (common in production), the alarmLaterChain promise chain grew without bound. This caused alarms to fire at times far in the past and could block SRS commits for arbitrarily long when a subsequent setAlarm moved the alarm earlier.
Replace the unbounded chain with a coalescing mechanism: at most one scheduleRun is in-flight at a time, and rapid intermediate alarm times are collapsed into a single pending value.