Skip to content

containers: Fix alarm scheduler spinloop under concurrent callers#191

Open
quantizor wants to merge 2 commits intocloudflare:mainfrom
quantizor:fix/alarm-scheduler-spinloop
Open

containers: Fix alarm scheduler spinloop under concurrent callers#191
quantizor wants to merge 2 commits intocloudflare:mainfrom
quantizor:fix/alarm-scheduler-spinloop

Conversation

@quantizor
Copy link
Copy Markdown
Contributor

@quantizor quantizor commented Apr 22, 2026

Fixes #189.

Symptom

Under load, the alarm() handler on Container fires at sub-second cadence instead of the designed 3-minute cadence. The Durable Object event loop saturates, and WebSocket upgrades queued behind the alarm get canceled at 0ms wallclock while HTTP traffic to the same DO continues to succeed.

Observed on a single production DO over 30 minutes: 50 eventType=alarm events with inter-arrival times ranging 7ms to 589ms (mean ~280ms, matching INSTANCE_POLL_INTERVAL_MS almost exactly). 100% of 1,000 fetch events in the window had outcome=canceled, of which 975/988 /ws events had wallTime=0ms. /api/health through the same DO's fetch() method during the storm succeeded in ~225ms. No deploy churn (single scriptVersion.id across the window).

Root cause

The alarm() handler paired an in-memory setTimeout sleep (up to 3 minutes) with an unconditional setAlarm(Date.now()) on exit:

// before — container.ts 1968–1980
await new Promise<void>(resolve => {
  this.resolve = resolve;
  if (!this.container.running) { resolve(); return; }
  this.timeout = setTimeout(() => { resolve(); }, timeout);
});

await this.ctx.storage.setAlarm(Date.now()); // ← unconditional

Any external call to scheduleNextAlarm() during the sleep resolved the handler's internal Promise via this.resolve() and set the storage alarm to Date.now() + 1000:

// before — scheduleNextAlarm 2012–2023
if (this.timeout) {
  if (this.resolve) this.resolve(); // ← short-circuits alarm()'s sleep
  clearTimeout(this.timeout);
}
await this.ctx.storage.setAlarm(nextTime); // Date.now() + 1000

The handler then resumed past its Promise, ran setAlarm(Date.now()) at the exit path, and overwrote the 1-second future alarm with "fire now". The runtime refired the alarm immediately, the new handler entered another 3-minute sleep, and the next external caller short-circuited that one too — a self-sustaining cascade.

The most reliable external trigger is startContainerIfNotRunning's retry loop, which calls scheduleNextAlarm() on every iteration (27 retries × INSTANCE_POLL_INTERVAL_MS=300ms). One stuck containerFetch against an unhealthy container produces ~27 scheduleNextAlarm() calls at 300ms intervals, which matches the observed 280ms mean cadence.

Fix

Replace the Promise/setTimeout pattern with direct storage-backed re-arming:

  • alarm() completes its work and sets the storage alarm to min(next due schedule, sleepAfterMs, Date.now() + MAX_ALARM_REARM_MS), floored at Date.now() + MIN_ALARM_REARM_MS (100ms). No in-memory setTimeout sleep. No this.resolve / this.timeout coordination.
  • scheduleNextAlarm() is idempotent: reads getAlarm(), no-ops if an existing alarm is already sooner than the request. Concurrent callers converge on the earliest requested time instead of clobbering each other.
// after — alarm() exit path
const requested = Math.min(
  minTimeFromSchedules,
  this.sleepAfterMs,
  Date.now() + MAX_ALARM_REARM_MS,
);
const nextAlarm = Math.max(requested, Date.now() + MIN_ALARM_REARM_MS);
await this.ctx.storage.setAlarm(nextAlarm);
// after — scheduleNextAlarm
const requested = Date.now() + Math.max(ms, MIN_ALARM_REARM_MS);
const existing = await this.ctx.storage.getAlarm();
if (existing !== null && existing <= requested) return;
await this.ctx.storage.setAlarm(requested);

Tests

Adds 13 unit tests in src/tests/alarm-scheduler.test.ts. 7 fail on main (including the direct spinloop regression — scheduleNextAlarm during an in-progress alarm does not cause immediate re-fire on exit — and a cadence-invariant test that times out at Jest's 5-second limit because the handler spins). All 13 pass with the fix.

The existing src/tests/container.test.ts was broken on main (couldn't resolve cloudflare:workers) and never ran in CI — the npm test script only iterates examples/*/test/. Jest config now maps cloudflare:workers to a local test stub (src/tests/__mocks__/cloudflare-workers.ts) and uses clearMocks: true for between-test spy hygiene. The stale container.test.ts is untouched in this PR.

Repro

The bug is deterministic. To reproduce without production traffic:

  1. Subclass Container, set sleepAfter = '1h', defaultPort set to a port the container won't listen on.
  2. From a Worker, fire await container.fetch(...) in a loop.
  3. Observe alarm cadence via ctx.storage.getAlarm() snapshots or CF Workers Observability on eventType=alarm.

Expected on main: ~300ms cadence. Expected after the fix: no more frequent than MIN_ALARM_REARM_MS (100ms), in practice much slower since schedule and sleepAfterMs don't change that frequently.

Follow-ups not included

  1. Drop MAX_ALARM_REARM_MS heartbeat in favor of explicit scheduleNextAlarm() calls from setupMonitorCallbacks after container state transitions. Clean but changes when onStop fires after a crash, so it deserves its own review.
  2. Subclass sleepAfter override applies after the base class's constructor calls renewActivityTimeout(), so the initial sleepAfterMs is always computed from DEFAULT_SLEEP_AFTER = '10m'. Self-corrects on the first containerFetch, but the 10-minute initial window can surprise users with longer sleepAfter. Orthogonal to the spinloop; worth a separate fix.
  3. inflightRequests leak hardening. If a WebSocket's close never fires (client disconnects abruptly without a close frame), the counter is permanently elevated and isActivityExpired() returns false forever. Not the storm bug, but worth capping or sweeping in a follow-up.

@quantizor quantizor requested a review from a team as a code owner April 22, 2026 16:18
@quantizor quantizor force-pushed the fix/alarm-scheduler-spinloop branch 2 times, most recently from 04dc96e to d096ae6 Compare April 22, 2026 16:29
The `alarm()` handler paired an in-memory `setTimeout` sleep (up to 3
minutes) with an unconditional `setAlarm(Date.now())` on exit. Any
external call to `scheduleNextAlarm()` during the sleep resolved the
handler's internal Promise via `this.resolve()`, letting the handler
proceed to the exit path and overwrite the caller's future alarm
timestamp with one scheduled for "now". The runtime would then refire
the alarm immediately, the new handler would enter another 3-minute
sleep, and the next external caller would short-circuit that one too —
producing a self-sustaining cascade of sub-second alarm fires.

In production this manifested under `startAndWaitForPorts` retry loops
(27 retries at INSTANCE_POLL_INTERVAL_MS=300ms per unhealthy
`containerFetch`) and partysocket reconnect storms: alarm cadence fell
to ~280ms mean with 7ms minima, matching the 300ms poll interval almost
exactly. The saturated DO event loop canceled WebSocket upgrades at 0ms
wallclock while HTTP traffic on the same DO continued to succeed.

Replace the Promise/setTimeout pattern with direct re-arming on storage:
`alarm()` completes its work and sets the storage alarm to the earliest
of the next due schedule, `sleepAfter` expiration, or a 3-minute
heartbeat, floored at MIN_ALARM_REARM_MS (100ms) so concurrent callers
cannot drive cadence below the floor. `scheduleNextAlarm()` is
idempotent: if an existing alarm is already set to fire sooner, the
call is a no-op, which converges concurrent callers on the earliest
requested time instead of having them clobber each other.

Adds unit tests in `src/tests/alarm-scheduler.test.ts` covering the
regression, durability invariants, and concurrent-caller settlement.
Jest config now maps `cloudflare:workers` to a local test stub and uses
`clearMocks: true` so the suite exercises the full alarm state machine
without a workerd runtime.

No behavior change for activity renewal, connection handling, or
`onStart`/`onStop` lifecycle hooks.
@quantizor quantizor force-pushed the fix/alarm-scheduler-spinloop branch from d096ae6 to 0b63489 Compare April 22, 2026 16:42
…racking

Pins the invariant from cloudflare#147 now that it's fixed
on main by commits 45ca64a / c88e29f / 828e8c8: an open WebSocket (or
any in-flight proxied request) must keep the container alive via the
`inflightRequests` counter even after the `sleepAfter` window has
elapsed. The reverse path is covered too — once the counter drops to
zero and `sleepAfterMs` is past, `onActivityExpired` fires.

Red-checked by reverting the `inflightRequests > 0` branch in
`isActivityExpired`: the positive test fails, the negative test
passes — exactly what you'd want from a regression net for this class
of bug.

No library code changes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic

1 participant