Skip to content

Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic #189

@quantizor

Description

@quantizor

Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic

Library version: @cloudflare/containers@0.3.3
Container runtime: standard-1, single DO instance via idFromName("default")
Worker runtime: compatibility_date: 2026-04-01, nodejs_compat
Observed: recurring in production, ~20-30 min per incident, self-resolves
Frequency: multiple times per day, correlates loosely with long-lived WS sessions but triggers without a deploy

Symptoms

My Worker proxies WebSocket upgrades from browser clients through the Container DO to two container ports (one on defaultPort 5555, another on 5556 via switchPort). Periodically the DO enters a state where every WebSocket upgrade is canceled at the DO entrypoint within 0ms while HTTP fetches through the same DO continue to succeed.

From the Workers Observability Query API over a representative 30-minute window:

signal value
Total fetch events ~1000 (query cap hit, likely more)
outcome=canceled 1000 / 1000
/ws events (DO entrypoint) 988, of which 975 wallTime=0ms, 22 at 1-10s, 3 at >10s
/api/mock/ws events (Worker entrypoint, switchPort(_, 5556)) 12, all with wallTime 2.5–4.8s (partysocket client connectionTimeout expiration)
/api/health via the same DO's fetch() method during the storm 200 in ~225ms — container is healthy, DO handles HTTP fine
scriptVersion.id single value across the window — no deploy churn
outcome=ok events 50 / 1000, all eventType=alarm

The zero ok outcomes on fetch events combined with the container's HTTP path working concurrently is the headline. The DO is selectively canceling WebSocket upgrades while processing HTTP.

Alarm cadence is abnormal

The alarm() handler in container.js is designed to sleep ~3 minutes inside the handler via await new Promise(resolve => setTimeout(resolve, timeout)) where timeout = min(3min, sleepAfterMs) (line ~1487 in 0.3.3 build). Expected alarm cadence when sleepAfter="1h": roughly one firing per 3 minutes.

Observed cadence during the incident (50 alarm events captured, sorted asc, consecutive inter-arrival intervals in ms):

192, 372, 47, 589, 69, 514, 7, 473, 263, ...

Mean ~280ms, min 7ms, max ~590ms. That is ~600× faster than designed. The handler is not taking the 3-minute sleep path.

Each alarm event reports outcome=ok, so the handler isn't throwing. It's completing successfully but scheduling itself to fire again almost immediately.

Client-side view

Browser client uses partysocket/ws (1.1.16) with minReconnectionDelay=1000ms, maxReconnectionDelay=5000ms. During the incident the console accumulates many WebSocket is closed before the connection is established warnings — partysocket's 4s connectionTimeout firing on successive CONNECTING-state sockets that never reach OPEN.

A protocol-level ringbuffer I maintain on the client shows that only 3 of these many attempts actually transition through OPEN during a 29-minute captured window:

total events: 19
  transport:reconnected: 3
  transport:disconnected: 3
  send:                  6
  response:              7
window: 29 min

3 brief OPENs during 29 minutes, average 2 sends per cycle (handshake + compile), then disconnect. The DO has transparent windows but they're rare and short.

Hypothesis

scheduleNextAlarm() (or something that calls setAlarm(Date.now())) is being invoked on every alarm cycle without taking the 3-minute sleep path in alarm(). The tight alarm loop preempts the DO event loop; incoming WS upgrade fetches are canceled at 0ms because they can never acquire uninterrupted handler time.

HTTP fetches succeed because they complete within a single event-loop tick through containerFetch. WS upgrades need to complete multiple async steps (accept both sides of WebSocketPair, attach forward handlers) and don't squeeze through the alarm fire rate.

I haven't identified the specific code path that produces the loop from reading container.js alone.

What would help most

  1. A way to log the alarm scheduler's branch decisions (which of Path A/B/C/D in the alarm() handler is being taken during the storm) so I can identify the loop's provenance.
  2. Confirmation that this matches a known failure mode, perhaps a flavor of WebSocket connections don't renew activity timeout - container sleeps despite active connections #147 / sleepAfter alarm fires during long-running containerFetch() operations #162 / Documentation / functionality of start method incorrect? #123 — I see WebSocket connections don't renew activity timeout - container sleeps despite active connections #147 ("WS activity doesn't renew sleepAfter") discussed, but my sleepAfter is 1h and storms fire well inside that window, so the renewal-failure hypothesis doesn't fully fit.
  3. If I reproduce this in a stripped-down project, is there a preferred shape the repro should take for triage?

Happy to share more detail from my Observability queries. I have structured debug logs throughout my Worker/DO surface; if there's an additional log or query that would isolate the library's state transitions during the storm, I can add it.

Thanks for the library, and thanks in advance for taking a look.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions