You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic
Library version:@cloudflare/containers@0.3.3 Container runtime:standard-1, single DO instance via idFromName("default") Worker runtime:compatibility_date: 2026-04-01, nodejs_compat Observed: recurring in production, ~20-30 min per incident, self-resolves Frequency: multiple times per day, correlates loosely with long-lived WS sessions but triggers without a deploy
Symptoms
My Worker proxies WebSocket upgrades from browser clients through the Container DO to two container ports (one on defaultPort 5555, another on 5556 via switchPort). Periodically the DO enters a state where every WebSocket upgrade is canceled at the DO entrypoint within 0ms while HTTP fetches through the same DO continue to succeed.
From the Workers Observability Query API over a representative 30-minute window:
signal
value
Total fetch events
~1000 (query cap hit, likely more)
outcome=canceled
1000 / 1000
/ws events (DO entrypoint)
988, of which 975 wallTime=0ms, 22 at 1-10s, 3 at >10s
12, all with wallTime 2.5–4.8s (partysocket client connectionTimeout expiration)
/api/health via the same DO's fetch() method during the storm
200 in ~225ms — container is healthy, DO handles HTTP fine
scriptVersion.id
single value across the window — no deploy churn
outcome=ok events
50 / 1000, all eventType=alarm
The zero ok outcomes on fetch events combined with the container's HTTP path working concurrently is the headline. The DO is selectively canceling WebSocket upgrades while processing HTTP.
Alarm cadence is abnormal
The alarm() handler in container.js is designed to sleep ~3 minutes inside the handler via await new Promise(resolve => setTimeout(resolve, timeout)) where timeout = min(3min, sleepAfterMs) (line ~1487 in 0.3.3 build). Expected alarm cadence when sleepAfter="1h": roughly one firing per 3 minutes.
Observed cadence during the incident (50 alarm events captured, sorted asc, consecutive inter-arrival intervals in ms):
192, 372, 47, 589, 69, 514, 7, 473, 263, ...
Mean ~280ms, min 7ms, max ~590ms. That is ~600× faster than designed. The handler is not taking the 3-minute sleep path.
Each alarm event reports outcome=ok, so the handler isn't throwing. It's completing successfully but scheduling itself to fire again almost immediately.
Client-side view
Browser client uses partysocket/ws (1.1.16) with minReconnectionDelay=1000ms, maxReconnectionDelay=5000ms. During the incident the console accumulates many WebSocket is closed before the connection is established warnings — partysocket's 4s connectionTimeout firing on successive CONNECTING-state sockets that never reach OPEN.
A protocol-level ringbuffer I maintain on the client shows that only 3 of these many attempts actually transition through OPEN during a 29-minute captured window:
total events: 19
transport:reconnected: 3
transport:disconnected: 3
send: 6
response: 7
window: 29 min
3 brief OPENs during 29 minutes, average 2 sends per cycle (handshake + compile), then disconnect. The DO has transparent windows but they're rare and short.
Hypothesis
scheduleNextAlarm() (or something that calls setAlarm(Date.now())) is being invoked on every alarm cycle without taking the 3-minute sleep path in alarm(). The tight alarm loop preempts the DO event loop; incoming WS upgrade fetches are canceled at 0ms because they can never acquire uninterrupted handler time.
HTTP fetches succeed because they complete within a single event-loop tick through containerFetch. WS upgrades need to complete multiple async steps (accept both sides of WebSocketPair, attach forward handlers) and don't squeeze through the alarm fire rate.
I haven't identified the specific code path that produces the loop from reading container.js alone.
What would help most
A way to log the alarm scheduler's branch decisions (which of Path A/B/C/D in the alarm() handler is being taken during the storm) so I can identify the loop's provenance.
If I reproduce this in a stripped-down project, is there a preferred shape the repro should take for triage?
Happy to share more detail from my Observability queries. I have structured debug logs throughout my Worker/DO surface; if there's an additional log or query that would isolate the library's state transitions during the storm, I can add it.
Thanks for the library, and thanks in advance for taking a look.
Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic
Library version:
@cloudflare/containers@0.3.3Container runtime:
standard-1, single DO instance viaidFromName("default")Worker runtime:
compatibility_date: 2026-04-01,nodejs_compatObserved: recurring in production, ~20-30 min per incident, self-resolves
Frequency: multiple times per day, correlates loosely with long-lived WS sessions but triggers without a deploy
Symptoms
My Worker proxies WebSocket upgrades from browser clients through the Container DO to two container ports (one on defaultPort 5555, another on 5556 via
switchPort). Periodically the DO enters a state where every WebSocket upgrade is canceled at the DO entrypoint within 0ms while HTTP fetches through the same DO continue to succeed.From the Workers Observability Query API over a representative 30-minute window:
outcome=canceled/wsevents (DO entrypoint)/api/mock/wsevents (Worker entrypoint,switchPort(_, 5556))connectionTimeoutexpiration)/api/healthvia the same DO'sfetch()method during the stormscriptVersion.idoutcome=okeventseventType=alarmThe zero
okoutcomes on fetch events combined with the container's HTTP path working concurrently is the headline. The DO is selectively canceling WebSocket upgrades while processing HTTP.Alarm cadence is abnormal
The
alarm()handler incontainer.jsis designed to sleep ~3 minutes inside the handler viaawait new Promise(resolve => setTimeout(resolve, timeout))wheretimeout = min(3min, sleepAfterMs)(line ~1487 in 0.3.3 build). Expected alarm cadence whensleepAfter="1h": roughly one firing per 3 minutes.Observed cadence during the incident (50
alarmevents captured, sorted asc, consecutive inter-arrival intervals in ms):Mean ~280ms, min 7ms, max ~590ms. That is ~600× faster than designed. The handler is not taking the 3-minute sleep path.
Each alarm event reports
outcome=ok, so the handler isn't throwing. It's completing successfully but scheduling itself to fire again almost immediately.Client-side view
Browser client uses
partysocket/ws(1.1.16) withminReconnectionDelay=1000ms,maxReconnectionDelay=5000ms. During the incident the console accumulates manyWebSocket is closed before the connection is establishedwarnings — partysocket's 4sconnectionTimeoutfiring on successiveCONNECTING-state sockets that never reach OPEN.A protocol-level ringbuffer I maintain on the client shows that only 3 of these many attempts actually transition through OPEN during a 29-minute captured window:
3 brief OPENs during 29 minutes, average 2 sends per cycle (handshake + compile), then disconnect. The DO has transparent windows but they're rare and short.
Hypothesis
scheduleNextAlarm()(or something that callssetAlarm(Date.now())) is being invoked on every alarm cycle without taking the 3-minute sleep path inalarm(). The tight alarm loop preempts the DO event loop; incoming WS upgrade fetches are canceled at 0ms because they can never acquire uninterrupted handler time.HTTP fetches succeed because they complete within a single event-loop tick through
containerFetch. WS upgrades need to complete multiple async steps (accept both sides ofWebSocketPair, attach forward handlers) and don't squeeze through the alarm fire rate.I haven't identified the specific code path that produces the loop from reading
container.jsalone.What would help most
alarm()handler is being taken during the storm) so I can identify the loop's provenance.sleepAfteralarm fires during long-runningcontainerFetch()operations #162 / Documentation / functionality of start method incorrect? #123 — I see WebSocket connections don't renew activity timeout - container sleeps despite active connections #147 ("WS activity doesn't renew sleepAfter") discussed, but mysleepAfteris 1h and storms fire well inside that window, so the renewal-failure hypothesis doesn't fully fit.Happy to share more detail from my Observability queries. I have structured debug logs throughout my Worker/DO surface; if there's an additional log or query that would isolate the library's state transitions during the storm, I can add it.
Thanks for the library, and thanks in advance for taking a look.