Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic

# Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic

**Library version:** `@cloudflare/containers@0.3.3`
**Container runtime:** `standard-1`, single DO instance via `idFromName("default")`
**Worker runtime:** `compatibility_date: 2026-04-01`, `nodejs_compat`
**Observed:** recurring in production, ~20-30 min per incident, self-resolves
**Frequency:** multiple times per day, correlates loosely with long-lived WS sessions but triggers without a deploy

## Symptoms

My Worker proxies WebSocket upgrades from browser clients through the Container DO to two container ports (one on defaultPort 5555, another on 5556 via `switchPort`). Periodically the DO enters a state where every WebSocket upgrade is canceled at the DO entrypoint within 0ms while HTTP fetches through the same DO continue to succeed.

From the Workers Observability Query API over a representative 30-minute window:

| signal | value |
|---|---|
| Total fetch events | ~1000 (query cap hit, likely more) |
| `outcome=canceled` | 1000 / 1000 |
| `/ws` events (DO entrypoint) | 988, of which **975 wallTime=0ms**, 22 at 1-10s, 3 at >10s |
| `/api/mock/ws` events (Worker entrypoint, `switchPort(_, 5556)`) | 12, all with wallTime 2.5–4.8s (partysocket client `connectionTimeout` expiration) |
| `/api/health` via the same DO's `fetch()` method during the storm | **200 in ~225ms** — container is healthy, DO handles HTTP fine |
| `scriptVersion.id` | single value across the window — no deploy churn |
| `outcome=ok` events | 50 / 1000, **all `eventType=alarm`** |

The zero `ok` outcomes on fetch events combined with the container's HTTP path working concurrently is the headline. The DO is selectively canceling WebSocket upgrades while processing HTTP.

## Alarm cadence is abnormal

The `alarm()` handler in `container.js` is designed to sleep ~3 minutes inside the handler via `await new Promise(resolve => setTimeout(resolve, timeout))` where `timeout = min(3min, sleepAfterMs)` (line ~1487 in 0.3.3 build). Expected alarm cadence when `sleepAfter="1h"`: roughly one firing per 3 minutes.

Observed cadence during the incident (50 `alarm` events captured, sorted asc, consecutive inter-arrival intervals in ms):

```
192, 372, 47, 589, 69, 514, 7, 473, 263, ...
```

Mean ~280ms, min 7ms, max ~590ms. That is **~600× faster than designed**. The handler is not taking the 3-minute sleep path.

Each alarm event reports `outcome=ok`, so the handler isn't throwing. It's completing successfully but scheduling itself to fire again almost immediately.

## Client-side view

Browser client uses `partysocket/ws` (1.1.16) with `minReconnectionDelay=1000ms`, `maxReconnectionDelay=5000ms`. During the incident the console accumulates many `WebSocket is closed before the connection is established` warnings — partysocket's 4s `connectionTimeout` firing on successive `CONNECTING`-state sockets that never reach OPEN.

A protocol-level ringbuffer I maintain on the client shows that only 3 of these many attempts actually transition through OPEN during a 29-minute captured window:

```
total events: 19
  transport:reconnected: 3
  transport:disconnected: 3
  send:                  6
  response:              7
window: 29 min
```

3 brief OPENs during 29 minutes, average 2 sends per cycle (handshake + compile), then disconnect. The DO has transparent windows but they're rare and short.

## Hypothesis

`scheduleNextAlarm()` (or something that calls `setAlarm(Date.now())`) is being invoked on every alarm cycle without taking the 3-minute sleep path in `alarm()`. The tight alarm loop preempts the DO event loop; incoming WS upgrade fetches are canceled at 0ms because they can never acquire uninterrupted handler time.

HTTP fetches succeed because they complete within a single event-loop tick through `containerFetch`. WS upgrades need to complete multiple async steps (accept both sides of `WebSocketPair`, attach forward handlers) and don't squeeze through the alarm fire rate.

I haven't identified the specific code path that produces the loop from reading `container.js` alone.

## What would help most

1. A way to log the alarm scheduler's branch decisions (which of Path A/B/C/D in the `alarm()` handler is being taken during the storm) so I can identify the loop's provenance.
2. Confirmation that this matches a known failure mode, perhaps a flavor of #147 / #162 / #123 — I see #147 ("WS activity doesn't renew sleepAfter") discussed, but my `sleepAfter` is 1h and storms fire well inside that window, so the renewal-failure hypothesis doesn't fully fit.
3. If I reproduce this in a stripped-down project, is there a preferred shape the repro should take for triage?

Happy to share more detail from my Observability queries. I have structured debug logs throughout my Worker/DO surface; if there's an additional log or query that would isolate the library's state transitions during the storm, I can add it.

Thanks for the library, and thanks in advance for taking a look.


signal	value
Total fetch events	~1000 (query cap hit, likely more)
`outcome=canceled`	1000 / 1000
`/ws` events (DO entrypoint)	988, of which 975 wallTime=0ms, 22 at 1-10s, 3 at >10s
`/api/mock/ws` events (Worker entrypoint, `switchPort(_, 5556)`)	12, all with wallTime 2.5–4.8s (partysocket client `connectionTimeout` expiration)
`/api/health` via the same DO's `fetch()` method during the storm	200 in ~225ms — container is healthy, DO handles HTTP fine
`scriptVersion.id`	single value across the window — no deploy churn
`outcome=ok` events	50 / 1000, all `eventType=alarm`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic #189

Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic

Symptoms

Alarm cadence is abnormal

Client-side view

Hypothesis

What would help most

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic #189

Description

Alarm scheduler fires at sub-second cadence while WS upgrades cancel at 0ms — DO becomes functionally unreachable for WebSocket traffic

Symptoms

Alarm cadence is abnormal

Client-side view

Hypothesis

What would help most

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions