Skip to content

Investigate why dropping socket.io polling fallback degrades p95 under load (#7756 follow-up) #7767

@JohnMcLear

Description

@JohnMcLear

From the 2026-05 scaling dive, setting `socketTransportProtocols: ["websocket"]` (i.e. dropping the polling fallback) consistently worsens client p95 latency under high concurrency:

Authors baseline p95 websocket-only p95 apply_mean baseline apply_mean ws-only
100 11 ms 18 ms 4.16 ms 5.13 ms
140 14 ms 25 ms 4.02 ms 6.09 ms
180 16 ms 68 ms 4.48 ms 9.81 ms
200 22 ms 82 ms 4.95 ms 13.33 ms

Same harness, same runner, same sweep — only difference is the transport setting.

Hypotheses worth checking

  1. WS-only forces clients that can't establish WebSocket within socket.io's handshake timeout to retry-loop rather than fall back to polling, producing reconnect storms that drive up server CPU.
  2. The WS-only path changes socket.io's handshake protocol routing in a way that interacts badly with load balancers / proxies.
  3. Per-message-WebSocket framing overhead becomes significant when emits/sec is high (66k/dwell at 200 authors).
  4. The polling fallback acts as a natural coalescer (multiple events per HTTP poll) that we lose when forcing pure-WS.

Hypothesis 4 is particularly interesting because it would mean polling-as-batching is doing real work for us today.

Reproducing

```
gh workflow run "Scaling dive" --repo ether/etherpad-load-test --ref main \
-f core_ref=develop \
-f sweep='authors=20..200:step=20:dwell=10s:warmup=2s'
```

Compare `scaling-dive-baseline` vs `scaling-dive-websocket-only` artifacts. Run 25940112728 is the reference.

Why this matters

If we know why the polling fallback helps so much, we can either preserve that property explicitly (via batching) or stop carrying socket.io's polling-fallback code path.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions