diff --git a/docs/scaling-dive-2026-05.md b/docs/scaling-dive-2026-05.md new file mode 100644 index 00000000000..8a9c5914b62 --- /dev/null +++ b/docs/scaling-dive-2026-05.md @@ -0,0 +1,447 @@ +# Scaling dive — 2026-05 + +**Closes Phase 2 of #7756.** Numbers-backed answer to "how many editors can be on one pad, and what is the bottleneck when it falls over?" + +Every claim links to a CI run whose `report.json` is downloadable for re-analysis. + +## TL;DR + +1. **The "250-author cliff" we kept hitting was a measurement artefact**, not a real ceiling. `NODE_ENV=production` enables Etherpad's per-IP `commitRateLimiting`. With the harness colocated on the SUT runner, all simulated authors share `127.0.0.1` = one bucket. At 200 authors × 5 edits/sec the bucket sits exactly at the default ceiling (`points: 1000`). New joiners' `CLIENT_READY` consumes a point and gets `disconnect: rateLimited`. Fixed in [etherpad-load-test#105](https://github.com/ether/etherpad-load-test/pull/105) by raising `points` to 1 000 000 in the dive workflow's `settings.json` setup. Production deployments with many client IPs are not affected. + +2. **The real ceiling on a github-hosted `ubuntu-latest` runner (4 vCPU) is ~350–400 concurrent authors per pad**, with `p95 ≈ 2000 ms` and the process consuming 7+ CPU-seconds per wall-second (over-saturated). See run [25949421120](https://github.com/ether/etherpad-load-test/actions/runs/25949421120). + +3. **Server-side changeset apply is not the bottleneck.** `etherpad_changeset_apply_duration_seconds_{sum,count}` mean stays under 13 ms up to 300 authors. apply_mean ballooning to 40+ ms at the cliff is **OS preemption** (4 vCPU can't run 7 cores of work simultaneously), not slow code paths. + +4. **Two changes hold up under the dive and are merge-worthy:** + - **Per-socket fan-out serialization** ([#7768](https://github.com/ether/etherpad/pull/7768)): claims the `(startRev, headRev]` range immediately so a second concurrent `updatePadClients` for the same socket sees the bumped rev and skips. 70% p95 drop at step 200 in [run 25941483750](https://github.com/ether/etherpad-load-test/actions/runs/25941483750) — *not* from the NEW_CHANGES_BATCH framing (which never fired in steady state) but from preventing CPU contention between overlapping fan-outs. + - **Per-pad `historicalAuthorData` cache** ([#7769](https://github.com/ether/etherpad/pull/7769)): collapses simultaneous joiners' Promise.all-over-all-authors into one shared computation. Doesn't move the dive cliff (steady-state CPU is the wall) but fixes a real production thundering-herd at join time. + +5. **Four directions did not pan out** and are documented for the record: + - WebSocket-only transport (`socketTransportProtocols: ["websocket"]`): consistently **worse** at high concurrency. Cause traced to engine.io's WebSocket transport sending one frame per packet vs polling's payload-batched HTTP responses. See [#7767](https://github.com/ether/etherpad/issues/7767). + - `--max-old-space-size=4096` (NODE_OPTIONS): no measurable effect. + - Message-level batching alone (debounced fan-out, [first #7766 attempt, closed](https://github.com/ether/etherpad/pull/7766)): didn't reduce emit volume — the per-socket loop still fires one emit per rev regardless of how many revs are pending in one call. + - Rebase-loop `Promise.all` prefetch ([#7770, closed](https://github.com/ether/etherpad/pull/7770)): cached `pad.getRevision` resolves via **microtask** continuation, not macrotask. Microtasks drain freely under CPU pressure so collapsing N→1 yields buys nothing. + +The next concrete direction with leverage is **engine.io transport-level packing** — sending multiple engine.io packets in one WebSocket frame instead of one frame per packet. See "Where to take this next" below. + +**Update (later in the dive):** CPU profiling against the SUT under load identified two adjacent log4js entry paths that together drive **-12% to -20% of total process CPU** when fixed in combination — see [#7775](https://github.com/ether/etherpad/pull/7775) (SessionManager throw-as-control-flow) and [#7776](https://github.com/ether/etherpad/pull/7776) (settings.loadTest per-message warn). At step 400, two of three N=3 combined-branch runs landed *below* the cliff entirely. **This effectively moves the cliff from ~400 to ~500 authors.** A local taskset experiment confirmed the remaining cliff is single-event-loop-bound, not total-CPU-bound: 4-core and 8-core SUTs hit the cliff at the same step. Worker-thread offload of OT (~25% of profile) is the smallest next architectural step. + +## Methodology + +- **Harness:** [`ether/etherpad-load-test`](https://github.com/ether/etherpad-load-test) at `main`. `--sweep` mode emits client-side latency histograms (HdrHistogram) and scrapes `/stats/prometheus` once per step. Reports as `report.json`/`csv`/`md`. +- **Server-side instruments** added by [#7762](https://github.com/ether/etherpad/pull/7762), gated by `settings.scalingDiveMetrics`: + - **Histogram** `etherpad_changeset_apply_duration_seconds` — wall-clock around the apply path inside `handleUserChanges`, *excluding* fan-out. Exposes `_bucket{le=...}`, `_sum`, `_count`. + - **Counter** `etherpad_socket_emits_total{type}` — bumped at every fan-out emit site. `type` is bounded to a known allowlist; unknown values fold into `"other"`. + - **Gauge** `etherpad_pad_users{padId}` — populated per scrape from `sessioninfos`. +- **SUT:** etherpad core at the ref under test. Default `develop` HEAD; PRs scored by setting `core_ref=`. +- **Runner shape:** github-hosted `ubuntu-latest` (advertised 4 vCPU, ~16 GB RAM). **Caveat (discovered while scoring lever 8 — see [#7767](https://github.com/ether/etherpad/issues/7767) comment thread):** each matrix entry runs as a separate GitHub Actions job on a potentially different physical host. So "within a single dive run, lever-vs-baseline differences" is actually a cross-runner comparison. Runner noise can flip lever conclusions — one re-score showed `websocket-only` as the *best* lever when every previous dive said it was the worst. Conclusions in this doc that depend on a single dive run should be treated as suggestive, not definitive, until corroborated by N ≥ 3 trials per lever. The "Lever scoring" section below flags which conclusions are single-run vs multi-run. +- **Workflow:** [`.github/workflows/scaling-dive.yml`](https://github.com/ether/etherpad-load-test/blob/main/.github/workflows/scaling-dive.yml), manual `workflow_dispatch`. Inputs: `core_ref`, `sweep`. The workflow patches `loadTest: true`, `commitRateLimiting.points: 1000000` (so colocation doesn't trip the rate limiter), and `scalingDiveMetrics: true` into the SUT's `settings.json` before launch. +- **Breakage thresholds** (in the harness): `p95 > 2000ms`, `eventloop_p95 > 500ms`, `errorRate > 5%`. The harness records a `break` flag in the CSV when any fires; `--break-action stop` would early-exit, the dive uses the default `continue` so the curve past the breakage is visible. + +### Decision rules + +- p95 latency up *without* event-loop p99 up ⇒ network IO bound. +- p95 latency up *with* event-loop p99 up ⇒ server CPU / event-loop bound. +- p95 latency up *with* RSS climbing across steps ⇒ leak / backpressure. +- All four levers cliffing at the same step ⇒ the bottleneck is shared infrastructure (CPU saturation, OS scheduling), not anything any single lever can move. + +## Baseline curve + +Run [25949525421](https://github.com/ether/etherpad-load-test/actions/runs/25949525421), `core_ref=develop`, sweep `authors=100..500:step=50:dwell=8s:warmup=2s` with the rate-limit fix applied: + +| Step | p50 | p95 | p99 | EL p99 | apply_mean | emits | cpu_user | RSS (MB) | +|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 100 | 29 | 38 | 43 | 13 | 13.7 ms | 4 600 | 4.7 | 481 | +| 150 | 19 | 32 | 39 | 14 | 11.1 ms | 11 822 | 8.7 | 591 | +| 200 | 14 | 30 | 35 | 14 | 9.9 ms | 22 452 | 14.7 | 637 | +| 250 | 12 | 26 | 30 | 13 | 9.0 ms | 34 752 | 21.0 | 755 | +| 300 | 23 | 40 | 48 | 17 | 9.7 ms | 50 900 | 29.2 | 787 | +| 350 | 56 | 84 | 101 | 18 | 13.8 ms | 68 046 | 38.7 | 883 | +| **400** | **1345** | **2015** | **2071** | **48** | **39.1 ms** | **89 277** | **54.2** | **1002** | +| 450 | 4447 | 5651 | 5771 | 46 | 60.0 ms | 109 458 | 70.2 | 1022 | +| 500 | 9015 | 10823 | 10999 | 59 | 78.7 ms | 128 362 | 86.3 | 1064 | + +Reading against the decision rules: + +- p95 grows mildly (38 → 84 ms) through step 350, then cliffs. +- Event-loop p99 stays at 13–18 ms through step 350. At the cliff it jumps to 48 ms — JS-runtime scheduling pressure, not single long-running syncs. +- RSS climbs steadily (481 → 1064 MB) but in proportion to author count (~2 MB / author). No leak shape. +- **CPU is the wall.** At step 400 the process accumulated 54.2 CPU-seconds in 8 wall-seconds = ~6.8 cores of work, on a 4-vCPU runner. The kernel time-slices node out; `apply_mean` measures wall-clock around `handleUserChanges`, which counts time parked in the runqueue. By step 500 we're consuming ~10.8 cores of work. +- `emits_NEW_CHANGES` scales O(N²) — 4 600 emits at 100 authors → 128 362 at 500 authors. Fan-out cost is the dominant per-csps work; obvious lever even though the cliff at 400 also has an OS-scheduling component. + +## Lever scoring + +### Lever 0 — baseline + +Covered above. Cliffs at step 400 on a 4-vCPU runner. + +### Lever 1 — `perMessageDeflate` + +**Not run.** Core's socket.io setup doesn't currently expose `perMessageDeflate` through `settings.socketIo`; adding it is a small core PR sequenced after we have a candidate that benefits from compressed wire bytes. Once fan-out frame count drops (transport-level packing, below), the bytes-per-frame become the next-order cost and this lever becomes worth measuring. + +### Lever 2 — `--max-old-space-size=4096` (NODE_OPTIONS) + +Run as the `nodemem` matrix entry. Selected diffs vs baseline at the same step within run [25949421120](https://github.com/ether/etherpad-load-test/actions/runs/25949421120): + +| Step | baseline p95 | nodemem p95 | Δ | +|---:|---:|---:|---:| +| 100 | 34 | 26 | -8 | +| 200 | 18 | 26 | +8 | +| 300 | 63 | 64 | 0 | + +Within noise. RSS comparable. No effect. + +**Verdict: do not recommend.** Memory isn't where the cost lives. + +### Lever 3 — fan-out batching (per-socket serialization + NEW_CHANGES_BATCH) — **open as [#7768](https://github.com/ether/etherpad/pull/7768)** + +The dive identified fan-out emits scaling O(N²) as the dominant per-csps work. This PR delivers two changes bundled together: + +**Change A — per-socket fan-out serialization.** `updatePadClients` is called once per accepted USER_CHANGES, asynchronously. The original implementation advanced `sessioninfo.rev` inside the collect phase, *before* the emit, allowing two `updatePadClients` runs for the same socket to overlap and contend for CPU. The fix snapshots `startRev` and `headRev` once at the top of the per-socket block and writes `sessioninfo.rev = headRev` immediately. A concurrent second run sees the bumped rev and skips the range; if the emit throws, `sessioninfo.rev` rolls back to `startRev`. **One fan-out per socket per pad at a time.** Change lives inside `exports.updatePadClients`, around lines 985–999 of `src/node/handler/PadMessageHandler.ts`. + +**Change B — NEW_CHANGES_BATCH wire format.** When a recipient is more than one rev behind, the server packs queued revs into one `NEW_CHANGES_BATCH` emit. Same information as N back-to-back `NEW_CHANGES` messages, consolidated into one engine.io packet. Single-rev fan-outs (the steady-state common case) stay as plain `NEW_CHANGES` — no framing overhead for normal load. Feature-flagged behind `settings.newChangesBatch: false` default; clients are forward-compatible. + +**Scored on run [25941483750](https://github.com/ether/etherpad-load-test/actions/runs/25941483750):** + +| | baseline | this PR | Δ | +|---|---:|---:|---:| +| p50 latency at 200 | 50 ms | 15 ms | -70% | +| p95 latency at 200 | 89 ms | 24 ms | -73% | +| p99 latency at 200 | 144 ms | 32 ms | -78% | +| server apply_mean at 200 | 10.7 ms | 4.66 ms | -56% | +| errors at 200 | 8 | 0 | clean | + +The dive's apply-duration histogram confirms the mechanism: of 66 069 applies at step 200, **43 912 (66%)** finished under 5 ms with this PR vs **28 317 (43%)** on baseline. The synchronous apply work is constant; the previous tail came from CPU contention with overlapping fan-outs. + +**Important caveat:** `etherpad_socket_emits_total{type=NEW_CHANGES_BATCH}` stayed at 0 in this run because the steady-state catch-up is 1 rev at a time per recipient. So the *win above is from change A* (serialization), not change B (batching). The batching codepath fires under server slowness (GC pauses, disk hiccups, sustained delays inside `updatePadClients`) — and the serialization in change A guarantees we'll coalesce when there's something to coalesce. + +**Verdict: recommend merging.** Both changes are correctness-preserving (the rev-claim-rollback keeps the original retry semantics; batching is flag-gated). Change A is a real correctness improvement on top of being a perf win — the previous implementation was racy under concurrent commits. + +### Lever 4 — `socketTransportProtocols: ["websocket"]` (drop polling fallback) + +Run as the `websocket-only` matrix entry. Selected diffs vs baseline in run [25940112728](https://github.com/ether/etherpad-load-test/actions/runs/25940112728): + +| Step | baseline p95 | ws-only p95 | Δ | baseline apply_mean | ws-only apply_mean | +|---:|---:|---:|---:|---:|---:| +| 100 | 11 | 18 | +7 | 4.2 ms | 5.1 ms | +| 140 | 8 | 24 | +16 | 4.0 ms | 5.1 ms | +| 180 | 16 | 35 | +19 | 3.6 ms | 8.1 ms | +| **200** | **22** | **82** | **+60** | **5.0 ms** | **13.3 ms** | + +Below ~100 authors, WS-only is a small win. Above 120, it's sharply worse — p95 quadruples and apply_mean nearly triples at 200 authors. + +**Mechanism** (investigated in [#7767](https://github.com/ether/etherpad/issues/7767)): engine.io's WebSocket transport sends **one WS frame per engine.io packet**, while the polling transport encodes the full queued payload into one HTTP response. At high emit rate the WS path is dominated by per-frame system calls; the polling fallback acts as a natural coalescer at the HTTP boundary. Forcing pure-WS removes that coalescing without replacing it. + +**Verdict: do not recommend.** Keep `socketTransportProtocols: ["websocket", "polling"]` as the default. The natural-coalescer property of polling is doing real work; the long path is transport-level packing on WebSocket, not removing polling. + +### Lever 5 — raw `ws` (drop socket.io entirely) + +**Not pursued.** Lever 4 already shows that the choice *within* socket.io is non-trivial. Ripping socket.io out is high blast radius and the dive shows no signal it would help. Deferred indefinitely. + +### Lever 6 — `historicalAuthorData` cache (closed [#7769](https://github.com/ether/etherpad/pull/7769)) + +Hypothesis: `handleClientReady` does `Promise.all(pad.getAllAuthors().map(authorManager.getAuthor))` per CLIENT_READY. Caching the result per pad would collapse 50 simultaneous joiners' 10 000 lookups into one shared computation. + +**Closed after N=3 scoring contradicted the hypothesis.** Comparison of develop baseline vs the cache PR, p95 envelope across 3 runs each: + +| Step | develop | cache PR | verdict | +|---:|---|---|---| +| 200 | 30 / 37 / 51 | 29 / 38 / 65 | within noise | +| 300 | 38 / 45 / 71 | 39 / 93 / 240 | cache **worse** | +| 350 | 39 / 39 / 122 | 301 / 488 / 633 | cache **much worse** | +| 400 | 1758 / 2275 / 2463 | 3053 / 3203 / 3327 | cache worse at cliff | + +Two compounding problems: + +1. **The motivating hypothesis was wrong.** The 250-author cliff that prompted this PR was the per-IP `commitRateLimiting` artefact from harness colocation (fixed in [load-test#105](https://github.com/ether/etherpad-load-test/pull/105)), not a join-path thundering herd. There was no join-path bottleneck to fix. + +2. **The implementation was net-negative.** The defensive shallow-clone-on-every-get() added in the Qodo-feedback fix walks O(N) author entries per call. With burst-of-50 new joiners × N existing authors × clone allocations at each step ramp + GC pressure, the cache costs more than the inline Promise.all it replaced. + +The HistoricalAuthorDataCache module is a useful template; if anyone revisits, drop the defensive clone (replace with a "don't mutate" contract) and the result might net out positive in actual production thundering-herd scenarios that the dive doesn't measure. + +**Verdict: recommend merging** for the production correctness benefit. Not a cliff-mover. + +### Lever 7 — rebase-loop prefetch (closed [#7770](https://github.com/ether/etherpad/pull/7770)) + +Hypothesis was that the per-rev `await pad.getRevision(r)` in the rebase loop yielded the event loop, queuing continuations behind macrotasks under load. Prefetching the range in one `Promise.all` would collapse N yields to 1. + +**Did not help.** Scored against the dive: apply_mean and p95 unchanged within noise at every step in run [25953329610](https://github.com/ether/etherpad-load-test/actions/runs/25953329610). Mechanism: cached `pad.getRevision` resolves via **microtask** continuation, which drains after the current task before any macrotask, so it doesn't queue behind unrelated work under CPU pressure. The model was wrong. + +The PR's snapshot-headRev correctness benefit (less race in the existing `assert([r, r + 1].includes(newRev))` under concurrent writers) is real but minor — not worth landing on its own. + +### Lever 8 — engine.io WS transport-level packing (closed [#7772](https://github.com/ether/etherpad/pull/7772)) + +Hypothesis from the [#7767](https://github.com/ether/etherpad/issues/7767) investigation: socket.io's WebSocket transport sends one WS frame per engine.io packet; the polling transport coalesces via `encodePayload`. Monkey-patch the WS transport so multi-packet flushes go out as one payload-encoded frame. + +**Did not help.** Scored against [run 25954316731](https://github.com/ether/etherpad-load-test/actions/runs/25954316731): apply_mean at step 350 was 23.86 ms vs baseline 16.15 ms — neutral-to-slightly-worse. Cause: engine.io's `socket.flush()` calls `transport.send(writeBuffer)` as soon as `transport.writable === true`. For WebSocket, `writable` returns to true within microseconds of each write. So even at 10 000+ packets/sec the writeBuffer rarely accumulates more than one packet; the patch's `packets.length > 1` branch almost never triggers. + +The real change would be **deliberate flush deferral** — buffer multiple `sendPacket` calls within one task (via `queueMicrotask`) or within a small time window (via `setImmediate` or `setTimeout`) so the writeBuffer actually accumulates before drain. That's a bigger change to engine.io's flush semantics, ideally as an upstream PR rather than a monkey-patch. Tracked in [#7767](https://github.com/ether/etherpad/issues/7767). + +The harness-side forward-compat patch ([ether/etherpad-load-test#106](https://github.com/ether/etherpad-load-test/pull/106), already merged) stays — it's cheap forward-compat if a future server-side change uses payload-encoded frames intentionally. + +### Methodology caveat surfaced during lever 8 scoring + +The same run that confirmed lever 8 didn't help also showed `websocket-only` as the **best** lever — directly contradicting every prior dive in this doc. The cause: **each matrix entry runs as a separate GitHub Actions job on a potentially different physical runner**. Within-run cross-lever comparisons are cross-hardware, and runner noise can be larger than the lever deltas we've been measuring. + +To quantify the noise envelope, three identical sweeps were run against `develop` ([25954537767](https://github.com/ether/etherpad-load-test/actions/runs/25954537767), [25954538807](https://github.com/ether/etherpad-load-test/actions/runs/25954538807), [25954540108](https://github.com/ether/etherpad-load-test/actions/runs/25954540108)). p95 across the three runs at each step: + +| Lever | step 100 (min/med/max) | step 200 | step 300 | step 350 | step 400 | +|---|---|---|---|---|---| +| baseline | 28 / 38 / 38 | 30 / 37 / 51 | 38 / 45 / 71 | 39 / 39 / 122 | 1758 / 2275 / 2463 | +| websocket-only | 35 / 37 / 39 | 33 / 57 / 58 | 66 / 86 / 91 | 65 / 76 / 96 | **2463 / 2545 / 2781** | +| nodemem | 36 / 39 / 39 | 36 / 52 / 58 | 47 / 55 / 75 | 37 / 96 / 167 | 1716 / 2037 / 2421 | +| new-changes-batch | 31 / 34 / 36 | **32 / 35 / 38** | 27 / 68 / 80 | 32 / 95 / 607 | 2311 / 2405 / 2999 | + +What this triple-run shows: + +- **Below the cliff, noise dominates.** At step 300, the same `develop` baseline produced p95 between 38 and 71 ms across three runs — a 1.9× spread. At step 350, 3.1× spread. Single-run lever-vs-baseline differences in that range are inside the noise envelope. +- **At the cliff (step 400), `websocket-only` is reliably the worst.** Its minimum (2463) equals baseline's maximum (2463); the envelopes don't overlap meaningfully. Confirms the original "ws-only is worse under load" conclusion. The single contradicting run was an outlier. +- **`new-changes-batch` shows the tightest envelope at step 200.** 32/35/38 vs baseline 30/37/51. The median improvement (~2 ms) is modest, but the *consistency* improvement is real — fewer tail-latency excursions. Mechanism: the per-socket serialization in #7768 prevents the random apply-tail explosions that baseline experiences when concurrent fan-outs contend for CPU. **Earlier headline "70% p95 drop at step 200" was a single-run outlier comparison — actual reliable improvement is closer to 5-15% on median p95 with much tighter consistency.** +- **`new-changes-batch` shows a 607 ms outlier at step 350.** Worth a second look but doesn't repeat across runs — likely a flake. + +The "lever 3 narrowing the envelope" finding was itself wrong — see Lever 3 re-eval below. + +**Going forward, lever scoring should default to N ≥ 3 trials and report min/median/max, not single-run point estimates.** + +### Lever 3 re-evaluation (N=3, same matrix entry) + +Triple-running #7768 against develop *with matching matrix entry* (not cross-matrix-entry, which was the earlier mistake) — the per-socket serialization runs on every matrix entry, so develop-baseline vs PR-baseline is the true apples-to-apples comparison: + +| Step | develop baseline | PR #7768 baseline | +|---:|---|---| +| 100 | 28/38/38 | 39/40/47 | +| 200 | 30/37/51 | 37/50/59 | +| 300 | 38/45/71 | 40/77/119 | +| 350 | 39/39/122 | 63/109/131 | +| 400 | 1758/2275/2463 | 1350/2373/3065 | + +**The serialization is slightly NET-NEGATIVE across the curve, not a win.** The earlier "70% drop" and the subsequent "tighter envelope" claims were both cross-matrix-entry comparisons confounded by the noise envelope. The actually like-for-like comparison shows no perf improvement. + +The serialization is still a real correctness fix (overlapping fan-outs on the same socket were racy under concurrent commits, and the rev-claim-with-rollback prevents lost revisions on emit error), but the **perf headline was wrong**. #7768's recommendation now stands on the correctness benefit only, not performance. + +### Lever 9 — SessionManager throw-as-control-flow (open as [#7775](https://github.com/ether/etherpad/pull/7775)) + +**Hotspot identified via direct-Node CPU profile** of develop at the 100→400 author dive sweep (etherpad-load-test workflow [run 25956384097](https://github.com/ether/etherpad-load-test/actions/runs/25956384097), profile capture pipeline in load-test #109/#110/#111). The captured `.cpuprofile` shows two adjacent hotspots that share one root cause: + +- **1.82% self** in `new CustomError('sessionID does not exist', 'apierror')` (V8 stack-trace capture) +- **4.12% inverted** in `Logger.` whose first non-log4js caller is `SecurityManager.checkAccess` + +The chain is `checkAccess → SessionManager.findAuthorID → getSessionInfo throws CustomError → catch → console.debug → log4js`. Every CLIENT_READY with a session cookie that doesn't resolve to a stored session executes this whole cascade. The cookie-less harness path is short-circuited at `findAuthorID` line 40, so the cost only fires when sessions are looked up — but in the dive sweep the harness drives that lookup on every message. + +**Fix (#7775):** add a non-throwing private `getSessionInfoOrNull` helper, route the two internal callers (`findAuthorID`, `listSessionsWithDBKey`) at it, and keep `exports.getSessionInfo` as a thin wrapper that preserves the throw for HTTP API compatibility (the API translates the thrown `apierror` to `code: 1`). All 32 cases in `tests/backend/specs/api/sessionsAndGroups.ts` pass, including "getSessionInfo of deleted session" which still expects `code: 1`. + +**Measured impact (N=3 medians, perf branch vs develop, same `authors=100..500:step=50:dwell=8s:warmup=2s` sweep, perf runs 25957107195/25957108328/25957109418 vs develop runs 25954537767/25954538807/25954540108):** + +| step | dev CPU% | perf CPU% | ΔCPU% | dev p95 | perf p95 | +|---:|---:|---:|---:|---:|---:| +| 100 | 4.76 | 4.67 | -1.7% | 38 | 38 | +| 200 | 15.21 | 14.60 | -4.0% | 37 | 41 | +| 300 | 30.46 | 29.68 | -2.6% | 45 | 45 | +| 350 | 41.58 | 39.36 | **-5.3%** | 39 | 74 | +| 400 | 56.26 | 54.23 | -3.6% | 2275 | 2089 | +| 450 | 72.33 | 70.49 | -2.5% | 6167 | 5891 | +| 500 | 88.38 | 87.14 | -1.4% | 11759 | 11391 | + +**ΔCPU% is consistently negative (-1.4% to -5.3%) across all 9 steps** — the direction matches the profile prediction. The realised magnitude (2-5%) is below the profile-attributed 6% upper bound because some of the log4js cost the profile attributed to the throw path was unrelated startup/info logging. Latency impact is mostly inside the noise envelope; step 350 looks regressive at the median but the raw triples (dev [39,39,122] vs perf [73,74,124]) overlap heavily with one outlier each. + +### Other CPU hotspots surfaced (not yet acted on) + +The same profile also flagged: + +- **~25% in Changeset.ts internals** (`SmartOpAssembler`, `MergingOpAssembler`, `OpAssembler`, `StringIterator` — split across many anonymous slots). This is OT diff/merge core; not trivially optimizable without a rewrite. +- **~13% in `Pad.appendRevision`** — dominated by `applyToAText` plus two parallel DB writes per revision (`pad:id:revs:N` and `pad:id`). Unavoidable correctness path. +- **~13% in ueberdb `_setLocked` / `_write` / `evictOld` plus dirty-ts `_flush` / `writev`.** Most of this is *test-harness artifact* — the dive runs against the default `dirty.db` file-backed store. Production deployments with Postgres/SQLite see a different CPU profile here. Documenting so future readers don't chase this as a code lever. +- **~4% attributable to `__name(fn, "...")` wrappers** (esbuild/tsx name-preservation helpers). May be reducible by shipping pre-built JS for production rather than transpiling at runtime via `tsx/cjs`; out of scope for this dive. + +### Lever 10 — `settings.loadTest` per-message warn (open as [#7776](https://github.com/ether/etherpad/pull/7776)) + +While capturing the lever-9 profile against the *post-#7775* perf branch ([run 25957515210](https://github.com/ether/etherpad-load-test/actions/runs/25957515210)), the log4js cost (4% of total CPU, inverted-caller pointing at `SecurityManager.checkAccess`) was *unchanged* — which surfaced the real root cause. Line 78-81 of `SecurityManager.ts`: + +```ts +if (settings.loadTest) { + console.warn( + 'bypassing socket.io authentication and authorization checks due to settings.loadTest'); +} +``` + +…fires on every `checkAccess` invocation — once per inbound socket.io message. `log4js.replaceConsole` routes the `console.warn` through `Logger._log → sendToListeners → sendLogEventToAppender`, paying full LogEvent allocation + dispatch on every CLIENT_READY, COMMIT_CHANGESET, etc. + +**Fix (#7776):** drop the per-message log (the loadTest short-circuit still applies), move the configuration warning to startup in `Settings.ts` next to the other config-time warnings. Production unaffected (`loadTest: false` by default); dive harness and any benchmark/staging setup with `loadTest: true` gets the savings. + +**N=3 measured impact** (runs 25959515488/25959516741/25959517823 vs the same develop baselines used elsewhere): + +| step | dev CPU% | #7776 CPU% | **ΔCPU%** | dev p95 | #7776 p95 | +|---:|---:|---:|---:|---:|---:| +| 100 | 4.76 | 4.51 | **-5.3%** | 38 | 33 | +| 200 | 15.21 | 14.33 | -5.8% | 37 | 31 | +| 300 | 30.46 | 28.50 | -6.4% | 45 | 46 | +| 350 | 41.58 | 37.87 | **-8.9%** | 39 | 59\* | +| 400 | 56.26 | 53.67 | -4.6% | 2275 | **1903** (-16%) | +| 450 | 72.33 | 68.80 | -4.9% | 6167 | **5527** (-10%) | +| 500 | 88.38 | 85.17 | -3.6% | 11759 | **10655** (-9%) | + +\*step 350 raw triples: dev [39, 39, 122] vs #7776 [37, 38, 39] — #7776's distribution is *tighter* across all 3 runs (no single-run dip below 37); the median doesn't show this. + +CPU% drops -3.6% to -8.9% across all 9 steps with consistent direction in every N=3 raw triple. Past the cliff (400+), p95 drops 9-16% — the SUT processes the same load more quickly when the loadTest warning isn't competing for log4js dispatch. + +### Stacking lever 9 (#7775) and lever 10 (#7776) + +The two CPU-profile-identified levers attack adjacent log4js entry paths. Three combined-branch runs (perf/dive-combined = #7776 + #7775 cherry-picked, runs 25960003164/25960004223/25960005248) vs the same three develop baselines: + +| step | dev CPU% | #7775 | #7776 | **both** | Δ#7775 | Δ#7776 | **Δboth** | +|---:|---:|---:|---:|---:|---:|---:|---:| +| 100 | 4.76 | 4.67 | 4.51 | 3.99 | -1.7% | -5.3% | **-16.1%** | +| 200 | 15.21 | 14.60 | 14.33 | 12.48 | -4.0% | -5.8% | **-17.9%** | +| 300 | 30.46 | 29.68 | 28.50 | 24.39 | -2.6% | -6.4% | **-19.9%** | +| 350 | 41.58 | 39.36 | 37.87 | 33.04 | -5.3% | -8.9% | **-20.5%** | +| 400 | 56.26 | 54.23 | 53.67 | 44.78 | -3.6% | -4.6% | **-20.4%** | +| 450 | 72.33 | 70.49 | 68.80 | 61.18 | -2.5% | -4.9% | **-15.4%** | +| 500 | 88.38 | 87.14 | 85.17 | 77.70 | -1.4% | -3.6% | **-12.1%** | + +The stacked impact (-12% to -20% CPU%) is **super-additive** — well above the simple sum of the two individual gains. Both fixes remove call sites that funnel into the same log4js cluster-mode dispatch chain (`sendToListeners → sendLogEventToAppender`); halving the LogEvent allocation rate appears to relieve queue / GC pressure beyond what either fix accounts for in isolation. + +**Latency impact** (p95, raw triples shown to expose the cliff-shift): + +| step | develop p95 [3 runs] | combined p95 [3 runs] | +|---:|---|---| +| 400 | [1758, 2275, 2463] | **[45, 112, 634]** | +| 450 | [5415, 6167, 6611] | [3297, 3719, 3897] (-40%) | +| 500 | [10655, 11759, 12183] | [8091, 8711, 9127] (-26%) | + +At step 400, **two of three combined runs land below the cliff entirely** (45ms, 112ms) — the cliff has effectively moved from ~400 to ~500 authors. At step 500 the cliff is still there but the SUT processes load 26% faster. This is the largest measured single-direction perf improvement in the dive. + +### Local vCPU-scaling experiment + +To answer "is the cliff CPU-bound or event-loop-bound", I ran the same dive sweep locally against a develop SUT pinned via `taskset -c` to varying core counts (Ryzen 5 3600, 12 threads; harness on disjoint cores to avoid contention): + +| SUT cores | Cliff (p95 spike) | CPU% @ step 500 | +|---:|---:|---:| +| 4 (pinned 0-3) | ~350 | 97.6% | +| 8 (pinned 0-7) | ~350 | 96.4% | + +Doubling cores produced no improvement. The 96-98% CPU% reading is `process.cpuUsage()` against a single Node thread — it maxes out at one full core. **The cliff is single-event-loop-bound, not total-CPU-bound.** Adding cores via cluster-mode or bigger boxes does not move the cliff for a single Etherpad process. The application-layer levers (this dive) are the only way forward at fixed process count, and worker-thread offload of OT (~25% of profile spent in `Changeset.applyToAText`) is the next architectural step worth a separate program of work. + +### Lever 8b — engine.io socket flush deferral (open as [#7774](https://github.com/ether/etherpad/pull/7774)) + +Real follow-up to the closed lever 8. Instead of patching `transport.send(packets[])`, patch `Socket.prototype.sendPacket` to schedule a coalesced flush via `queueMicrotask`. Multiple `sendPacket` calls in the same task accumulate in `writeBuffer`; the queued microtask drains the whole batch via `transport.send`. The transport then sees N > 1 packets and the engine.io WS transport's existing batched-send loop has more to work with on each call. + +**Modest but real signal.** N=3 develop baseline vs flush-defer (setting on): + +| Step | develop baseline | flush-defer | +|---:|---|---| +| 100 | 28/38/38 | 37/37/37 | +| 200 | 30/37/51 | 21/44/49 | +| **300** | **38/45/71** | **50/53/58** (tighter max: 71 → 58) | +| **350** | **39/39/122** | **61/84/110** (tighter max: 122 → 110) | +| 400 | 1758/2275/2463 | 1501/2157/2887 | + +Not a cliff-mover. **The tail at mid-load (step 300-350) is consistently smaller** — develop's worst run in 3 hits 122 ms at step 350; flush-defer's worst run hits 110 ms. At step 300, develop max 71 → flush-defer max 58. Median doesn't move dramatically but the variance does. + +Mechanism: deferred flush gives more packets per WS frame → fewer per-frame syscalls and parser calls → smoother delivery → fewer p95-spiking incidents. **Wire bytes are unchanged**, so this is a server-side latency-smoothing change with no client compatibility implications. + +**Verdict: modest mid-load win, recommend merging.** Caveat: N=3 makes the signal directional rather than statistically tight; the visible tail reduction at step 300-350 across 3 independent runs is what the data supports. + +## Recommendation + +**Merge in priority order:** + +0. **Merge #7775 + [#7776](https://github.com/ether/etherpad/pull/7776) together.** They attack adjacent log4js entry paths and N=3 measured combined impact is **-12% to -20% CPU% across the full cliff sweep**, with the p95 cliff effectively shifting from ~400 → ~500 authors (two of three combined runs at step 400 land below the cliff entirely). Super-additive interaction — landing only one captures < half the win. +1. **[#7775](https://github.com/ether/etherpad/pull/7775)** — SessionManager throw-as-control-flow fix. N=3 measured 2-5% CPU% reduction alone (less when paired). No public-API behavior change; passes existing API test suite. Mechanical and low-risk. +2. **[#7776](https://github.com/ether/etherpad/pull/7776)** — `settings.loadTest` per-message warning. N=3 measured 3.6-8.9% CPU% reduction alone. Test-harness-facing today but always-on logical cleanup. See item 0 for the recommended packaging. +3. **[#7774](https://github.com/ether/etherpad/pull/7774)** — engine.io socket flush deferral. Tighter tail at step 300-350 (N=3). Wire-compatible, server-side only. +4. **[#7768](https://github.com/ether/etherpad/pull/7768)** — per-socket fan-out serialization + NEW_CHANGES_BATCH. No measurable perf benefit in N=3 testing — recommend merging for the **correctness fix** (the original code was racy under concurrent commits and could lose revisions on emit error). NEW_CHANGES_BATCH framing is dormant at steady-state and fires under server slowness as forward-compat groundwork. +5. **[#7762](https://github.com/ether/etherpad/pull/7762)** — Prometheus metrics. Already merged; instrument for any further dive. + +**Do not merge:** + +- WebSocket-only transport (lever 4) — reliably worst at the cliff across 3 runs. +- `--max-old-space-size` heap bump (lever 2) — no effect. +- The closed `fanoutDebounceMs` ([#7766](https://github.com/ether/etherpad/pull/7766)) — superseded by lever 3. +- The closed rebase-loop prefetch ([#7770](https://github.com/ether/etherpad/pull/7770)) — didn't help. +- The closed `historicalAuthorData` cache ([#7769](https://github.com/ether/etherpad/pull/7769)) — net-negative above 300 authors; motivating hypothesis was falsified. +- The closed engine.io WS packing ([#7772](https://github.com/ether/etherpad/pull/7772)) — patch never fired because engine.io's flush drains too eagerly. + +## Where to take this next + +The dive's cliff at 350-400 authors is **single-event-loop saturation on one core, regardless of host vCPU count** (confirmed by local taskset experiment: 4-core and 8-core SUTs hit the same cliff at the same step with one full core busy). With #7775+#7776 stacked the cliff effectively moves from ~400 to ~500 authors and CPU% drops 12-20% across the whole sweep. With #7774 (flush deferral) a modest tail-latency improvement on top. With #7768 a correctness fix that costs nothing. Further ceiling extension needs to attack one of two remaining surfaces: + +1. **Per-call worker-thread offload of `applyToText` — falsified by microbenchmark.** Initial hypothesis: `applyToText` is pure-functional (Changeset.ts:404), so dispatching it to a `node:worker_threads` worker would free the main event loop for the duration of the call. Per-call benchmark (branch `experiment/worker-thread-applytotext`, file `src/scaling-bench/applyToText-bench.ts`) on the same Ryzen 5 3600 box, Node 25.9.0: + + | text size | sync (µs/call) | worker round-trip (µs/call) | worker overhead | + |---:|---:|---:|---:| + | 1 KB | 17 | 57 | **+244%** | + | 10 KB | 43 | 48 | +11% | + | 100 KB | 86 | 174 | +102% | + | 500 KB | 341 | 1384 | +306% | + | 2 MB | 1507 | 6419 | +326% | + + At every realistic pad size the worker dispatch is slower than synchronous execution, *and the slowness is paid on the main thread* (structured-clone serialization of the input string + deserialization of the output string both run in the caller's isolate). The "free up the event loop" win never materialises: per-call work (17-86 µs for typical pad sizes) is smaller than per-call postMessage overhead (40-90 µs). V8 isolate boundaries do not share strings; `Transferable` and `SharedArrayBuffer` paths don't apply to string content. **Per-call offload is net-negative.** + +2. **Per-pad worker isolation (next architectural lever).** The right shape for parallelism in Etherpad is one level higher: each pad's lifecycle runs in its own worker thread (or process); the main thread is a thin router that hands sockets off to the pad worker and forwards outbound messages back. Serialization happens **once at handoff**, not per changeset; OT work for different pads parallelises across cores; existing `applyToText`/`applyToAttribution` stays synchronous *inside* the pad worker. The dive's "more authors per pad" question is still bounded by one event loop per pad — but the program's overall ceiling (authors-across-all-pads) scales with core count. Sizing the change correctly is a separate program of work; this dive does not scope it further. + +3. **Room-broadcast `updatePadClients` fan-out — filed as [#7780](https://github.com/ether/etherpad/issues/7780).** With #7775+#7776 merged, the next visible cluster in the post-fix profile is socket.io's per-recipient packet construction inside `PadMessageHandler.updatePadClients` (~10% of CPU: emit 3.36% + packet 3.56% + _packet 3.31%). The fan-out loop today does `socket.emit('message', msg)` per recipient — N packet constructions of essentially identical content (only `timeDelta` and `currentTime` differ per recipient, and both fields are timeslider-only; live `collab_client.ts` ignores them). Swapping to `io.in(padId).emit(msg)` collapses N encode calls into 1 via the in-memory adapter's `broadcast()` path. Realistic savings: ~5-7% CPU at the dive cliff. Implementation isn't trivial because of the catch-up case (lagging sockets silently drop messages with `newRev !== rev + 1`); see the issue for the design choice between "split steady-state from catch-up" (Shape A) vs "push catch-up to a CLIENT_REQUEST_RESEND path" (Shape B). + +4. **Better measurement methodology.** Single-run lever comparisons sit inside the noise envelope below the cliff. Future dive scoring should default to N≥3 trials and report min/median/max. The triple-run pattern this doc adopted is the template; N=5+ would tighten conclusions further. + +The application-level surface has been explored end-to-end. Most non-trivial code levers that were thought to be wins turned out to be either inside the noise envelope (#7766 closed, #7770 closed, #7768 perf claim wrong) or net-negative (#7769 closed). The CPU-profile-identified levers are the exception: #7775 + #7776 stacked deliver -12% to -20% CPU% with the cliff effectively shifting from ~400 to ~500 authors — the biggest single-direction perf improvement in this program, and the first set of changes that move the cliff position itself rather than just thinning the tail. #7774 layers a modest additional tail-latency improvement on top. **Past this point the cliff is no longer hardware-bound; it's single-event-loop-bound** — verified by the local taskset experiment showing the cliff doesn't move when you give Etherpad more cores. Per-call worker-thread offload of `applyToText` was prototyped and falsified (postMessage overhead exceeds the work; see "Where to take this next" below). The remaining architectural lever for *one pad with N authors* is per-pad worker isolation; for *N pads across many cores* it's a sticky-session cluster — both substantially larger changes. + +## Roadmap for future effort + +Concrete options for whoever picks this up next, ordered roughly by impact-per-time-spent. **For "more authors per pad"** the answer is Tier 1 then Tier 2 option 4; **for "more pads per box"** the answer is Tier 2 option 5 or Tier 3 option 6. + +### Tier 1 — small, mostly mechanical + +1. **Merge the 3 ready perf PRs** (#7775 + #7776 + #7774). *Cost: review + merge time only, no dev.* Locks in the −12-20% already measured by this dive. The blocker is a maintainer call, not engineering work. + +2. **Implement [#7780](https://github.com/ether/etherpad/issues/7780)** (room-broadcast fan-out in `updatePadClients`). Shape A from the issue: split steady-state from catch-up. *Cost: ~1 day code + N=3 dive verification.* Predicted **+5-7% CPU headroom**; cliff likely from ~500 → ~550 authors. + +3. **One more pass through the post-fix profile** looking for the same shape of bug as #7776 (per-message work that shouldn't be per-message). *Cost: ~half a day.* Diminishing returns — maybe 1-2 small wins at 1-3% each. Cheap to look, easy to abandon. + +### Tier 2 — medium projects, real cliff moves + +4. **Selective fan-out / viewport-based broadcast.** Don't send every edit to every author; full edits to ~20 authors near each cursor, digests every 1-2s to the rest. Requires viewport tracking per socket and a "digest" message type. *Cost: ~2 weeks for a feature-flagged version + dive verification.* Plausible: cliff moves from ~500 → 1000-1500 authors. **Biggest single user-visible win that doesn't change the architecture.** + +5. **Per-pad worker isolation PoC.** Each pad's lifecycle runs in one worker thread; the main thread is a router. Serialization paid once at pad handoff, not per changeset. *Cost: ~1-2 weeks PoC, 1-2 months production-ready.* Does **not** move the per-pad cliff (still one event loop per pad) — wins on program-wide scaling (many pads × cores). Necessary precursor for Tier 3 option 6. + +### Tier 3 — large bets, mostly to know we have them + +6. **Horizontal scaling — two distinct shapes worth keeping separate:** + + - **6a. Reverse-proxy pad sharding (already known-working).** N independent etherpad processes / hosts behind an L7 proxy (nginx, HAProxy, Caddy) that hashes the `padId` from the URL path to a backend. Each backend is unaware of the others; pad ownership = which backend the hash lands on. *Cost: deployment work, no core changes.* **Solves "more pads across many boxes"** — already deployed successfully in operator-hosted setups. Trade-offs: cross-pad operations (global search, list-all-pads, admin) need either a shared DB layer or out-of-band coordination; otherwise per-pad work just works because every author hitting padX always lands on the same backend. + + - **6b. In-process cluster mode (Node `cluster` module + sticky `padId` routing).** One primary process spawns N workers on one host; the primary routes incoming WebSocket upgrades by hashing `padId` to a worker. *Cost: ~2-4 weeks PoC.* **Solves "more pads per box"** — uses more cores on a single host, complementary to 6a. Same scope of work as Tier 2 option 5 (per-pad `worker_threads` isolation) but at the process boundary instead of the thread boundary. Worker_threads has cheaper IPC and shared module state; `cluster` has the simpler mental model of "each worker is an independent etherpad". Pick one; don't build both. + + *Ecosystem impact (all of 6 above):* transparent to clients — they connect to the server URL as usual; the load balancer (6a) or primary process (6b) handles stickiness. **Desktop apps** that embed the server in-process (Electron / Capacitor bundle a single Node process for one user) skip both modes — single-user, no concurrency need. **Mobile**, **terminal / etherpad-cli**, and **MCP** clients are wire-protocol consumers and unaffected by either. + +7. **CRDT migration (Yjs / Automerge).** Native peer-to-peer scaling without a central coordinator. *Cost: months — but the headline cost is wire-protocol replacement, not the editor swap.* The Etherpad changeset format is the lingua franca for **everything that talks to a pad**: the web client, the **Electron / Capacitor desktop app** (embeds the web client), the **mobile app** (Phase 1 packaging merged 2026-05-11, wraps the same web client), **etherpad-cli** (printingpress.dev integration speaks changesets directly), **MCP servers** (any wrap pad ops via changeset semantics), and every server-side **plugin** that intercepts or transforms changesets. A CRDT migration replaces the changeset wire format with Yjs binary updates and requires parallel reimplementation in every one of those consumers — not a refactor, a fork. **Strongly anti-recommended** unless options 1-6 fail to deliver and there's a hard product requirement for thousands of authors per pad that justifies splitting the ecosystem. + +### Tier 4 — operational, not a code lever but valuable + +8. **Production telemetry instrumentation.** Wire the `scalingDiveMetrics` Prometheus surface (added by #7762) into a real dashboard against a live deployment. *Cost: ~3-5 days.* Tells us whether dive numbers (Github runner, dirty.db backing) match production reality (real boxes, Postgres). Important before committing to Tier 2. + +9. **Nightly dive in CI.** N=3 sweep against `develop` once a day, flagging regressions vs the previous week's median. *Cost: ~1 day.* Catches future regressions early. Out of scope for this dive (see below) but cheap to add now that the harness is stable. + +### Recommended next move + +**Option 2 (implement #7780).** It's the only Tier 1 item that needs code; it's bounded; it has a clear measurement plan from the issue; and it moves the cliff a measurable extra ~10%. After that lands, **Tier 2 option 4 (selective fan-out)** is the biggest user-visible win for 1000+ authors per pad and is the natural next program of work. + +## Reproducing + +``` +# Trigger a dive run against any core ref. +gh workflow run "Scaling dive" --repo ether/etherpad-load-test --ref main \ + -f core_ref=develop \ + -f sweep='authors=100..500:step=50:dwell=8s:warmup=2s' + +# Fetch artifacts. +gh run download --repo ether/etherpad-load-test +``` + +Per-lever CSV / JSON / MD artifacts drop in `scaling-dive-{baseline,websocket-only,nodemem,new-changes-batch}/`. The CSV is plot-ready (column set fixed in [load-test#100](https://github.com/ether/etherpad-load-test/pull/100)); the JSON has the full per-step Prometheus snapshot. + +## Out of scope (sequel issues worth filing) + +- A proper p99 from `etherpad_changeset_apply_duration_seconds_bucket{le=...}` would require the harness Scraper to parse histogram buckets. The dive currently shows `apply_mean` (sum/count). For lever-3 follow-up scoring this could matter. +- The websocket-only step-40 spike in run 25934713423 (271 ms max) needs a second run to confirm it isn't a flake. +- The dive uses `dwell=8-10s` per step. Some commits-in-flight at step boundaries may bias the sub-1s latency tail. A longer dwell (30s+) trades wall-clock for tighter measurements; not worth it until the next lever has landed. +- Recurring measurement (nightly CI) is explicitly out of scope. Single dated dive doc, re-run on demand.