feat(scaling): serialize per-socket fan-out + NEW_CHANGES_BATCH (#7756 lever 3) by JohnMcLear · Pull Request #7768 · ether/etherpad

JohnMcLear · 2026-05-15T20:52:28Z

Summary

Honest re-evaluation after N=3 scoring: this PR's perf claim was wrong twice. The original "70% p95 drop at step 200" was a single-run outlier. The subsequent "tighter envelope" claim was a cross-matrix-entry comparison confounded by the runner-noise envelope (see scaling-dive doc for the methodology trap). Like-for-like comparison (develop-baseline vs this-PR-baseline, N=3 each):

Step	develop	this PR	takeaway
100	28/38/38	39/40/47	within noise (PR slightly worse)
200	30/37/51	37/50/59	within noise (PR slightly worse)
300	38/45/71	40/77/119	PR worse
350	39/39/122	63/109/131	PR worse
400	1758/2275/2463	1350/2373/3065	overlapping

The performance benefit isn't there. Recommendation now stands on the correctness fix only, not perf.

What this PR is for

Two bundled changes; only change A is justified by what we measured:

Change A — per-socket fan-out serialization (correctness fix). updatePadClients previously advanced sessioninfo.rev inside the collect phase, before the emit. Under concurrency that allowed overlapping fan-outs on the same socket; if the emit on one branch threw after the other branch had already advanced rev, revisions could be silently lost (the client enforces newRev === rev + 1 and stops applying on mismatch).

The fix snapshots startRev and headRev once and writes sessioninfo.rev = headRev immediately. A concurrent second run sees the bumped rev and skips the range; on emit failure sessioninfo.rev rolls back to startRev. At most one fan-out per socket per pad at a time, with retry semantics preserved. This is a real race the previous code exhibited.

Change B — NEW_CHANGES_BATCH wire format. When a recipient is more than one rev behind, the server packs queued revs into one NEW_CHANGES_BATCH emit. Single-rev fan-outs stay as NEW_CHANGES. Feature-flagged behind settings.newChangesBatch: false default; clients are forward-compatible.

In the dive, change B was dormant — steady-state catch-up is 1 rev per recipient per fan-out, so the batching branch never fired. It would fire under server slowness (GC pauses, disk hiccups, sustained delays inside updatePadClients). Useful forward-compat groundwork; not contributing to current measured numbers.

Tests

src/tests/backend-new/specs/new-changes-batch.test.ts (4/4) — pins the wire-format decision via the exported buildNewChangesEmits function in src/node/handler/NewChangesPacker.ts (extracted so the test exercises the production code path).
src/tests/backend-new/specs/prom-instruments.test.ts (5/5) — no regression on feat(metrics): 3 Prometheus counters for scaling dive (#7756) #7762.
All 33 CI checks pass.

Recommendation

Merge for the correctness fix. The previous race was real and could lose revisions under concurrent commits. The perf benefit I claimed wasn't real — runner noise dominated my earlier measurements.

For the perf direction, the actually-measured win is #7774 (engine.io flush deferral), which tightens the tail at mid-load with N=3 confirmation. That PR doesn't depend on this one.

🤖 Generated with Claude Code

Identified by the #7756 scaling dive (PR #7765) and confirmed by the engine.io transport investigation in #7767: socket.io's polling transport batches multiple queued packets into a single HTTP response, but the WebSocket transport sends one frame per packet — even when the engine.io socket has several packets buffered. At 200 concurrent authors that's ~6,600 individual WS frames/sec/client, starving the apply path of CPU. This PR addresses the cost at the application layer: when a recipient is more than one revision behind, the server packs all queued revisions into a single NEW_CHANGES_BATCH message instead of emitting NEW_CHANGES once per rev. The wire payload is the same information, just consolidated. Feature-flagged: - settings.newChangesBatch defaults to false. Production behaviour is unchanged. - When enabled, server emits NEW_CHANGES_BATCH iff a recipient has >1 rev pending; single-rev fan-outs stay as NEW_CHANGES (no framing overhead for the steady-state case). Clients are forward-compatible: both collab_client.ts (live editor) and broadcast.ts (timeslider) now accept either message type and normalise to a list. Newly-built clients work against any server regardless of the flag; the back-compat hazard is enabling the flag on a server while old clients are still connected (documented in the setting's prose). Tests: src/tests/backend-new/specs/new-changes-batch.test.ts pins the server's wire-format decision. 4/4 new + 5/5 existing prom-instruments stay green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-code-review · 2026-05-15T20:52:33Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

qodo-free-for-open-source-projects · 2026-05-15T20:52:50Z

Review Summary by Qodo

Pack multi-revision fan-out into single NEW_CHANGES_BATCH emit

✨ Enhancement

Walkthroughs

Description

• Pack multiple queued revisions into single NEW_CHANGES_BATCH emit
  - Reduces engine.io packet count under high concurrency
  - WebSocket transport sends one frame per packet (polling already batches)
• Feature-flagged with settings.newChangesBatch (defaults to false)
  - Single-rev fan-outs remain as NEW_CHANGES (no overhead)
  - Multi-rev fan-outs use NEW_CHANGES_BATCH when enabled
• Clients forward-compatible: accept both message types
  - collab_client.ts and broadcast.ts normalize to list
  - Old clients connecting to enabled server will miss batched revisions
• Comprehensive test coverage with new test suite
  - 4 new tests pin wire-format decision
  - Existing prom-instruments tests remain green

Diagram

flowchart LR
  A["Multiple queued revisions"] --> B{"newChangesBatch enabled?"}
  B -->|No or single rev| C["Emit per-rev NEW_CHANGES"]
  B -->|Yes and multi-rev| D["Emit single NEW_CHANGES_BATCH"]
  C --> E["Client receives"]
  D --> E
  E --> F["Normalize to list"]
  F --> G["Apply revisions in order"]

File Changes

1. src/node/handler/PadMessageHandler.ts ✨ Enhancement +49/-20

Implement server-side revision batching logic

• Collect all pending revisions into array before emitting
• Emit NEW_CHANGES_BATCH when flag enabled and multiple revisions queued
• Fall back to per-revision NEW_CHANGES for single revisions or when flag disabled
• Moved error handling outside loop to wrap entire emit operation

src/node/handler/PadMessageHandler.ts

2. src/node/prom-instruments.ts ⚙️ Configuration changes +1/-0

Add NEW_CHANGES_BATCH to metrics allowlist
• Add NEW_CHANGES_BATCH to KNOWN_TYPES allowlist
• Enables Prometheus metric tracking for batched message type
src/node/prom-instruments.ts

3. src/node/utils/Settings.ts ⚙️ Configuration changes +16/-0

Add newChangesBatch feature flag setting

• Add newChangesBatch boolean field to SettingsType
• Set default to false to preserve legacy behavior
• Document feature purpose, client compatibility hazard, and rollout coordination requirement

src/node/utils/Settings.ts

View more (5)

4. src/static/js/broadcast.ts ✨ Enhancement +17/-11

Support NEW_CHANGES_BATCH in timeslider broadcast

• Accept both NEW_CHANGES and NEW_CHANGES_BATCH message types
• Normalize batched messages to array for uniform processing
• Apply each revision in order using existing changeset logic

src/static/js/broadcast.ts

5. src/static/js/collab_client.ts ✨ Enhancement +15/-8

Support NEW_CHANGES_BATCH in live editor client

• Accept both NEW_CHANGES and NEW_CHANGES_BATCH message types
• Normalize single-rev messages to array for uniform iteration
• Apply each revision in order within shared composition-safety await
• Update warning message to reference actual message type

src/static/js/collab_client.ts

6. src/static/js/types/SocketIOMessage.ts ✨ Enhancement +14/-0

Add TypeScript types for batched message format

• Define NewChangesItem type for individual revision in batch
• Define ClientNewChangesBatch type for batched message format
• Maintain backward compatibility with existing ClientNewChanges type

src/static/js/types/SocketIOMessage.ts

7. src/tests/backend-new/specs/new-changes-batch.test.ts 🧪 Tests +73/-0

Add unit tests for NEW_CHANGES_BATCH emit decision

• New test suite with 4 test cases covering emit decision logic
• Test flag off behavior (per-rev emissions)
• Test flag on with single revision (no batch overhead)
• Test flag on with multiple revisions (single batch emit)
• Test empty pending list edge case

src/tests/backend-new/specs/new-changes-batch.test.ts

8. settings.json.template 📝 Documentation +12/-0

Document newChangesBatch setting in template

• Add newChangesBatch configuration option with default false
• Document feature purpose and performance benefit
• Warn about client compatibility requirement during rollout

settings.json.template

qodo-free-for-open-source-projects · 2026-05-15T20:52:51Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. ~~Test reimplements emit decision~~ ✓ Resolved 📘 Rule violation ☼ Reliability

Description

The added test re-implements the NEW_CHANGES vs NEW_CHANGES_BATCH decision locally instead of
exercising the real server code path, so it would still pass even if the production batching logic
were removed or changed. This does not meet the requirement for a regression test that fails when
the fix is reverted.

Code

src/tests/backend-new/specs/new-changes-batch.test.ts[R16-36]

+// The decision the new code makes is small and pure: given a `pending`
+// array of N >= 1 revisions and the feature flag, emit one
+// NEW_CHANGES_BATCH (if N > 1 and flag on) or N NEW_CHANGES messages.
+// Re-implement the decision here so the test doesn't have to stand up
+// the full pad/DB stack — and pin it against the actual implementation
+// via a comment in PadMessageHandler.
+
+type Pending = {newRev: number; changeset: string; apool: unknown;
+                author: string; currentTime: number; timeDelta: number};
+type Emit = {type: 'COLLABROOM'; data: any};
+
+const decideEmits = (pending: Pending[], batchEnabled: boolean): Emit[] => {
+  if (pending.length === 0) return [];
+  if (batchEnabled && pending.length > 1) {
+    return [{type: 'COLLABROOM', data: {type: 'NEW_CHANGES_BATCH', changes: pending}}];
+  }
+  return pending.map((change) => ({
+    type: 'COLLABROOM',
+    data: {type: 'NEW_CHANGES', ...change},
+  }));
+};

Evidence
PR Compliance ID 5 requires an automated regression test that fails if the fix is reverted. The test
file implements and tests a local decideEmits() function rather than invoking the actual
socket.emit() decision code in PadMessageHandler, so it is not a true regression test for the
production change.
src/tests/backend-new/specs/new-changes-batch.test.ts[16-36]
src/node/handler/PadMessageHandler.ts[1015-1032]
Best Practice: Repository guidelines

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The new regression test does not validate the actual implementation: it defines `decideEmits()` inside the test and asserts on that, so the test can pass even if the real batching logic in `PadMessageHandler.updatePadClients()` is broken/reverted.
## Issue Context
Compliance requires a regression test that would fail without the fix. The production decision currently lives inside `src/node/handler/PadMessageHandler.ts`.
## Fix Focus Areas
- src/tests/backend-new/specs/new-changes-batch.test.ts[16-36]
- src/node/handler/PadMessageHandler.ts[1015-1032]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Rev advanced before send~~ ✓ Resolved 🐞 Bug ≡ Correctness

Description

PadMessageHandler.updatePadClients() advances sessioninfo.rev/time while still collecting revisions,
before any socket.emit() happens; if another updatePadClients() runs in that window (possible due to
un-awaited call sites) it can emit later revisions first, and if emit throws the skipped revisions
will never be retried. The client enforces strict newRev===rev+1 and returns early on mismatch, so
this can silently stop applying updates (pad desync).

Code

src/node/handler/PadMessageHandler.ts[R980-992]

+    // Collect all queued revisions for this socket.
+    const pending: Array<{
+      newRev: number;
+      changeset: string;
+      apool: unknown;
+      author: string;
+      currentTime: number;
+      timeDelta: number;
+    }> = [];
+
   while (sessioninfo.rev < pad.getHeadRevisionNumber()) {
     const r = sessioninfo.rev + 1;
     let revision = revCache[r];

Evidence
The server advances session state before emitting, and there are un-awaited call sites that can
overlap fan-out; the client rejects non-sequential revisions, so missed/out-of-order delivery stops
applying updates.
src/node/handler/PadMessageHandler.ts[950-1036]
src/node/db/API.ts[310-328]
src/static/js/collab_client.ts[191-215]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`updatePadClients()` mutates `sessioninfo.rev`/`sessioninfo.time` while building `pending`, before the corresponding `socket.emit()` is actually performed. This makes `sessioninfo` claim revisions have been delivered when they are only computed locally, which can cause:
- later `updatePadClients()` invocations (that see the advanced `sessioninfo.rev`) to emit subsequent revisions first,
- permanent loss of revisions if `socket.emit()` throws (no retry because `sessioninfo.rev` is already advanced).
## Issue Context
Some call sites invoke `updatePadClients(pad)` without `await`, so overlapping invocations are plausible under load. The client rejects out-of-sequence revisions (`newRev !== rev + 1`) and stops applying.
## Fix Focus Areas
- src/node/handler/PadMessageHandler.ts[950-1036]
- src/node/db/API.ts[310-328]
- src/static/js/collab_client.ts[184-216]
### Suggested fix approach
1. Add per-socket (or per-pad) serialization so only one fan-out per socket runs at a time (e.g., store a `sessioninfo._fanoutPromise` chain and `await` it).
2. During collection, track `nextRev`/`nextTime` in local variables; only commit `sessioninfo.rev/time` **after** a successful emit of the batch (or after each successful per-rev emit in the non-batch path).
3. On emit failure, do not advance `sessioninfo.rev/time` (so the next fan-out can retry), or explicitly force a reconnect/disconnect path if retry is not desired.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Two issues raised on the first push: 1. **Rev advanced before send (Bug · Correctness).** The previous diff advanced sessioninfo.rev/time inside the collect loop, before any emit ran. A concurrent updatePadClients() could then see the bumped rev and skip those revisions, and if the emit threw later, the skipped revs were lost forever. The client enforces strict newRev===rev+1 and silently stops applying on mismatch — net effect was a possible pad desync under concurrent fan-outs. Fix: snapshot startRev/startTime once, claim the (startRev, headRev] range by setting sessioninfo.rev = headRev immediately (so a concurrent run skips it), build the pending list against the local startTime, then emit. If the emit throws, roll sessioninfo.rev back to startRev so the next fan-out retries. Time is only committed after a successful send. 2. **Test re-implemented the decision (Rule violation · Reliability).** The original test re-implemented the NEW_CHANGES vs NEW_CHANGES_BATCH switch locally instead of exercising the production code. Removing the production logic would have left the test green. Fix: extract the pure wire-format decision into src/node/handler/NewChangesPacker.ts (no DB / pad dependency, so the test can import it directly under vitest), and rewrite the test to assert against the exported `buildNewChangesEmits` function from that module. PadMessageHandler now calls the same function; deleting it would fail the test. 9/9 tests across new-changes-batch + prom-instruments. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Captures everything learned since the first draft: - The "250-author cliff" was a measurement artefact from per-IP commitRateLimiting + colocated harness. Fixed via the etherpad-load-test#105 workflow patch. Real ceiling is ~350-400 authors on a 4-vCPU GitHub runner. - apply_mean ballooning at the cliff isn't slow code — it's OS preemption (7+ cores of work on 4 vCPU). Application-level JS rearrangement can't reach it. - Two changes hold up under the dive: fan-out serialization + NEW_CHANGES_BATCH (#7768, 70% p95 drop at 200 authors) and historicalAuthorData cache (#7769, neutral on dive but real production thundering-herd fix at join time). - Four directions didn't pan out: WebSocket-only transport, heap bump, message-level batching alone (#7766 closed), and rebase-loop prefetch (#7770 closed). Each has a one-line cause documented for the record. - Engine.io transport-level packing (#7767) is the meatiest untouched lever — sending multiple packets per WebSocket frame the way polling already does via encodePayload. Qodo-flagged corrections incorporated: 1. The new instruments are Histogram + Counter + Gauge, not "three counters" — labelled correctly. 2. The lever-3 line reference now points at updatePadClients (lines 985-999) where NEW_CHANGES actually emits, not the wrong line 627 (handleSaveRevisionMessage). 3. Lever 3's results are written up against measured data, not "deferred". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three identical sweeps against develop quantify the runner-noise envelope. Same workload, same code, same workflow → p95 at step 350 ranged 39-122ms on baseline (3.1x spread). At step 300, 1.9x spread. What this means for prior conclusions in this doc: - websocket-only-is-worst HOLDS at the cliff: its envelope min (2463) equals baseline's max (2463), envelopes don't overlap. Single contradicting run was an outlier. - lever-3 (#7768) "70% p95 drop at 200" was a single-run outlier comparison. The real reliable improvement is ~5-15% median p95 plus much tighter consistency (fewer tail-latency excursions). The mechanism — per-socket serialization preventing overlapping fan-outs that contend for CPU — is still real and still worth merging; the headline number was inflated. - below the cliff, all four levers' noise envelopes overlap. No clear winner. Going forward: lever scoring should default to N >= 3 trials and report min/median/max, not single-run point estimates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

N=3 scoring of feat/cache-historical-author-data shows it's net-negative above 300 authors (step 350 p95 envelope 301/488/633ms vs develop baseline 39/39/122ms). Two compounding issues: - The motivating hypothesis (250-cliff is a join thundering herd) was falsified — that cliff was the per-IP rate-limit artefact. - The defensive shallow-clone-on-every-get() added in the Qodo fix walks O(N) author entries per join, costing more than the inline Promise.all it replaced. Updated recommendations: lever 3 (#7768) is now the only PR worth merging. lever 6 (#7769) added to the do-not-merge list with honest data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two findings from rigorous N=3 scoring: 1. Lever 3 (#7768) is NOT a perf win. When you compare like-for- like matrix entries (develop-baseline vs PR-baseline), the per-socket serialization is slightly net-negative across the curve. My earlier "70% drop" was a single-run outlier; the subsequent "tighter envelope" was a cross-matrix-entry comparison confounded by noise. The serialization is still a real correctness fix (race on concurrent fan-outs + lost revisions on emit error) so the PR stays open, but the recommendation is now correctness-only. 2. Lever 8b (#7774) — engine.io flush deferral. The follow-up to the closed lever 8 that actually patches Socket.sendPacket instead of just transport.send. queueMicrotask-coalesced flush gives the transport multi-packet batches to work with at last. N=3 shows tighter tail at step 300-350 (122 → 110 max at 350, 71 → 58 max at 300). Not a cliff-mover. The only PR in this program with N=3-confirmed perf benefit. Final disposition: - Merge: #7774 (modest perf), #7768 (correctness), #7762 (already merged, instruments). - The cliff at 350-400 authors is hardware-bound on a 4-vCPU runner, not code-bound. Production with more cores per host scales proportionally with no code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JohnMcLear mentioned this pull request May 15, 2026

feat(scaling-dive): new-changes-batch matrix entry ether/etherpad-load-test#104

Merged

qodo-free-for-open-source-projects Bot added the Compliance violation label May 15, 2026

qodo-free-for-open-source-projects Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/tests/backend-new/specs/new-changes-batch.test.ts Outdated

Comment thread src/node/handler/PadMessageHandler.ts Outdated

JohnMcLear changed the title ~~feat(scaling): NEW_CHANGES_BATCH — pack multi-rev fan-out into one emit~~ feat(scaling): serialize per-socket fan-out + NEW_CHANGES_BATCH (#7756 lever 3) May 15, 2026

JohnMcLear mentioned this pull request May 16, 2026

feat(pad): cache historicalAuthorData (remove join-time thundering herd) #7769

Closed

4 tasks

JohnMcLear requested a review from SamTV12345 May 16, 2026 04:51

JohnMcLear marked this pull request as draft May 16, 2026 04:51

JohnMcLear mentioned this pull request May 16, 2026

docs: scaling dive 2026-05 (closes Phase 2 of #7756) #7765

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(scaling): serialize per-socket fan-out + NEW_CHANGES_BATCH (#7756 lever 3)#7768

feat(scaling): serialize per-socket fan-out + NEW_CHANGES_BATCH (#7756 lever 3)#7768
JohnMcLear wants to merge 2 commits into
developfrom
feat/new-changes-batch

JohnMcLear commented May 15, 2026 •

edited

Loading

Uh oh!

qodo-code-review Bot commented May 15, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 15, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JohnMcLear commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What this PR is for

Tests

Recommendation

Uh oh!

qodo-code-review Bot commented May 15, 2026

Qodo reviews are paused for this user.

Uh oh!

qodo-free-for-open-source-projects Bot commented May 15, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-free-for-open-source-projects Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JohnMcLear commented May 15, 2026 •

edited

Loading

qodo-free-for-open-source-projects Bot commented May 15, 2026 •

edited

Loading