Skip to content

feat(scaling): serialize per-socket fan-out + NEW_CHANGES_BATCH (#7756 lever 3)#7768

Draft
JohnMcLear wants to merge 2 commits into
developfrom
feat/new-changes-batch
Draft

feat(scaling): serialize per-socket fan-out + NEW_CHANGES_BATCH (#7756 lever 3)#7768
JohnMcLear wants to merge 2 commits into
developfrom
feat/new-changes-batch

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

@JohnMcLear JohnMcLear commented May 15, 2026

Summary

Honest re-evaluation after N=3 scoring: this PR's perf claim was wrong twice. The original "70% p95 drop at step 200" was a single-run outlier. The subsequent "tighter envelope" claim was a cross-matrix-entry comparison confounded by the runner-noise envelope (see scaling-dive doc for the methodology trap). Like-for-like comparison (develop-baseline vs this-PR-baseline, N=3 each):

Step develop this PR takeaway
100 28/38/38 39/40/47 within noise (PR slightly worse)
200 30/37/51 37/50/59 within noise (PR slightly worse)
300 38/45/71 40/77/119 PR worse
350 39/39/122 63/109/131 PR worse
400 1758/2275/2463 1350/2373/3065 overlapping

The performance benefit isn't there. Recommendation now stands on the correctness fix only, not perf.

What this PR is for

Two bundled changes; only change A is justified by what we measured:

Change A — per-socket fan-out serialization (correctness fix). updatePadClients previously advanced sessioninfo.rev inside the collect phase, before the emit. Under concurrency that allowed overlapping fan-outs on the same socket; if the emit on one branch threw after the other branch had already advanced rev, revisions could be silently lost (the client enforces newRev === rev + 1 and stops applying on mismatch).

The fix snapshots startRev and headRev once and writes sessioninfo.rev = headRev immediately. A concurrent second run sees the bumped rev and skips the range; on emit failure sessioninfo.rev rolls back to startRev. At most one fan-out per socket per pad at a time, with retry semantics preserved. This is a real race the previous code exhibited.

Change B — NEW_CHANGES_BATCH wire format. When a recipient is more than one rev behind, the server packs queued revs into one NEW_CHANGES_BATCH emit. Single-rev fan-outs stay as NEW_CHANGES. Feature-flagged behind settings.newChangesBatch: false default; clients are forward-compatible.

In the dive, change B was dormant — steady-state catch-up is 1 rev per recipient per fan-out, so the batching branch never fired. It would fire under server slowness (GC pauses, disk hiccups, sustained delays inside updatePadClients). Useful forward-compat groundwork; not contributing to current measured numbers.

Tests

  • src/tests/backend-new/specs/new-changes-batch.test.ts (4/4) — pins the wire-format decision via the exported buildNewChangesEmits function in src/node/handler/NewChangesPacker.ts (extracted so the test exercises the production code path).
  • src/tests/backend-new/specs/prom-instruments.test.ts (5/5) — no regression on feat(metrics): 3 Prometheus counters for scaling dive (#7756) #7762.
  • All 33 CI checks pass.

Recommendation

Merge for the correctness fix. The previous race was real and could lose revisions under concurrent commits. The perf benefit I claimed wasn't real — runner noise dominated my earlier measurements.

For the perf direction, the actually-measured win is #7774 (engine.io flush deferral), which tightens the tail at mid-load with N=3 confirmation. That PR doesn't depend on this one.

🤖 Generated with Claude Code

Identified by the #7756 scaling dive (PR #7765) and confirmed by
the engine.io transport investigation in #7767: socket.io's
polling transport batches multiple queued packets into a single
HTTP response, but the WebSocket transport sends one frame per
packet — even when the engine.io socket has several packets
buffered. At 200 concurrent authors that's ~6,600 individual WS
frames/sec/client, starving the apply path of CPU.

This PR addresses the cost at the application layer: when a
recipient is more than one revision behind, the server packs all
queued revisions into a single NEW_CHANGES_BATCH message instead
of emitting NEW_CHANGES once per rev. The wire payload is the same
information, just consolidated.

Feature-flagged:

- settings.newChangesBatch defaults to false. Production behaviour
  is unchanged.
- When enabled, server emits NEW_CHANGES_BATCH iff a recipient has
  >1 rev pending; single-rev fan-outs stay as NEW_CHANGES (no
  framing overhead for the steady-state case).

Clients are forward-compatible: both collab_client.ts (live editor)
and broadcast.ts (timeslider) now accept either message type and
normalise to a list. Newly-built clients work against any server
regardless of the flag; the back-compat hazard is enabling the flag
on a server while old clients are still connected (documented in
the setting's prose).

Tests: src/tests/backend-new/specs/new-changes-batch.test.ts pins
the server's wire-format decision. 4/4 new + 5/5 existing
prom-instruments stay green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Pack multi-revision fan-out into single NEW_CHANGES_BATCH emit

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Pack multiple queued revisions into single NEW_CHANGES_BATCH emit
  - Reduces engine.io packet count under high concurrency
  - WebSocket transport sends one frame per packet (polling already batches)
• Feature-flagged with settings.newChangesBatch (defaults to false)
  - Single-rev fan-outs remain as NEW_CHANGES (no overhead)
  - Multi-rev fan-outs use NEW_CHANGES_BATCH when enabled
• Clients forward-compatible: accept both message types
  - collab_client.ts and broadcast.ts normalize to list
  - Old clients connecting to enabled server will miss batched revisions
• Comprehensive test coverage with new test suite
  - 4 new tests pin wire-format decision
  - Existing prom-instruments tests remain green
Diagram
flowchart LR
  A["Multiple queued revisions"] --> B{"newChangesBatch enabled?"}
  B -->|No or single rev| C["Emit per-rev NEW_CHANGES"]
  B -->|Yes and multi-rev| D["Emit single NEW_CHANGES_BATCH"]
  C --> E["Client receives"]
  D --> E
  E --> F["Normalize to list"]
  F --> G["Apply revisions in order"]
Loading

Grey Divider

File Changes

1. src/node/handler/PadMessageHandler.ts ✨ Enhancement +49/-20

Implement server-side revision batching logic

• Collect all pending revisions into array before emitting
• Emit NEW_CHANGES_BATCH when flag enabled and multiple revisions queued
• Fall back to per-revision NEW_CHANGES for single revisions or when flag disabled
• Moved error handling outside loop to wrap entire emit operation

src/node/handler/PadMessageHandler.ts


2. src/node/prom-instruments.ts ⚙️ Configuration changes +1/-0

Add NEW_CHANGES_BATCH to metrics allowlist

• Add NEW_CHANGES_BATCH to KNOWN_TYPES allowlist
• Enables Prometheus metric tracking for batched message type

src/node/prom-instruments.ts


3. src/node/utils/Settings.ts ⚙️ Configuration changes +16/-0

Add newChangesBatch feature flag setting

• Add newChangesBatch boolean field to SettingsType
• Set default to false to preserve legacy behavior
• Document feature purpose, client compatibility hazard, and rollout coordination requirement

src/node/utils/Settings.ts


View more (5)
4. src/static/js/broadcast.ts ✨ Enhancement +17/-11

Support NEW_CHANGES_BATCH in timeslider broadcast

• Accept both NEW_CHANGES and NEW_CHANGES_BATCH message types
• Normalize batched messages to array for uniform processing
• Apply each revision in order using existing changeset logic

src/static/js/broadcast.ts


5. src/static/js/collab_client.ts ✨ Enhancement +15/-8

Support NEW_CHANGES_BATCH in live editor client

• Accept both NEW_CHANGES and NEW_CHANGES_BATCH message types
• Normalize single-rev messages to array for uniform iteration
• Apply each revision in order within shared composition-safety await
• Update warning message to reference actual message type

src/static/js/collab_client.ts


6. src/static/js/types/SocketIOMessage.ts ✨ Enhancement +14/-0

Add TypeScript types for batched message format

• Define NewChangesItem type for individual revision in batch
• Define ClientNewChangesBatch type for batched message format
• Maintain backward compatibility with existing ClientNewChanges type

src/static/js/types/SocketIOMessage.ts


7. src/tests/backend-new/specs/new-changes-batch.test.ts 🧪 Tests +73/-0

Add unit tests for NEW_CHANGES_BATCH emit decision

• New test suite with 4 test cases covering emit decision logic
• Test flag off behavior (per-rev emissions)
• Test flag on with single revision (no batch overhead)
• Test flag on with multiple revisions (single batch emit)
• Test empty pending list edge case

src/tests/backend-new/specs/new-changes-batch.test.ts


8. settings.json.template 📝 Documentation +12/-0

Document newChangesBatch setting in template

• Add newChangesBatch configuration option with default false
• Document feature purpose and performance benefit
• Warn about client compatibility requirement during rollout

settings.json.template


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 15, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. Test reimplements emit decision ✓ Resolved 📘 Rule violation ☼ Reliability
Description
The added test re-implements the NEW_CHANGES vs NEW_CHANGES_BATCH decision locally instead of
exercising the real server code path, so it would still pass even if the production batching logic
were removed or changed. This does not meet the requirement for a regression test that fails when
the fix is reverted.
Code

src/tests/backend-new/specs/new-changes-batch.test.ts[R16-36]

+// The decision the new code makes is small and pure: given a `pending`
+// array of N >= 1 revisions and the feature flag, emit one
+// NEW_CHANGES_BATCH (if N > 1 and flag on) or N NEW_CHANGES messages.
+// Re-implement the decision here so the test doesn't have to stand up
+// the full pad/DB stack — and pin it against the actual implementation
+// via a comment in PadMessageHandler.
+
+type Pending = {newRev: number; changeset: string; apool: unknown;
+                author: string; currentTime: number; timeDelta: number};
+type Emit = {type: 'COLLABROOM'; data: any};
+
+const decideEmits = (pending: Pending[], batchEnabled: boolean): Emit[] => {
+  if (pending.length === 0) return [];
+  if (batchEnabled && pending.length > 1) {
+    return [{type: 'COLLABROOM', data: {type: 'NEW_CHANGES_BATCH', changes: pending}}];
+  }
+  return pending.map((change) => ({
+    type: 'COLLABROOM',
+    data: {type: 'NEW_CHANGES', ...change},
+  }));
+};
Evidence
PR Compliance ID 5 requires an automated regression test that fails if the fix is reverted. The test
file implements and tests a local decideEmits() function rather than invoking the actual
socket.emit() decision code in PadMessageHandler, so it is not a true regression test for the
production change.

src/tests/backend-new/specs/new-changes-batch.test.ts[16-36]
src/node/handler/PadMessageHandler.ts[1015-1032]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The new regression test does not validate the actual implementation: it defines `decideEmits()` inside the test and asserts on that, so the test can pass even if the real batching logic in `PadMessageHandler.updatePadClients()` is broken/reverted.
## Issue Context
Compliance requires a regression test that would fail without the fix. The production decision currently lives inside `src/node/handler/PadMessageHandler.ts`.
## Fix Focus Areas
- src/tests/backend-new/specs/new-changes-batch.test.ts[16-36]
- src/node/handler/PadMessageHandler.ts[1015-1032]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Rev advanced before send ✓ Resolved 🐞 Bug ≡ Correctness
Description
PadMessageHandler.updatePadClients() advances sessioninfo.rev/time while still collecting revisions,
before any socket.emit() happens; if another updatePadClients() runs in that window (possible due to
un-awaited call sites) it can emit later revisions first, and if emit throws the skipped revisions
will never be retried. The client enforces strict newRev===rev+1 and returns early on mismatch, so
this can silently stop applying updates (pad desync).
Code

src/node/handler/PadMessageHandler.ts[R980-992]

+    // Collect all queued revisions for this socket.
+    const pending: Array<{
+      newRev: number;
+      changeset: string;
+      apool: unknown;
+      author: string;
+      currentTime: number;
+      timeDelta: number;
+    }> = [];
+
   while (sessioninfo.rev < pad.getHeadRevisionNumber()) {
     const r = sessioninfo.rev + 1;
     let revision = revCache[r];
Evidence
The server advances session state before emitting, and there are un-awaited call sites that can
overlap fan-out; the client rejects non-sequential revisions, so missed/out-of-order delivery stops
applying updates.

src/node/handler/PadMessageHandler.ts[950-1036]
src/node/db/API.ts[310-328]
src/static/js/collab_client.ts[191-215]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`updatePadClients()` mutates `sessioninfo.rev`/`sessioninfo.time` while building `pending`, before the corresponding `socket.emit()` is actually performed. This makes `sessioninfo` claim revisions have been delivered when they are only computed locally, which can cause:
- later `updatePadClients()` invocations (that see the advanced `sessioninfo.rev`) to emit subsequent revisions first,
- permanent loss of revisions if `socket.emit()` throws (no retry because `sessioninfo.rev` is already advanced).
## Issue Context
Some call sites invoke `updatePadClients(pad)` without `await`, so overlapping invocations are plausible under load. The client rejects out-of-sequence revisions (`newRev !== rev + 1`) and stops applying.
## Fix Focus Areas
- src/node/handler/PadMessageHandler.ts[950-1036]
- src/node/db/API.ts[310-328]
- src/static/js/collab_client.ts[184-216]
### Suggested fix approach
1. Add per-socket (or per-pad) serialization so only one fan-out per socket runs at a time (e.g., store a `sessioninfo._fanoutPromise` chain and `await` it).
2. During collection, track `nextRev`/`nextTime` in local variables; only commit `sessioninfo.rev/time` **after** a successful emit of the batch (or after each successful per-rev emit in the non-batch path).
3. On emit failure, do not advance `sessioninfo.rev/time` (so the next fan-out can retry), or explicitly force a reconnect/disconnect path if retry is not desired.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread src/tests/backend-new/specs/new-changes-batch.test.ts Outdated
Comment thread src/node/handler/PadMessageHandler.ts Outdated
Two issues raised on the first push:

1. **Rev advanced before send (Bug · Correctness).** The previous
   diff advanced sessioninfo.rev/time inside the collect loop,
   before any emit ran. A concurrent updatePadClients() could then
   see the bumped rev and skip those revisions, and if the emit
   threw later, the skipped revs were lost forever. The client
   enforces strict newRev===rev+1 and silently stops applying on
   mismatch — net effect was a possible pad desync under
   concurrent fan-outs.

   Fix: snapshot startRev/startTime once, claim the
   (startRev, headRev] range by setting sessioninfo.rev = headRev
   immediately (so a concurrent run skips it), build the pending
   list against the local startTime, then emit. If the emit
   throws, roll sessioninfo.rev back to startRev so the next
   fan-out retries. Time is only committed after a successful
   send.

2. **Test re-implemented the decision (Rule violation ·
   Reliability).** The original test re-implemented the
   NEW_CHANGES vs NEW_CHANGES_BATCH switch locally instead of
   exercising the production code. Removing the production logic
   would have left the test green.

   Fix: extract the pure wire-format decision into
   src/node/handler/NewChangesPacker.ts (no DB / pad dependency,
   so the test can import it directly under vitest), and rewrite
   the test to assert against the exported `buildNewChangesEmits`
   function from that module. PadMessageHandler now calls the
   same function; deleting it would fail the test.

9/9 tests across new-changes-batch + prom-instruments.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear JohnMcLear changed the title feat(scaling): NEW_CHANGES_BATCH — pack multi-rev fan-out into one emit feat(scaling): serialize per-socket fan-out + NEW_CHANGES_BATCH (#7756 lever 3) May 15, 2026
@JohnMcLear JohnMcLear requested a review from SamTV12345 May 16, 2026 04:51
@JohnMcLear JohnMcLear marked this pull request as draft May 16, 2026 04:51
JohnMcLear added a commit that referenced this pull request May 16, 2026
Captures everything learned since the first draft:

- The "250-author cliff" was a measurement artefact from per-IP
  commitRateLimiting + colocated harness. Fixed via the
  etherpad-load-test#105 workflow patch. Real ceiling is ~350-400
  authors on a 4-vCPU GitHub runner.

- apply_mean ballooning at the cliff isn't slow code — it's OS
  preemption (7+ cores of work on 4 vCPU). Application-level JS
  rearrangement can't reach it.

- Two changes hold up under the dive: fan-out serialization
  + NEW_CHANGES_BATCH (#7768, 70% p95 drop at 200 authors) and
  historicalAuthorData cache (#7769, neutral on dive but real
  production thundering-herd fix at join time).

- Four directions didn't pan out: WebSocket-only transport, heap
  bump, message-level batching alone (#7766 closed), and
  rebase-loop prefetch (#7770 closed). Each has a one-line cause
  documented for the record.

- Engine.io transport-level packing (#7767) is the meatiest
  untouched lever — sending multiple packets per WebSocket frame
  the way polling already does via encodePayload.

Qodo-flagged corrections incorporated:
1. The new instruments are Histogram + Counter + Gauge, not
   "three counters" — labelled correctly.
2. The lever-3 line reference now points at updatePadClients
   (lines 985-999) where NEW_CHANGES actually emits, not the
   wrong line 627 (handleSaveRevisionMessage).
3. Lever 3's results are written up against measured data, not
   "deferred".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnMcLear added a commit that referenced this pull request May 16, 2026
Three identical sweeps against develop quantify the runner-noise
envelope. Same workload, same code, same workflow → p95 at step 350
ranged 39-122ms on baseline (3.1x spread). At step 300, 1.9x spread.

What this means for prior conclusions in this doc:

- websocket-only-is-worst HOLDS at the cliff: its envelope min (2463)
  equals baseline's max (2463), envelopes don't overlap. Single
  contradicting run was an outlier.

- lever-3 (#7768) "70% p95 drop at 200" was a single-run outlier
  comparison. The real reliable improvement is ~5-15% median p95
  plus much tighter consistency (fewer tail-latency excursions).
  The mechanism — per-socket serialization preventing overlapping
  fan-outs that contend for CPU — is still real and still worth
  merging; the headline number was inflated.

- below the cliff, all four levers' noise envelopes overlap. No
  clear winner.

Going forward: lever scoring should default to N >= 3 trials and
report min/median/max, not single-run point estimates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnMcLear added a commit that referenced this pull request May 16, 2026
N=3 scoring of feat/cache-historical-author-data shows it's
net-negative above 300 authors (step 350 p95 envelope
301/488/633ms vs develop baseline 39/39/122ms). Two compounding
issues:
- The motivating hypothesis (250-cliff is a join thundering herd)
  was falsified — that cliff was the per-IP rate-limit artefact.
- The defensive shallow-clone-on-every-get() added in the Qodo
  fix walks O(N) author entries per join, costing more than the
  inline Promise.all it replaced.

Updated recommendations: lever 3 (#7768) is now the only PR worth
merging. lever 6 (#7769) added to the do-not-merge list with
honest data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JohnMcLear added a commit that referenced this pull request May 16, 2026
Two findings from rigorous N=3 scoring:

1. Lever 3 (#7768) is NOT a perf win. When you compare like-for-
   like matrix entries (develop-baseline vs PR-baseline), the
   per-socket serialization is slightly net-negative across the
   curve. My earlier "70% drop" was a single-run outlier; the
   subsequent "tighter envelope" was a cross-matrix-entry
   comparison confounded by noise. The serialization is still a
   real correctness fix (race on concurrent fan-outs + lost
   revisions on emit error) so the PR stays open, but the
   recommendation is now correctness-only.

2. Lever 8b (#7774) — engine.io flush deferral. The follow-up to
   the closed lever 8 that actually patches Socket.sendPacket
   instead of just transport.send. queueMicrotask-coalesced flush
   gives the transport multi-packet batches to work with at last.
   N=3 shows tighter tail at step 300-350 (122 → 110 max at 350,
   71 → 58 max at 300). Not a cliff-mover. The only PR in this
   program with N=3-confirmed perf benefit.

Final disposition:
- Merge: #7774 (modest perf), #7768 (correctness), #7762 (already
  merged, instruments).
- The cliff at 350-400 authors is hardware-bound on a 4-vCPU
  runner, not code-bound. Production with more cores per host
  scales proportionally with no code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant