feat(metrics): 3 Prometheus counters for scaling dive (#7756) by JohnMcLear · Pull Request #7762 · ether/etherpad

JohnMcLear · 2026-05-15T18:34:40Z

Summary

Adds three Prometheus rows to `/stats/prometheus` so the load-test harness for #7756 can attribute where time goes on the server, not just the gauge headline (CPU / event-loop / memory):

`etherpad_pad_users{padId}` — gauge derived from `sessioninfos` on scrape. Lets the harness confirm a target pad actually has the expected concurrency.
`etherpad_changeset_apply_duration_seconds` — histogram observed inside `handleUserChanges`. Separates "apply path is slow" from "fan-out is slow" when latency rises.
`etherpad_socket_emits_total{type}` — counter at the broadcast emit sites (`handleCustomObjectMessage`, `handleCustomMessage`, `sendChatMessageToPadClients`) and the per-socket `NEW_CHANGES` loop in `updatePadClients`. Bucketed by message type so the harness can measure the amplification factor of each lever (especially the fan-out batching lever).

Metric handles live in a new `src/node/prom-instruments.ts` rather than `prometheus.ts` itself, to avoid a circular import — `prometheus.ts` already `require`s `PadMessageHandler` to read `sessioninfos`.

This is the follow-up PR planned in section 6 of `docs/superpowers/specs/2026-05-15-scaling-dive-design.md` (in `ether/etherpad-load-test`). The load-test harness already tolerates these metrics' absence, so it ships unblocked; this PR just makes the next dive run more informative.

Test Plan

Backend test `src/tests/backend-new/specs/prom-instruments.test.ts` passes (3/3 locally) — verifies the recording helpers move the underlying counter/histogram.
Existing backend test suite remains green.
After merge, `curl http://your-etherpad/stats/prometheus | grep etherpad_` shows the three new rows.
Trigger the Scaling dive workflow with `core_ref` pointing at this branch or develop-post-merge; verify the harness reports include `etherpad_socket_emits_total{type=NEW_CHANGES}` and `etherpad_changeset_apply_duration_seconds_count` in the JSON snapshot.

🤖 Generated with Claude Code

Per the spec section 6 of #7756: enables the load-test harness to attribute *where* time goes on the server, not just the gauge headline (CPU / event-loop / memory) the dive doc starts from. New /stats/prometheus rows: - etherpad_pad_users{padId} — gauge, derived from sessioninfos on each scrape. Lets the harness confirm the pad it points at actually has the expected concurrency. - etherpad_changeset_apply_duration_seconds — histogram observed inside handleUserChanges. Separates "apply path is slow" from "fan-out is slow" when latency rises. - etherpad_socket_emits_total{type} — counter at the broadcast emit sites (handleCustomObjectMessage, handleCustomMessage, sendChatMessageToPadClients) and inside the NEW_CHANGES per-socket loop in updatePadClients. Bucketed by message type so the harness can measure the amplification factor of each lever (especially the fan-out batching lever). Metric handles live in a new prom-instruments.ts module rather than in prometheus.ts itself, so PadMessageHandler can import the recording helpers without creating a circular dependency (prometheus.ts already requires PadMessageHandler). Tests: smoke test verifies recordSocketEmit + recordChangesetApply move the underlying counters/histogram. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-code-review · 2026-05-15T18:34:44Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

qodo-free-for-open-source-projects · 2026-05-15T18:34:58Z

Review Summary by Qodo

Add Prometheus metrics for scaling dive load testing

✨ Enhancement

Walkthroughs

Description

• Add three Prometheus metrics for scaling dive load testing
  - etherpad_pad_users{padId} gauge tracks concurrent users per pad
  - etherpad_changeset_apply_duration_seconds histogram measures edit processing time
  - etherpad_socket_emits_total{type} counter tracks broadcast messages by type
• Create new prom-instruments.ts module to avoid circular imports
• Instrument hot paths in PadMessageHandler to record metrics
• Add smoke tests verifying metric recording functionality

Diagram

flowchart LR
  PMH["PadMessageHandler<br/>hot path"] -->|import| PI["prom-instruments.ts<br/>metric handles"]
  PI -->|register| PROM["prometheus.ts<br/>central registry"]
  PMH -->|recordSocketEmit| SE["socketEmitsTotal<br/>counter"]
  PMH -->|recordChangesetApply| CAD["changesetApplyDuration<br/>histogram"]
  PROM -->|getPadUsersMap| PUG["padUsersGauge<br/>gauge"]
  PROM -->|expose| STATS["/stats/prometheus<br/>endpoint"]

File Changes

1. src/node/prom-instruments.ts ✨ Enhancement +36/-0

New Prometheus instruments module for hot path

• New module defining three Prometheus instruments for the scaling dive
• Exports changesetApplyDuration histogram, socketEmitsTotal counter, and padUsersGauge gauge
• Provides recordChangesetApply() and recordSocketEmit() helper functions for hot-path recording
• Separated from prometheus.ts to avoid circular dependency with PadMessageHandler

src/node/prom-instruments.ts

2. src/node/handler/PadMessageHandler.ts ✨ Enhancement +21/-0

Instrument message handlers with metric recording

• Import recordChangesetApply and recordSocketEmit from new prom-instruments module
• Add getPadUsersMap() function to derive per-pad user counts from sessioninfos
• Instrument handleCustomObjectMessage() to record broadcast emits
• Instrument handleCustomMessage() to record broadcast emits
• Instrument sendChatMessageToPadClients() to record CHAT_MESSAGE emits
• Instrument handleUserChanges() to measure changeset apply duration
• Instrument updatePadClients() to record NEW_CHANGES emits per socket

src/node/handler/PadMessageHandler.ts

3. src/node/prometheus.ts ✨ Enhancement +14/-0

Register new metrics and populate pad users gauge

• Import and register three new metrics from prom-instruments.ts with central registry
• Update monitor() function to populate padUsersGauge by calling getPadUsersMap()
• Reset padUsersGauge on each scrape to avoid stale labels for drained pads

src/node/prometheus.ts

View more (1)

4. src/tests/backend-new/specs/prom-instruments.test.ts 🧪 Tests +47/-0

Smoke tests for metric recording helpers

• New smoke test file verifying recordSocketEmit() increments counter by message type
• Test verifies fallback to "unknown" label when type is undefined
• Test verifies recordChangesetApply() observes duration in histogram
• Tests reset metrics before each case to ensure isolation

src/tests/backend-new/specs/prom-instruments.test.ts

qodo-free-for-open-source-projects · 2026-05-15T18:34:59Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

1. ~~New Prometheus metrics lack flag~~ ✓ Resolved 📘 Rule violation ⚙ Maintainability

Description

The PR introduces new Prometheus metrics (etherpad_pad_users,
etherpad_changeset_apply_duration_seconds, etherpad_socket_emits_total) that are registered and
emitted unconditionally. This violates the requirement that new features be gated behind a feature
flag and disabled by default.

Code

src/node/prometheus.ts[R26-37]

+// Added for the #7756 scaling dive: lets the load-test harness attribute
+// where time goes (apply path vs. fan-out) and confirm per-pad concurrency.
+// The metric handles live in prom-instruments.ts to avoid a circular import
+// with PadMessageHandler (which records into them on the hot path).
+import {padUsersGauge, changesetApplyDuration, socketEmitsTotal} from './prom-instruments';
+register.registerMetric(padUsersGauge);
+register.registerMetric(changesetApplyDuration);
+register.registerMetric(socketEmitsTotal);
+
client.collectDefaultMetrics({register});
const monitor = async function () {

Evidence

PR Compliance ID 9 requires new features to be behind a feature flag and disabled by default. The
diff shows the new metrics are registered in src/node/prometheus.ts and emitted/recorded from
PadMessageHandler with no flag checks or configuration gate, meaning they are enabled by default.

src/node/prometheus.ts[26-48]
src/node/handler/PadMessageHandler.ts[606-610]
Best Practice: Repository guidelines

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New Prometheus metrics are always enabled, but compliance requires new features to be behind a feature flag and disabled by default.
## Issue Context
This PR registers three new metrics and emits them on the hot path and in the Prometheus monitor loop without any enable/disable mechanism.
## Fix Focus Areas
- src/node/prometheus.ts[26-48]
- src/node/handler/PadMessageHandler.ts[50-50]
- src/node/handler/PadMessageHandler.ts[609-609]
- src/node/handler/PadMessageHandler.ts[630-630]
- src/node/handler/PadMessageHandler.ts[671-671]
- src/node/handler/PadMessageHandler.ts[791-791]
- src/node/handler/PadMessageHandler.ts[904-904]
- src/node/handler/PadMessageHandler.ts[957-957]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Histogram includes fan-out~~ ✓ Resolved 🐞 Bug ≡ Correctness

Description

handleUserChanges() starts etherpad_changeset_apply_duration_seconds before any processing and only
stops it in finally, after updatePadClients() completes, so the histogram includes broadcast/fan-out
time rather than isolating the apply path. This prevents the scaling-dive analysis from
distinguishing “apply is slow” vs “fan-out is slow”.

Code

src/node/handler/PadMessageHandler.ts[791]
+  const stopHistogram = recordChangesetApply();

Evidence
The timer is started at the top of handleUserChanges(), then updatePadClients() is awaited, and only
afterwards is stopHistogram() called in finally, so the histogram necessarily includes fan-out time.
src/node/handler/PadMessageHandler.ts[789-905]
src/node/prom-instruments.ts[11-15]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`etherpad_changeset_apply_duration_seconds` is intended to time the *apply path*, but the timer currently spans the entire `handleUserChanges()` including `await exports.updatePadClients(pad)` (fan-out). This makes the metric misleading and defeats the PR’s stated purpose.
## Issue Context
- The timer is started near the beginning of `handleUserChanges()` and stopped in `finally`.
- `updatePadClients()` is awaited before the timer is stopped, so its work is included in the duration.
## Fix Focus Areas
- src/node/handler/PadMessageHandler.ts[788-905]
## Implementation notes
- Move `recordChangesetApply()` start to immediately before the “apply” work you want to measure (e.g., just before `pad.appendRevision(...)`).
- Call the returned `stopHistogram()` immediately after the apply work completes and **before** any socket emits / `updatePadClients()` fan-out.
- If you also want total end-to-end latency, keep using the existing `stats.timer('edits')` (or add a second histogram explicitly for fan-out/total).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

3. ~~Unbounded metrics label~~ ✓ Resolved 🐞 Bug ☼ Reliability

Description

handleCustomMessage() increments etherpad_socket_emits_total with msgString (an HTTP API parameter)
as the label value, allowing an API caller to generate unbounded distinct label values and
potentially exhaust memory/CPU via high-cardinality time series. Although the endpoint is API-key
protected, this still creates a sharp footgun if the key is leaked or a client misbehaves.

Code

src/node/handler/PadMessageHandler.ts[630]
+  recordSocketEmit(msg.data.type);

Evidence
The HTTP API passes an arbitrary msg string into handleCustomMessage(), which assigns it to
data.type and uses it as the type label for the Prometheus counter; prom-client counters create
distinct time series per label value.
src/node/db/API.ts[863-866]
src/node/handler/PadMessageHandler.ts[614-631]
src/node/prom-instruments.ts[17-21]
src/node/prom-instruments.ts[34-36]
src/node/handler/APIHandler.ts[180-185]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`etherpad_socket_emits_total{type=...}` uses the message type as a Prometheus label. In `handleCustomMessage()`, that type comes directly from the HTTP API `msg` parameter, which means label cardinality is unbounded.
## Issue Context
- `sendClientsMessage(padID, msg)` passes `msg` directly into `PadMessageHandler.handleCustomMessage(padID, msg)`.
- `handleCustomMessage()` sets `data.type = msgString` and then calls `recordSocketEmit(msg.data.type)`.
- `recordSocketEmit()` uses `socketEmitsTotal.labels(type).inc()`, creating a new time series per distinct `type`.
## Fix Focus Areas
- src/node/handler/PadMessageHandler.ts[620-631]
- src/node/db/API.ts[863-866]
- src/node/prom-instruments.ts[17-21]
- src/node/prom-instruments.ts[34-36]
## Implementation notes
Pick one of:
1) **Normalize/bucket**: Map any unknown/untrusted type to a small bounded set (e.g., `API_CUSTOM_MESSAGE`, `CUSTOM`, `other`, `unknown`).
2) **Allowlist**: Only allow a fixed set of label values (e.g., `NEW_CHANGES`, `CHAT_MESSAGE`, etc.); everything else becomes `other`.
3) **Remove label at this site**: For `handleCustomMessage()`, increment with a constant label regardless of `msgString`.
Also consider enforcing a max length/character set if you keep any user-provided label value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

…lly emits (#99) Confirmed from the first real dive run (25934713423): core's /stats/prometheus uses prom-client's collectDefaultMetrics output (process_cpu_user_seconds_total, nodejs_eventloop_lag_p95_seconds, process_resident_memory_bytes, ...) — not the nodejs_cpu_gauge / nodejs_eventloop_latency_gauge names that src/node/metrics.ts defines but never registers. The Scraper's default allowlist was filtering EVERYTHING out, so all dive reports had empty cpu_user / evloop_p95_ms / rss_mb columns. Two changes: 1. Update DEFAULT_KEEP prefixes to match real prom-client names. Includes 'etherpad_' as a single prefix that covers all current and future etherpad_ rows (including the three added in ether/etherpad#7762). 2. Update Reporter CSV column mapping to read process_cpu_user_seconds_total, nodejs_eventloop_lag_p95_seconds, and process_resident_memory_bytes (converting seconds -> ms and bytes -> MB as before). CSV column names stay stable; only the underlying lookup keys change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… label cardinality Three issues raised on the initial PR: 1. **Feature flag.** Per project compliance rule, new features must be behind a flag and disabled by default. Adds `settings.scalingDiveMetrics` (default `false`). When off, recordSocketEmit() / recordChangesetApply() short-circuit to no-ops and the metrics are never even registered with the Prometheus register. Enable only when running the ether/etherpad-load-test scaling-dive harness. 2. **Histogram scope.** Previously the etherpad_changeset_apply_duration_seconds timer wrapped the whole handleUserChanges() body — including `await exports.updatePadClients(pad)` — so the histogram measured apply+fan-out, defeating its stated purpose. Now stopped immediately after the apply work (`assert.equal(...rev, r)`), before the ACCEPT_COMMIT socket emit and the updatePadClients call. Failed applies deliberately don't observe so the success-path distribution stays clean. 3. **Label cardinality.** handleCustomMessage was passing the user-supplied msgString (an HTTP-API param) directly as the `type` label value. A misbehaving API caller could grow prom-client's internal label map until OOM. Now bucketed against a known-types allowlist; anything outside it lands in `other`. Tests updated: 5/5 — covers happy path, "other" bucketing of unknown/unsafe labels, and that the flag-disabled state is a true no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two findings from rigorous N=3 scoring: 1. Lever 3 (#7768) is NOT a perf win. When you compare like-for- like matrix entries (develop-baseline vs PR-baseline), the per-socket serialization is slightly net-negative across the curve. My earlier "70% drop" was a single-run outlier; the subsequent "tighter envelope" was a cross-matrix-entry comparison confounded by noise. The serialization is still a real correctness fix (race on concurrent fan-outs + lost revisions on emit error) so the PR stays open, but the recommendation is now correctness-only. 2. Lever 8b (#7774) — engine.io flush deferral. The follow-up to the closed lever 8 that actually patches Socket.sendPacket instead of just transport.send. queueMicrotask-coalesced flush gives the transport multi-packet batches to work with at last. N=3 shows tighter tail at step 300-350 (122 → 110 max at 350, 71 → 58 max at 300). Not a cliff-mover. The only PR in this program with N=3-confirmed perf benefit. Final disposition: - Merge: #7774 (modest perf), #7768 (correctness), #7762 (already merged, instruments). - The cliff at 350-400 authors is hardware-bound on a 4-vCPU runner, not code-bound. Production with more cores per host scales proportionally with no code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

JohnMcLear mentioned this pull request May 15, 2026

fix(sim): Scraper allowlist + Reporter CSV match real metric names ether/etherpad-load-test#99

Merged

3 tasks

qodo-free-for-open-source-projects Bot added the Compliance violation label May 15, 2026

qodo-free-for-open-source-projects Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/node/prometheus.ts

Comment thread src/node/handler/PadMessageHandler.ts Outdated

JohnMcLear merged commit 986f139 into develop May 15, 2026
31 checks passed

JohnMcLear deleted the feat/scaling-dive-counters branch May 15, 2026 18:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(metrics): 3 Prometheus counters for scaling dive (#7756)#7762

feat(metrics): 3 Prometheus counters for scaling dive (#7756)#7762
JohnMcLear merged 2 commits into
developfrom
feat/scaling-dive-counters

JohnMcLear commented May 15, 2026

Uh oh!

qodo-code-review Bot commented May 15, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 15, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JohnMcLear commented May 15, 2026

Summary

Test Plan

Uh oh!

qodo-code-review Bot commented May 15, 2026

Qodo reviews are paused for this user.

Uh oh!

qodo-free-for-open-source-projects Bot commented May 15, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-free-for-open-source-projects Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-free-for-open-source-projects Bot commented May 15, 2026 •

edited

Loading