Skip to content

feat(metrics): 3 Prometheus counters for scaling dive (#7756)#7762

Merged
JohnMcLear merged 2 commits into
developfrom
feat/scaling-dive-counters
May 15, 2026
Merged

feat(metrics): 3 Prometheus counters for scaling dive (#7756)#7762
JohnMcLear merged 2 commits into
developfrom
feat/scaling-dive-counters

Conversation

@JohnMcLear
Copy link
Copy Markdown
Member

Summary

Adds three Prometheus rows to `/stats/prometheus` so the load-test harness for #7756 can attribute where time goes on the server, not just the gauge headline (CPU / event-loop / memory):

  • `etherpad_pad_users{padId}` — gauge derived from `sessioninfos` on scrape. Lets the harness confirm a target pad actually has the expected concurrency.
  • `etherpad_changeset_apply_duration_seconds` — histogram observed inside `handleUserChanges`. Separates "apply path is slow" from "fan-out is slow" when latency rises.
  • `etherpad_socket_emits_total{type}` — counter at the broadcast emit sites (`handleCustomObjectMessage`, `handleCustomMessage`, `sendChatMessageToPadClients`) and the per-socket `NEW_CHANGES` loop in `updatePadClients`. Bucketed by message type so the harness can measure the amplification factor of each lever (especially the fan-out batching lever).

Metric handles live in a new `src/node/prom-instruments.ts` rather than `prometheus.ts` itself, to avoid a circular import — `prometheus.ts` already `require`s `PadMessageHandler` to read `sessioninfos`.

This is the follow-up PR planned in section 6 of `docs/superpowers/specs/2026-05-15-scaling-dive-design.md` (in `ether/etherpad-load-test`). The load-test harness already tolerates these metrics' absence, so it ships unblocked; this PR just makes the next dive run more informative.

Test Plan

  • Backend test `src/tests/backend-new/specs/prom-instruments.test.ts` passes (3/3 locally) — verifies the recording helpers move the underlying counter/histogram.
  • Existing backend test suite remains green.
  • After merge, `curl http://your-etherpad/stats/prometheus | grep etherpad_` shows the three new rows.
  • Trigger the Scaling dive workflow with `core_ref` pointing at this branch or develop-post-merge; verify the harness reports include `etherpad_socket_emits_total{type=NEW_CHANGES}` and `etherpad_changeset_apply_duration_seconds_count` in the JSON snapshot.

🤖 Generated with Claude Code

Per the spec section 6 of #7756: enables the load-test harness to
attribute *where* time goes on the server, not just the gauge headline
(CPU / event-loop / memory) the dive doc starts from.

New /stats/prometheus rows:

- etherpad_pad_users{padId} — gauge, derived from sessioninfos on
  each scrape. Lets the harness confirm the pad it points at actually
  has the expected concurrency.

- etherpad_changeset_apply_duration_seconds — histogram observed
  inside handleUserChanges. Separates "apply path is slow" from
  "fan-out is slow" when latency rises.

- etherpad_socket_emits_total{type} — counter at the broadcast
  emit sites (handleCustomObjectMessage, handleCustomMessage,
  sendChatMessageToPadClients) and inside the NEW_CHANGES per-socket
  loop in updatePadClients. Bucketed by message type so the harness
  can measure the amplification factor of each lever (especially the
  fan-out batching lever).

Metric handles live in a new prom-instruments.ts module rather than
in prometheus.ts itself, so PadMessageHandler can import the
recording helpers without creating a circular dependency
(prometheus.ts already requires PadMessageHandler).

Tests: smoke test verifies recordSocketEmit + recordChangesetApply
move the underlying counters/histogram.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@qodo-code-review
Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

Review Summary by Qodo

Add Prometheus metrics for scaling dive load testing

✨ Enhancement

Grey Divider

Walkthroughs

Description
• Add three Prometheus metrics for scaling dive load testing
  - etherpad_pad_users{padId} gauge tracks concurrent users per pad
  - etherpad_changeset_apply_duration_seconds histogram measures edit processing time
  - etherpad_socket_emits_total{type} counter tracks broadcast messages by type
• Create new prom-instruments.ts module to avoid circular imports
• Instrument hot paths in PadMessageHandler to record metrics
• Add smoke tests verifying metric recording functionality
Diagram
flowchart LR
  PMH["PadMessageHandler<br/>hot path"] -->|import| PI["prom-instruments.ts<br/>metric handles"]
  PI -->|register| PROM["prometheus.ts<br/>central registry"]
  PMH -->|recordSocketEmit| SE["socketEmitsTotal<br/>counter"]
  PMH -->|recordChangesetApply| CAD["changesetApplyDuration<br/>histogram"]
  PROM -->|getPadUsersMap| PUG["padUsersGauge<br/>gauge"]
  PROM -->|expose| STATS["/stats/prometheus<br/>endpoint"]
Loading

Grey Divider

File Changes

1. src/node/prom-instruments.ts ✨ Enhancement +36/-0

New Prometheus instruments module for hot path

• New module defining three Prometheus instruments for the scaling dive
• Exports changesetApplyDuration histogram, socketEmitsTotal counter, and padUsersGauge gauge
• Provides recordChangesetApply() and recordSocketEmit() helper functions for hot-path recording
• Separated from prometheus.ts to avoid circular dependency with PadMessageHandler

src/node/prom-instruments.ts


2. src/node/handler/PadMessageHandler.ts ✨ Enhancement +21/-0

Instrument message handlers with metric recording

• Import recordChangesetApply and recordSocketEmit from new prom-instruments module
• Add getPadUsersMap() function to derive per-pad user counts from sessioninfos
• Instrument handleCustomObjectMessage() to record broadcast emits
• Instrument handleCustomMessage() to record broadcast emits
• Instrument sendChatMessageToPadClients() to record CHAT_MESSAGE emits
• Instrument handleUserChanges() to measure changeset apply duration
• Instrument updatePadClients() to record NEW_CHANGES emits per socket

src/node/handler/PadMessageHandler.ts


3. src/node/prometheus.ts ✨ Enhancement +14/-0

Register new metrics and populate pad users gauge

• Import and register three new metrics from prom-instruments.ts with central registry
• Update monitor() function to populate padUsersGauge by calling getPadUsersMap()
• Reset padUsersGauge on each scrape to avoid stale labels for drained pads

src/node/prometheus.ts


View more (1)
4. src/tests/backend-new/specs/prom-instruments.test.ts 🧪 Tests +47/-0

Smoke tests for metric recording helpers

• New smoke test file verifying recordSocketEmit() increments counter by message type
• Test verifies fallback to "unknown" label when type is undefined
• Test verifies recordChangesetApply() observes duration in histogram
• Tests reset metrics before each case to ensure isolation

src/tests/backend-new/specs/prom-instruments.test.ts


Grey Divider

Qodo Logo

@qodo-free-for-open-source-projects
Copy link
Copy Markdown

qodo-free-for-open-source-projects Bot commented May 15, 2026

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (0) 📎 Requirement gaps (0)

Grey Divider


Action required

1. New Prometheus metrics lack flag ✓ Resolved 📘 Rule violation ⚙ Maintainability
Description
The PR introduces new Prometheus metrics (etherpad_pad_users,
etherpad_changeset_apply_duration_seconds, etherpad_socket_emits_total) that are registered and
emitted unconditionally. This violates the requirement that new features be gated behind a feature
flag and disabled by default.
Code

src/node/prometheus.ts[R26-37]

+// Added for the #7756 scaling dive: lets the load-test harness attribute
+// where time goes (apply path vs. fan-out) and confirm per-pad concurrency.
+// The metric handles live in prom-instruments.ts to avoid a circular import
+// with PadMessageHandler (which records into them on the hot path).
+import {padUsersGauge, changesetApplyDuration, socketEmitsTotal} from './prom-instruments';
+register.registerMetric(padUsersGauge);
+register.registerMetric(changesetApplyDuration);
+register.registerMetric(socketEmitsTotal);
+
client.collectDefaultMetrics({register});
const monitor = async function () {
Evidence
PR Compliance ID 9 requires new features to be behind a feature flag and disabled by default. The
diff shows the new metrics are registered in src/node/prometheus.ts and emitted/recorded from
PadMessageHandler with no flag checks or configuration gate, meaning they are enabled by default.

src/node/prometheus.ts[26-48]
src/node/handler/PadMessageHandler.ts[606-610]
Best Practice: Repository guidelines

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
New Prometheus metrics are always enabled, but compliance requires new features to be behind a feature flag and disabled by default.
## Issue Context
This PR registers three new metrics and emits them on the hot path and in the Prometheus monitor loop without any enable/disable mechanism.
## Fix Focus Areas
- src/node/prometheus.ts[26-48]
- src/node/handler/PadMessageHandler.ts[50-50]
- src/node/handler/PadMessageHandler.ts[609-609]
- src/node/handler/PadMessageHandler.ts[630-630]
- src/node/handler/PadMessageHandler.ts[671-671]
- src/node/handler/PadMessageHandler.ts[791-791]
- src/node/handler/PadMessageHandler.ts[904-904]
- src/node/handler/PadMessageHandler.ts[957-957]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


2. Histogram includes fan-out ✓ Resolved 🐞 Bug ≡ Correctness
Description
handleUserChanges() starts etherpad_changeset_apply_duration_seconds before any processing and only
stops it in finally, after updatePadClients() completes, so the histogram includes broadcast/fan-out
time rather than isolating the apply path. This prevents the scaling-dive analysis from
distinguishing “apply is slow” vs “fan-out is slow”.
Code

src/node/handler/PadMessageHandler.ts[791]

+  const stopHistogram = recordChangesetApply();
Evidence
The timer is started at the top of handleUserChanges(), then updatePadClients() is awaited, and only
afterwards is stopHistogram() called in finally, so the histogram necessarily includes fan-out time.

src/node/handler/PadMessageHandler.ts[789-905]
src/node/prom-instruments.ts[11-15]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`etherpad_changeset_apply_duration_seconds` is intended to time the *apply path*, but the timer currently spans the entire `handleUserChanges()` including `await exports.updatePadClients(pad)` (fan-out). This makes the metric misleading and defeats the PR’s stated purpose.
## Issue Context
- The timer is started near the beginning of `handleUserChanges()` and stopped in `finally`.
- `updatePadClients()` is awaited before the timer is stopped, so its work is included in the duration.
## Fix Focus Areas
- src/node/handler/PadMessageHandler.ts[788-905]
## Implementation notes
- Move `recordChangesetApply()` start to immediately before the “apply” work you want to measure (e.g., just before `pad.appendRevision(...)`).
- Call the returned `stopHistogram()` immediately after the apply work completes and **before** any socket emits / `updatePadClients()` fan-out.
- If you also want total end-to-end latency, keep using the existing `stats.timer('edits')` (or add a second histogram explicitly for fan-out/total).

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools



Remediation recommended

3. Unbounded metrics label ✓ Resolved 🐞 Bug ☼ Reliability
Description
handleCustomMessage() increments etherpad_socket_emits_total with msgString (an HTTP API parameter)
as the label value, allowing an API caller to generate unbounded distinct label values and
potentially exhaust memory/CPU via high-cardinality time series. Although the endpoint is API-key
protected, this still creates a sharp footgun if the key is leaked or a client misbehaves.
Code

src/node/handler/PadMessageHandler.ts[630]

+  recordSocketEmit(msg.data.type);
Evidence
The HTTP API passes an arbitrary msg string into handleCustomMessage(), which assigns it to
data.type and uses it as the type label for the Prometheus counter; prom-client counters create
distinct time series per label value.

src/node/db/API.ts[863-866]
src/node/handler/PadMessageHandler.ts[614-631]
src/node/prom-instruments.ts[17-21]
src/node/prom-instruments.ts[34-36]
src/node/handler/APIHandler.ts[180-185]

Agent prompt
The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
`etherpad_socket_emits_total{type=...}` uses the message type as a Prometheus label. In `handleCustomMessage()`, that type comes directly from the HTTP API `msg` parameter, which means label cardinality is unbounded.
## Issue Context
- `sendClientsMessage(padID, msg)` passes `msg` directly into `PadMessageHandler.handleCustomMessage(padID, msg)`.
- `handleCustomMessage()` sets `data.type = msgString` and then calls `recordSocketEmit(msg.data.type)`.
- `recordSocketEmit()` uses `socketEmitsTotal.labels(type).inc()`, creating a new time series per distinct `type`.
## Fix Focus Areas
- src/node/handler/PadMessageHandler.ts[620-631]
- src/node/db/API.ts[863-866]
- src/node/prom-instruments.ts[17-21]
- src/node/prom-instruments.ts[34-36]
## Implementation notes
Pick one of:
1) **Normalize/bucket**: Map any unknown/untrusted type to a small bounded set (e.g., `API_CUSTOM_MESSAGE`, `CUSTOM`, `other`, `unknown`).
2) **Allowlist**: Only allow a fixed set of label values (e.g., `NEW_CHANGES`, `CHAT_MESSAGE`, etc.); everything else becomes `other`.
3) **Remove label at this site**: For `handleCustomMessage()`, increment with a constant label regardless of `msgString`.
Also consider enforcing a max length/character set if you keep any user-provided label value.

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools


Grey Divider

Qodo Logo

Comment thread src/node/prometheus.ts
Comment thread src/node/handler/PadMessageHandler.ts Outdated
JohnMcLear added a commit to ether/etherpad-load-test that referenced this pull request May 15, 2026
…lly emits (#99)

Confirmed from the first real dive run (25934713423): core's
/stats/prometheus uses prom-client's collectDefaultMetrics output
(process_cpu_user_seconds_total, nodejs_eventloop_lag_p95_seconds,
process_resident_memory_bytes, ...) — not the nodejs_cpu_gauge /
nodejs_eventloop_latency_gauge names that src/node/metrics.ts
defines but never registers. The Scraper's default allowlist was
filtering EVERYTHING out, so all dive reports had empty cpu_user /
evloop_p95_ms / rss_mb columns.

Two changes:

1. Update DEFAULT_KEEP prefixes to match real prom-client names.
   Includes 'etherpad_' as a single prefix that covers all current
   and future etherpad_ rows (including the three added in
   ether/etherpad#7762).

2. Update Reporter CSV column mapping to read process_cpu_user_seconds_total,
   nodejs_eventloop_lag_p95_seconds, and process_resident_memory_bytes
   (converting seconds -> ms and bytes -> MB as before).

CSV column names stay stable; only the underlying lookup keys change.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… label cardinality

Three issues raised on the initial PR:

1. **Feature flag.** Per project compliance rule, new features must
   be behind a flag and disabled by default. Adds
   `settings.scalingDiveMetrics` (default `false`). When off,
   recordSocketEmit() / recordChangesetApply() short-circuit to
   no-ops and the metrics are never even registered with the
   Prometheus register. Enable only when running the
   ether/etherpad-load-test scaling-dive harness.

2. **Histogram scope.** Previously the
   etherpad_changeset_apply_duration_seconds timer wrapped the
   whole handleUserChanges() body — including
   `await exports.updatePadClients(pad)` — so the histogram
   measured apply+fan-out, defeating its stated purpose. Now
   stopped immediately after the apply work (`assert.equal(...rev,
   r)`), before the ACCEPT_COMMIT socket emit and the
   updatePadClients call. Failed applies deliberately don't observe
   so the success-path distribution stays clean.

3. **Label cardinality.** handleCustomMessage was passing the
   user-supplied msgString (an HTTP-API param) directly as the
   `type` label value. A misbehaving API caller could grow
   prom-client's internal label map until OOM. Now bucketed against
   a known-types allowlist; anything outside it lands in `other`.

Tests updated: 5/5 — covers happy path, "other" bucketing of
unknown/unsafe labels, and that the flag-disabled state is a true
no-op.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@JohnMcLear JohnMcLear merged commit 986f139 into develop May 15, 2026
31 checks passed
@JohnMcLear JohnMcLear deleted the feat/scaling-dive-counters branch May 15, 2026 18:54
JohnMcLear added a commit that referenced this pull request May 16, 2026
Two findings from rigorous N=3 scoring:

1. Lever 3 (#7768) is NOT a perf win. When you compare like-for-
   like matrix entries (develop-baseline vs PR-baseline), the
   per-socket serialization is slightly net-negative across the
   curve. My earlier "70% drop" was a single-run outlier; the
   subsequent "tighter envelope" was a cross-matrix-entry
   comparison confounded by noise. The serialization is still a
   real correctness fix (race on concurrent fan-outs + lost
   revisions on emit error) so the PR stays open, but the
   recommendation is now correctness-only.

2. Lever 8b (#7774) — engine.io flush deferral. The follow-up to
   the closed lever 8 that actually patches Socket.sendPacket
   instead of just transport.send. queueMicrotask-coalesced flush
   gives the transport multi-packet batches to work with at last.
   N=3 shows tighter tail at step 300-350 (122 → 110 max at 350,
   71 → 58 max at 300). Not a cliff-mover. The only PR in this
   program with N=3-confirmed perf benefit.

Final disposition:
- Merge: #7774 (modest perf), #7768 (correctness), #7762 (already
  merged, instruments).
- The cliff at 350-400 authors is hardware-bound on a 4-vCPU
  runner, not code-bound. Production with more cores per host
  scales proportionally with no code changes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant