feat(metrics): 3 Prometheus counters for scaling dive (#7756)#7762
Conversation
Per the spec section 6 of #7756: enables the load-test harness to attribute *where* time goes on the server, not just the gauge headline (CPU / event-loop / memory) the dive doc starts from. New /stats/prometheus rows: - etherpad_pad_users{padId} — gauge, derived from sessioninfos on each scrape. Lets the harness confirm the pad it points at actually has the expected concurrency. - etherpad_changeset_apply_duration_seconds — histogram observed inside handleUserChanges. Separates "apply path is slow" from "fan-out is slow" when latency rises. - etherpad_socket_emits_total{type} — counter at the broadcast emit sites (handleCustomObjectMessage, handleCustomMessage, sendChatMessageToPadClients) and inside the NEW_CHANGES per-socket loop in updatePadClients. Bucketed by message type so the harness can measure the amplification factor of each lever (especially the fan-out batching lever). Metric handles live in a new prom-instruments.ts module rather than in prometheus.ts itself, so PadMessageHandler can import the recording helpers without creating a circular dependency (prometheus.ts already requires PadMessageHandler). Tests: smoke test verifies recordSocketEmit + recordChangesetApply move the underlying counters/histogram. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoAdd Prometheus metrics for scaling dive load testing
WalkthroughsDescription• Add three Prometheus metrics for scaling dive load testing
- etherpad_pad_users{padId} gauge tracks concurrent users per pad
- etherpad_changeset_apply_duration_seconds histogram measures edit processing time
- etherpad_socket_emits_total{type} counter tracks broadcast messages by type
• Create new prom-instruments.ts module to avoid circular imports
• Instrument hot paths in PadMessageHandler to record metrics
• Add smoke tests verifying metric recording functionality
Diagramflowchart LR
PMH["PadMessageHandler<br/>hot path"] -->|import| PI["prom-instruments.ts<br/>metric handles"]
PI -->|register| PROM["prometheus.ts<br/>central registry"]
PMH -->|recordSocketEmit| SE["socketEmitsTotal<br/>counter"]
PMH -->|recordChangesetApply| CAD["changesetApplyDuration<br/>histogram"]
PROM -->|getPadUsersMap| PUG["padUsersGauge<br/>gauge"]
PROM -->|expose| STATS["/stats/prometheus<br/>endpoint"]
File Changes1. src/node/prom-instruments.ts
|
Code Review by Qodo
1.
|
…lly emits (#99) Confirmed from the first real dive run (25934713423): core's /stats/prometheus uses prom-client's collectDefaultMetrics output (process_cpu_user_seconds_total, nodejs_eventloop_lag_p95_seconds, process_resident_memory_bytes, ...) — not the nodejs_cpu_gauge / nodejs_eventloop_latency_gauge names that src/node/metrics.ts defines but never registers. The Scraper's default allowlist was filtering EVERYTHING out, so all dive reports had empty cpu_user / evloop_p95_ms / rss_mb columns. Two changes: 1. Update DEFAULT_KEEP prefixes to match real prom-client names. Includes 'etherpad_' as a single prefix that covers all current and future etherpad_ rows (including the three added in ether/etherpad#7762). 2. Update Reporter CSV column mapping to read process_cpu_user_seconds_total, nodejs_eventloop_lag_p95_seconds, and process_resident_memory_bytes (converting seconds -> ms and bytes -> MB as before). CSV column names stay stable; only the underlying lookup keys change. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… label cardinality Three issues raised on the initial PR: 1. **Feature flag.** Per project compliance rule, new features must be behind a flag and disabled by default. Adds `settings.scalingDiveMetrics` (default `false`). When off, recordSocketEmit() / recordChangesetApply() short-circuit to no-ops and the metrics are never even registered with the Prometheus register. Enable only when running the ether/etherpad-load-test scaling-dive harness. 2. **Histogram scope.** Previously the etherpad_changeset_apply_duration_seconds timer wrapped the whole handleUserChanges() body — including `await exports.updatePadClients(pad)` — so the histogram measured apply+fan-out, defeating its stated purpose. Now stopped immediately after the apply work (`assert.equal(...rev, r)`), before the ACCEPT_COMMIT socket emit and the updatePadClients call. Failed applies deliberately don't observe so the success-path distribution stays clean. 3. **Label cardinality.** handleCustomMessage was passing the user-supplied msgString (an HTTP-API param) directly as the `type` label value. A misbehaving API caller could grow prom-client's internal label map until OOM. Now bucketed against a known-types allowlist; anything outside it lands in `other`. Tests updated: 5/5 — covers happy path, "other" bucketing of unknown/unsafe labels, and that the flag-disabled state is a true no-op. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two findings from rigorous N=3 scoring: 1. Lever 3 (#7768) is NOT a perf win. When you compare like-for- like matrix entries (develop-baseline vs PR-baseline), the per-socket serialization is slightly net-negative across the curve. My earlier "70% drop" was a single-run outlier; the subsequent "tighter envelope" was a cross-matrix-entry comparison confounded by noise. The serialization is still a real correctness fix (race on concurrent fan-outs + lost revisions on emit error) so the PR stays open, but the recommendation is now correctness-only. 2. Lever 8b (#7774) — engine.io flush deferral. The follow-up to the closed lever 8 that actually patches Socket.sendPacket instead of just transport.send. queueMicrotask-coalesced flush gives the transport multi-packet batches to work with at last. N=3 shows tighter tail at step 300-350 (122 → 110 max at 350, 71 → 58 max at 300). Not a cliff-mover. The only PR in this program with N=3-confirmed perf benefit. Final disposition: - Merge: #7774 (modest perf), #7768 (correctness), #7762 (already merged, instruments). - The cliff at 350-400 authors is hardware-bound on a 4-vCPU runner, not code-bound. Production with more cores per host scales proportionally with no code changes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Adds three Prometheus rows to `/stats/prometheus` so the load-test harness for #7756 can attribute where time goes on the server, not just the gauge headline (CPU / event-loop / memory):
Metric handles live in a new `src/node/prom-instruments.ts` rather than `prometheus.ts` itself, to avoid a circular import — `prometheus.ts` already `require`s `PadMessageHandler` to read `sessioninfos`.
This is the follow-up PR planned in section 6 of `docs/superpowers/specs/2026-05-15-scaling-dive-design.md` (in `ether/etherpad-load-test`). The load-test harness already tolerates these metrics' absence, so it ships unblocked; this PR just makes the next dive run more informative.
Test Plan
🤖 Generated with Claude Code