test(ci): heartbeat + running-test pointer to debug the silent backend-test ELIFECYCLE by JohnMcLear · Pull Request #7838 · ether/etherpad

JohnMcLear · 2026-05-25T11:05:18Z

Summary

Track the currently-running test (beforeEach), not just the last-finished one (afterEach), so a kill mid-test or in setup attributes correctly.
Add a 1Hz heartbeat with running-test name, RSS / heap, and active-handle / request counts. unref()'d so it never holds the event loop open.

Why

Backend tests silently die with exit code 255 mid-suite ~22% of the time on develop — most often Windows with Plugins (24), occasionally Linux too. Each kill lands ~300 ms after the previous test's clean ✔ teardown line and emits no failing-test marker, no error, no Mocha summary. The handlers in diagnostics.ts (PR #7663) and common.ts don't fire either: only [diag +0ms] diagnostics loaded lands before the silent ELIFECYCLE.

Recent reproduction: run 26311025244. Both attempts crashed at completely different "last test" locations (messages.ts > identity changeset is accepted in attempt 1; clientvar_rev_consistency.ts > CLIENT_VARS stays consistent under concurrent edits during handshake (delay race) in attempt 2), so the dying test itself isn't to blame.

The current lastSeenTest pointer is only updated in afterEach, so if the kill lands during the NEXT test's setup or body — which is exactly the ~300 ms gap we observe — the pointer points at the previous passing test, not the one being killed. The new running pointer + heartbeat narrow the kill window to ≤1 s and expose handle/memory behavior leading up to the kill.

What this changes

Pure diagnostic. No behavior change on successful runs.

mochaHooks.beforeEach now sets currentTest.
mochaHooks.afterEach continues to update lastFinishedTest and clears currentTest.
Every diag line carries both pointers.
New 1Hz heartbeat via writeSync(2, ...) + process.memoryUsage() + _getActiveHandles() / _getActiveRequests().

Cost: ~60-120 extra log lines per CI run.

Locally verified

Ran mocha against a 3.5 s probe spec:

[diag +0ms] diagnostics loaded
[diag +1002ms] hb running="heartbeat probe runs for 3.5s" lastFinished="<no test finished yet>" rss=109M heap=12M handles=2 requests=0
[diag +2002ms] hb running="heartbeat probe runs for 3.5s" lastFinished="<no test finished yet>" rss=109M heap=12M handles=2 requests=0
[diag +3003ms] hb running="heartbeat probe runs for 3.5s" lastFinished="<no test finished yet>" rss=109M heap=12M handles=2 requests=0
[diag +3565ms] beforeExit code=0 ... lastFinished="heartbeat probe runs for 3.5s"
[diag +3565ms] exit code=0 ... lastFinished="heartbeat probe runs for 3.5s"

Test plan

Existing backend-test matrix passes (Linux ± plugins, Windows without plugins).
Windows-with-plugins run produces visible heartbeat lines.
On the next reproduction of the silent flake, the log contains a final heartbeat naming the running test + handle count immediately before ELIFECYCLE.

🤖 Generated with Claude Code

Backend tests silently die with code 255 mid-suite ~22% of the time on develop (most often Windows-with-plugins, Node 24). Each kill lands 300±50 ms after the previous test's clean ✔ teardown line and produces no failing-test marker, no error, no Mocha summary, and — despite the unconditional handlers in `diagnostics.ts` — none of the JS-level death events fire either. Recent example: run 26311025244 (`Windows with Plugins (24)`); both attempts crashed at completely different "last test" locations, so the dying test itself isn't to blame. The existing diagnostics only set lastSeenTest in afterEach, so if the kill lands during the NEXT test's setup or body — which is exactly the ~300ms gap we observe — the pointer reads as the previous (passing) test. That hides whether we're between tests or inside one, and which one. Two changes: 1. Track currentTest in beforeEach as well as lastFinishedTest in afterEach. Every diag line now carries both, so the death point is bracketable regardless of which lifecycle phase the kill interrupts. 2. Add a 1Hz heartbeat that writeSyncs the running-test name plus `process.memoryUsage()` (rss, heap) and the active-handle and active-request counts. The interval is unref'd so it never holds the event loop open by itself. Cost is roughly one extra log line per second of mocha runtime (~60-120 lines per CI run). When the next failure fires, the last heartbeat narrows the kill window to ≤1s, the running pointer names the test on the rails at that moment, and the handle/memory trace gives a sparkline that exposes sudden spikes — a leaked socket, an unref'd timer, a runaway map — that would otherwise be invisible at the runner-log level. No behavior change on successful runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-code-review · 2026-05-25T11:05:22Z

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

qodo-free-for-open-source-projects · 2026-05-25T11:05:37Z

Review Summary by Qodo

Add heartbeat and running-test tracking to backend test diagnostics

🧪 Tests ✨ Enhancement

Walkthroughs

Description

• Track currently-running test in beforeEach, not just last-finished in afterEach
• Add 1Hz heartbeat logging memory usage and active handles/requests
• Narrow silent backend-test kill window from ~300ms to ≤1s
• Expose handle/memory leaks invisible at runner-log level

Diagram

flowchart LR
  A["Test Lifecycle"] -->|beforeEach| B["Set currentTest"]
  B -->|test runs| C["1Hz Heartbeat"]
  C -->|logs| D["Memory + Handles + Requests"]
  A -->|afterEach| E["Set lastFinishedTest"]
  E -->|clear| F["currentTest"]
  G["Process Death"] -->|kill signal| H["Last Heartbeat"]
  H -->|narrows window| I["≤1s bracket"]

File Changes

1. src/tests/backend/diagnostics.ts ✨ Enhancement +51/-12

Add heartbeat and dual test-pointer tracking

• Split lastSeenTest into currentTest (set in beforeEach) and lastFinishedTest (set in
 afterEach)
• Added 1Hz heartbeat via setInterval that logs running test name, RSS/heap memory usage, and
 active handle/request counts
• Updated all diagnostic log lines to include both running and lastFinished pointers instead of
 single lastTest
• Heartbeat interval is unref()'d to prevent holding event loop open
• Enhanced comments explaining the silent flake diagnosis and new tracking strategy

src/tests/backend/diagnostics.ts

qodo-free-for-open-source-projects · 2026-05-25T11:05:38Z

Code Review by Qodo

🐞 Bugs (0) 📘 Rule violations (1) 📎 Requirement gaps (0)

1. Heartbeat logging lacks feature flag 📘 Rule violation ⚙ Maintainability

Description

The new 1Hz heartbeat logging runs unconditionally whenever diagnostics.ts is loaded, adding
repeated stderr output and potential overhead with no enable/disable mechanism. This introduces new
behavior without a feature flag and is not disabled by default.

Code

src/tests/backend/diagnostics.ts[R68-81]

Evidence

PR Compliance ID 5 requires new functionality to be behind a feature flag and disabled by default.
The code introduces a heartbeat via setInterval that logs every second, and it is started
unconditionally without any flag/guard.

src/tests/backend/diagnostics.ts[68-81]
Best Practice: Repository guidelines

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
A new heartbeat feature (`setInterval` that logs memory/handles every second) is always enabled when `src/tests/backend/diagnostics.ts` is required. The compliance checklist requires new features to be guarded behind a feature flag and disabled by default.
## Issue Context
The heartbeat currently starts unconditionally and writes periodic diagnostic lines to stderr. Add an explicit opt-in flag (for example an environment variable) so default behavior remains unchanged unless enabled.
## Fix Focus Areas
- src/tests/backend/diagnostics.ts[68-81]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

2. ~~Heartbeat optional-chain TypeError~~ ✓ Resolved 🐞 Bug ☼ Reliability

Description

The heartbeat computes handle/request counts via (process as any)._getActiveHandles?.().length and
_getActiveRequests?.().length, but ?.() only guards the call — if the method is missing it
returns undefined and the subsequent .length access throws a TypeError. Because diagnostics.ts
is loaded via mocha --require in the main backend test script, this would take down the whole test
run on Node builds where those private APIs are absent/changed.

Code

src/tests/backend/diagnostics.ts[R74-75]

Evidence

The heartbeat interval reads _getActiveHandles/_getActiveRequests with ?.() but immediately
dereferences .length on the result, which will throw if the function is undefined; the backend
test script always --requires this file, so such a throw would affect all test runs.

src/tests/backend/diagnostics.ts[68-81]
src/package.json[150-153]

Agent prompt

The issue below was found during a code review. Follow the provided context and guidance below and implement a solution

## Issue description
The heartbeat uses optional chaining only on the private API call (`?.()`), but not on the returned value before accessing `.length`, so the fallback `?? -1` is never reached if the API is missing.
## Issue Context
This file is always loaded for backend tests via `--require ./tests/backend/diagnostics.ts`, so an exception inside the heartbeat interval can crash the test process.
## Fix
Change both expressions to optionally chain the returned value as well (or otherwise guard the length access).
Example:

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

qodo-free-for-open-source-projects · 2026-05-25T11:07:40Z

+// Heartbeat. unref()'d so it never holds the event loop open by itself —
+// it only fires if mocha is otherwise alive. The interval cadence (1Hz) is
+// the trade-off between log noise (~60-120 extra lines per run) and how
+// tightly we can bracket the kill timestamp.
+const heartbeat = setInterval(() => {
+  const mem = process.memoryUsage();
+  const handles = (process as any)._getActiveHandles?.().length ?? -1;
+  const requests = (process as any)._getActiveRequests?.().length ?? -1;
+  diag(`hb running="${currentTest}" lastFinished="${lastFinishedTest}" ` +
+    `rss=${Math.round(mem.rss / 1024 / 1024)}M ` +
+    `heap=${Math.round(mem.heapUsed / 1024 / 1024)}M ` +
+    `handles=${handles} requests=${requests}`);
+}, 1000);
+heartbeat.unref();


1. Heartbeat logging lacks feature flag 📘 Rule violation ⚙ Maintainability

The new 1Hz heartbeat logging runs unconditionally whenever diagnostics.ts is loaded, adding repeated stderr output and potential overhead with no enable/disable mechanism. This introduces new behavior without a feature flag and is not disabled by default.

Agent Prompt

## Issue description A new heartbeat feature (`setInterval` that logs memory/handles every second) is always enabled when `src/tests/backend/diagnostics.ts` is required. The compliance checklist requires new features to be guarded behind a feature flag and disabled by default. ## Issue Context The heartbeat currently starts unconditionally and writes periodic diagnostic lines to stderr. Add an explicit opt-in flag (for example an environment variable) so default behavior remains unchanged unless enabled. ## Fix Focus Areas - src/tests/backend/diagnostics.ts[68-81]

ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools

Concern #2 (the optional-chain TypeError) is fixed in 9b79a76 — capture the array first, read .length only when it exists, keep -1 as the "API missing" sentinel. Good catch.

Concern #1 (feature flag): respectfully disagreeing. diagnostics.ts is a temporary debug-only bootstrap (top-of-file comment: "Drop this file once the flake's root cause is identified and fixed.") wired in via mocha --require in the backend test script. The whole file is already opt-in at the test-runner layer — adding an internal flag would just multiply the existing toggle, and defaulting it off would silence the diagnostic in CI which is the one environment where the silent flake reliably reproduces. The repo's pattern for diagnostic instrumentation (see PR #7663, which added the unhandled-rejection logger with no flag) is to land it unguarded and remove the file once the kill is rooted. Keeping that pattern here.

Qodo correctly flagged `_getActiveHandles?.().length` as a latent TypeError: `?.()` guards the call but the call's `undefined` return on a missing method still hits `.length`, which throws. Since the heartbeat fires on a setInterval inside the mocha bootstrap, a Node build without the underscore-prefixed internals would take down the whole backend test run. Capture the array first, then read `.length` only when it actually exists. -1 stays as the "API missing" sentinel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to the heartbeat PR after run 26397693748 confirmed the diagnostic works (the kill landed at importexportGetPost.ts 'Import authorization checks > authn anonymous !exist -> fail', ~300 ms after the previous test's ✔). Two cleanups so the next failure pinpoints faster and reads cleaner: 1. diagnostics.ts: emit a `test start: <name>` diag line in the mocha beforeEach hook, after setting the currentTest pointer. The 1Hz heartbeat misses tests that take less than a second, and the silent kills land ~300 ms after a test boundary — precisely the gap where heartbeat resolution fails. A start line per test gives sub-millisecond resolution on which test was on the rails when the process died. 2. specs/api/importexportGetPost.ts: drop a stray `console.log(importedPads)` debug leftover (and the duplicate `await importEtherpad(records)` only present to feed it) in the `malformed .etherpad files are rejected` block. The leftover dumped a ~600-line reflection of a supertest Response object to the CI log on every successful run, drowning the surrounding test output and making the silent-kill window much harder to read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run 26398054688 narrowed the kill to a specific test (pad.ts > Gets text on a pad Id and doesn't have an excess newline) but the test body is a trivial supertest GET — the kill bypasses all JS handlers, so we can't capture stack state at death. Two failures across two runs share the shape: an agent.{get,post} + common.generateJWTToken() call dies ~300-600 ms after test start, with no JS-visible cause. The next step is V8 + native stack. Hook into the existing 1Hz heartbeat to call process.report.writeReport(path) whenever a report directory is set. The Windows backend-tests workflow already wires up `--report-directory=${{ github.workspace }}/node-report` via NODE_OPTIONS and uploads that directory as an artifact on failure, so the rolling snapshots ride for free on the existing upload step. Each report (~50 KB) contains: - V8 + native call stacks for all threads - libuv active handles (open TCP, timers, file handles) - JS heap statistics - resourceUsage + system info - shared-object list On the next reproduction the latest report before ELIFECYCLE will sit ~0-1 s before the kill — enough to see whether the V8 stack is inside jose's WebCrypto sign path, inside supertest's TCP roundtrip, or somewhere unexpected entirely. NODE_REPORT_DIR is also honored as an explicit override for local repro / non-workflow runs. Cost: ~6 files (~300 KB) per Windows backend-test failure, plus ~50 ms event-loop pause per heartbeat. No-op when neither env var is set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run 26398830249 exposed the path-separator bug in the previous commit: every heartbeat tick on the Windows runner logged Failed to open Node.js report file: D:\a\etherpad\etherpad/node-report/hb-NNNN-...json directory: D:\a\etherpad\etherpad/node-report (errno: 22) — EINVAL. The workflow sets --report-directory with forward-slash separators on Windows, then this code concatenated another `/` plus the filename, producing a path Node's report writer rejects. writeReport(fileName) takes a BARE filename and resolves it against the configured report directory using the platform-correct separator internally. Switch to that. For local repro overrides via NODE_REPORT_DIR, push the path into process.report.directory (the documented config knob) instead of joining it into the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run 26398985832 proved the heartbeat-only report cadence isn't tight enough: the last report before the kill was hb-0013 at +16201ms, ~1.5 s before ELIFECYCLE at +17701ms — during which ~30 tests fired, including the dying one (`authn anonymous !exist -> fail`). The captured V8 stack is just our heartbeat code, not the dying test. Move the writeReport call to a shared tryWriteReport() helper and invoke it from BOTH the heartbeat AND mocha's beforeEach hook, throttled to one report per 250 ms. That gives ≤250 ms resolution on the kill window — close enough that the latest report captures state from inside the dying test rather than from the test ~30 slots earlier. The heartbeat always writes (so we don't lose the no-test-running ticks during setup); beforeEach only writes when the throttle window has elapsed. Cost ceiling: ~4 reports/sec × ~12 s test phase ≈ 48 reports (~2.5 MB) per failing run. Each writeReport adds ~50 ms of event-loop pause — at 4Hz that's 20% of wall time spent in diagnostics, which is acceptable for a temporary debug-only bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run 26399285213's rerun captured a sixth death point on the new 4Hz cadence (`socketio.ts > Duplicate-author handling > cookie identity: same-author second socket kicks the first`, kill at +45953ms, 271ms after test start). The throttle suppressed the dying test's own beforeEach: previous boundary write landed 128 ms earlier and the next 31 ms after that, both inside the 250 ms window. Last captured report (be-0100) is from the previous test. 100 ms is still well above the inter-test cadence in fast burst suites (tests fire 2-5 ms apart, so 20-50 of them get throttled to a single write, ceiling ~10 writes/sec). But it's tight enough that any death-window neighbour ≥100 ms after the previous report — the shape we keep observing — gets its own boundary snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

qodo-code-review Bot deleted a comment from qodo-free-for-open-source-projects Bot May 25, 2026

qodo-free-for-open-source-projects Bot reviewed May 25, 2026

View reviewed changes

qodo-code-review Bot deleted a comment from qodo-free-for-open-source-projects Bot May 25, 2026

JohnMcLear merged commit 98dbba4 into develop May 25, 2026
40 checks passed

JohnMcLear deleted the test-ci-diagnostics-heartbeat-currenttest branch May 25, 2026 12:50

This was referenced May 25, 2026

feat(settings): default settings.enablePluginPadOptions to true #7841

Merged

test(ci): schedule a mid-test snapshot 150ms after every beforeEach #7842

Merged

test(ci): drain event loop with setImmediate after every test (mitigation hypothesis) #7844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test(ci): heartbeat + running-test pointer to debug the silent backend-test ELIFECYCLE#7838

test(ci): heartbeat + running-test pointer to debug the silent backend-test ELIFECYCLE#7838
JohnMcLear merged 7 commits into
developfrom
test-ci-diagnostics-heartbeat-currenttest

JohnMcLear commented May 25, 2026

Uh oh!

qodo-code-review Bot commented May 25, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 25, 2026

Uh oh!

qodo-free-for-open-source-projects Bot commented May 25, 2026 •

edited

Loading

Uh oh!

qodo-free-for-open-source-projects Bot May 25, 2026

Uh oh!

JohnMcLear May 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

JohnMcLear commented May 25, 2026

Summary

Why

What this changes

Locally verified

Test plan

Uh oh!

qodo-code-review Bot commented May 25, 2026

Qodo reviews are paused for this user.

Uh oh!

qodo-free-for-open-source-projects Bot commented May 25, 2026

Review Summary by Qodo

Walkthroughs

File Changes

Uh oh!

qodo-free-for-open-source-projects Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review by Qodo

Uh oh!

qodo-free-for-open-source-projects Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

JohnMcLear May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

qodo-free-for-open-source-projects Bot commented May 25, 2026 •

edited

Loading