test(ci): heartbeat + running-test pointer to debug the silent backend-test ELIFECYCLE#7838
Conversation
Backend tests silently die with code 255 mid-suite ~22% of the time on develop (most often Windows-with-plugins, Node 24). Each kill lands 300±50 ms after the previous test's clean ✔ teardown line and produces no failing-test marker, no error, no Mocha summary, and — despite the unconditional handlers in `diagnostics.ts` — none of the JS-level death events fire either. Recent example: run 26311025244 (`Windows with Plugins (24)`); both attempts crashed at completely different "last test" locations, so the dying test itself isn't to blame. The existing diagnostics only set lastSeenTest in afterEach, so if the kill lands during the NEXT test's setup or body — which is exactly the ~300ms gap we observe — the pointer reads as the previous (passing) test. That hides whether we're between tests or inside one, and which one. Two changes: 1. Track currentTest in beforeEach as well as lastFinishedTest in afterEach. Every diag line now carries both, so the death point is bracketable regardless of which lifecycle phase the kill interrupts. 2. Add a 1Hz heartbeat that writeSyncs the running-test name plus `process.memoryUsage()` (rss, heap) and the active-handle and active-request counts. The interval is unref'd so it never holds the event loop open by itself. Cost is roughly one extra log line per second of mocha runtime (~60-120 lines per CI run). When the next failure fires, the last heartbeat narrows the kill window to ≤1s, the running pointer names the test on the rails at that moment, and the handle/memory trace gives a sparkline that exposes sudden spikes — a leaked socket, an unref'd timer, a runaway map — that would otherwise be invisible at the runner-log level. No behavior change on successful runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
Review Summary by QodoAdd heartbeat and running-test tracking to backend test diagnostics
WalkthroughsDescription• Track currently-running test in beforeEach, not just last-finished in afterEach • Add 1Hz heartbeat logging memory usage and active handles/requests • Narrow silent backend-test kill window from ~300ms to ≤1s • Expose handle/memory leaks invisible at runner-log level Diagramflowchart LR
A["Test Lifecycle"] -->|beforeEach| B["Set currentTest"]
B -->|test runs| C["1Hz Heartbeat"]
C -->|logs| D["Memory + Handles + Requests"]
A -->|afterEach| E["Set lastFinishedTest"]
E -->|clear| F["currentTest"]
G["Process Death"] -->|kill signal| H["Last Heartbeat"]
H -->|narrows window| I["≤1s bracket"]
File Changes1. src/tests/backend/diagnostics.ts
|
Code Review by Qodo
1. Heartbeat logging lacks feature flag
|
| // Heartbeat. unref()'d so it never holds the event loop open by itself — | ||
| // it only fires if mocha is otherwise alive. The interval cadence (1Hz) is | ||
| // the trade-off between log noise (~60-120 extra lines per run) and how | ||
| // tightly we can bracket the kill timestamp. | ||
| const heartbeat = setInterval(() => { | ||
| const mem = process.memoryUsage(); | ||
| const handles = (process as any)._getActiveHandles?.().length ?? -1; | ||
| const requests = (process as any)._getActiveRequests?.().length ?? -1; | ||
| diag(`hb running="${currentTest}" lastFinished="${lastFinishedTest}" ` + | ||
| `rss=${Math.round(mem.rss / 1024 / 1024)}M ` + | ||
| `heap=${Math.round(mem.heapUsed / 1024 / 1024)}M ` + | ||
| `handles=${handles} requests=${requests}`); | ||
| }, 1000); | ||
| heartbeat.unref(); |
There was a problem hiding this comment.
1. Heartbeat logging lacks feature flag 📘 Rule violation ⚙ Maintainability
The new 1Hz heartbeat logging runs unconditionally whenever diagnostics.ts is loaded, adding repeated stderr output and potential overhead with no enable/disable mechanism. This introduces new behavior without a feature flag and is not disabled by default.
Agent Prompt
## Issue description
A new heartbeat feature (`setInterval` that logs memory/handles every second) is always enabled when `src/tests/backend/diagnostics.ts` is required. The compliance checklist requires new features to be guarded behind a feature flag and disabled by default.
## Issue Context
The heartbeat currently starts unconditionally and writes periodic diagnostic lines to stderr. Add an explicit opt-in flag (for example an environment variable) so default behavior remains unchanged unless enabled.
## Fix Focus Areas
- src/tests/backend/diagnostics.ts[68-81]
ⓘ Copy this prompt and use it to remediate the issue with your preferred AI generation tools
There was a problem hiding this comment.
Concern #2 (the optional-chain TypeError) is fixed in 9b79a76 — capture the array first, read .length only when it exists, keep -1 as the "API missing" sentinel. Good catch.
Concern #1 (feature flag): respectfully disagreeing. diagnostics.ts is a temporary debug-only bootstrap (top-of-file comment: "Drop this file once the flake's root cause is identified and fixed.") wired in via mocha --require in the backend test script. The whole file is already opt-in at the test-runner layer — adding an internal flag would just multiply the existing toggle, and defaulting it off would silence the diagnostic in CI which is the one environment where the silent flake reliably reproduces. The repo's pattern for diagnostic instrumentation (see PR #7663, which added the unhandled-rejection logger with no flag) is to land it unguarded and remove the file once the kill is rooted. Keeping that pattern here.
Qodo correctly flagged `_getActiveHandles?.().length` as a latent TypeError: `?.()` guards the call but the call's `undefined` return on a missing method still hits `.length`, which throws. Since the heartbeat fires on a setInterval inside the mocha bootstrap, a Node build without the underscore-prefixed internals would take down the whole backend test run. Capture the array first, then read `.length` only when it actually exists. -1 stays as the "API missing" sentinel. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the heartbeat PR after run 26397693748 confirmed the diagnostic works (the kill landed at importexportGetPost.ts 'Import authorization checks > authn anonymous !exist -> fail', ~300 ms after the previous test's ✔). Two cleanups so the next failure pinpoints faster and reads cleaner: 1. diagnostics.ts: emit a `test start: <name>` diag line in the mocha beforeEach hook, after setting the currentTest pointer. The 1Hz heartbeat misses tests that take less than a second, and the silent kills land ~300 ms after a test boundary — precisely the gap where heartbeat resolution fails. A start line per test gives sub-millisecond resolution on which test was on the rails when the process died. 2. specs/api/importexportGetPost.ts: drop a stray `console.log(importedPads)` debug leftover (and the duplicate `await importEtherpad(records)` only present to feed it) in the `malformed .etherpad files are rejected` block. The leftover dumped a ~600-line reflection of a supertest Response object to the CI log on every successful run, drowning the surrounding test output and making the silent-kill window much harder to read. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 26398054688 narrowed the kill to a specific test
(pad.ts > Gets text on a pad Id and doesn't have an excess newline)
but the test body is a trivial supertest GET — the kill bypasses
all JS handlers, so we can't capture stack state at death.
Two failures across two runs share the shape: an agent.{get,post}
+ common.generateJWTToken() call dies ~300-600 ms after test start,
with no JS-visible cause. The next step is V8 + native stack.
Hook into the existing 1Hz heartbeat to call
process.report.writeReport(path) whenever a report directory is set.
The Windows backend-tests workflow already wires up
`--report-directory=${{ github.workspace }}/node-report` via
NODE_OPTIONS and uploads that directory as an artifact on failure,
so the rolling snapshots ride for free on the existing upload step.
Each report (~50 KB) contains:
- V8 + native call stacks for all threads
- libuv active handles (open TCP, timers, file handles)
- JS heap statistics
- resourceUsage + system info
- shared-object list
On the next reproduction the latest report before ELIFECYCLE will
sit ~0-1 s before the kill — enough to see whether the V8 stack
is inside jose's WebCrypto sign path, inside supertest's TCP
roundtrip, or somewhere unexpected entirely.
NODE_REPORT_DIR is also honored as an explicit override for local
repro / non-workflow runs.
Cost: ~6 files (~300 KB) per Windows backend-test failure, plus
~50 ms event-loop pause per heartbeat. No-op when neither env var
is set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 26398830249 exposed the path-separator bug in the previous commit: every heartbeat tick on the Windows runner logged Failed to open Node.js report file: D:\a\etherpad\etherpad/node-report/hb-NNNN-...json directory: D:\a\etherpad\etherpad/node-report (errno: 22) — EINVAL. The workflow sets --report-directory with forward-slash separators on Windows, then this code concatenated another `/` plus the filename, producing a path Node's report writer rejects. writeReport(fileName) takes a BARE filename and resolves it against the configured report directory using the platform-correct separator internally. Switch to that. For local repro overrides via NODE_REPORT_DIR, push the path into process.report.directory (the documented config knob) instead of joining it into the call site. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 26398985832 proved the heartbeat-only report cadence isn't tight enough: the last report before the kill was hb-0013 at +16201ms, ~1.5 s before ELIFECYCLE at +17701ms — during which ~30 tests fired, including the dying one (`authn anonymous !exist -> fail`). The captured V8 stack is just our heartbeat code, not the dying test. Move the writeReport call to a shared tryWriteReport() helper and invoke it from BOTH the heartbeat AND mocha's beforeEach hook, throttled to one report per 250 ms. That gives ≤250 ms resolution on the kill window — close enough that the latest report captures state from inside the dying test rather than from the test ~30 slots earlier. The heartbeat always writes (so we don't lose the no-test-running ticks during setup); beforeEach only writes when the throttle window has elapsed. Cost ceiling: ~4 reports/sec × ~12 s test phase ≈ 48 reports (~2.5 MB) per failing run. Each writeReport adds ~50 ms of event-loop pause — at 4Hz that's 20% of wall time spent in diagnostics, which is acceptable for a temporary debug-only bootstrap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run 26399285213's rerun captured a sixth death point on the new 4Hz cadence (`socketio.ts > Duplicate-author handling > cookie identity: same-author second socket kicks the first`, kill at +45953ms, 271ms after test start). The throttle suppressed the dying test's own beforeEach: previous boundary write landed 128 ms earlier and the next 31 ms after that, both inside the 250 ms window. Last captured report (be-0100) is from the previous test. 100 ms is still well above the inter-test cadence in fast burst suites (tests fire 2-5 ms apart, so 20-50 of them get throttled to a single write, ceiling ~10 writes/sec). But it's tight enough that any death-window neighbour ≥100 ms after the previous report — the shape we keep observing — gets its own boundary snapshot. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Why
Backend tests silently die with exit code 255 mid-suite ~22% of the time on develop — most often
Windows with Plugins (24), occasionally Linux too. Each kill lands ~300 ms after the previous test's clean ✔ teardown line and emits no failing-test marker, no error, no Mocha summary. The handlers indiagnostics.ts(PR #7663) andcommon.tsdon't fire either: only[diag +0ms] diagnostics loadedlands before the silentELIFECYCLE.Recent reproduction: run 26311025244. Both attempts crashed at completely different "last test" locations (
messages.ts > identity changeset is acceptedin attempt 1;clientvar_rev_consistency.ts > CLIENT_VARS stays consistent under concurrent edits during handshake (delay race)in attempt 2), so the dying test itself isn't to blame.The current
lastSeenTestpointer is only updated inafterEach, so if the kill lands during the NEXT test's setup or body — which is exactly the ~300 ms gap we observe — the pointer points at the previous passing test, not the one being killed. The new running pointer + heartbeat narrow the kill window to ≤1 s and expose handle/memory behavior leading up to the kill.What this changes
Pure diagnostic. No behavior change on successful runs.
Cost: ~60-120 extra log lines per CI run.
Locally verified
Ran mocha against a 3.5 s probe spec:
Test plan
🤖 Generated with Claude Code