Skip to content

Bind E2E server lifetime to vitest via kernel pipe-EOF#56

Merged
antoninbas merged 3 commits intomainfrom
e2e-cleanup-leak
Apr 20, 2026
Merged

Bind E2E server lifetime to vitest via kernel pipe-EOF#56
antoninbas merged 3 commits intomainfrom
e2e-cleanup-leak

Conversation

@antoninbas
Copy link
Copy Markdown
Owner

@antoninbas antoninbas commented Apr 20, 2026

Summary

Every prior e2e run was leaving orphaned tsx src/main.ts server --port <random> processes on the host. I cleared 8 of them while deploying v0.12.0.

Two layers of fix:

  1. Spawn the server in its own process group (detached: true) so we can reliably tear the whole subtree down with process.kill(-pgid, sig) on clean teardown. Previously SIGTERM went to the npx wrapper and didn't propagate to the inner node child.

  2. Bind the server's lifetime to vitest via stdin-EOF, gated on KNOTES_E2E_WATCH_STDIN=1. The harness wires its own pipe to the server's stdin; the kernel closes that pipe's write end the moment the harness process dies (clean exit, SIGKILL, OOM), and the server reads EOF and exits.

    This is the canonical UNIX equivalent of Linux prctl(PR_SET_PDEATHSIG): the binding is enforced by the kernel, not by a polling loop, and the watchdog lives in the leaf process itself — no intermediary that could be killed and break the chain.

    The watchdog is a ~10 line opt-in hook in src/cli/commands/server.ts; the production server never activates it.

Also fixes the close timed out after 10000ms warning vitest printed at the end of each run.

Why not a separate supervisor process

An earlier draft of this PR introduced a small supervisor between vitest and the server that polled the anchor PID. It worked, but moved the problem rather than solving it: if the supervisor itself was SIGKILLed, the server orphaned again. The pipe-EOF approach removes that intermediary entirely — there is no Node-side process between vitest's death and the server's exit, just a kernel-managed FD.

Test plan

  • Full e2e suite locally — 65/65 pass in 163s, no leaks after teardown
  • Watchdog standalone test — server with KNOTES_E2E_WATCH_STDIN=1 and < /dev/null exits immediately on EOF
  • Manual SIGKILL of the vitest worker mid-run — server exits within ~5s, no orphans
  • CI green

antoninbas and others added 2 commits April 19, 2026 18:05
Spawn the server with detached: true so it leads its own process group,
then signal -pgid on teardown. Previously SIGTERM went to the npx
wrapper, which didn't propagate to the inner node child — leaving
orphaned tsx server processes on random ports after every e2e run.

Also fixes the "close timed out after 10000ms" warning vitest printed at
the end of each run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The first attempt (process-group SIGTERM on teardown) handled clean
shutdown but still leaked when vitest itself died hard (SIGKILL, OOM,
runner yanked) — the detached child kept running, reparented to PID 1.

Insert a tiny supervisor between vitest and the server. The supervisor:
  - Spawns the server in its own process group
  - Polls KNOTES_E2E_ANCHOR_PID (vitest's PID) once a second via
    process.kill(pid, 0); on ESRCH, tears the server down
  - Forwards SIGTERM/SIGINT/SIGHUP to the server group on clean teardown

Anchor pid is passed via env because intermediate wrappers (npx, the
tsx CLI) exit between vitest and the supervisor, making process.ppid
unreliable.

Verified manually: SIGKILL'ing the vitest process leaves no orphans
within ~5s.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@antoninbas antoninbas changed the title Kill E2E server's process group on teardown Bind E2E server lifetime to vitest via watchdog Apr 20, 2026
Drop the supervisor process and put the parent-death detection in the
server itself, gated on KNOTES_E2E_WATCH_STDIN=1. The harness wires its
own pipe to the server's stdin and holds the write end for the server's
lifetime; the kernel closes that write end the moment the harness
process dies (clean exit, SIGKILL, OOM), and the server reads EOF and
exits.

This is the canonical UNIX equivalent of prctl(PR_SET_PDEATHSIG): the
binding is enforced by the kernel, not by a polling loop, and the leaf
process detects parent death directly with no intermediary that could
itself be killed and break the chain.

Verified manually: SIGKILL'ing the vitest worker tears the e2e server
down within ~5s, with no orphans left behind.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@antoninbas antoninbas changed the title Bind E2E server lifetime to vitest via watchdog Bind E2E server lifetime to vitest via kernel pipe-EOF Apr 20, 2026
@antoninbas antoninbas merged commit b75e2d3 into main Apr 20, 2026
7 checks passed
@antoninbas antoninbas deleted the e2e-cleanup-leak branch April 20, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant