Skip to content

fix(windows): reap orphaned MCP processes when their parent exits#711

Merged
colbymchenry merged 1 commit into
mainfrom
fix/win-ppid-watchdog-leak
Jun 6, 2026
Merged

fix(windows): reap orphaned MCP processes when their parent exits#711
colbymchenry merged 1 commit into
mainfrom
fix/win-ppid-watchdog-leak

Conversation

@colbymchenry
Copy link
Copy Markdown
Owner

Problem

On Windows, codegraph's background processes pile up without bound over a long session and eventually saturate CPU — closing the editor/agent that launched CodeGraph does not terminate the associated processes, and the shared daemon's 5-minute idle timeout never fires. (#692, with #576 and #680 as the same symptom.)

Root cause

All three PPID watchdogs (proxy socket, proxy local-handshake, direct mode) detected parent death via only:

  • ppidChanged — a POSIX signal: the OS reparents an orphan to init, so process.ppid diverges. Windows never reparents, so process.ppid stays constant after the parent dies and this can never fire.
  • hostGone — needs CODEGRAPH_HOST_PPID, which is set by the wasm relaunch. The standalone bundle pre-bakes --liftoff-only, so the relaunch is skipped and HOST_PPID is never set.

On a Windows standalone bundle (exactly #692's environment) neither condition can fire → the orphaned proxy/server runs forever → its socket never closes → the shared daemon keeps a phantom client → clients never reaches 0 → the idle timer never arms → processes accumulate.

Confirmed empirically on a Windows 11 VM: a child's process.ppid stays constant across parent death (ppid_changed=false), while process.kill(originalPpid, 0) starts throwing the moment the parent exits.

Fix

Add a win32-only signal: poll the original parent's liveness directly (process.kill(originalPpid, 0)), since ppid is stable on Windows. Gated to win32 on purpose — on POSIX a double-forked grandparent can legitimately outlive the reparent, so a dead originalPpid is not proof of orphaning there; the ppid-change signal remains correct and sufficient. The decision is extracted into a pure helper (src/mcp/ppid-watchdog.ts) shared by all three sites, with a cross-platform unit matrix so the Windows branch is covered on any OS.

POSIX behavior is unchanged.

Test plan

Fixes #692.
Related: #576 (same Windows orphan-reaping mechanism) and #680 (the same symptom; should be resolved for Windows hosts).

🤖 Generated with Claude Code

…, #576, #680)

On Windows the PPID watchdog could never fire: orphans aren't reparented, so
`process.ppid` stays constant after the parent dies (defeating the ppid-change
check), and the standalone bundle pre-bakes `--liftoff-only`, skipping the
relaunch that sets `CODEGRAPH_HOST_PPID` (defeating the host-liveness check).
With neither signal available, an orphaned proxy / direct server ran forever,
the shared daemon never saw the client disconnect, and its idle timer never
armed — node processes accumulated until CPU saturated.

Add a win32-only signal: poll the original parent's liveness directly, since
ppid is stable there. Gated to Windows so POSIX double-fork cases keep relying
on the ppid-change signal (a dead original parent is not proof of orphaning on
POSIX). The decision is extracted into a pure, unit-tested helper shared by all
three watchdog sites (proxy socket, proxy local-handshake, direct mode).

Validated on a real Windows 11 VM: in the exact bundle scenario (direct mode,
no HOST_PPID) an orphaned server now exits within one watchdog poll via the new
path; the POSIX reparent path is unchanged and its integration test still passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@colbymchenry colbymchenry merged commit 565eb20 into main Jun 6, 2026
@colbymchenry colbymchenry deleted the fix/win-ppid-watchdog-leak branch June 6, 2026 19:05
colbymchenry added a commit that referenced this pull request Jun 6, 2026
… can't leak (#692) (#712)

Layer-2 defense-in-depth follow-up to the Windows PPID watchdog fix (#711).
That fix makes an orphaned proxy exit so its socket closes and the daemon
reaps via the refcount + idle timer. This adds two daemon-side safety nets for
the residual case where a socket close is never delivered (a Windows named-pipe
hazard) and a phantom client would otherwise pin the daemon forever:

  - Liveness sweep: a proxy now sends an optional client-hello carrying its pid
    (+ host pid) right after verifying the daemon hello; the daemon periodically
    drops any client whose peer process is dead, re-arming the idle timer.
    Fail-safe and version-pinned — a connection that never sends the hello just
    falls back to the socket-close lifecycle, and the daemon reads it before the
    transport so a non-hello first line is handed through untouched.
  - Inactivity backstop: the daemon exits after a generous no-traffic window
    (CODEGRAPH_DAEMON_MAX_IDLE_MS, default 30 min) even with clients attached, so
    a phantom client that sends nothing can't keep it alive.

Pure helpers (parseClientHelloLine, peerIsDead) are unit-tested; the full
handshake + sweep and the backstop are covered end-to-end in mcp-daemon.test.ts.
Validated on a real Windows 11 VM: the sweep reaps a dead-pid client over a
named pipe and the backstop fires with a client still connected.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Daemon processes leak indefinitely on Windows - idle timeout never fires after proxy disconnects via EPIPE

1 participant