fix(windows): reap orphaned MCP processes when their parent exits#711
Merged
Conversation
…, #576, #680) On Windows the PPID watchdog could never fire: orphans aren't reparented, so `process.ppid` stays constant after the parent dies (defeating the ppid-change check), and the standalone bundle pre-bakes `--liftoff-only`, skipping the relaunch that sets `CODEGRAPH_HOST_PPID` (defeating the host-liveness check). With neither signal available, an orphaned proxy / direct server ran forever, the shared daemon never saw the client disconnect, and its idle timer never armed — node processes accumulated until CPU saturated. Add a win32-only signal: poll the original parent's liveness directly, since ppid is stable there. Gated to Windows so POSIX double-fork cases keep relying on the ppid-change signal (a dead original parent is not proof of orphaning on POSIX). The decision is extracted into a pure, unit-tested helper shared by all three watchdog sites (proxy socket, proxy local-handshake, direct mode). Validated on a real Windows 11 VM: in the exact bundle scenario (direct mode, no HOST_PPID) an orphaned server now exits within one watchdog poll via the new path; the POSIX reparent path is unchanged and its integration test still passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 6, 2026
Closed
colbymchenry
added a commit
that referenced
this pull request
Jun 6, 2026
… can't leak (#692) (#712) Layer-2 defense-in-depth follow-up to the Windows PPID watchdog fix (#711). That fix makes an orphaned proxy exit so its socket closes and the daemon reaps via the refcount + idle timer. This adds two daemon-side safety nets for the residual case where a socket close is never delivered (a Windows named-pipe hazard) and a phantom client would otherwise pin the daemon forever: - Liveness sweep: a proxy now sends an optional client-hello carrying its pid (+ host pid) right after verifying the daemon hello; the daemon periodically drops any client whose peer process is dead, re-arming the idle timer. Fail-safe and version-pinned — a connection that never sends the hello just falls back to the socket-close lifecycle, and the daemon reads it before the transport so a non-hello first line is handed through untouched. - Inactivity backstop: the daemon exits after a generous no-traffic window (CODEGRAPH_DAEMON_MAX_IDLE_MS, default 30 min) even with clients attached, so a phantom client that sends nothing can't keep it alive. Pure helpers (parseClientHelloLine, peerIsDead) are unit-tested; the full handshake + sweep and the backstop are covered end-to-end in mcp-daemon.test.ts. Validated on a real Windows 11 VM: the sweep reaps a dead-pid client over a named pipe and the backstop fires with a client still connected. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
On Windows,
codegraph's background processes pile up without bound over a long session and eventually saturate CPU — closing the editor/agent that launched CodeGraph does not terminate the associated processes, and the shared daemon's 5-minute idle timeout never fires. (#692, with #576 and #680 as the same symptom.)Root cause
All three PPID watchdogs (proxy socket, proxy local-handshake, direct mode) detected parent death via only:
ppidChanged— a POSIX signal: the OS reparents an orphan to init, soprocess.ppiddiverges. Windows never reparents, soprocess.ppidstays constant after the parent dies and this can never fire.hostGone— needsCODEGRAPH_HOST_PPID, which is set by the wasm relaunch. The standalone bundle pre-bakes--liftoff-only, so the relaunch is skipped andHOST_PPIDis never set.On a Windows standalone bundle (exactly #692's environment) neither condition can fire → the orphaned proxy/server runs forever → its socket never closes → the shared daemon keeps a phantom client →
clientsnever reaches 0 → the idle timer never arms → processes accumulate.Confirmed empirically on a Windows 11 VM: a child's
process.ppidstays constant across parent death (ppid_changed=false), whileprocess.kill(originalPpid, 0)starts throwing the moment the parent exits.Fix
Add a win32-only signal: poll the original parent's liveness directly (
process.kill(originalPpid, 0)), since ppid is stable on Windows. Gated to win32 on purpose — on POSIX a double-forked grandparent can legitimately outlive the reparent, so a deadoriginalPpidis not proof of orphaning there; the ppid-change signal remains correct and sufficient. The decision is extracted into a pure helper (src/mcp/ppid-watchdog.ts) shared by all three sites, with a cross-platform unit matrix so the Windows branch is covered on any OS.POSIX behavior is unchanged.
Test plan
mcp-ppid-watchdog.test.ts) and the daemon lifecycle suite (Share oneserve --mcpper project across concurrent MCP clients to avoid N× inotify + N× index cost #411/serve --mcp is not reaped when the parent Claude Code process is SIGKILL'd (Linux) #277/idle-timeout) all pass.HOST_PPID) — the orphaned server exited within one watchdog poll, stderr confirming the new path:parent pid … exitedis produced only by the new win32 liveness branch — not the POSIXppid →path norhost pid … exited.)Fixes #692.
Related: #576 (same Windows orphan-reaping mechanism) and #680 (the same symptom; should be resolved for Windows hosts).
🤖 Generated with Claude Code