Skip to content

fix(daemon): reap dead-peer clients + inactivity backstop (Layer 2, #692)#712

Merged
colbymchenry merged 1 commit into
mainfrom
fix/daemon-client-liveness
Jun 6, 2026
Merged

fix(daemon): reap dead-peer clients + inactivity backstop (Layer 2, #692)#712
colbymchenry merged 1 commit into
mainfrom
fix/daemon-client-liveness

Conversation

@colbymchenry
Copy link
Copy Markdown
Owner

Context

Defense-in-depth follow-up to #711 (the Windows PPID watchdog fix for #692). #711 makes an orphaned proxy exit so its socket closes and the daemon reaps the client via the refcount + idle timer. This PR adds two daemon-side safety nets for the residual case the watchdog can't cover: a socket close that is never delivered (a Windows named-pipe hazard), where a phantom client would otherwise pin the daemon forever and the idle timer (which only arms at zero clients) never fires.

What changed

  • Liveness sweep. A proxy now sends an optional client-hello carrying its own pid (and the host pid) right after it verifies the daemon hello and before piping any traffic. The daemon records it per-client and periodically (CODEGRAPH_DAEMON_CLIENT_SWEEP_MS, default 30s) drops any client whose peer process is dead, re-arming the idle timer. Reaps a dead client within one sweep instead of never.
  • Inactivity backstop. The daemon exits after a generous no-traffic window (CODEGRAPH_DAEMON_MAX_IDLE_MS, default 30 min) even with clients still attached — a phantom client sends nothing, so it can't keep the daemon alive. A lightweight socket observer feeds the activity clock; no protocol parsing on this path.

Safety / risk

  • Fail-safe & version-pinned. Proxy and daemon are always the exact same version. The client-hello is optional: the daemon reads the first line, and a non-hello first line (legacy/direct client, or a timeout) is handed through to the transport untouched, falling back to today's socket-close lifecycle. Buffers are split on the newline byte so a UTF-8 sequence straddling a chunk boundary in the unshifted tail is never corrupted.
  • The backstop defers to the existing idle timer for the zero-client case; it only acts while clients are (nominally) attached.

Test plan

  • macOS: full suite green (1240 passed, 2 skipped). New pure-unit tests for parseClientHelloLine + peerIsDead; two end-to-end tests in mcp-daemon.test.ts — a raw client announcing a dead pid gets reaped, and the daemon exits on the inactivity backstop with a client still connected.
  • Real Windows 11 VM: clean build; unit tests pass; the daemon suite passes repeatedly (3× stable after accounting for a VM-load flake, since hardened with more generous timeouts). A standalone probe confirmed the sweep reaps a dead-pid client over a named pipe, logging Reaping client with dead peer (pid 999999), and the backstop fires with a client connected.

Follow-up to #711 / #692 (already closed by #711); no issue to re-close.

🤖 Generated with Claude Code

… can't leak (#692)

Layer-2 defense-in-depth follow-up to the Windows PPID watchdog fix (#711).
That fix makes an orphaned proxy exit so its socket closes and the daemon
reaps via the refcount + idle timer. This adds two daemon-side safety nets for
the residual case where a socket close is never delivered (a Windows named-pipe
hazard) and a phantom client would otherwise pin the daemon forever:

  - Liveness sweep: a proxy now sends an optional client-hello carrying its pid
    (+ host pid) right after verifying the daemon hello; the daemon periodically
    drops any client whose peer process is dead, re-arming the idle timer.
    Fail-safe and version-pinned — a connection that never sends the hello just
    falls back to the socket-close lifecycle, and the daemon reads it before the
    transport so a non-hello first line is handed through untouched.
  - Inactivity backstop: the daemon exits after a generous no-traffic window
    (CODEGRAPH_DAEMON_MAX_IDLE_MS, default 30 min) even with clients attached, so
    a phantom client that sends nothing can't keep it alive.

Pure helpers (parseClientHelloLine, peerIsDead) are unit-tested; the full
handshake + sweep and the backstop are covered end-to-end in mcp-daemon.test.ts.
Validated on a real Windows 11 VM: the sweep reaps a dead-pid client over a
named pipe and the backstop fires with a client still connected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@colbymchenry colbymchenry merged commit 80358a8 into main Jun 6, 2026
@colbymchenry colbymchenry deleted the fix/daemon-client-liveness branch June 6, 2026 20:23
colbymchenry added a commit that referenced this pull request Jun 6, 2026
…it (#662) (#713)

When an MCP host (opencode and others) SIGTERM's the shared daemon as a new
session starts, the existing session's proxy used to exit on the dropped socket
— silently losing CodeGraph for that session, and hanging any request in flight
at the drop. The SIGTERM originates in the host's process-tree teardown, not in
CodeGraph (nothing here signals another process), so the fix is proxy
resilience, not chasing the signal.

The local-handshake proxy now treats a daemon disconnect as recoverable rather
than terminal: it falls back to its in-process engine for the rest of the
session (the same path used when no daemon is reachable at startup, and what
CODEGRAPH_NO_DAEMON does) and re-serves any requests that were in flight to the
dead daemon, so the host never hangs. The proxy still exits when the HOST goes
away (stdin close / PPID watchdog) — only daemon loss is now non-fatal.

Also replaces the over-the-wire liveness-sweep test added in #712 — which was
flaky under heavy parallel load (a raced raw-socket connect) — with a
deterministic Daemon.reapDeadClients unit test. The client-hello round-trip is
still exercised by every daemon test (the real proxy now sends it).

Validated with a reproduction (proxy stays alive, in-flight request answered,
post-drop request recovers) and a regression test in mcp-daemon.test.ts.
Confirmed on macOS (full suite green) and a Windows 11 VM.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant