feat(6.31): boot-time orphan reap + cfcf server reap interactive verb#36
Merged
fstamatelopoulos merged 1 commit intomainfrom May 8, 2026
Merged
Conversation
Closes the hard-crash hole left by v0.20.0's signal-handler-based cleanup: when the cfcf server dies via SIGKILL or an OS panic, its agent children get reparented to PID 1 and keep running, tying up ollama's model runner for up to 10 minutes per orphan. New `packages/core/src/orphan-reaper.ts` module: - `findOrphanAgentProcesses()` scans `ps -eo pid,ppid,user,etime,command` with three conjoined filters: PPID==1 (orphan signature) + same effective user + cfcf-spawn command shape. The shape matchers are tight enough that a hand-typed `claude -p` from another shell would not match (cfcf always pairs `-p` with `--dangerously-skip-permissions`). - `reapOrphans()` mirrors process-manager.ts's killProcessTree: group SIGTERM, 1.5s grace, group SIGKILL, with direct-PID fallback when the group target throws ESRCH. - `classifyCommand`, `parsePsOutput`, `filterOrphans`, `formatOrphanLine` exported as pure helpers for unit testing. Wired into: - `packages/server/src/start.ts`: boot-time auto-reap after the stale-history-event cleanup. Best-effort — a scan failure logs and continues, never blocks server boot. - `packages/cli/src/commands/server.ts`: new `cfcf server reap` verb that combines list + interactive y/N kill in a single mental model. Empty-state prints "No zombie agent processes detected." and exits. Supports `-y / --yes` for non-interactive use. Calls core directly — does NOT require the cfcf server to be running. Tests (25 in `orphan-reaper.test.ts`): every cfcf-spawn pattern + each negative case (interactive claude, ollama serve/pull, unrelated commands), parser robustness on malformed input, each filter in isolation, the full SIGTERM-then-SIGKILL flow with mocked process.kill, and the group-then-direct fallback. docs/plan.md: 6.31 marked ✅ shipped post-v0.20.0; iter-6 active-set callout updated. The pre-existing `app.test.ts` config-merge failure on main is NOT caused by this work (verified via `git stash && test`) and tracks as a separate concern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes the hard-crash hole left by v0.20.0's signal-handler-based cleanup. When the cfcf server dies via SIGKILL or an OS panic — bypassing the SIGINT/SIGTERM handlers in
start.ts— its agent children get reparented to PID 1 and keep running, tying up ollama's model runner for up to 10 minutes per orphan. v0.20.0 fixed clean-stop reaping; this PR closes the post-crash + manual-recovery cases. Marks item 6.31 as ✅ shipped indocs/plan.md.startServer()scans for orphans on every start, kills any it finds (best-effort, never blocks boot).cfcf server reap— new interactive verb: scans, prints candidates, asksKill these N process(es)? [y/N]. Empty case printsNo zombie agent processes detected.and exits. Supports-yfor non-interactive use. Runs without the cfcf server.packages/core/src/orphan-reaper.tswith three conjoined filters (PPID==1 + same effective user + cfcf-spawn command shape). Matchers are tight enough that hand-typed agent commands from another shell would not match (cfcf always pairsclaude -pwith--dangerously-skip-permissions, etc.).psoutput, each filter in isolation, and the SIGTERM→SIGKILL flow with mockedprocess.kill.Test plan
bun run typecheck— cleanbun test packages/core— 700/700 pass (orphan-reaper.test.ts: 25/25)cfcf server reapon a clean machine →No zombie agent processes detected.exits 0cfcf server reap --helpshows correct flagscfcf server reapinteractive y/N flow against real orphansNotes
packages/server/src/app.test.tsconfig-merge failure on main is not caused by this work (verified viagit stash && bun test); tracks as a separate concern.