fix: prevent orphan VMM via socket tiebreaker + status one-shot default#50
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses two orphan-process failure modes by (1) adding a pidfile-independent Unix-socket liveness probe as a last-line guard before deleting VM records/dirs, and (2) changing cocoon vm status to default to a one-shot snapshot unless explicitly put into watch/event streaming modes.
Changes:
- Add an API-socket connectivity “tiebreaker” check in VM delete to avoid removing DB entries while a VMM is still alive.
- Change
vm statusdefault behavior to one-shot output; introduce--watchto opt into the refresh loop (with--eventpreserved for streaming diffs). - Refactor VM list rendering/filtering so list + status can share the same JSON/table output paths.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| hypervisor/stop.go | Adds an API socket responsiveness check before deleting VM dirs/DB records. |
| hypervisor/state.go | Introduces Backend.IsAPISocketLive helper backed by utils.CheckSocket. |
| cmd/vm/status.go | Implements one-shot status default, adds shared renderVMList, and factors filtering into applyFilters. |
| cmd/vm/commands.go | Updates vm status CLI help and adds --watch flag/interval wording. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@CMGS Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub. You can ask me to try again later by mentioning me in a new comment. If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: Sorry for the inconvenience! |
Closes #48 and #49 — both were triggers for orphan cocoon/CH processes observed on cocoonset-gke-private (8 alive CH/FC without DB entries, plus 4-day-old cocoon-CLI watchers from prior vk-cocoon restarts). #49: vm rm --force could clean the DB but leave CH running when the pidfile/cmdline check returned a false negative (stale pidfile, recycled PID, manual rundir tampering). Add Backend.IsAPISocketLive as a pidfile-independent tiebreaker (utils.CheckSocket) and call it in DeleteAll after the stop step. Fires unconditionally — AF_UNIX has no TIME_WAIT and TerminateProcess only returns nil after full pid reap, so a successful stop guarantees the listener is gone; firing also catches the pidfd_linux.go's own VerifyProcessCmdline false-negative path that otherwise returns (true, nil) without sending a signal. #48: cocoon vm status defaulted to a 5s polling loop. When invoked from a shell that disconnected without SIGHUP propagation (sudo wrapper, bash -c pipeline, tmux killed without huponexit), the loop survived indefinitely — observed 21-day-old orphans on prod. Add --watch flag; default is one-shot snapshot via statusOnce. --event still implies streaming events. statusOnce + List share a renderVMList helper so the JSON/table/empty-list branch lives in one place.
…ved (#51) * fix: kill orphan VMM via /proc cmdline fallback when pidfile pre-removed PR #50's socket-probe tiebreaker only catches orphans whose api.sock is still listening. If pidfile and api.sock are both pre-removed before the VMM exits (observed on GKE prod after vk-cocoon rapid restart + CH InvalidStateTransition), DeleteAll's pidfile-based stop returns ErrNotRunning, the probe returns ENOENT, and the VMM survives as a PPID=1 orphan with no rundir. Add utils.FindVMMByCmdline as a /proc scan fallback keyed on the api-socket path (already unique per VM). Wire it into: - WithRunningVM: recover the live pid when pidfile/socket are gone - DeleteAll: second-pass after socket probe to catch sibling/worker pids Repro: sleep + rm pidfile + rm api.sock + cocoon vm rm --force leaves a CH orphan. With the fix, the cmdline scan recovers and SIGKILLs it. * chore: senior-review fixes — public-above-private, expectArg naming - Reorder utils/process_*.go so FindVMMByCmdline sits above verifyProcessCmdline (matches sparse_linux.go / reflink_linux.go). - Rename the FindVMMByCmdline marker param to expectArg for consistency with VerifyProcessCmdline / TerminateProcess / pidfd_linux.go. * fix: address Copilot round-1 findings on orphan VMM PR - utils/process_linux.go: slices.Sort the returned pids so callers get a deterministic smallest-pid choice (Copilot caught the /proc lexicographic ordering trap, e.g. "100" < "11"). - hypervisor/state.go: fail-closed when /proc scan errors after pidfile-based check fails; previously returned ErrNotRunning on inconclusive state, which could let start/delete proceed against a still-running VM. - hypervisor/stop.go: fail-closed in DeleteAll second-pass when /proc scan errors; previously dropped scanErr and risked re-introducing the orphan leak the PR is trying to fix. - utils/process_test.go: replace flaky "sleep marker 60" (sleep rejects non-numeric arg and exits immediately) with "sh -c 'sleep 60 && :' marker" (compound prevents sh tail-exec into sleep). Gate on runtime.GOOS == "linux". * chore: align fail-closed error strings with sibling refuse-delete wording state.go + stop.go: add "(resolve the host issue and retry)" actionable-hint clause so the new scan-error wraps match the existing socket-probe error format. * fix: surface non-ENOENT cmdline read errors so FindVMMByCmdline fails closed Copilot round-3 finding: verifyProcessCmdline returned (false, available=false) on permission/IO errors reading /proc/<pid>/cmdline (e.g. hidepid/EPERM), and FindVMMByCmdline silently dropped that signal — a hidepid environment could mask the real VMM and reintroduce the orphan leak. Refactor verifyProcessCmdline to return (bool, error); FindVMMByCmdline now distinguishes ENOENT (transient race, safe to skip) from any other read error (fail-closed, return wrapped first error). VerifyProcessCmdline wrapper preserves the "fall back to IsProcessAlive on error" semantic.
Summary
Closes #48 and #49. Both were triggers for the orphan cocoon/CH processes we found on
cocoon-pool-2(8 alive CH/FC without DB entries; multiple cocoon-CLI watchers stranded by prior vk-cocoon restarts).#49 —
vm rm --forcecould clean DB while CH stayed alive`hypervisor/stop.go:DeleteAll` treats `WithRunningVM ⇒ ErrNotRunning` as "VM already gone, just clean DB". That's correct for the legitimate "VM crashed / cleanly stopped" cases. The leak path is when the pidfile/cmdline check returns a false negative (stale pidfile, recycled PID, manual rundir tampering) — cocoon believes the VM is dead while CH is still listening on its api socket.
Fix:
The tiebreaker fires unconditionally (not gated on ErrNotRunning):
#48 — `vm status` polling default leaked processes for 21 days
`cmd/vm/status.go` always entered a 5s polling loop. From a shell that disconnected without SIGHUP propagation (sudo wrapper, bash -c pipeline, tmux without huponexit), the loop survived indefinitely. We observed 21-day-old orphans of `sudo cocoon vm status --format json | python3 -c 'json.load(sys.stdin)'`.
Fix:
Breaking change: `cocoon vm status` callers who relied on auto-polling must add `--watch`. Consciously chosen — the orphan risk far outweighs the convenience.
Test plan