fix: prevent orphan VMM via socket tiebreaker + status one-shot default by CMGS · Pull Request #50 · cocoonstack/cocoon

CMGS · 2026-05-15T04:49:52Z

Summary

Closes #48 and #49. Both were triggers for the orphan cocoon/CH processes we found on cocoon-pool-2 (8 alive CH/FC without DB entries; multiple cocoon-CLI watchers stranded by prior vk-cocoon restarts).

#49 — `vm rm --force` could clean DB while CH stayed alive

`hypervisor/stop.go:DeleteAll` treats `WithRunningVM ⇒ ErrNotRunning` as "VM already gone, just clean DB". That's correct for the legitimate "VM crashed / cleanly stopped" cases. The leak path is when the pidfile/cmdline check returns a false negative (stale pidfile, recycled PID, manual rundir tampering) — cocoon believes the VM is dead while CH is still listening on its api socket.

Fix:

Add `Backend.IsAPISocketLive(rec)` — a pidfile-independent probe over `utils.CheckSocket`.
Call it as a tiebreaker in `DeleteAll` right before `RemoveVMDirs + DB.Update`. If the socket answers, refuse the delete.

The tiebreaker fires unconditionally (not gated on ErrNotRunning):

AF_UNIX has no TIME_WAIT and `TerminateProcess` only returns nil after full pid reap, so a successful stop guarantees the listener has been torn down — no false positives.
Firing on the happy path also catches `utils/pidfd_linux.go:16-17`'s own `VerifyProcessCmdline` false-negative shortcut that otherwise returns `(true, nil)` without sending a signal.

#48 — `vm status` polling default leaked processes for 21 days

`cmd/vm/status.go` always entered a 5s polling loop. From a shell that disconnected without SIGHUP propagation (sudo wrapper, bash -c pipeline, tmux without huponexit), the loop survived indefinitely. We observed 21-day-old orphans of `sudo cocoon vm status --format json | python3 -c 'json.load(sys.stdin)'`.

Fix:

Add `--watch` flag (default false). Without it, run `statusOnce` — single snapshot, exit.
`--event` still implies streaming events (unchanged so vk-cocoon's WatchEvents keeps working).
Extract `renderVMList(vms, format)` so `Handler.List` and `statusOnce` share the JSON/table/empty-list branch instead of forking.

Breaking change: `cocoon vm status` callers who relied on auto-polling must add `--watch`. Consciously chosen — the orphan risk far outweighs the convenience.

Test plan

`go build ./...` and `go test ./...` green
`make lint` dual-platform 0 issues
/simplify pass: senior walk + 3 parallel review agents, all findings addressed
Smoke on cocoonset-gke-private after merge:
- Trigger a rundir / pidfile drift, confirm `vm rm --force` rejects with the new tiebreaker error
- `cocoon vm status` exits immediately; `cocoon vm status --watch` enters the refresh loop as before

Copilot

Pull request overview

This PR addresses two orphan-process failure modes by (1) adding a pidfile-independent Unix-socket liveness probe as a last-line guard before deleting VM records/dirs, and (2) changing cocoon vm status to default to a one-shot snapshot unless explicitly put into watch/event streaming modes.

Changes:

Add an API-socket connectivity “tiebreaker” check in VM delete to avoid removing DB entries while a VMM is still alive.
Change vm status default behavior to one-shot output; introduce --watch to opt into the refresh loop (with --event preserved for streaming diffs).
Refactor VM list rendering/filtering so list + status can share the same JSON/table output paths.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
hypervisor/stop.go	Adds an API socket responsiveness check before deleting VM dirs/DB records.
hypervisor/state.go	Introduces `Backend.IsAPISocketLive` helper backed by `utils.CheckSocket`.
cmd/vm/status.go	Implements one-shot status default, adds shared `renderVMList`, and factors filtering into `applyFilters`.
cmd/vm/commands.go	Updates `vm status` CLI help and adds `--watch` flag/interval wording.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Copilot · 2026-05-15T05:20:28Z

@CMGS Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub.

You can ask me to try again later by mentioning me in a new comment.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 61c3610e-fbde-4a21-9d5c-ab1aa4950228

Sorry for the inconvenience!

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Closes #48 and #49 — both were triggers for orphan cocoon/CH processes observed on cocoonset-gke-private (8 alive CH/FC without DB entries, plus 4-day-old cocoon-CLI watchers from prior vk-cocoon restarts). #49: vm rm --force could clean the DB but leave CH running when the pidfile/cmdline check returned a false negative (stale pidfile, recycled PID, manual rundir tampering). Add Backend.IsAPISocketLive as a pidfile-independent tiebreaker (utils.CheckSocket) and call it in DeleteAll after the stop step. Fires unconditionally — AF_UNIX has no TIME_WAIT and TerminateProcess only returns nil after full pid reap, so a successful stop guarantees the listener is gone; firing also catches the pidfd_linux.go's own VerifyProcessCmdline false-negative path that otherwise returns (true, nil) without sending a signal. #48: cocoon vm status defaulted to a 5s polling loop. When invoked from a shell that disconnected without SIGHUP propagation (sudo wrapper, bash -c pipeline, tmux killed without huponexit), the loop survived indefinitely — observed 21-day-old orphans on prod. Add --watch flag; default is one-shot snapshot via statusOnce. --event still implies streaming events. statusOnce + List share a renderVMList helper so the JSON/table/empty-list branch lives in one place.

…ved (#51) * fix: kill orphan VMM via /proc cmdline fallback when pidfile pre-removed PR #50's socket-probe tiebreaker only catches orphans whose api.sock is still listening. If pidfile and api.sock are both pre-removed before the VMM exits (observed on GKE prod after vk-cocoon rapid restart + CH InvalidStateTransition), DeleteAll's pidfile-based stop returns ErrNotRunning, the probe returns ENOENT, and the VMM survives as a PPID=1 orphan with no rundir. Add utils.FindVMMByCmdline as a /proc scan fallback keyed on the api-socket path (already unique per VM). Wire it into: - WithRunningVM: recover the live pid when pidfile/socket are gone - DeleteAll: second-pass after socket probe to catch sibling/worker pids Repro: sleep + rm pidfile + rm api.sock + cocoon vm rm --force leaves a CH orphan. With the fix, the cmdline scan recovers and SIGKILLs it. * chore: senior-review fixes — public-above-private, expectArg naming - Reorder utils/process_*.go so FindVMMByCmdline sits above verifyProcessCmdline (matches sparse_linux.go / reflink_linux.go). - Rename the FindVMMByCmdline marker param to expectArg for consistency with VerifyProcessCmdline / TerminateProcess / pidfd_linux.go. * fix: address Copilot round-1 findings on orphan VMM PR - utils/process_linux.go: slices.Sort the returned pids so callers get a deterministic smallest-pid choice (Copilot caught the /proc lexicographic ordering trap, e.g. "100" < "11"). - hypervisor/state.go: fail-closed when /proc scan errors after pidfile-based check fails; previously returned ErrNotRunning on inconclusive state, which could let start/delete proceed against a still-running VM. - hypervisor/stop.go: fail-closed in DeleteAll second-pass when /proc scan errors; previously dropped scanErr and risked re-introducing the orphan leak the PR is trying to fix. - utils/process_test.go: replace flaky "sleep marker 60" (sleep rejects non-numeric arg and exits immediately) with "sh -c 'sleep 60 && :' marker" (compound prevents sh tail-exec into sleep). Gate on runtime.GOOS == "linux". * chore: align fail-closed error strings with sibling refuse-delete wording state.go + stop.go: add "(resolve the host issue and retry)" actionable-hint clause so the new scan-error wraps match the existing socket-probe error format. * fix: surface non-ENOENT cmdline read errors so FindVMMByCmdline fails closed Copilot round-3 finding: verifyProcessCmdline returned (false, available=false) on permission/IO errors reading /proc/<pid>/cmdline (e.g. hidepid/EPERM), and FindVMMByCmdline silently dropped that signal — a hidepid environment could mask the real VMM and reintroduce the orphan leak. Refactor verifyProcessCmdline to return (bool, error); FindVMMByCmdline now distinguishes ENOENT (transient race, safe to skip) from any other read error (fail-closed, return wrapped first error). VerifyProcessCmdline wrapper preserves the "fall back to IsProcessAlive on error" semantic.

CMGS force-pushed the fix/orphan-leak branch from 36a28bd to 702c56b Compare May 15, 2026 04:56

CMGS requested a review from Copilot May 15, 2026 05:00

Copilot started reviewing on behalf of CMGS May 15, 2026 05:00 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread hypervisor/state.go Outdated

Comment thread hypervisor/stop.go Outdated

Comment thread hypervisor/stop.go Outdated

Comment thread cmd/vm/commands.go Outdated

CMGS force-pushed the fix/orphan-leak branch from 702c56b to 967e90d Compare May 15, 2026 05:08

CMGS requested a review from Copilot May 15, 2026 05:08

Copilot started work on behalf of CMGS May 15, 2026 05:08 View session

Copilot started reviewing on behalf of CMGS May 15, 2026 05:09 View session

Copilot finished work on behalf of CMGS May 15, 2026 05:09

Copilot finished work on behalf of CMGS May 15, 2026 05:10

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread hypervisor/state.go Outdated

Comment thread cmd/vm/status.go

CMGS force-pushed the fix/orphan-leak branch from 967e90d to 039b6be Compare May 15, 2026 05:19

CMGS requested a review from Copilot May 15, 2026 05:20

Copilot started work on behalf of CMGS May 15, 2026 05:20 View session

Copilot started reviewing on behalf of CMGS May 15, 2026 05:21 View session

Copilot finished work on behalf of CMGS May 15, 2026 05:21

Copilot started work on behalf of CMGS May 15, 2026 05:22 View session

Copilot finished work on behalf of CMGS May 15, 2026 05:22

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread cmd/vm/status.go

CMGS force-pushed the fix/orphan-leak branch from 039b6be to 88dfc87 Compare May 15, 2026 05:26

CMGS requested a review from Copilot May 15, 2026 05:26

Copilot started work on behalf of CMGS May 15, 2026 05:27 View session

Copilot started reviewing on behalf of CMGS May 15, 2026 05:27 View session

Copilot finished work on behalf of CMGS May 15, 2026 05:28

Copilot started reviewing on behalf of CMGS May 15, 2026 05:34 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread cmd/vm/status_test.go Outdated

CMGS force-pushed the fix/orphan-leak branch from a8f13f2 to 7ad8e51 Compare May 15, 2026 05:39

CMGS requested a review from Copilot May 15, 2026 05:39

Copilot started reviewing on behalf of CMGS May 15, 2026 05:40 View session

Copilot finished work on behalf of CMGS May 15, 2026 05:43

Copilot started work on behalf of CMGS May 15, 2026 05:43 View session

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread cmd/vm/status.go

Copilot finished work on behalf of CMGS May 15, 2026 05:44

CMGS force-pushed the fix/orphan-leak branch from 7ad8e51 to 5e2ddab Compare May 15, 2026 05:44

CMGS requested a review from Copilot May 15, 2026 05:45

Copilot started work on behalf of CMGS May 15, 2026 05:45 View session

Copilot started reviewing on behalf of CMGS May 15, 2026 05:45 View session

Copilot finished work on behalf of CMGS May 15, 2026 05:46

Copilot AI reviewed May 15, 2026

View reviewed changes

Comment thread hypervisor/state.go Outdated

Comment thread hypervisor/stop.go

CMGS force-pushed the fix/orphan-leak branch from 5e2ddab to 6d38bfe Compare May 15, 2026 05:50

CMGS requested a review from Copilot May 15, 2026 05:50

Copilot started work on behalf of CMGS May 15, 2026 05:50 View session

Copilot started reviewing on behalf of CMGS May 15, 2026 05:50 View session

Copilot finished work on behalf of CMGS May 15, 2026 05:51

Copilot AI reviewed May 15, 2026

View reviewed changes

CMGS force-pushed the fix/orphan-leak branch from 6d38bfe to d623c63 Compare May 15, 2026 06:03

CMGS merged commit a8e79cb into master May 15, 2026
4 checks passed

CMGS deleted the fix/orphan-leak branch May 15, 2026 06:12

CMGS mentioned this pull request May 15, 2026

fix: kill orphan VMM via /proc cmdline fallback when pidfile pre-removed #51

Merged

4 tasks

CMGS mentioned this pull request May 15, 2026

vm rm --force: DB delete proceeds when WithRunningVM reports ErrNotRunning, but CH may actually be alive #49

Closed

Conversation

CMGS commented May 15, 2026

Summary

#49 — vm rm --force could clean DB while CH stayed alive

#48 — `vm status` polling default leaked processes for 21 days

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI commented May 15, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

#49 — `vm rm --force` could clean DB while CH stayed alive