Skip to content

fix: prevent orphan VMM via socket tiebreaker + status one-shot default#50

Merged
CMGS merged 1 commit into
masterfrom
fix/orphan-leak
May 15, 2026
Merged

fix: prevent orphan VMM via socket tiebreaker + status one-shot default#50
CMGS merged 1 commit into
masterfrom
fix/orphan-leak

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 15, 2026

Summary

Closes #48 and #49. Both were triggers for the orphan cocoon/CH processes we found on cocoon-pool-2 (8 alive CH/FC without DB entries; multiple cocoon-CLI watchers stranded by prior vk-cocoon restarts).

#49vm rm --force could clean DB while CH stayed alive

`hypervisor/stop.go:DeleteAll` treats `WithRunningVM ⇒ ErrNotRunning` as "VM already gone, just clean DB". That's correct for the legitimate "VM crashed / cleanly stopped" cases. The leak path is when the pidfile/cmdline check returns a false negative (stale pidfile, recycled PID, manual rundir tampering) — cocoon believes the VM is dead while CH is still listening on its api socket.

Fix:

  • Add `Backend.IsAPISocketLive(rec)` — a pidfile-independent probe over `utils.CheckSocket`.
  • Call it as a tiebreaker in `DeleteAll` right before `RemoveVMDirs + DB.Update`. If the socket answers, refuse the delete.

The tiebreaker fires unconditionally (not gated on ErrNotRunning):

  • AF_UNIX has no TIME_WAIT and `TerminateProcess` only returns nil after full pid reap, so a successful stop guarantees the listener has been torn down — no false positives.
  • Firing on the happy path also catches `utils/pidfd_linux.go:16-17`'s own `VerifyProcessCmdline` false-negative shortcut that otherwise returns `(true, nil)` without sending a signal.

#48 — `vm status` polling default leaked processes for 21 days

`cmd/vm/status.go` always entered a 5s polling loop. From a shell that disconnected without SIGHUP propagation (sudo wrapper, bash -c pipeline, tmux without huponexit), the loop survived indefinitely. We observed 21-day-old orphans of `sudo cocoon vm status --format json | python3 -c 'json.load(sys.stdin)'`.

Fix:

  • Add `--watch` flag (default false). Without it, run `statusOnce` — single snapshot, exit.
  • `--event` still implies streaming events (unchanged so vk-cocoon's WatchEvents keeps working).
  • Extract `renderVMList(vms, format)` so `Handler.List` and `statusOnce` share the JSON/table/empty-list branch instead of forking.

Breaking change: `cocoon vm status` callers who relied on auto-polling must add `--watch`. Consciously chosen — the orphan risk far outweighs the convenience.

Test plan

  • `go build ./...` and `go test ./...` green
  • `make lint` dual-platform 0 issues
  • /simplify pass: senior walk + 3 parallel review agents, all findings addressed
  • Smoke on cocoonset-gke-private after merge:
    • Trigger a rundir / pidfile drift, confirm `vm rm --force` rejects with the new tiebreaker error
    • `cocoon vm status` exits immediately; `cocoon vm status --watch` enters the refresh loop as before

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses two orphan-process failure modes by (1) adding a pidfile-independent Unix-socket liveness probe as a last-line guard before deleting VM records/dirs, and (2) changing cocoon vm status to default to a one-shot snapshot unless explicitly put into watch/event streaming modes.

Changes:

  • Add an API-socket connectivity “tiebreaker” check in VM delete to avoid removing DB entries while a VMM is still alive.
  • Change vm status default behavior to one-shot output; introduce --watch to opt into the refresh loop (with --event preserved for streaming diffs).
  • Refactor VM list rendering/filtering so list + status can share the same JSON/table output paths.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
hypervisor/stop.go Adds an API socket responsiveness check before deleting VM dirs/DB records.
hypervisor/state.go Introduces Backend.IsAPISocketLive helper backed by utils.CheckSocket.
cmd/vm/status.go Implements one-shot status default, adds shared renderVMList, and factors filtering into applyFilters.
cmd/vm/commands.go Updates vm status CLI help and adds --watch flag/interval wording.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread hypervisor/state.go Outdated
Comment thread hypervisor/stop.go Outdated
Comment thread hypervisor/stop.go Outdated
Comment thread cmd/vm/commands.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

Comment thread hypervisor/state.go Outdated
Comment thread cmd/vm/status.go
@CMGS CMGS force-pushed the fix/orphan-leak branch from 967e90d to 039b6be Compare May 15, 2026 05:19
@CMGS CMGS requested a review from Copilot May 15, 2026 05:20
Copy link
Copy Markdown

Copilot AI commented May 15, 2026

@CMGS Unfortunately I hit an unexpected error while processing your comment. I've automatically reported this to GitHub.

You can ask me to try again later by mentioning me in a new comment.

If you want to contact GitHub about this error, please mention the following identifier so they can better serve you: 61c3610e-fbde-4a21-9d5c-ab1aa4950228

Sorry for the inconvenience!

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

Comment thread cmd/vm/status.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread cmd/vm/status_test.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread cmd/vm/status.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread hypervisor/state.go Outdated
Comment thread hypervisor/stop.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Closes #48 and #49 — both were triggers for orphan cocoon/CH processes
observed on cocoonset-gke-private (8 alive CH/FC without DB entries,
plus 4-day-old cocoon-CLI watchers from prior vk-cocoon restarts).

#49: vm rm --force could clean the DB but leave CH running when the
pidfile/cmdline check returned a false negative (stale pidfile, recycled
PID, manual rundir tampering). Add Backend.IsAPISocketLive as a
pidfile-independent tiebreaker (utils.CheckSocket) and call it in
DeleteAll after the stop step. Fires unconditionally — AF_UNIX has no
TIME_WAIT and TerminateProcess only returns nil after full pid reap, so
a successful stop guarantees the listener is gone; firing also catches
the pidfd_linux.go's own VerifyProcessCmdline false-negative path that
otherwise returns (true, nil) without sending a signal.

#48: cocoon vm status defaulted to a 5s polling loop. When invoked from
a shell that disconnected without SIGHUP propagation (sudo wrapper,
bash -c pipeline, tmux killed without huponexit), the loop survived
indefinitely — observed 21-day-old orphans on prod. Add --watch flag;
default is one-shot snapshot via statusOnce. --event still implies
streaming events. statusOnce + List share a renderVMList helper so the
JSON/table/empty-list branch lives in one place.
@CMGS CMGS force-pushed the fix/orphan-leak branch from 6d38bfe to d623c63 Compare May 15, 2026 06:03
@CMGS CMGS merged commit a8e79cb into master May 15, 2026
4 checks passed
@CMGS CMGS deleted the fix/orphan-leak branch May 15, 2026 06:12
CMGS added a commit that referenced this pull request May 15, 2026
…ved (#51)

* fix: kill orphan VMM via /proc cmdline fallback when pidfile pre-removed

PR #50's socket-probe tiebreaker only catches orphans whose api.sock is still
listening. If pidfile and api.sock are both pre-removed before the VMM exits
(observed on GKE prod after vk-cocoon rapid restart + CH InvalidStateTransition),
DeleteAll's pidfile-based stop returns ErrNotRunning, the probe returns
ENOENT, and the VMM survives as a PPID=1 orphan with no rundir.

Add utils.FindVMMByCmdline as a /proc scan fallback keyed on the api-socket
path (already unique per VM). Wire it into:
- WithRunningVM: recover the live pid when pidfile/socket are gone
- DeleteAll: second-pass after socket probe to catch sibling/worker pids

Repro: sleep + rm pidfile + rm api.sock + cocoon vm rm --force leaves a CH
orphan. With the fix, the cmdline scan recovers and SIGKILLs it.

* chore: senior-review fixes — public-above-private, expectArg naming

- Reorder utils/process_*.go so FindVMMByCmdline sits above verifyProcessCmdline (matches sparse_linux.go / reflink_linux.go).
- Rename the FindVMMByCmdline marker param to expectArg for consistency with VerifyProcessCmdline / TerminateProcess / pidfd_linux.go.

* fix: address Copilot round-1 findings on orphan VMM PR

- utils/process_linux.go: slices.Sort the returned pids so callers get a deterministic smallest-pid choice (Copilot caught the /proc lexicographic ordering trap, e.g. "100" < "11").
- hypervisor/state.go: fail-closed when /proc scan errors after pidfile-based check fails; previously returned ErrNotRunning on inconclusive state, which could let start/delete proceed against a still-running VM.
- hypervisor/stop.go: fail-closed in DeleteAll second-pass when /proc scan errors; previously dropped scanErr and risked re-introducing the orphan leak the PR is trying to fix.
- utils/process_test.go: replace flaky "sleep marker 60" (sleep rejects non-numeric arg and exits immediately) with "sh -c 'sleep 60 && :' marker" (compound prevents sh tail-exec into sleep). Gate on runtime.GOOS == "linux".

* chore: align fail-closed error strings with sibling refuse-delete wording

state.go + stop.go: add "(resolve the host issue and retry)" actionable-hint clause so the new scan-error wraps match the existing socket-probe error format.

* fix: surface non-ENOENT cmdline read errors so FindVMMByCmdline fails closed

Copilot round-3 finding: verifyProcessCmdline returned (false, available=false) on permission/IO errors reading /proc/<pid>/cmdline (e.g. hidepid/EPERM), and FindVMMByCmdline silently dropped that signal — a hidepid environment could mask the real VMM and reintroduce the orphan leak.

Refactor verifyProcessCmdline to return (bool, error); FindVMMByCmdline now distinguishes ENOENT (transient race, safe to skip) from any other read error (fail-closed, return wrapped first error). VerifyProcessCmdline wrapper preserves the "fall back to IsProcessAlive on error" semantic.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

cocoon vm status defaults to indefinite polling; orphans easily on shell disconnect

3 participants