Problem
When cocoon runs under sudo (the common deployment), and the caller of sudo
SIGKILLs sudo to cancel the operation (exec.CommandContext does this on
ctx cancellation), sudo dies but the cocoon grandchild does not get a
signal — the kernel reparents it to PID 1 and it keeps running until its
in-progress work finishes naturally.
Concrete observation in vk-cocoon's runPostCloneSetup:
- vk-cocoon spawns
sudo cocoon vm exec <vmid> -- powershell ... via
exec.CommandContext(ctx, "sudo", ...).
- cocoon dials cocoon-agent over hybrid-vsock and waits for the agent to
complete the PowerShell PnP-rebind (sometimes 60s+ on Windows clones).
- vk-cocoon's loopCtx (180s budget) fires;
cmd.Cancel() → SIGKILL to sudo.
- sudo dies. cocoon survives, holds vsock UDS open, keeps streaming bytes
to/from the agent.
- vk-cocoon's
cmd.Wait() would block on the orphan's stdout pipe; we
work around that with a select { case <-done: case <-loopCtx.Done(): }
— but the orphan cocoon process (and its FDs) leak until the agent finally
answers. With many stuck clones this accumulates orphan processes.
cocoon vm exec is in-process today (no subprocess from cocoon's side), so
the leak is bounded to one orphan per stuck call. Still ugly under load.
Why this isn't fully a caller-side fix
The caller (vk-cocoon, or any sudo-wrapped invocation) can use
SysProcAttr{Setpgid: true} and kill the whole pgid — that does work and we
plan to do it on vk-cocoon's side regardless. But:
- It's a per-caller mitigation; every cocoon CLI consumer has to remember to
do it.
- SIGKILL gives cocoon no chance to flush state (close vsock connection
cleanly, release agent-side resources, write final logs).
- Setting
Pdeathsig on the immediate child only gives sudo the signal, not
cocoon — Pdeathsig propagates one level only.
Proposed fix in cocoon
Add prctl(PR_SET_PDEATHSIG, SIGTERM) in main() (Linux only, build-tagged).
When cocoon's parent process dies — sudo crashing, killed by ctx, or a
caller force-quitting — the kernel signals cocoon directly. cocoon already has
signal.NotifyContext(ctx, SIGINT, SIGTERM) at cmd/root.go:86, so the
existing ctx-cancellation paths (including f1f641a's vsock CONNECT honor-ctx
fix and any future cancellable IO) take over and shut down gracefully.
Sketch:
// main_linux.go (//go:build linux)
package main
import (
"syscall"
"golang.org/x/sys/unix"
)
func init() {
// Ask the kernel to send SIGTERM if our parent dies. Inherited across
// exec/fork only by the calling thread, which is fine here because
// main runs on the locked main goroutine before any work.
_ = unix.Prctl(unix.PR_SET_PDEATHSIG, uintptr(syscall.SIGTERM), 0, 0, 0)
}
// main_other.go (//go:build !linux)
package main
func init() {} // no-op
This:
- Makes cocoon robust under any sudo / supervisor / docker-style parent
without each caller having to engineer pgid handling.
- Plays well with the SIGINT/SIGTERM handler already at
cmd/root.go:86 —
ctx is canceled, in-flight vm exec / vm clone / snapshot save paths
unwind through their existing ctx-aware code, vsock connections close
cleanly, agent sees EOF and reaps its child.
Out of scope for this issue
Long-running internal subprocesses (cloud-hypervisor, firecracker) already
use Setpgid: true and survive cocoon's death intentionally — that's a
separate design decision and not affected by adding PR_SET_PDEATHSIG to
cocoon's own main.
Test plan
- Add a test that runs
cocoon vm exec against a stub agent, kills the
parent (cocoon's grandparent test harness), confirms the cocoon process
exits with non-zero within ~1s.
- Manual: run
sudo cocoon vm exec <stuck-vm> -- some-hung-cmd, kill -9
the sudo, verify cocoon exits (currently: stays alive until the cmd finishes).
Related
- Fix from caller side (vk-cocoon
feat/post-clone-auto-exec branch)
Setpgid + pgid-kill — will land separately, complements this fix.
- cocoonv2
f1f641a made vsock CONNECT honor ctx — same theme of
graceful cancellation; this issue extends that to caller-driven exits.
Problem
When
cocoonruns undersudo(the common deployment), and the caller ofsudoSIGKILLs sudo to cancel the operation (
exec.CommandContextdoes this onctx cancellation),
sudodies but the cocoon grandchild does not get asignal — the kernel reparents it to PID 1 and it keeps running until its
in-progress work finishes naturally.
Concrete observation in vk-cocoon's
runPostCloneSetup:sudo cocoon vm exec <vmid> -- powershell ...viaexec.CommandContext(ctx, "sudo", ...).complete the PowerShell PnP-rebind (sometimes 60s+ on Windows clones).
cmd.Cancel()→ SIGKILL to sudo.to/from the agent.
cmd.Wait()would block on the orphan's stdout pipe; wework around that with a
select { case <-done: case <-loopCtx.Done(): }— but the orphan cocoon process (and its FDs) leak until the agent finally
answers. With many stuck clones this accumulates orphan processes.
cocoon vm execis in-process today (no subprocess from cocoon's side), sothe leak is bounded to one orphan per stuck call. Still ugly under load.
Why this isn't fully a caller-side fix
The caller (vk-cocoon, or any sudo-wrapped invocation) can use
SysProcAttr{Setpgid: true}and kill the whole pgid — that does work and weplan to do it on vk-cocoon's side regardless. But:
do it.
cleanly, release agent-side resources, write final logs).
Pdeathsigon the immediate child only gives sudo the signal, notcocoon — Pdeathsig propagates one level only.
Proposed fix in cocoon
Add
prctl(PR_SET_PDEATHSIG, SIGTERM)inmain()(Linux only, build-tagged).When cocoon's parent process dies — sudo crashing, killed by ctx, or a
caller force-quitting — the kernel signals cocoon directly. cocoon already has
signal.NotifyContext(ctx, SIGINT, SIGTERM)atcmd/root.go:86, so theexisting ctx-cancellation paths (including
f1f641a's vsock CONNECT honor-ctxfix and any future cancellable IO) take over and shut down gracefully.
Sketch:
This:
without each caller having to engineer pgid handling.
cmd/root.go:86—ctx is canceled, in-flight
vm exec/vm clone/snapshot savepathsunwind through their existing ctx-aware code, vsock connections close
cleanly, agent sees EOF and reaps its child.
Out of scope for this issue
Long-running internal subprocesses (cloud-hypervisor, firecracker) already
use
Setpgid: trueand survive cocoon's death intentionally — that's aseparate design decision and not affected by adding PR_SET_PDEATHSIG to
cocoon's own main.
Test plan
cocoon vm execagainst a stub agent, kills theparent (cocoon's grandparent test harness), confirms the cocoon process
exits with non-zero within ~1s.
sudo cocoon vm exec <stuck-vm> -- some-hung-cmd, kill -9the sudo, verify cocoon exits (currently: stays alive until the cmd finishes).
Related
feat/post-clone-auto-execbranch)Setpgid + pgid-kill — will land separately, complements this fix.
f1f641amade vsock CONNECT honor ctx — same theme ofgraceful cancellation; this issue extends that to caller-driven exits.