fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523) by G4614 · Pull Request #613 · boxlite-ai/boxlite

G4614 · 2026-05-28T08:15:29Z

Add a scoped ShimReaper so shims killed without a Child::wait() get reaped instead of leaking zombies (Issue #523).

Test plan

Run with --features krun,gvproxy, nextest --profile vm.

New tests (src/boxlite/tests/zombie_reaper.rs)

test	asserts
`n_init_failures_leave_zero_zombies`	N register-then-forget PIDs are drained; no `State: Z` left in `/proc`
`register_is_idempotent_and_survives_prior_cleanup`	double-register is a no-op; re-register after a sweep is clean
`unregistered_child_exit_code_is_not_stolen`	a child the reaper never registered is left for its owner to `wait()`

Two-side verification

test	pre-fix (caught)	post-fix
`unregistered_child_exit_code_is_not_stolen`	inject global `waitpid(-1)` → FAIL: owner's `wait()` returns `ECHILD`, exit code 42 stolen	scoped sweep → PASS: owner reads exit code 42
`health_check_becomes_unhealthy_when_shim_killed` (original test, no test-side cleanup)	no reaper (main) → FAIL: `PerTestBoxHome dropped with 1 live shim(s): [<pid>]` (SIGKILL'd shim is the zombie)	reaper → PASS: leak check finds 0 live shims

Regression (branch rebased onto current main, incl. #590 where recovery now sees the killed shim as reaped, not zombie)

suite	result
`lifecycle` (incl. `remove_active_with_force_stops_and_removes` — the `remove_box` SIGKILL-by-PID reap path, via the #604 `PerTestBoxHome` leak check)	24 pass
`recovery::recovery_with_dead_process`	1 pass
`pid_file`	10 pass
`detach`	7 pass

Adds `ShimReaper`, a per-runtime worker that periodically calls `waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design that matches Issue boxlite-ai#523's acceptance criteria. Problem ======= `boxlite-shim` children that get abandoned mid-init (panic past `CleanupGuard`, retry loop spawning a fresh shim, reattach paths where the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies piled up over time, and the contributing cause sat on top of the CleanupGuard fix (which itself is independent and stays). PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these. That call races every other `Child::wait()` site in the workspace: if the reaper wins, the owner's `wait()` returns ECHILD and the exit code is lost. The reaper was reverted on the same PR pending the scoped design this commit implements. Design ====== `src/boxlite/src/util/reaper.rs` introduces: - `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>` registry every 250 ms and reaps any registered PID whose `waitpid` reports exit or ECHILD. Std thread (not tokio task) so it works regardless of whether `RuntimeImpl::new` was called inside a tokio runtime context. - `ReaperHandle` — RAII registration handle. Drop removes the PID from the registry; a panic in any caller path that owns one can't leak. - `shutdown()` — Condvar-driven, completes in well under one tick so `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't block the runtime exit on a stuck reaper. Why scoped, not daemon-wide --------------------------- Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site. This reaper only touches PIDs that were explicitly registered. Today's only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();` sites in `shim.rs:112` / `shim.rs:121` discard their results, so ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` / `Child::try_wait()` call site in `src/boxlite/`: | Site | Process | Race-safe? | |-------------------------------------|------------|------------------| | shim.rs:102 (stop's try_wait loop) | shim | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded | | shim.rs:112, 121 (`let _ = wait()`) | shim | Yes — result discarded | | watchdog.rs:256 | test only | N/A | | net/libslirp.rs:135 | gvproxy | Not registered, reaper doesn't touch | | box_impl.rs:1169 (ChildGuard) | test only | N/A | | ProcessMonitor::try_wait | any | Already maps ECHILD to `Unknown` | Why polling, not SIGCHLD ------------------------ SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is process-global and adds a race surface against the runtime's other signal handlers. A 250 ms HashSet scan is cheaper than that complexity buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in < 2 s by polling, never sleeping. Wiring ====== - `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and `shutdown_sync()` (atexit / Drop path). - `ShimController::new` takes `Arc<ShimReaper>`; threaded through from `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`. - `ShimHandler::from_spawned` registers the PID and stores the `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT register — `waitpid` on a non-child returns ECHILD anyway; the reattached shim's eventual zombie belongs to init, not us. Tests ===== Acceptance criteria from Issue boxlite-ai#523, restated and pinned: - [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see `probe_pid` in reaper.rs. - [x] Audit of `Child::wait()` call sites — recorded in the table above. - [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and sync paths call `shim_reaper.shutdown()`. Drop on the reaper is idempotent. - [x] Verification test runs in < 2 s (poll-based) — three unit tests in `reaper.rs::tests` and two integration tests in `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`, `registry_accepts_re_registration_after_drop`. Wallclock for each is sub-second on the dev host. - [x] Integration test demonstrates "N init failures leave 0 zombies" — `n_init_failures_leave_zero_zombies` spawns N=8 (matching the `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned- mid-init path, and asserts both registry drain and `/proc/<pid>/status` absence-or-not-Z. Test counts: - Unit (reaper.rs): 3 tests, all pass in ~0.3 s total - Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total - Regression: 754/754 boxlite lib tests pass with --features rest (excluding the pre-existing `ws_watchdog` flake unrelated to this change); 118/118 boxlite-cli unit tests pass. Out of scope ============ - Reattach orphans: a reattached `ShimHandler` doesn't register with the reaper. If the original spawning process is gone, init will reap the eventual zombie. Tracking via reconciler (separate epic) covers the DB/runtime drift case. - PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix for a different leak mechanism (`is_same_process` recovery misidentification). Both are needed for the full dind test reliability story; landing one doesn't subsume the other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The prior commit (4c1bf2fd) introduced a `ReaperHandle` RAII type whose Drop auto-unregistered a PID from the reaper. That design missed the load-bearing production zombie source it was supposed to catch. The bug ======= `ShimHandler` field-order-drops in this sequence: 1. process: Option<Child> — Child::Drop is a no-op 2. keepalive: Option<Keepalive> — closes the watchdog pipe → shim sees POLLHUP and BEGINS graceful exit (takes ~100 ms+) 3. reaper_handle: Option<...> — unregisters PID immediately Between steps 2 and 3 the shim is still alive; step 3 fires while the shim is mid-shutdown. ~100 ms later the shim actually exits — and now nobody is watching. The reaper had been told "stop watching" exactly microseconds before the very event it existed to catch. In production this is hit by every `rt_impl::remove_box` of an active box (which SIGKILLs the shim by PID and never calls `handler.stop()`, so `Child::wait()` is never invoked) and by every runtime drop that skips `shutdown()`. These are exactly the paths Issue boxlite-ai#523's "7+ accumulated zombies on dev" came from. The unit tests in 4c1bf2fd accidentally papered over this by holding `_handle` explicitly across the assertion — production code can't do that because there's no orderly point at which to release. The fix ======= Registration is now a one-way door: - `ShimReaper::register(pid)` returns `()`. No handle. - `ShimHandler` no longer holds a `reaper_handle` field. Dropping a `ShimHandler` does not affect reaper state. - A PID stays in the registry until the reaper's `waitpid(WNOHANG)` sweep observes `Reaped` (we just collected it) or `Vanished` (someone else collected it, or the PID isn't our child). Both outcomes already drop the PID — no new code path needed in the sweep logic. Worst-case registry membership for a successfully-stopped shim is one REAPER_TICK (250 ms) — the sweep cost is a HashSet scan, so this is trivial. Tests ===== - `reaps_exited_pid_within_one_second` — unchanged in shape, but now exercises the only registration path (no handle to retain). - `dropping_handle_unregisters_pid` is replaced by `live_pid_stays_registered_until_actual_exit`. This pins the new invariant directly: register a long-running PID, give the worker multiple ticks to sweep, assert the PID is still registered (the old design would have purged it on owner Drop); then kill+wait the child and assert the next sweep drains via ECHILD. - Integration test renamed from `registry_accepts_re_registration_after_drop` to `register_is_idempotent_and_survives_prior_cleanup`, asserting HashSet-insert idempotence (a double register results in a single entry) and post-cleanup re-registration semantics. The N=8 zombie-accumulation test is otherwise unchanged. Regression sweep: 754/754 boxlite lib (with --features rest, excluding the pre-existing ws_watchdog flake), 3/3 reaper unit, 2/2 reaper integration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes the resource-hazard-detection gap from the prior commits: the reaper now exposes counters, recovers from worker death, and surfaces its health through `RuntimeMetrics` and the REST `/v1/metrics` endpoint so operators can spot "the reaper isn't running" before zombies pile up. What's added ============ 1. **Reaper internal stats + `stats()` API.** `Inner` gains atomic counters for `reaped_total`, `vanished_total`, `sweeps_completed`, `last_sweep_at_ms`. `ShimReaper::stats()` returns a `ReaperStats` snapshot including the registry size and worker liveness. Dashboards alert on `sweeps_completed` going flat while `registered > 0` — that's the load-bearing "reaper died, zombies accumulating" signature. 2. **Self-healing.** `ShimReaper::worker_alive()` checks `JoinHandle::is_finished()`. If false and shutdown wasn't called, `register()` transparently respawns the worker and emits a `tracing::error!` — the loud alert that something pathological is going on inside the reaper. New unit test `worker_respawns_after_death` uses a test-only `test_panic_next_iteration` flag to deterministically trigger worker death (mutex poisoning can't be used anymore because `lock_or_recover` now absorbs poison silently — see below). 3. **Mutex poison recovery.** All `registry.lock().unwrap()` sites are replaced with `lock_or_recover` which calls `PoisonError::into_inner` to extract the data anyway. For our `HashSet<u32>` registry, insert/remove are atomic at the std-lib level so torn state is impossible. Without this, a single panic anywhere holding the registry mutex would cascade-poison every subsequent reaper call and defeat the self-healing path. 4. **High-pressure unit test.** `registry_under_high_register_pressure` burst-registers N=1024 PIDs (all in a high u32 range that returns ECHILD on waitpid). Asserts the sweep drains the whole batch within 3 s and that `vanished_total` reflects the full batch — verifies the worker handles register bursts without deadlocking on the registry mutex. 5. **`MetricsSink` cross-module bridge.** The reaper's `spawn_with_sink` accepts an optional `MetricsSink { reaped_total, vanished_total, registered_now }` of `Arc<AtomicU64>`s. `RuntimeImpl::new` constructs one of these by cloning the relevant fields out of its `RuntimeMetricsStorage`, so reaper activity is mirrored into the metrics storage without making `util::reaper` import the metrics module. Keeps `util` a leaf module. 6. **`RuntimeMetricsStorage` + `RuntimeMetrics` getters.** Three new fields (`shim_reaped`, `shim_reaper_vanished`, `shim_reaper_registered`) and the corresponding accessor methods on `RuntimeMetrics`. The first two are monotonic counters; the third is a gauge. 7. **REST `/v1/metrics` exposure.** `RuntimeMetricsResponse` (server and client) gains the same three fields, marked `#[serde(default)]` on the client mirror so older servers that pre-date the reaper still deserialize cleanly. Two new wire-shape tests: - `test_runtime_metrics_back_compat_default_zero_for_missing_shim_fields` - `test_runtime_metrics_shim_fields_round_trip` Tests ===== New unit tests in `util::reaper::tests`: - `registry_under_high_register_pressure` — N=1024 burst drains in <3s - `worker_respawns_after_death` — kill worker, register, observe respawn + continued sweep activity - `stats_track_sweep_progress` — aggregate Stats API snapshot Updated existing tests to assert against the new counters (`reaped_total >= 1` after a reap, `vanished_total >= 1` after a vanish). Total 6 reaper unit tests, all pass. REST mirror gets 2 new wire-shape tests (3 total `test_runtime_metrics_*` tests in `rest::types::tests`). Regression sweep: 759/759 boxlite lib (with --features rest, excluding pre-existing ws_watchdog flake), 118/118 boxlite-cli, 2/2 zombie_reaper integration. Zero regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ives Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered must be left for its owner to wait(). Two-side verified — injecting a global waitpid(-1) into the sweep makes the owner's wait() return ECHILD ("No child processes") and the test fail; the scoped sweep preserves exit code 42. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eak a running shim (#615) health_check_transitions_to_healthy_after_startup ends with the box still Running. RuntimeImpl::Drop's safety-net shutdown_sync only sends SIGTERM and returns without waiting, so the shim is still mid-shutdown when PerTestBoxHome::Drop (#604) runs its leak assertion — `1 live shim(s)`. Adding t.bx.stop() (which stops and waits) clears it. The reaper (#613) does not help here: it reaps dead shims, not a still-running one. Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-149.ap-southeast-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ubuntu and others added 4 commits May 28, 2026 07:38

style(vmm): rustfmt ShimReaper construction and worker loop

9859dd3

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

G4614 marked this pull request as draft May 28, 2026 08:28

G4614 marked this pull request as ready for review May 28, 2026 09:33

G4614 mentioned this pull request May 28, 2026

test(health_check): explicit stop() for the healthy-box test #615

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523)#613

fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523)#613
G4614 wants to merge 5 commits into
boxlite-ai:mainfrom
G4614:fix/scoped-zombie-shim-reaper

G4614 commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

G4614 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

G4614 commented May 28, 2026 •

edited

Loading