fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523)#613
Open
G4614 wants to merge 5 commits into
Open
fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523)#613G4614 wants to merge 5 commits into
G4614 wants to merge 5 commits into
Conversation
Adds `ShimReaper`, a per-runtime worker that periodically calls `waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design that matches Issue boxlite-ai#523's acceptance criteria. Problem ======= `boxlite-shim` children that get abandoned mid-init (panic past `CleanupGuard`, retry loop spawning a fresh shim, reattach paths where the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies piled up over time, and the contributing cause sat on top of the CleanupGuard fix (which itself is independent and stays). PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these. That call races every other `Child::wait()` site in the workspace: if the reaper wins, the owner's `wait()` returns ECHILD and the exit code is lost. The reaper was reverted on the same PR pending the scoped design this commit implements. Design ====== `src/boxlite/src/util/reaper.rs` introduces: - `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>` registry every 250 ms and reaps any registered PID whose `waitpid` reports exit or ECHILD. Std thread (not tokio task) so it works regardless of whether `RuntimeImpl::new` was called inside a tokio runtime context. - `ReaperHandle` — RAII registration handle. Drop removes the PID from the registry; a panic in any caller path that owns one can't leak. - `shutdown()` — Condvar-driven, completes in well under one tick so `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't block the runtime exit on a stuck reaper. Why scoped, not daemon-wide --------------------------- Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site. This reaper only touches PIDs that were explicitly registered. Today's only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();` sites in `shim.rs:112` / `shim.rs:121` discard their results, so ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` / `Child::try_wait()` call site in `src/boxlite/`: | Site | Process | Race-safe? | |-------------------------------------|------------|------------------| | shim.rs:102 (stop's try_wait loop) | shim | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded | | shim.rs:112, 121 (`let _ = wait()`) | shim | Yes — result discarded | | watchdog.rs:256 | test only | N/A | | net/libslirp.rs:135 | gvproxy | Not registered, reaper doesn't touch | | box_impl.rs:1169 (ChildGuard) | test only | N/A | | ProcessMonitor::try_wait | any | Already maps ECHILD to `Unknown` | Why polling, not SIGCHLD ------------------------ SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is process-global and adds a race surface against the runtime's other signal handlers. A 250 ms HashSet scan is cheaper than that complexity buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in < 2 s by polling, never sleeping. Wiring ====== - `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and `shutdown_sync()` (atexit / Drop path). - `ShimController::new` takes `Arc<ShimReaper>`; threaded through from `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`. - `ShimHandler::from_spawned` registers the PID and stores the `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT register — `waitpid` on a non-child returns ECHILD anyway; the reattached shim's eventual zombie belongs to init, not us. Tests ===== Acceptance criteria from Issue boxlite-ai#523, restated and pinned: - [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see `probe_pid` in reaper.rs. - [x] Audit of `Child::wait()` call sites — recorded in the table above. - [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and sync paths call `shim_reaper.shutdown()`. Drop on the reaper is idempotent. - [x] Verification test runs in < 2 s (poll-based) — three unit tests in `reaper.rs::tests` and two integration tests in `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`, `registry_accepts_re_registration_after_drop`. Wallclock for each is sub-second on the dev host. - [x] Integration test demonstrates "N init failures leave 0 zombies" — `n_init_failures_leave_zero_zombies` spawns N=8 (matching the `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned- mid-init path, and asserts both registry drain and `/proc/<pid>/status` absence-or-not-Z. Test counts: - Unit (reaper.rs): 3 tests, all pass in ~0.3 s total - Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total - Regression: 754/754 boxlite lib tests pass with --features rest (excluding the pre-existing `ws_watchdog` flake unrelated to this change); 118/118 boxlite-cli unit tests pass. Out of scope ============ - Reattach orphans: a reattached `ShimHandler` doesn't register with the reaper. If the original spawning process is gone, init will reap the eventual zombie. Tracking via reconciler (separate epic) covers the DB/runtime drift case. - PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix for a different leak mechanism (`is_same_process` recovery misidentification). Both are needed for the full dind test reliability story; landing one doesn't subsume the other. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior commit (4c1bf2fd) introduced a `ReaperHandle` RAII type whose
Drop auto-unregistered a PID from the reaper. That design missed the
load-bearing production zombie source it was supposed to catch.
The bug
=======
`ShimHandler` field-order-drops in this sequence:
1. process: Option<Child> — Child::Drop is a no-op
2. keepalive: Option<Keepalive> — closes the watchdog pipe →
shim sees POLLHUP and BEGINS
graceful exit (takes ~100 ms+)
3. reaper_handle: Option<...> — unregisters PID immediately
Between steps 2 and 3 the shim is still alive; step 3 fires while the
shim is mid-shutdown. ~100 ms later the shim actually exits — and now
nobody is watching. The reaper had been told "stop watching" exactly
microseconds before the very event it existed to catch.
In production this is hit by every `rt_impl::remove_box` of an active
box (which SIGKILLs the shim by PID and never calls `handler.stop()`,
so `Child::wait()` is never invoked) and by every runtime drop that
skips `shutdown()`. These are exactly the paths Issue boxlite-ai#523's
"7+ accumulated zombies on dev" came from.
The unit tests in 4c1bf2fd accidentally papered over this by holding
`_handle` explicitly across the assertion — production code can't do
that because there's no orderly point at which to release.
The fix
=======
Registration is now a one-way door:
- `ShimReaper::register(pid)` returns `()`. No handle.
- `ShimHandler` no longer holds a `reaper_handle` field. Dropping a
`ShimHandler` does not affect reaper state.
- A PID stays in the registry until the reaper's `waitpid(WNOHANG)`
sweep observes `Reaped` (we just collected it) or `Vanished`
(someone else collected it, or the PID isn't our child). Both
outcomes already drop the PID — no new code path needed in the
sweep logic.
Worst-case registry membership for a successfully-stopped shim is one
REAPER_TICK (250 ms) — the sweep cost is a HashSet scan, so this is
trivial.
Tests
=====
- `reaps_exited_pid_within_one_second` — unchanged in shape, but now
exercises the only registration path (no handle to retain).
- `dropping_handle_unregisters_pid` is replaced by
`live_pid_stays_registered_until_actual_exit`. This pins the new
invariant directly: register a long-running PID, give the worker
multiple ticks to sweep, assert the PID is still registered (the
old design would have purged it on owner Drop); then kill+wait
the child and assert the next sweep drains via ECHILD.
- Integration test renamed from
`registry_accepts_re_registration_after_drop` to
`register_is_idempotent_and_survives_prior_cleanup`, asserting
HashSet-insert idempotence (a double register results in a single
entry) and post-cleanup re-registration semantics. The N=8
zombie-accumulation test is otherwise unchanged.
Regression sweep: 754/754 boxlite lib (with --features rest, excluding
the pre-existing ws_watchdog flake), 3/3 reaper unit, 2/2 reaper
integration.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the resource-hazard-detection gap from the prior commits: the
reaper now exposes counters, recovers from worker death, and surfaces
its health through `RuntimeMetrics` and the REST `/v1/metrics`
endpoint so operators can spot "the reaper isn't running" before
zombies pile up.
What's added
============
1. **Reaper internal stats + `stats()` API.** `Inner` gains atomic
counters for `reaped_total`, `vanished_total`, `sweeps_completed`,
`last_sweep_at_ms`. `ShimReaper::stats()` returns a `ReaperStats`
snapshot including the registry size and worker liveness. Dashboards
alert on `sweeps_completed` going flat while `registered > 0` —
that's the load-bearing "reaper died, zombies accumulating"
signature.
2. **Self-healing.** `ShimReaper::worker_alive()` checks
`JoinHandle::is_finished()`. If false and shutdown wasn't called,
`register()` transparently respawns the worker and emits a
`tracing::error!` — the loud alert that something pathological is
going on inside the reaper. New unit test
`worker_respawns_after_death` uses a test-only
`test_panic_next_iteration` flag to deterministically trigger
worker death (mutex poisoning can't be used anymore because
`lock_or_recover` now absorbs poison silently — see below).
3. **Mutex poison recovery.** All `registry.lock().unwrap()` sites are
replaced with `lock_or_recover` which calls
`PoisonError::into_inner` to extract the data anyway. For our
`HashSet<u32>` registry, insert/remove are atomic at the std-lib
level so torn state is impossible. Without this, a single panic
anywhere holding the registry mutex would cascade-poison every
subsequent reaper call and defeat the self-healing path.
4. **High-pressure unit test.**
`registry_under_high_register_pressure` burst-registers N=1024
PIDs (all in a high u32 range that returns ECHILD on waitpid).
Asserts the sweep drains the whole batch within 3 s and that
`vanished_total` reflects the full batch — verifies the worker
handles register bursts without deadlocking on the registry mutex.
5. **`MetricsSink` cross-module bridge.** The reaper's `spawn_with_sink`
accepts an optional `MetricsSink { reaped_total, vanished_total,
registered_now }` of `Arc<AtomicU64>`s. `RuntimeImpl::new`
constructs one of these by cloning the relevant fields out of its
`RuntimeMetricsStorage`, so reaper activity is mirrored into the
metrics storage without making `util::reaper` import the metrics
module. Keeps `util` a leaf module.
6. **`RuntimeMetricsStorage` + `RuntimeMetrics` getters.** Three new
fields (`shim_reaped`, `shim_reaper_vanished`,
`shim_reaper_registered`) and the corresponding accessor methods
on `RuntimeMetrics`. The first two are monotonic counters; the
third is a gauge.
7. **REST `/v1/metrics` exposure.** `RuntimeMetricsResponse` (server
and client) gains the same three fields, marked
`#[serde(default)]` on the client mirror so older servers that
pre-date the reaper still deserialize cleanly. Two new wire-shape
tests:
- `test_runtime_metrics_back_compat_default_zero_for_missing_shim_fields`
- `test_runtime_metrics_shim_fields_round_trip`
Tests
=====
New unit tests in `util::reaper::tests`:
- `registry_under_high_register_pressure` — N=1024 burst drains in <3s
- `worker_respawns_after_death` — kill worker, register, observe
respawn + continued sweep activity
- `stats_track_sweep_progress` — aggregate Stats API snapshot
Updated existing tests to assert against the new counters
(`reaped_total >= 1` after a reap, `vanished_total >= 1` after a
vanish). Total 6 reaper unit tests, all pass.
REST mirror gets 2 new wire-shape tests (3 total
`test_runtime_metrics_*` tests in `rest::types::tests`).
Regression sweep: 759/759 boxlite lib (with --features rest,
excluding pre-existing ws_watchdog flake), 118/118 boxlite-cli,
2/2 zombie_reaper integration. Zero regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ives Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered must be left for its owner to wait(). Two-side verified — injecting a global waitpid(-1) into the sweep makes the owner's wait() return ECHILD ("No child processes") and the test fail; the scoped sweep preserves exit code 42. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DorianZheng
pushed a commit
that referenced
this pull request
May 28, 2026
…eak a running shim (#615) health_check_transitions_to_healthy_after_startup ends with the box still Running. RuntimeImpl::Drop's safety-net shutdown_sync only sends SIGTERM and returns without waiting, so the shim is still mid-shutdown when PerTestBoxHome::Drop (#604) runs its leak assertion — `1 live shim(s)`. Adding t.bx.stop() (which stops and waits) clears it. The reaper (#613) does not help here: it reaps dead shims, not a still-running one. Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-149.ap-southeast-2.compute.internal> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add a scoped
ShimReaperso shims killed without aChild::wait()get reaped instead of leaking zombies (Issue #523).Test plan
Run with
--features krun,gvproxy, nextest--profile vm.New tests (
src/boxlite/tests/zombie_reaper.rs)n_init_failures_leave_zero_zombiesState: Zleft in/procregister_is_idempotent_and_survives_prior_cleanupunregistered_child_exit_code_is_not_stolenwait()Two-side verification
unregistered_child_exit_code_is_not_stolenwaitpid(-1)→ FAIL: owner'swait()returnsECHILD, exit code 42 stolenhealth_check_becomes_unhealthy_when_shim_killed(original test, no test-side cleanup)PerTestBoxHome dropped with 1 live shim(s): [<pid>](SIGKILL'd shim is the zombie)Regression (branch rebased onto current main, incl. #590 where recovery now sees the killed shim as reaped, not zombie)
lifecycle(incl.remove_active_with_force_stops_and_removes— theremove_boxSIGKILL-by-PID reap path, via the #604PerTestBoxHomeleak check)recovery::recovery_with_dead_processpid_filedetach