Skip to content

fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523)#613

Open
G4614 wants to merge 5 commits into
boxlite-ai:mainfrom
G4614:fix/scoped-zombie-shim-reaper
Open

fix(vmm): scoped ShimReaper for zombie shim PIDs (Issue #523)#613
G4614 wants to merge 5 commits into
boxlite-ai:mainfrom
G4614:fix/scoped-zombie-shim-reaper

Conversation

@G4614
Copy link
Copy Markdown
Contributor

@G4614 G4614 commented May 28, 2026

Add a scoped ShimReaper so shims killed without a Child::wait() get reaped instead of leaking zombies (Issue #523).

Test plan

Run with --features krun,gvproxy, nextest --profile vm.

New tests (src/boxlite/tests/zombie_reaper.rs)

test asserts
n_init_failures_leave_zero_zombies N register-then-forget PIDs are drained; no State: Z left in /proc
register_is_idempotent_and_survives_prior_cleanup double-register is a no-op; re-register after a sweep is clean
unregistered_child_exit_code_is_not_stolen a child the reaper never registered is left for its owner to wait()

Two-side verification

test pre-fix (caught) post-fix
unregistered_child_exit_code_is_not_stolen inject global waitpid(-1)FAIL: owner's wait() returns ECHILD, exit code 42 stolen scoped sweep → PASS: owner reads exit code 42
health_check_becomes_unhealthy_when_shim_killed (original test, no test-side cleanup) no reaper (main) → FAIL: PerTestBoxHome dropped with 1 live shim(s): [<pid>] (SIGKILL'd shim is the zombie) reaper → PASS: leak check finds 0 live shims

Regression (branch rebased onto current main, incl. #590 where recovery now sees the killed shim as reaped, not zombie)

suite result
lifecycle (incl. remove_active_with_force_stops_and_removes — the remove_box SIGKILL-by-PID reap path, via the #604 PerTestBoxHome leak check) 24 pass
recovery::recovery_with_dead_process 1 pass
pid_file 10 pass
detach 7 pass

Ubuntu and others added 4 commits May 28, 2026 07:38
Adds `ShimReaper`, a per-runtime worker that periodically calls
`waitpid(pid, WNOHANG)` on a small registered PID set. Replaces the
reverted daemon-wide reaper attempt from PR boxlite-ai#520 with a scoped design
that matches Issue boxlite-ai#523's acceptance criteria.

Problem
=======

`boxlite-shim` children that get abandoned mid-init (panic past
`CleanupGuard`, retry loop spawning a fresh shim, reattach paths where
the Rust-side `Child` was never wait()'d) accumulate as `<defunct>` PIDs
under the daemon. The `CL84LvGx7RBE` incident on dev showed 7+ zombies
piled up over time, and the contributing cause sat on top of the
CleanupGuard fix (which itself is independent and stays).

PR boxlite-ai#520 originally landed a `waitpid(-1, WNOHANG)` reaper to drain these.
That call races every other `Child::wait()` site in the workspace: if
the reaper wins, the owner's `wait()` returns ECHILD and the exit code
is lost. The reaper was reverted on the same PR pending the scoped
design this commit implements.

Design
======

`src/boxlite/src/util/reaper.rs` introduces:

- `ShimReaper` — owns a `std::thread` worker that scans an `Arc<Mutex<HashSet<u32>>>`
  registry every 250 ms and reaps any registered PID whose `waitpid`
  reports exit or ECHILD. Std thread (not tokio task) so it works
  regardless of whether `RuntimeImpl::new` was called inside a tokio
  runtime context.
- `ReaperHandle` — RAII registration handle. Drop removes the PID from
  the registry; a panic in any caller path that owns one can't leak.
- `shutdown()` — Condvar-driven, completes in well under one tick so
  `RuntimeImpl::shutdown` (sync and async paths both call it) doesn't
  block the runtime exit on a stuck reaper.

Why scoped, not daemon-wide
---------------------------

Per Issue boxlite-ai#523, an unscoped reaper races every `Child::wait()` call site.
This reaper only touches PIDs that were explicitly registered. Today's
only registrar is `ShimHandler::from_spawned`. The `let _ = process.wait();`
sites in `shim.rs:112` / `shim.rs:121` discard their results, so
ECHILD-from-reaper-win is safe. Audit of every `Child::wait()` /
`Child::try_wait()` call site in `src/boxlite/`:

| Site                                | Process    | Race-safe?       |
|-------------------------------------|------------|------------------|
| shim.rs:102 (stop's try_wait loop)  | shim       | Yes — `Err(_)` arm falls through to kill+wait, both ECHILD-discarded |
| shim.rs:112, 121 (`let _ = wait()`) | shim       | Yes — result discarded |
| watchdog.rs:256                     | test only  | N/A              |
| net/libslirp.rs:135                 | gvproxy    | Not registered, reaper doesn't touch |
| box_impl.rs:1169 (ChildGuard)       | test only  | N/A              |
| ProcessMonitor::try_wait            | any        | Already maps ECHILD to `Unknown` |

Why polling, not SIGCHLD
------------------------

SIGCHLD plumbing in async Rust (signal-hook / tokio::signal::unix) is
process-global and adds a race surface against the runtime's other
signal handlers. A 250 ms HashSet scan is cheaper than that complexity
buys back. Worst-case zombie lifetime is 250 ms; tests verify drain in
< 2 s by polling, never sleeping.

Wiring
======

- `RuntimeImpl` gains `shim_reaper: Arc<ShimReaper>`, spawned in
  `RuntimeImpl::new`, shut down in both `shutdown()` (async path) and
  `shutdown_sync()` (atexit / Drop path).
- `ShimController::new` takes `Arc<ShimReaper>`; threaded through from
  `vmm_spawn::spawn_vm` via `ctx.runtime.shim_reaper`.
- `ShimHandler::from_spawned` registers the PID and stores the
  `ReaperHandle`. `ShimHandler::from_pid` (reattach mode) does NOT
  register — `waitpid` on a non-child returns ECHILD anyway; the
  reattached shim's eventual zombie belongs to init, not us.

Tests
=====

Acceptance criteria from Issue boxlite-ai#523, restated and pinned:

- [x] Scopes to known shim PIDs, never calls `waitpid(-1, ...)` — see
      `probe_pid` in reaper.rs.
- [x] Audit of `Child::wait()` call sites — recorded in the table above.
- [x] Shutdown path tied to `RuntimeImpl::shutdown` — both async and
      sync paths call `shim_reaper.shutdown()`. Drop on the reaper is
      idempotent.
- [x] Verification test runs in < 2 s (poll-based) — three unit tests
      in `reaper.rs::tests` and two integration tests in
      `src/boxlite/tests/zombie_reaper.rs::n_init_failures_leave_zero_zombies`,
      `registry_accepts_re_registration_after_drop`. Wallclock for each
      is sub-second on the dev host.
- [x] Integration test demonstrates "N init failures leave 0 zombies" —
      `n_init_failures_leave_zero_zombies` spawns N=8 (matching the
      `CL84LvGx7RBE` count), forgets each Child to mirror the abandoned-
      mid-init path, and asserts both registry drain and `/proc/<pid>/status`
      absence-or-not-Z.

Test counts:
- Unit (reaper.rs): 3 tests, all pass in ~0.3 s total
- Integration (tests/zombie_reaper.rs): 2 tests, all pass in ~0.5 s total
- Regression: 754/754 boxlite lib tests pass with --features rest
  (excluding the pre-existing `ws_watchdog` flake unrelated to this
  change); 118/118 boxlite-cli unit tests pass.

Out of scope
============

- Reattach orphans: a reattached `ShimHandler` doesn't register with
  the reaper. If the original spawning process is gone, init will reap
  the eventual zombie. Tracking via reconciler (separate epic) covers
  the DB/runtime drift case.
- PR boxlite-ai#573 (orphan shim port retention, Issue boxlite-ai#565): independent fix
  for a different leak mechanism (`is_same_process` recovery
  misidentification). Both are needed for the full dind test reliability
  story; landing one doesn't subsume the other.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The prior commit (4c1bf2fd) introduced a `ReaperHandle` RAII type whose
Drop auto-unregistered a PID from the reaper. That design missed the
load-bearing production zombie source it was supposed to catch.

The bug
=======

`ShimHandler` field-order-drops in this sequence:

  1. process: Option<Child>       — Child::Drop is a no-op
  2. keepalive: Option<Keepalive> — closes the watchdog pipe →
                                     shim sees POLLHUP and BEGINS
                                     graceful exit (takes ~100 ms+)
  3. reaper_handle: Option<...>   — unregisters PID immediately

Between steps 2 and 3 the shim is still alive; step 3 fires while the
shim is mid-shutdown. ~100 ms later the shim actually exits — and now
nobody is watching. The reaper had been told "stop watching" exactly
microseconds before the very event it existed to catch.

In production this is hit by every `rt_impl::remove_box` of an active
box (which SIGKILLs the shim by PID and never calls `handler.stop()`,
so `Child::wait()` is never invoked) and by every runtime drop that
skips `shutdown()`. These are exactly the paths Issue boxlite-ai#523's
"7+ accumulated zombies on dev" came from.

The unit tests in 4c1bf2fd accidentally papered over this by holding
`_handle` explicitly across the assertion — production code can't do
that because there's no orderly point at which to release.

The fix
=======

Registration is now a one-way door:

- `ShimReaper::register(pid)` returns `()`. No handle.
- `ShimHandler` no longer holds a `reaper_handle` field. Dropping a
  `ShimHandler` does not affect reaper state.
- A PID stays in the registry until the reaper's `waitpid(WNOHANG)`
  sweep observes `Reaped` (we just collected it) or `Vanished`
  (someone else collected it, or the PID isn't our child). Both
  outcomes already drop the PID — no new code path needed in the
  sweep logic.

Worst-case registry membership for a successfully-stopped shim is one
REAPER_TICK (250 ms) — the sweep cost is a HashSet scan, so this is
trivial.

Tests
=====

- `reaps_exited_pid_within_one_second` — unchanged in shape, but now
  exercises the only registration path (no handle to retain).
- `dropping_handle_unregisters_pid` is replaced by
  `live_pid_stays_registered_until_actual_exit`. This pins the new
  invariant directly: register a long-running PID, give the worker
  multiple ticks to sweep, assert the PID is still registered (the
  old design would have purged it on owner Drop); then kill+wait
  the child and assert the next sweep drains via ECHILD.
- Integration test renamed from
  `registry_accepts_re_registration_after_drop` to
  `register_is_idempotent_and_survives_prior_cleanup`, asserting
  HashSet-insert idempotence (a double register results in a single
  entry) and post-cleanup re-registration semantics. The N=8
  zombie-accumulation test is otherwise unchanged.

Regression sweep: 754/754 boxlite lib (with --features rest, excluding
the pre-existing ws_watchdog flake), 3/3 reaper unit, 2/2 reaper
integration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the resource-hazard-detection gap from the prior commits: the
reaper now exposes counters, recovers from worker death, and surfaces
its health through `RuntimeMetrics` and the REST `/v1/metrics`
endpoint so operators can spot "the reaper isn't running" before
zombies pile up.

What's added
============

1. **Reaper internal stats + `stats()` API.** `Inner` gains atomic
   counters for `reaped_total`, `vanished_total`, `sweeps_completed`,
   `last_sweep_at_ms`. `ShimReaper::stats()` returns a `ReaperStats`
   snapshot including the registry size and worker liveness. Dashboards
   alert on `sweeps_completed` going flat while `registered > 0` —
   that's the load-bearing "reaper died, zombies accumulating"
   signature.

2. **Self-healing.** `ShimReaper::worker_alive()` checks
   `JoinHandle::is_finished()`. If false and shutdown wasn't called,
   `register()` transparently respawns the worker and emits a
   `tracing::error!` — the loud alert that something pathological is
   going on inside the reaper. New unit test
   `worker_respawns_after_death` uses a test-only
   `test_panic_next_iteration` flag to deterministically trigger
   worker death (mutex poisoning can't be used anymore because
   `lock_or_recover` now absorbs poison silently — see below).

3. **Mutex poison recovery.** All `registry.lock().unwrap()` sites are
   replaced with `lock_or_recover` which calls
   `PoisonError::into_inner` to extract the data anyway. For our
   `HashSet<u32>` registry, insert/remove are atomic at the std-lib
   level so torn state is impossible. Without this, a single panic
   anywhere holding the registry mutex would cascade-poison every
   subsequent reaper call and defeat the self-healing path.

4. **High-pressure unit test.**
   `registry_under_high_register_pressure` burst-registers N=1024
   PIDs (all in a high u32 range that returns ECHILD on waitpid).
   Asserts the sweep drains the whole batch within 3 s and that
   `vanished_total` reflects the full batch — verifies the worker
   handles register bursts without deadlocking on the registry mutex.

5. **`MetricsSink` cross-module bridge.** The reaper's `spawn_with_sink`
   accepts an optional `MetricsSink { reaped_total, vanished_total,
   registered_now }` of `Arc<AtomicU64>`s. `RuntimeImpl::new`
   constructs one of these by cloning the relevant fields out of its
   `RuntimeMetricsStorage`, so reaper activity is mirrored into the
   metrics storage without making `util::reaper` import the metrics
   module. Keeps `util` a leaf module.

6. **`RuntimeMetricsStorage` + `RuntimeMetrics` getters.** Three new
   fields (`shim_reaped`, `shim_reaper_vanished`,
   `shim_reaper_registered`) and the corresponding accessor methods
   on `RuntimeMetrics`. The first two are monotonic counters; the
   third is a gauge.

7. **REST `/v1/metrics` exposure.** `RuntimeMetricsResponse` (server
   and client) gains the same three fields, marked
   `#[serde(default)]` on the client mirror so older servers that
   pre-date the reaper still deserialize cleanly. Two new wire-shape
   tests:
   - `test_runtime_metrics_back_compat_default_zero_for_missing_shim_fields`
   - `test_runtime_metrics_shim_fields_round_trip`

Tests
=====

New unit tests in `util::reaper::tests`:
- `registry_under_high_register_pressure` — N=1024 burst drains in <3s
- `worker_respawns_after_death` — kill worker, register, observe
  respawn + continued sweep activity
- `stats_track_sweep_progress` — aggregate Stats API snapshot

Updated existing tests to assert against the new counters
(`reaped_total >= 1` after a reap, `vanished_total >= 1` after a
vanish). Total 6 reaper unit tests, all pass.

REST mirror gets 2 new wire-shape tests (3 total
`test_runtime_metrics_*` tests in `rest::types::tests`).

Regression sweep: 759/759 boxlite lib (with --features rest,
excluding pre-existing ws_watchdog flake), 118/118 boxlite-cli,
2/2 zombie_reaper integration. Zero regressions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as draft May 28, 2026 08:28
…ives

Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper
reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered
must be left for its owner to wait(). Two-side verified — injecting a
global waitpid(-1) into the sweep makes the owner's wait() return ECHILD
("No child processes") and the test fail; the scoped sweep preserves
exit code 42.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@G4614 G4614 marked this pull request as ready for review May 28, 2026 09:33
DorianZheng pushed a commit that referenced this pull request May 28, 2026
…eak a running shim (#615)

health_check_transitions_to_healthy_after_startup ends with the box still
Running. RuntimeImpl::Drop's safety-net shutdown_sync only sends SIGTERM and
returns without waiting, so the shim is still mid-shutdown when
PerTestBoxHome::Drop (#604) runs its leak assertion — `1 live shim(s)`.
Adding t.bx.stop() (which stops and waits) clears it. The reaper (#613) does
not help here: it reaps dead shims, not a still-running one.

Co-authored-by: Ubuntu <ubuntu@ip-172-31-20-149.ap-southeast-2.compute.internal>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant