Skip to content

fix(runtime): preserve box record on init failure as Failed state#520

Merged
DorianZheng merged 3 commits into
mainfrom
worktree-jolly-noodling-nygaard
May 14, 2026
Merged

fix(runtime): preserve box record on init failure as Failed state#520
DorianZheng merged 3 commits into
mainfrom
worktree-jolly-noodling-nygaard

Conversation

@DorianZheng
Copy link
Copy Markdown
Member

Summary

  • Replace CleanupGuard::drop's remove_box() with a Failed-state transition (preserves the DB row + error_reason). Matches the Daytona/Kata/containerd/Docker canonical pattern. Fixes the root cause of the CL84LvGx7RBE orphan on dev.boxlite.ai.
  • Add daemon-wide install_zombie_reaper() so failed shim children don't accumulate as <defunct> PIDs.
  • Wire Go runner SIGTERM -> boxliteClient.Shutdown(25s) -> apiServer.Stop() so running VMs get graceful SIGTERM; bump systemd TimeoutStopSec=60.
  • Enriched wait_for_guest_ready timeout error (shim_alive, console_bytes, ready_socket_exists, likely_cause, console tail) + injectable timeout for testability.
  • CLAUDE.md gains the test-meaningfulness rule.

Test plan

  • cargo test -p boxlite --lib litebox::state::tests (43 pass; +9 new for the Failed matrix)
  • cargo test -p boxlite --lib litebox::init::types::tests (4 pass; +1 new CleanupGuard drop preservation test)
  • cargo test -p boxlite --lib util::process::tests (21 pass; +1 new zombie-reaper test)
  • cargo test -p boxlite --lib litebox::init::tasks::guest_connect::tests (12 pass; includes the real enriched-timeout test from a prior turn)
  • Rip-out verification: reverting CleanupGuard::drop body and install_zombie_reaper body individually flips the new tests red; restoring brings them back to green.
  • Post-deploy: clean up the orphaned CL84LvGx7RBE directory on the dev runner (one-time operational step; runbook lives in the plan file).

When init-pipeline tasks fail (e.g. the 30s guest-connect timeout that
hit CL84LvGx7RBE on dev.boxlite.ai), CleanupGuard::drop used to call
remove_box() unconditionally -- orphaning the box's persistent disks on
host while the DB row disappeared, leaving the user with an
unrecoverable sandbox. Match the canonical pattern (Daytona ERROR,
Kata startVM defer, containerd status.ExitCode, Docker
SetError+CheckpointTo): preserve the record, transition to Failed with
error_reason, let DESTROY_SANDBOX be the only deletion path.

Bundled companion fixes for contributing bugs found in the same
investigation:

- install_zombie_reaper: daemon-wide SIGCHLD reaper so repeated shim
  failures don't accumulate <defunct> children (7+ observed in prod).
- Go runner SIGTERM handler now calls boxliteClient.Shutdown() before
  apiServer.Stop(), so VMs get a graceful SIGTERM instead of being
  killed mid-write by the parent exit. TimeoutStopSec=60 on the systemd
  unit leaves headroom.
- wait_for_guest_ready accepts an injectable Duration (production keeps
  the 30s constant) and the timeout error body now includes shim_alive,
  console_bytes, ready_socket_exists, likely_cause heuristic, and a
  console tail -- turning hours of forensics into a one-look diagnostic.

Tests:
- BoxStatus::Failed serde + transitions + can_remove/can_start matrix
  (state.rs, 9 tests).
- mark_failed sets status/reason/pid and preserves health (state.rs).
- CleanupGuard::drop persists Failed and keeps the row (init/types.rs).
  Reverting Drop to remove_box flips this test red.
- install_zombie_reaper consumes an unwaited child within 8s
  (util/process.rs). Reverting the reaper flips this test red.
- wait_for_guest_ready timeout branch returns the enriched error body
  via a 100ms test-side timeout (guest_connect.rs).

CLAUDE.md gains the test-meaningfulness rule: assertions must be on
data routed through production code, not on values the test body
invented.
The daemon-wide SIGCHLD reaper from the previous commit needs more
design work before it's safe to land. Concerns surfaced after merging:

- Global side effect: waitpid(-1, WNOHANG) races with any code in the
  same process that owns a Child handle and expects to call .wait().
  If the reaper consumes the child first, the owner gets ECHILD and
  loses the exit code. ProcessMonitor::try_wait already returns
  ProcessExit::Unknown for this case, but other callers of
  std::process::Child::wait() across the daemon do not.
- 5s sleep cycle is coarse -- long enough that the test had to wait
  up to 8s for verification, slowing CI.
- The reaper thread is install-once and runs for the lifetime of the
  process; there is no shutdown path or per-test isolation.

The CleanupGuard preservation fix (the root cause of the
CL84LvGx7RBE incident) is independent of the reaper and stays.
Investigation tracked in a follow-up issue.
E0004 in CI on \`sdks/node/src/info.rs:72\` (and the same shape in
sdks/python/src/info.rs and sdks/c/src/info.rs): the three SDK match
arms on \`BoxStatus\` weren't updated when \`Failed\` was added to the
enum in the previous commit, so the SDK lib targets failed to compile.

Add \`Failed => "failed"\` to each, matching the canonical string from
\`BoxStatus::as_str()\` so REST/CLI/SDK consumers see one consistent name.
@DorianZheng DorianZheng merged commit e4767e7 into main May 14, 2026
53 of 54 checks passed
@DorianZheng DorianZheng deleted the worktree-jolly-noodling-nygaard branch May 14, 2026 13:56
G4614 pushed a commit to G4614/boxlite that referenced this pull request May 28, 2026
…ives

Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper
reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered
must be left for its owner to wait(). Two-side verified — injecting a
global waitpid(-1) into the sweep makes the owner's wait() return ECHILD
("No child processes") and the test fail; the scoped sweep preserves
exit code 42.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant