fix(runtime): preserve box record on init failure as Failed state#520
Merged
Conversation
When init-pipeline tasks fail (e.g. the 30s guest-connect timeout that hit CL84LvGx7RBE on dev.boxlite.ai), CleanupGuard::drop used to call remove_box() unconditionally -- orphaning the box's persistent disks on host while the DB row disappeared, leaving the user with an unrecoverable sandbox. Match the canonical pattern (Daytona ERROR, Kata startVM defer, containerd status.ExitCode, Docker SetError+CheckpointTo): preserve the record, transition to Failed with error_reason, let DESTROY_SANDBOX be the only deletion path. Bundled companion fixes for contributing bugs found in the same investigation: - install_zombie_reaper: daemon-wide SIGCHLD reaper so repeated shim failures don't accumulate <defunct> children (7+ observed in prod). - Go runner SIGTERM handler now calls boxliteClient.Shutdown() before apiServer.Stop(), so VMs get a graceful SIGTERM instead of being killed mid-write by the parent exit. TimeoutStopSec=60 on the systemd unit leaves headroom. - wait_for_guest_ready accepts an injectable Duration (production keeps the 30s constant) and the timeout error body now includes shim_alive, console_bytes, ready_socket_exists, likely_cause heuristic, and a console tail -- turning hours of forensics into a one-look diagnostic. Tests: - BoxStatus::Failed serde + transitions + can_remove/can_start matrix (state.rs, 9 tests). - mark_failed sets status/reason/pid and preserves health (state.rs). - CleanupGuard::drop persists Failed and keeps the row (init/types.rs). Reverting Drop to remove_box flips this test red. - install_zombie_reaper consumes an unwaited child within 8s (util/process.rs). Reverting the reaper flips this test red. - wait_for_guest_ready timeout branch returns the enriched error body via a 100ms test-side timeout (guest_connect.rs). CLAUDE.md gains the test-meaningfulness rule: assertions must be on data routed through production code, not on values the test body invented.
The daemon-wide SIGCHLD reaper from the previous commit needs more design work before it's safe to land. Concerns surfaced after merging: - Global side effect: waitpid(-1, WNOHANG) races with any code in the same process that owns a Child handle and expects to call .wait(). If the reaper consumes the child first, the owner gets ECHILD and loses the exit code. ProcessMonitor::try_wait already returns ProcessExit::Unknown for this case, but other callers of std::process::Child::wait() across the daemon do not. - 5s sleep cycle is coarse -- long enough that the test had to wait up to 8s for verification, slowing CI. - The reaper thread is install-once and runs for the lifetime of the process; there is no shutdown path or per-test isolation. The CleanupGuard preservation fix (the root cause of the CL84LvGx7RBE incident) is independent of the reaper and stays. Investigation tracked in a follow-up issue.
5 tasks
E0004 in CI on \`sdks/node/src/info.rs:72\` (and the same shape in sdks/python/src/info.rs and sdks/c/src/info.rs): the three SDK match arms on \`BoxStatus\` weren't updated when \`Failed\` was added to the enum in the previous commit, so the SDK lib targets failed to compile. Add \`Failed => "failed"\` to each, matching the canonical string from \`BoxStatus::as_str()\` so REST/CLI/SDK consumers see one consistent name.
G4614
pushed a commit
to G4614/boxlite
that referenced
this pull request
May 28, 2026
…ives Pins the property whose absence got PR boxlite-ai#520's global waitpid(-1) reaper reverted (Issue boxlite-ai#523, criterion #2): a child the reaper never registered must be left for its owner to wait(). Two-side verified — injecting a global waitpid(-1) into the sweep makes the owner's wait() return ECHILD ("No child processes") and the test fail; the scoped sweep preserves exit code 42. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CleanupGuard::drop'sremove_box()with aFailed-state transition (preserves the DB row +error_reason). Matches the Daytona/Kata/containerd/Docker canonical pattern. Fixes the root cause of theCL84LvGx7RBEorphan ondev.boxlite.ai.install_zombie_reaper()so failed shim children don't accumulate as<defunct>PIDs.boxliteClient.Shutdown(25s)->apiServer.Stop()so running VMs get graceful SIGTERM; bump systemdTimeoutStopSec=60.wait_for_guest_readytimeout error (shim_alive, console_bytes, ready_socket_exists, likely_cause, console tail) + injectable timeout for testability.Test plan
cargo test -p boxlite --lib litebox::state::tests(43 pass; +9 new for the Failed matrix)cargo test -p boxlite --lib litebox::init::types::tests(4 pass; +1 new CleanupGuard drop preservation test)cargo test -p boxlite --lib util::process::tests(21 pass; +1 new zombie-reaper test)cargo test -p boxlite --lib litebox::init::tasks::guest_connect::tests(12 pass; includes the real enriched-timeout test from a prior turn)CleanupGuard::dropbody andinstall_zombie_reaperbody individually flips the new tests red; restoring brings them back to green.CL84LvGx7RBEdirectory on the dev runner (one-time operational step; runbook lives in the plan file).