Skip to content

fix(hypervisor): reconcile Error→Running when VMM process is alive#57

Merged
CMGS merged 1 commit into
masterfrom
fix/start-reconcile-error-state
May 20, 2026
Merged

fix(hypervisor): reconcile Error→Running when VMM process is alive#57
CMGS merged 1 commit into
masterfrom
fix/start-reconcile-error-state

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 20, 2026

Summary

  • A VM stuck in State=Error with the VMM process still alive (e.g. after MarkError following a transient stop-API failure) had no recovery path: PrepareStart's already-running branch returned without flipping DB state, so vm start was a silent no-op and inspect showed misleading Error forever — the only escape was vm rm --force.
  • PrepareStart now calls reconcileToRunning when WithRunningVM confirms the process is alive but the record drifted off Running.
  • Ledger is intentionally not touched: the compute interval is already open and matches the live process; emitting a second start would double-count.

Trigger sequence

  1. VM Running, ledger has compute.start, process alive
  2. vm stop vm1 graceful → vm.shutdown HTTP times out / ctx-cancel → HandleStopResult calls MarkErrorState=Error
  3. VMM process is still alive (shutdown was just an ACPI hint)
  4. vm start vm1PrepareStartWithRunningVM sees process alive → previously returned nil, nil; now reconciles State→Running first

Test plan

  • TestReconcileToRunningFromError — verifies Running→MarkError→reconcile flips State without emitting
  • TestReconcileToRunningIdempotent — verifies reconcile on a healthy Running VM is a no-op
  • make fmt-check && make lint && go test -race ./... — 21 packages green, lint 0, fmt 0
  • AST layout audit — 0 violations

@CMGS CMGS force-pushed the fix/start-reconcile-error-state branch from b1fa89f to 0223c0b Compare May 20, 2026 07:39
A VM stuck in State=Error with the VMM process still running (e.g. after
MarkError following a transient stop-API failure) had no recovery path:
PrepareStart's already-running branch returned without flipping the DB
state, so vm start was a silent no-op and inspect showed misleading
Error forever. The only escape was vm rm --force.

PrepareStart now calls reconcileToRunning when WithRunningVM confirms
the process is alive but the record drifted off Running. The ledger is
intentionally not touched — the compute interval is already open and
matches the live process; emitting a second start would double-count.
@CMGS CMGS force-pushed the fix/start-reconcile-error-state branch from 0223c0b to 6b153a7 Compare May 20, 2026 07:46
@CMGS CMGS merged commit a42ab77 into master May 20, 2026
4 checks passed
@CMGS CMGS deleted the fix/start-reconcile-error-state branch May 20, 2026 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant