fix(hypervisor): reconcile Error→Running when VMM process is alive#57
Merged
Conversation
b1fa89f to
0223c0b
Compare
A VM stuck in State=Error with the VMM process still running (e.g. after MarkError following a transient stop-API failure) had no recovery path: PrepareStart's already-running branch returned without flipping the DB state, so vm start was a silent no-op and inspect showed misleading Error forever. The only escape was vm rm --force. PrepareStart now calls reconcileToRunning when WithRunningVM confirms the process is alive but the record drifted off Running. The ledger is intentionally not touched — the compute interval is already open and matches the live process; emitting a second start would double-count.
0223c0b to
6b153a7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
State=Errorwith the VMM process still alive (e.g. afterMarkErrorfollowing a transient stop-API failure) had no recovery path:PrepareStart's already-running branch returned without flipping DB state, sovm startwas a silent no-op andinspectshowed misleading Error forever — the only escape wasvm rm --force.PrepareStartnow callsreconcileToRunningwhenWithRunningVMconfirms the process is alive but the record drifted off Running.Trigger sequence
compute.start, process alivevm stop vm1graceful →vm.shutdownHTTP times out / ctx-cancel →HandleStopResultcallsMarkError→State=Errorvm start vm1→PrepareStart→WithRunningVMsees process alive → previously returnednil, nil; now reconcilesState→RunningfirstTest plan
TestReconcileToRunningFromError— verifies Running→MarkError→reconcile flips State without emittingTestReconcileToRunningIdempotent— verifies reconcile on a healthy Running VM is a no-opmake fmt-check && make lint && go test -race ./...— 21 packages green, lint 0, fmt 0