Skip to content

fix: make CH pause/resume idempotent via error matching (no extra vm.info)#52

Merged
CMGS merged 1 commit into
masterfrom
fix/pause-resume-idempotent
May 15, 2026
Merged

fix: make CH pause/resume idempotent via error matching (no extra vm.info)#52
CMGS merged 1 commit into
masterfrom
fix/pause-resume-idempotent

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 15, 2026

Summary

  • CH's Vm::pause / Vm::resume reject same-state transitions (vmm/src/vm.rs:412-440) — a cocoon process SIGKILLed mid-snapshot leaves the VM stuck Paused, after which every subsequent snapshot save fails with 500 InvalidStateTransition(Paused, Paused). This was the root cause of the journal noise that helped trigger fix: kill orphan VMM via /proc cmdline fallback when pidfile pre-removed #51's orphan chain on GKE prod.
  • Cocoon-side fix: pauseVM / resumeVM swallow CH's exact Invalid transition: InvalidStateTransition(<state>, <state>) 500 body via APIError.Code + strings.Contains on the message. No extra vm.info round-trip.
  • FC research confirms FC's pause/resume API is inherently idempotent at the vCPU event-loop level (src/vmm/src/vstate/vcpu.rs ack Pause from paused() and Resume from running() without error). The "Non-idempotent" comment in firecracker/api.go was wrong — comment-only cleanup.

Repro (testbed)

cocoon vm run --nics 0 --memory 512M ghcr.io/cocoonstack/cocoon/ubuntu:24.04
curl --unix-socket /var/lib/cocoon/run/cloudhypervisor/$VMID/api.sock -X PUT http://localhost/api/v1/vm.pause  # simulates stuck-paused
cocoon snapshot save --name test $VMID
# Before: pause: PUT vm.pause → 500: ["...InvalidStateTransition(Paused, Paused)..."], snapshot save fails
# After:  snapshot saved: <id>, VM resumes to Running

Test plan

  • make lint 0 issues, both GOOS=linux and GOOS=darwin
  • new TestIsAlreadyInStateError (9 cases: paused/running match, wrong-state, non-500, different-transition, non-APIError, nil, wrapped via fmt.Errorf %w, empty state)
  • testbed e2e: manual vm.pause via curl → cocoon snapshot save succeeds (was failing pre-fix); cleanup leaves no orphan
  • deploy GKE prod, watch journal for InvalidStateTransition(Paused, Paused) recurrence after next e2e batch

CH's Vm::pause / Vm::resume at vmm/src/vm.rs:412-440 reject same-state transitions (Paused→Paused, Running→Running) as InvalidStateTransition, returning 500. A cocoon process SIGKILLed between pause and resume during snapshot save leaves the VM stuck Paused, after which every subsequent snapshot save fails — observed on GKE prod and journal-confirmed.

pauseVM / resumeVM now swallow CH's exact "Invalid transition: InvalidStateTransition(<state>, <state>)" 500 body, treating it as "already in the desired state". Matched via APIError code + substring; transport errors, other 500 bodies, and different transitions (Created→Paused etc.) still propagate.

FC research (upstream src/vmm/src/vstate/vcpu.rs): FC's vCPU event loop acks Pause from paused state and Resume from running state without error — FC API is inherently idempotent. Updated the misleading "Non-idempotent" comment on FC's pauseVM/resumeVM; no behavior change.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Makes Cloud Hypervisor pauseVM / resumeVM idempotent by detecting and swallowing CH's InvalidStateTransition(<state>, <state>) 500 response, so a stuck-Paused VM (e.g., after a SIGKILL mid-snapshot) can recover without an extra vm.info round trip. Also corrects an inaccurate "Non-idempotent" comment on the Firecracker pause/resume helpers, since FC's vCPU event loop already acks redundant transitions.

Changes:

  • Add isAlreadyInStateError helper matching CH's exact 500 body via utils.APIError.Code + strings.Contains, and use it in pauseVM / resumeVM.
  • New TestIsAlreadyInStateError covering 9 cases (matches, wrong state, non-500, non-APIError, nil, wrapped error, empty state, etc.).
  • Update Firecracker pauseVM / resumeVM doc comments to reflect that FC's pause/resume is inherently idempotent.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
hypervisor/cloudhypervisor/helper.go Adds error-matching helper and wires it into pauseVM/resumeVM to swallow same-state transitions.
hypervisor/cloudhypervisor/helper_test.go New unit tests for isAlreadyInStateError.
hypervisor/firecracker/api.go Corrects misleading "Non-idempotent" doc comments on FC pause/resume.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@CMGS CMGS merged commit 5b6964e into master May 15, 2026
8 checks passed
@CMGS CMGS deleted the fix/pause-resume-idempotent branch May 15, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants