Summary
hibernation/wake.go:vmClonedAndRunning gates the wake fast-path on IsContainerRunning(pod) && ParseVMRuntime(pod).VMID != "". The correctness of this gate depends on an external invariant: vk-cocoon must clear the vm.cocoonstack.io/id annotation during hibernate, and re-write it only after a successful snapshot clone on wake.
// hibernation/wake.go
// during hibernate, vk-cocoon clears the VMID annotation; on wake it writes
// a new VMID only after the snapshot clone succeeds.
func vmClonedAndRunning(pod *corev1.Pod) bool {
return meta.IsContainerRunning(pod) && meta.ParseVMRuntime(pod).VMID != ""
}
Problem
This contract is not expressed or enforced in any repo we can verify:
cocoon-common/meta/vmruntime.go: VMRuntime.Apply uses setIfNotEmpty — it can only write, never clear.
cocoon-common/k8s/utils.go: PatchHibernateState only touches AnnotationHibernate.
cocoon-operator: no code path clears VMID.
So the "clear on hibernate" step must live entirely in vk-cocoon (likely via a direct client-go patch, or via pod recreation). If vk-cocoon ever regresses — crashes mid-hibernate, partial hibernate, bug in the clear path — the operator's wake gate silently degrades back to the pre-82a9bc3 race: IsContainerRunning alone, which can flap during the pod-recreate → wake window.
Why this is worth tracking
- The assumption is written in a comment but not enforced by a test, schema, or contract.
- A regression in vk-cocoon would not produce any failing test in cocoon-operator — the gate just silently stops gating.
Options
- Strong signal: introduce a
vm.cocoonstack.io/hibernate-epoch annotation (monotonically incremented each hibernate/wake pair by vk-cocoon). The operator gates on epoch advancing, not on VMID presence. Survives transient VMID residue.
- Contract test: add an integration / contract test that exercises vk-cocoon's hibernate path and asserts the VMID annotation is absent post-hibernate. Catches vk-cocoon regressions here, not months later in prod.
- Defensive clear in operator: have the operator clear VMID itself during
reconcileHibernate. Rejected for now — it makes VMID a shared-writer annotation and introduces its own races.
Option 1 is the cleanest long-term fix but requires a coordinated change across vk-cocoon, cocoon-common (annotation constant), and cocoon-operator.
Notes
- Surfaced during a
/code review of HEAD~3..HEAD; deferred out of scope because the fix is cross-repo.
Summary
hibernation/wake.go:vmClonedAndRunninggates the wake fast-path onIsContainerRunning(pod) && ParseVMRuntime(pod).VMID != "". The correctness of this gate depends on an external invariant: vk-cocoon must clear thevm.cocoonstack.io/idannotation during hibernate, and re-write it only after a successful snapshot clone on wake.Problem
This contract is not expressed or enforced in any repo we can verify:
cocoon-common/meta/vmruntime.go:VMRuntime.ApplyusessetIfNotEmpty— it can only write, never clear.cocoon-common/k8s/utils.go:PatchHibernateStateonly touchesAnnotationHibernate.cocoon-operator: no code path clears VMID.So the "clear on hibernate" step must live entirely in vk-cocoon (likely via a direct client-go patch, or via pod recreation). If vk-cocoon ever regresses — crashes mid-hibernate, partial hibernate, bug in the clear path — the operator's wake gate silently degrades back to the pre-
82a9bc3race:IsContainerRunningalone, which can flap during the pod-recreate → wake window.Why this is worth tracking
Options
vm.cocoonstack.io/hibernate-epochannotation (monotonically incremented each hibernate/wake pair by vk-cocoon). The operator gates on epoch advancing, not on VMID presence. Survives transient VMID residue.reconcileHibernate. Rejected for now — it makes VMID a shared-writer annotation and introduces its own races.Option 1 is the cleanest long-term fix but requires a coordinated change across vk-cocoon, cocoon-common (annotation constant), and cocoon-operator.
Notes
/codereview of HEAD~3..HEAD; deferred out of scope because the fix is cross-repo.