feat: cross-node migration (nodeName affinity + migration state machine)#11
Conversation
2edf006 to
3b78805
Compare
|
Rebased onto
One thing left for you — needs vk-cocoon context I can't verify locally:
Otherwise #11's migration flow is now correct (CRD + loop fixed, rebase verified intact). |
|
Rewrote the branch on current main (
Follow-up (not in this PR): a migration timeout + Warning events like the hibernation controller's, and sub-agent migration (currently scoped out, one-VM-per-set model). |
8ba5d6a to
e71b762
Compare
|
Live E2E on the GKE cluster (operator Two pre-existing gaps surfaced by the final drop-snapshot step (not this PR's logic):
Once the IAM grant lands the full loop (drop + settle to Running) can be re-validated; everything up to that point is verified. |
270057a to
461ba90
Compare
|
IAM granted ( Also folded the observability follow-up into this PR ( Final branch passed two full review rounds (3-lens on the port + a 2-agent verification pass on the final form), 22 tests, lint 0 issues on both GOOS. Operator reverted to merged main ( |
461ba90 to
a8f6147
Compare
…state machine) Rewritten on current main atop the merged restore-from-hibernate producer (#14): the control plane patches CocoonSet.spec.nodeName and the operator hibernates the main agent, waits for the :hibernate snapshot in the OCI registry, deletes the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate, and drops the snapshot once the restored VM runs with a fresh VMID. Decisions are pure functions of durable state, so every step is idempotent and crash-recoverable. Hardening over the original branch: - a registry probe error owns the reconcile — falling through would let applyUnsuspend unwind the migration or fresh-boot over the only snapshot - a :hibernate tag on a never-quiesced pod is a leftover and is dropped, not restored - re-targeting nodeName back mid-migration wakes the pod in place instead of deadlocking; CR-owned hibernation short-circuits before the registry probe - clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot - steady-state pinned sets skip the registry probe; a CR wake mid-flight is not repainted as a migration Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
crlog was set to logr.Discard(), so every reconcile error controller-runtime retried (returned by reconcilers, not logged at call sites) vanished — the migration E2E's registry 403s were invisible. Bridge logr to core/log: errors always forwarded (nil-err anomaly reports downgrade to Warn since core/log drops nil-err Error lines), V(0) info kept, V(1)+ internals chatter dropped.
a8f6147 to
e12f600
Compare
Operator side of cross-node
migrate(vmname, node): the control plane patchesCocoonSet.spec.nodeName, the operator does the rest.What
buildAgentPod): the main agent (slot 0) gets a required hostnamenodeAffinityfromspec.nodeNameinstead of a hardNodeNamebind — it lands on the target only if it fits and the node is schedulable, else stays Pending (respects capacity/cordon, no OOM). Sub-agents keep their hard-bind to the main's node.reconcileMigration): a pure observation function over durable state (spec.nodeName, the pod, the epoch:hibernatesnapshot) — set internal hibernate annotation → wait for snapshot → delete old pod → recreate on target withrestore-from-hibernate→ wait for the restored VMID → drop the snapshot. Idempotent and crash-recoverable; runs beforeapplyUnsuspendso its hibernate annotation isn't cleared mid-flight. Ordering gates: old pod deleted only after the snapshot lands; snapshot dropped only after the new VM has a fresh VMID. SurfacesCocoonSetPhaseMigrating. Scoped to the main agent (one VM per CocoonSet).Dependency
Depends on cocoonstack/cocoon-common#3 (
spec.nodeName+Migratingphase). go.mod pins the branch commit via pseudo-version; bump to the cocoon-common release tag after #3 merges.Tests
migrate_test.go(7 transitions incl. both ordering gates),pods_test.go(3 affinity cases); full suite +make lintclean on linux + darwin.Not in scope
Control-plane
migrateAPI + IP backfill + involuntary-eviction reconcile (simular-pro-vm-service); end-to-end + crash-injection tests (need a cluster).