Skip to content

feat: cross-node migration (nodeName affinity + migration state machine)#11

Merged
CMGS merged 2 commits into
mainfrom
feat/cross-node-migration
Jul 2, 2026
Merged

feat: cross-node migration (nodeName affinity + migration state machine)#11
CMGS merged 2 commits into
mainfrom
feat/cross-node-migration

Conversation

@tonicmuroq

Copy link
Copy Markdown
Contributor

Operator side of cross-node migrate(vmname, node): the control plane patches CocoonSet.spec.nodeName, the operator does the rest.

What

  • Placement (buildAgentPod): the main agent (slot 0) gets a required hostname nodeAffinity from spec.nodeName instead of a hard NodeName bind — it lands on the target only if it fits and the node is schedulable, else stays Pending (respects capacity/cordon, no OOM). Sub-agents keep their hard-bind to the main's node.
  • Migration state machine (reconcileMigration): a pure observation function over durable state (spec.nodeName, the pod, the epoch :hibernate snapshot) — set internal hibernate annotation → wait for snapshot → delete old pod → recreate on target with restore-from-hibernate → wait for the restored VMID → drop the snapshot. Idempotent and crash-recoverable; runs before applyUnsuspend so its hibernate annotation isn't cleared mid-flight. Ordering gates: old pod deleted only after the snapshot lands; snapshot dropped only after the new VM has a fresh VMID. Surfaces CocoonSetPhaseMigrating. Scoped to the main agent (one VM per CocoonSet).
  • Imports the regenerated CocoonSet CRD.

Dependency

Depends on cocoonstack/cocoon-common#3 (spec.nodeName + Migrating phase). go.mod pins the branch commit via pseudo-version; bump to the cocoon-common release tag after #3 merges.

Tests

migrate_test.go (7 transitions incl. both ordering gates), pods_test.go (3 affinity cases); full suite + make lint clean on linux + darwin.

Not in scope

Control-plane migrate API + IP backfill + involuntary-eviction reconcile (simular-pro-vm-service); end-to-end + crash-injection tests (need a cluster).

@CMGS CMGS force-pushed the feat/cross-node-migration branch from 2edf006 to 3b78805 Compare June 29, 2026 08:25
@CMGS

CMGS commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Rebased onto main (code-style cleanup + dep bumps to common v0.2.2 / epoch v0.2.4 — the two embedded deps commits collapsed into main's single bump) and pushed a fix commit (8ba5d6a) from a strict review:

  • CRD was missing the Migrating phase enum. The operator's config/crd/bases is the deployed artifact (via config/default kustomize; common's CRD only reaches the cluster through make import-crds), and it was stale — the PR hand-added nodeName but the phase enum was [Pending…Suspended, Failed] with no Migrating. So apiserver rejects every phase=Migrating status PATCH → reconcileMigration error-loops and migration never progresses (the unit tests use a fake client that skips enum validation, so they stayed green and hid it). Ran make import-crds to regenerate from v0.2.2 — re-adds Migrating (+ the macos OS enum that had also drifted), keeps nodeName.
  • case-1 delete/recreate loop. The snapshot branch's main.Spec.NodeName != desired deletes the just-recreated restore pod while it's still unscheduled (NodeName == "" — the main==nil branch builds it with affinity, not a hard bind), looping forever. Gated on NodeName != "" (mirrors the !snap branch's existing :48 fallback) + added TestMigrationDoesNotDeleteRecreatedRestorePod (the existing cases all preset a NodeName, so the bug slipped through).
  • nits: migratingmarkMigrating, error-wrap, requeueMigratePoll const.

One thing left for you — needs vk-cocoon context I can't verify locally:

reconcileMigration trusts a pre-existing :hibernate tag (HasManifest) to go straight to the destructive delete+restore. But the operator's own applyUnsuspend only does PatchHibernateState(false) — it does not drop the :hibernate tag. So the sequence suspend (creates tag) → unsuspend (tag survives) → set spec.nodeName → migration sees snap=trueskips taking a fresh snapshot → deletes the live pod → restores the stale snapshot = data loss since the last suspend. It's only safe if vk-cocoon drops the :hibernate tag when it wakes a VM via annotation removal. Could you confirm that — or have migration unconditionally reset the hibernate annotation and wait for a fresh snapshot before the destructive step?

Otherwise #11's migration flow is now correct (CRD + loop fixed, rebase verified intact).

@CMGS

CMGS commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Rewrote the branch on current main (5485764), single commit, design and flow preserved:

  • epoch → OCI registry: probes via the shared hasHibernateSnapshot (same helper as feat(cocoonset): restore hibernated agents from :hibernate on (re)create #14's producer); drops via Registry.DeleteManifest. No more r.Epoch (removed in the OCI cutover).
  • Reuses feat(cocoonset): restore hibernated agents from :hibernate on (re)create #14's infra: meta.MarkRestoreFromHibernate typed setter on the recreate; suspend/CR interplay leans on the merged producer semantics.
  • Hardening found in review (all with regression tests, 18 tests total):
    1. registry probe error now owns the reconcile — before, it fell through and applyUnsuspend could unwind the migration or fresh-boot over the only snapshot;
    2. a :hibernate tag on a never-quiesced pod is treated as a leftover (suspend→unsuspend never deletes the tag) and dropped, instead of deleting a live pod and restoring stale state;
    3. re-targeting nodeName back mid-migration wakes the pod in place instead of deadlocking in Migrating (CR-owned hibernation excluded);
    4. clearing nodeName in the deleted-pod window finishes the restore instead of stranding the snapshot;
    5. steady-state pinned sets skip the per-reconcile registry probe (Migrating phase is persisted before the first side effect).

Follow-up (not in this PR): a migration timeout + Warning events like the hibernation controller's, and sub-agent migration (currently scoped out, one-VM-per-set model).

@CMGS CMGS force-pushed the feat/cross-node-migration branch from 8ba5d6a to e71b762 Compare July 2, 2026 05:42
@CMGS

CMGS commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Live E2E on the GKE cluster (operator oci-e71b762, CRD updated):

create e2e-mig on node-1 -> write marker in guest -> patch spec.nodeName=node-2
[3s]  phase=Scaling   node=node-1  (generation bump)
[6s]  phase=Migrating node=node-1  hibernate=true          <- startMigration
[12s] phase=Migrating node=node-2  restore-from-hibernate=true  <- teardown + recreate
[15s] phase=Migrating node=node-2  ready=true              <- restored
marker read back on node-2: MATCH  ==> cross-node migration with state preserved, 15s e2e

Two pre-existing gaps surfaced by the final drop-snapshot step (not this PR's logic):

  1. IAM: the operator runs as cocoon-ar-writer@ which lacks artifactregistry.repositories.deleteArtifacts — every DeleteManifest 403s, so the migration correctly refuses to settle (phase stays Migrating, snapshot kept). Same known gap that already breaks hibernation-wake tag cleanup + CocoonSet teardown GC. Needs roles/artifactregistry.repoAdmin on the repo.
  2. Observability: main.go sets crlog.SetLogger(logr.Discard()), so controller-runtime swallows every reconcile error — the 403s were invisible in the logs. Worth a small follow-up PR wiring controller-runtime's logger to core/log.

Once the IAM grant lands the full loop (drop + settle to Running) can be re-validated; everything up to that point is verified.

@CMGS CMGS force-pushed the feat/cross-node-migration branch 2 times, most recently from 270057a to 461ba90 Compare July 2, 2026 07:44
@CMGS

CMGS commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

IAM granted (cocoon-ar-writer@repoAdmin) — full E2E now settles completely (operator oci-270057a):

[6s]  Migrating  hibernate=true         [12s] pod on node-2, restore=true
[15s] Running    SETTLED                AR :hibernate tag = 0 (dropped)
marker written pre-migration read back on node-2: MATCH — 15s end-to-end

Also folded the observability follow-up into this PR (461ba90): crlog was logr.Discard(), hiding every reconcile error (the 403s above were invisible) — now bridged to core/log (errors always surface, nil-err anomaly reports → Warn, V(1)+ chatter dropped). Verified live: controller-runtime startup/EventSource lines now appear in the operator log.

Final branch passed two full review rounds (3-lens on the port + a 2-agent verification pass on the final form), 22 tests, lint 0 issues on both GOOS. Operator reverted to merged main (oci-5485764) pending review of this PR; the CRD in-cluster already carries spec.nodeName (optional, inert).

@CMGS CMGS force-pushed the feat/cross-node-migration branch from 461ba90 to a8f6147 Compare July 2, 2026 07:51
tonicmuroq and others added 2 commits July 2, 2026 15:55
…state machine)

Rewritten on current main atop the merged restore-from-hibernate producer (#14):
the control plane patches CocoonSet.spec.nodeName and the operator hibernates
the main agent, waits for the :hibernate snapshot in the OCI registry, deletes
the old pod, recreates it with hostname nodeAffinity + restore-from-hibernate,
and drops the snapshot once the restored VM runs with a fresh VMID. Decisions
are pure functions of durable state, so every step is idempotent and
crash-recoverable.

Hardening over the original branch:
- a registry probe error owns the reconcile — falling through would let
  applyUnsuspend unwind the migration or fresh-boot over the only snapshot
- a :hibernate tag on a never-quiesced pod is a leftover and is dropped,
  not restored
- re-targeting nodeName back mid-migration wakes the pod in place instead of
  deadlocking; CR-owned hibernation short-circuits before the registry probe
- clearing nodeName in the deleted-pod window finishes the restore instead of
  stranding the snapshot
- steady-state pinned sets skip the registry probe; a CR wake mid-flight is
  not repainted as a migration

Scoped to the main agent (slot 0); sub-agents follow via their hard bind.
crlog was set to logr.Discard(), so every reconcile error controller-runtime
retried (returned by reconcilers, not logged at call sites) vanished — the
migration E2E's registry 403s were invisible. Bridge logr to core/log: errors
always forwarded (nil-err anomaly reports downgrade to Warn since core/log
drops nil-err Error lines), V(0) info kept, V(1)+ internals chatter dropped.
@CMGS CMGS force-pushed the feat/cross-node-migration branch from a8f6147 to e12f600 Compare July 2, 2026 07:56
@CMGS CMGS merged commit 59e3a72 into main Jul 2, 2026
1 of 2 checks passed
@CMGS CMGS deleted the feat/cross-node-migration branch July 2, 2026 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants