Skip to content

docs(known-issues): clarify clone/restore --cpu/--memory cannot grow#26

Merged
CMGS merged 1 commit into
masterfrom
docs/clone-resource-immutable
May 5, 2026
Merged

docs(known-issues): clarify clone/restore --cpu/--memory cannot grow#26
CMGS merged 1 commit into
masterfrom
docs/clone-resource-immutable

Conversation

@CMGS
Copy link
Copy Markdown
Contributor

@CMGS CMGS commented May 5, 2026

Summary

Empirical verification during the overnight E2E regression showed that the existing "Clone/restore disk queue count is immutable" entry was misleading: the implication was that `--cpu` does change vCPU count and only disk `num_queues` lags. Actual behavior — cocoon patches `config.json` correctly (`boot_vcpus: 4`, `memory.size: 2147483648`) but the guest still sees the snapshot's original `nproc` and `MemTotal`. CH's `vm.restore` reconstructs the guest from binary memory file and saved CPU state, both sized at snapshot time.

Update the entry to:

  • Rename to "Clone/restore CPU and memory cannot grow"
  • Add a side-by-side table showing the divergence (config.json vs guest view)
  • Clarify the disk `num_queues` lag as a downstream effect of vCPU count
  • Keep the "no hot-add at restore" workaround note

No code change.

Test plan

  • Verified empirically: `vm run --cpu 2 --memory 1G` → snapshot → `vm restore --cpu 4 --memory 2G`. config.json shows `boot_vcpus: 4`, `memory.size: 2147483648`. Guest sees `nproc=2`, `MemTotal=982240 kB`. (E2E SNAP_04, SNAP_06)
  • No code change; `make lint` not impacted

Previous entry implied --cpu changes vCPU count but disk num_queues
follows. Empirical verification (E2E run) shows neither vCPU count
nor memory size actually grow on the guest side: cocoon patches
config.json correctly but CH's vm.restore reconstructs from binary
state at snapshot's original size. Update the entry with a clear
table and the corrected consequence.
@CMGS CMGS merged commit 7609f43 into master May 5, 2026
4 checks passed
@CMGS CMGS deleted the docs/clone-resource-immutable branch May 5, 2026 18:17
CMGS added a commit that referenced this pull request May 6, 2026
* refactor: drop --cpu/--memory from clone/restore

Both hypervisors restore guest CPU and memory from the snapshot's binary
memory image; values written into config.json or rejected at the FC API
do not change what the guest sees. Removing the flags eliminates the
illusion of overridable resources and the dead validation paths that
went with them.

CLI: removes --cpu/--memory from clone and restore. mergeResourceFlags
collapses to mergeStorageFlag.

CH: patchOptions drops cpu/memory/windows fields; patchCHConfig drops
its rawData parameter (no callers passed bytes); patchMemoryAndBalloon
and patchBalloonRaw deleted.

FC: validateRestoreOverrides and the cloneAfterExtract CPU/memory mismatch
checks deleted; RestoreSpec/DirectRestoreSpec.OverrideCheck removed.

Docs: README clone/restore tables and snapshot section updated;
KNOWN_ISSUES drops the entry from PR #26 and the stale FC bullet.

* refactor: drop --storage and --nics from clone/restore too

All clone/restore resources now inherit verbatim from the snapshot. Both
hypervisors restore guest state from the snapshot's binary device state,
so storage and NIC count are equally fixed at snapshot time — same root
cause as CPU/memory.

The hot-swap NIC logic in clone stays: clone always rebuilds NICs at the
snapshot's count to fix the MAC binding issue (binary device state holds
the snapshot-time MACs).

CLI: drops --storage from restore and --storage/--nics from clone.
mergeStorageFlag and validateFCCloneOverrides deleted (fully dead).
RestoreVMConfigFromFlags moves the NIC-count guard inside, eliminating
two duplicate checks in run.go.

Backends: removes the four storage-resize call sites in clone/restore
(CH and FC). qemuExpandImage / ExpandRawImage stay — still used by vm
run for fresh COW sizing. CH restoreAfterExtract loses an unused
cowPath parameter.

Docs: README clone/restore tables and KNOWN_ISSUES updated to state
that all resources inherit from the snapshot.

* docs: add Restore Flags table to README

* fix: restore record reflects snapshot resources, not stale VM record

RestoreVMConfigFromFlags previously copied vm.Config (current record) and
passed it to FinalizeRestore, which writes it back as the persisted
config. With a same-lineage snapshot the values match, but two paths
break the invariant:

  - vm restore --from-dir --force on a foreign lineage with the same
    NIC count silently overlays snapshot resources onto the guest while
    the record retains the old VM's CPU/memory/storage.
  - Records carrying drift from older builds (when --cpu / --memory on
    restore was a CLI knob that never actually grew the guest) keep
    propagating the wrong numbers through every subsequent restore.

Fix: build VMConfig from snapCfg.Config so the persisted record matches
what the hypervisor reconstructs in the guest. VM identity (Name) stays;
--on-demand still comes from the flag.

* docs: clarify restore resources come from snapshot, NICs must match VM

* fix: preserve VM Network on restore

Restoring from a snapshot reuses the VM's existing CNI namespace, TAPs,
and IP allocation — those were bound to vm.Config.Network at create
time. The previous fix copied snapCfg.Config wholesale into the
persisted record, which would overwrite Network too. With a foreign
--from-dir --force snapshot whose source VM ran on a different CNI
conflist, the record would advertise a CNI selection that the live
namespace doesn't actually use.

Other Config fields (CPU/Memory/Storage, Image*, QueueSize, Windows,
SharedMemory, NoDirectIO) genuinely follow the snapshot — the guest is
literally reconstructed from snapshot state — so they keep coming from
snapCfg.

* docs: clarify snapshot metadata records resource topology

* fix: PR review fallout — clone flag table, restore wording, snap validation

Three review fixes:

1. Clone Flags table missed --on-demand and --from-dir even though
   addCloneFlags registers both — added rows so the README matches the
   CLI surface.

2. KNOWN_ISSUES'\'' resource section claimed restore inherits NIC count
   from the snapshot. Restore actually requires the snapshot NIC count
   to match the target VM (mismatch is rejected); reworded.

3. RestoreVMConfigFromFlags now calls VMConfig.Validate() on the
   assembled config so a tampered or malformed --from-dir --force
   snapshot.json (negative CPU, zero memory, etc.) errors out before
   FinalizeRestore overwrites the VM record.

* fix: defer Validate to after clone name default

CloneVMConfigFromFlags called Validate() before prepareClone applied the
'cocoon-clone-<id>' default name, so 'cocoon vm clone <snap>' without
--name failed with 'vm name cannot be empty' instead of taking the
default. Move Validate to prepareClone, right after the default is
filled in. Other Config fields are still pre-validated transitively
(snapshot save validates them at write time).

* docs: trim comments in clone/restore helpers
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant