Primitives for qemu data disk reuse#28
Merged
Merged
Conversation
added 3 commits
May 13, 2026 08:05
Adds an operator-facing primitive that exposes the labeled
data-volume mechanic the appliance flow uses in cloud, but
through the local qemu provisioner. The maintainer no longer
has to round-trip through prepare-export + GCP / Hetzner
import to verify a disk-reuse scenario.
Surface:
- QEMUConfig.DataDisk: path to a qcow2 attached as a
LABEL=y-cluster-data ext4 volume. Empty (default) keeps
today's behaviour. Relative paths resolve against the
config-file directory (configfile DirAware convention).
- QEMUConfig.DataDiskSize: size for a freshly-created
DataDisk (default 10G; ignored when the file exists or
when DataDisk is empty).
Behaviour:
- Provision creates the file at the configured path with
qemu-img create + virt-format if it doesn't exist;
reuses it as-is when it does (the whole point of the
primitive).
- The cloud-init seed gains a `mounts:` entry pointing
LABEL=y-cluster-data at /data/yolean with `nofail`, so
the kernel mounts the labeled volume on every boot
without needing prepare-export to have pre-baked the
fstab line.
- Teardown leaves the DataDisk file alone (it lives
outside CacheDir, isn't in perVMArtefacts, and we log a
"data disk preserved" line so the operator sees the
path on the next teardown / re-provision pair).
- virt-format is a conditional prereq: only required when
the DataDisk file doesn't exist yet (existing-disk
attachment uses qemu's built-in qcow2 support).
Coverage:
- Unit: TestFromConfig_DataDiskDefaults pins
absolute / relative / empty path semantics and the
"no surprise default" rule for DataDiskSize when
DataDisk is empty. TestRenderCloudInitUserData_*Mount*
pins the cloud-init mount block. TestEnsureDataDisk_*
pins the preserve-existing-bytes contract and the
empty-path guard. Tag-driven schema regen included.
- e2e (e2e && kvm): TestQemu_DataDisk_ReuseAcrossProvisions
exercises the full round-trip: provision -> write
sentinel under /data/yolean -> teardown (NO keep-disk)
-> assert DataDisk survives + boot disk gone ->
re-provision same DataDisk -> read sentinel back.
Closes the coverage gap where the only existing
appliance-volume e2e went through prepare-export +
seed-tarball (a different code path).
Two of the three FRs from
specs/y-cluster/FEATURE_REQUEST_DISK_REUSE_E2E_PRIMITIVES.md
the checkit maintainer filed against the local-qemu disk-reuse
flow. FR 2 (`--first-boot-only`) is deferred -- the spec marks
it as needing a design pass on marker storage / mount semantics /
verb shape, and that conversation hasn't happened yet.
FR 1 -- `manifests replace` + `manifests rm`
============================================
Strict-in-both-directions verb set. No --force.
- add : name must NOT be staged. As a re-run convenience,
an `add` with byte-identical content to an
already-staged file succeeds silently ("my script
ran twice, file unchanged" case) -- covers the
checkit appliance-init.sh re-run pattern without
weakening the strict-new contract.
- replace : name MUST be staged, overwrites. Bails loud when
the name isn't staged; the verb at the call site
documents the operator's intent ("I know this is
in use and I'm overwriting"). The same byte-equal
short-circuit applies here for "no change".
- rm : name MUST be staged, removes. Bails loud when the
name isn't staged.
The new prelude (readManifestInput) consolidates name-regex +
input-read + cluster-lookup + target-path so add/replace can't
drift on the validation steps. readStagedManifest /
writeStagedManifest / removeStagedManifest hold the cluster-
side shell commands in one place per verb.
Tests (cmd/y-cluster/manifests_test.go):
- TestManifestsAddCmd_RejectsInvalidName (existing, kept)
- TestManifestsReplaceCmd_RejectsInvalidName
- TestManifestsRmCmd_RejectsInvalidName
- TestStagedManifestPath (pins the single touchpoint between
the build-side verbs and prepare-export)
- TestManifestsCmd_Subcommands (pins the three-verb surface so
a future refactor can't accidentally drop one and silently
turn a verb into a --force equivalent)
FR 3 -- `import` accepts qcow2
==============================
`y-cluster import` and the underlying `qemu.Import` now sniff
the input extension and dispatch:
.vmdk -> qemu-img convert -f vmdk -O qcow2 (existing path)
.qcow2 -> qemu-img convert -f qcow2 -O qcow2 (new; usually a
copy + compaction, no foreign format conversion)
Lets a local-qemu-only e2e loop chain `y-cluster export
--format=qcow2` straight into `y-cluster import` without a
manual `qemu-img convert` step or `--format=vmdk` workaround.
The format-table lives in one switch (importFormatFromExt) so
adding raw / vdi / etc. later is a one-line change.
Other formats (raw, vdi, OVA, gcp-tar) intentionally not added:
they aren't on the import path today and the PoC doesn't need
them. The proposal sticks to formats `y-cluster export`
natively produces.
ImportVMDK kept as a deprecated thin wrapper around Import for
any out-of-tree caller pinned to the old name.
Tests (pkg/provision/qemu/qemu_test.go):
- TestImportVMDK_MissingVMDK (existing, kept)
- TestImport_MissingInput (both extensions)
- TestImportFormatFromExt (case-insensitive
happy paths; explicit reject of unsupported / no-ext)
e2e (e2e/qemu_test.go::TestQemu_ExportImport_Qcow2):
Full provision -> teardown(keep) -> export --format=qcow2 ->
delete original boot disk -> Import qcow2 -> re-provision and
SSH. Mirrors the existing VMDK-flavoured TestQemu_ExportImport,
but exercises the no-foreign-format path.
FR 2 -- `--first-boot-only`
===========================
NOT included. The spec calls out three open design questions
the checkit maintainer wants to settle with the y-cluster
maintainer before any implementation lands: where the marker
file lives (data disk vs y-cluster-managed boot dir), whether
the wrapper Job auto-supplies a ServiceAccount + hostPath for
/etc/machine-id, and whether the flag lives on `add` (with a
flag-mismatch error from `replace`) or a separate verb
`add-first-boot`. Worth a design pass before code.
Refs:
- specs/y-cluster/FEATURE_REQUEST_DISK_REUSE_E2E_PRIMITIVES.md
Pre-FR 4 the import -> boot path was a deadlock:
- `y-cluster import <disk>` writes <CacheDir>/<Name>.qcow2
- `y-cluster provision` errors: "disk already exists; run
teardown or start"
- `y-cluster start` errors: "kubeconfig context X has no
associated cluster; nothing to start" -- the disk has
never been booted so no context was ever merged in
The customer-side appliance-upgrade case ("new image, keep data
disk") cannot complete with this deadlock; checkit's
appliance-disk-reuse-test.sh Phase 3 bails here.
Fix: `qemu.Provision` now detects an already-staged
<CacheDir>/<Name>.qcow2 (the only state `import` leaves) and
branches into a STAGED-DISK path that:
- Skips `ensureCloudImage` + `ensureDisk` (the qcow2 IS the
boot disk; nothing to fetch or create).
- Still emits a fresh SSH keypair + cloud-init seed (so the
customer's pubkey lands in the imported image's
authorized_keys via cloud-init at first boot).
- Boots the VM and waits for SSH.
- Calls `waitForK3sReady` + `extractKubeconfig` + merges into
the host kubeconfig. The pre-baked k3s on the appliance
image is what we're attaching to.
- SKIPS `writeRegistries`, `installK3s`, `localstorage.Install`,
`envoygateway.Install`. Those were all baked in by the
SOURCE appliance build. Re-running them would clobber the
appliance's pre-baked state (overwrite GatewayClass /
parametersRef, re-apply install.yaml at a different version
than the appliance shipped, etc.).
The fresh-provision branch (no staged disk) is unchanged.
Coverage:
TestQemu_ExportImport_Qcow2 (already e2e && kvm-gated from
the FR 3 commit) exercises the full round-trip:
Provision -> Teardown(keep) -> Export qcow2 -> rm disk ->
Import qcow2 -> PROVISION (this is the FR 4 case) -> SSH +
kubectl get nodes. Pre-FR 4 the second Provision hit the
"disk already exists" error; with the staged-disk branch it
reaches the in-cluster surface and returns a kubeconfig
context that `kubectl get nodes` can talk to.
TestQemu_ExportImport (the VMDK-flavoured counterpart) takes
the same code path; same FR 4 fix unblocks it.
Refs:
- specs/y-cluster/FEATURE_REQUEST_DISK_REUSE_E2E_PRIMITIVES.md
(FR 4)
solsson
pushed a commit
to Yolean/ystack
that referenced
this pull request
May 13, 2026
Bumps host bin (bin/y-bin.runner.yaml) and the in-cluster y-kustomize Deployment image. v0.4.6 adds primitives for qemu data-disk reuse (Yolean/y-cluster#28). Image digest verified via `crane digest ghcr.io/yolean/y-cluster:v0.4.6`. sha256 sums copied from the release's checksums.txt.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.