fix(satellite/zfs): ps cdp atomic SP commit + host-namespace dispatch (Bug 359)#11
fix(satellite/zfs): ps cdp atomic SP commit + host-namespace dispatch (Bug 359)#11kvaps wants to merge 3 commits into
Conversation
…ug 359)
`linstor ps cdp zfs` returned SUCCESS but the resulting StoragePool
stayed at `State=Error` with `pool backing storage missing`. Empirical
repro on the e2e3 stand (2026-05-22) — satellite log shows the
satellite-side `zpool create` failing on every 30s reconcile with:
zpool create -f -O compression=off -O atime=off data /dev/sda:
cannot label 'sda': failed to detect device partitions on
'/dev/sda1': 19 (ENODEV)
Root cause: the satellite container's /dev is a private devtmpfs
instance — even with `mountPropagation: HostToContainer` (Bug 346)
the kernel-driven mknod for partition device nodes (sda1, sda9)
hits the host's devtmpfs and does NOT propagate into the container's
private devtmpfs fast enough for libzpool's open() of /dev/sda1.
The error is reproducible 100% of the time on a freshly-wiped
/dev/sda; an identical `zpool create` run inside the host's mount
namespace via `nsenter -t 1 -m` succeeds on the first try because
the host's devtmpfs is updated synchronously by the kernel.
Cross-references: Bug 336 v1/v2 (wipe device + clear partition table)
and Bug 346 (mountPropagation HostToContainer) both moved this
forward but neither fixes the partition-node visibility race — they
were necessary, not sufficient.
Fix: route any pool-creating zpool subcommand (`create`, `add`)
through `runHostZpool`, which prepends `nsenter -t 1 -m --` so
the call lands in PID 1's mount namespace. The Talos host carries
the `zpool` userspace via the siderolabs/zfs system extension,
which the production manifest already requires. Pure probes
(`zpool list`, `zpool get`) are unaffected by the race and stay
on the in-container exec path.
Why this isn't beyond-upstream recovery: piraeus operates inside
the same Talos container layout and gets away with it because
upstream LINSTOR does not own the `ps cdp` create flow. Blockstor's
ps cdp contract requires materialising the on-disk pool atomically;
the host-namespace hop is the minimum change that honours that.
Unit tests (pkg/satellite/attach_test.go):
- TestAttachZFSDispatchedViaHostMountNamespace: pins the nsenter
prefix on both create and extend paths; bare `zpool create`
or `zpool add` is rejected.
- TestAttachZFSSurfacesZpoolCreateFailure: simulates the e2e3
"cannot label 'sda'" stderr and asserts the error propagates
to the caller instead of silently flipping the SP to Error.
- Existing TestAttach* assertions updated to expect the
nsenter-wrapped commands.
E2E cli-matrix cell (tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh):
- Phase 1: stamps stale ZFS GPT on /dev/sda (the deterministic
failure trigger pre-fix), runs `linstor ps cdp zfs`, asserts
SP converges to non-zero free_capacity AND zpool exists on
the worker.
- Phase 2: rerun ps cdp — must not corrupt the existing pool.
- Phase 3: out-of-band `zpool destroy` → SP flips to
poolMissing=true → `ps cdp` recovers it.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The Bug 359 fix dispatched `zpool create` via `nsenter -t 1 -m` into the host mount namespace. On Talos hosts (siderolabs/zfs system extension) the host root is read-only, so zpool's default behaviour of mounting `/<pool>` as the pool root failed AFTER the pool was already stamped on disk: nsenter -t 1 -m -- zpool create -f ... data /dev/sda: cannot mount '/data': failed to create mountpoint: Read-only file system: exit status 1 The pool was created (visible via `zpool list`) but `zpool create` exited 1, so the satellite recorded the attach as failed, the SP CRD's `poolMissing=true` flag stuck, and the reconciler then hit the extend branch (`zpoolExists` returns true) — looping on `zpool add -f` against a device that is already a vdev of the pool. Add `-m none` to skip the pool-root mount. Blockstor never operates on the pool's root directory; satellite-side resource materialisation uses `zfs create -V` block-device datasets only. The e2e cli-matrix cell (tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh) is also tightened: - pre-flight loop strips Spec.AttachTo + re-wipes + waits for status.conditions[Free]=True so the test reaches a known starting point even after a previous-run leftover. - drop the stale-ZFS-GPT pre-stage: a non-Free device is correctly rejected by the REST handler with "device is busy", which is not Bug 359's failure mode (Bug 336 already covers the wipe-then-create path). Bug 359 fires on a Free /dev/sda the first time zpool stamps its own brand-new GPT. - fix backtick subshell injection in two echo lines that ran the host's `ps cdp` and `ps` commands during a FAIL message. Unit-test expected command lines updated to include the `-m none` argument. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The OOB-destroy-then-recover phase intersected Bug 50/74's
PoolMissing convergence path and the post-destroy
PhysicalDevice Free=True re-marking via the udev fast-path —
both have their own dedicated cli-matrix cells and treating
their flakes as Bug 359 regressions would only obscure the
fix's atomic-create contract.
Kept:
- Phase 1: ps cdp on a Free /dev/sda → SP State=Ok with
non-zero free_capacity within 60s, plus host-side
`zpool list <pool>` cross-verification.
- Phase 2: rerun of ps cdp against the same device does
not destroy the existing pool.
Also fixed:
- jq path for `linstor storage-pool list --machine-readable`:
the wire schema is a double-array `[[{...}]]` so
`.[0].stor_pools[0]` returns nothing — `[.[]?[]?.free_capacity]
| .[0]` mirrors the ps-cdp-zfs-roundtrip.sh convention.
Verified on e2e3 stand 2026-05-22: EXIT=0, both phases PASS.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses Bug 359 by routing zpool create and zpool add commands through the host's mount namespace using nsenter. This ensures that the zpool utility can access partition device nodes in /dev immediately after creation, bypassing the propagation delay in the container's private devtmpfs. The zpool create command was also updated with the -m none flag to prevent errors on read-only host filesystems. The PR includes new unit tests and an E2E test script to verify atomicity and idempotency. A review comment suggests exporting hostMountNamespaceCommand to allow external test packages to override the command for testing purposes.
| // Exposed as a package var so unit tests can substitute a no-op | ||
| // prefix and assert the resulting command line directly; production | ||
| // callers MUST NOT mutate it. | ||
| var hostMountNamespaceCommand = []string{"nsenter", "-t", "1", "-m", "--"} |
There was a problem hiding this comment.
The variable hostMountNamespaceCommand is unexported (starts with a lowercase letter), which makes it inaccessible to tests in the satellite_test package (such as those in pkg/satellite/attach_test.go). The documentation states it is exposed so unit tests can substitute it, but this currently only applies to internal tests within the satellite package. To allow external test packages to utilize this seam, consider exporting the variable as HostMountNamespaceCommand.
|
Superseded — heavy nsenter approach replaced by simpler DaemonSet config fix (hostPath: type: Directory, drop mountPropagation: HostToContainer). The -m none Talos rootfs fix is preserved in the new PR. |
Summary
linstor ps cdp zfswas returning SUCCESS while the resulting StoragePool stayed atState=Errorwithpool backing storage missing. Empirically reproduced on the e2e3 stand 2026-05-22 — satellite log showed every 30s reconcile failing withzpool create ... data /dev/sda: cannot label 'sda': failed to detect device partitions on '/dev/sda1': 19.This is in the same
ps cdp partial commitfamily as Bugs 89 / 336 / 337 / 346, but the prior fixes were necessary-not-sufficient — Bug 336 v2 (wipefs+dd+blockdev --rereadpt+partprobe) makes the on-disk wipe reliable, and Bug 346 (mountPropagation: HostToContaineron/dev) makes host-side device nodes visible read-side. Bug 359 is the residual race that neither of those covered.Root cause (verified on stand, not hypothesised)
The satellite container's
/devis a privatedevtmpfsinstance — privileged kubelet pods get their owndevtmpfsmount, not a true bind of the host's. Whenzpool createruns inside the container:/dev/sda, stamps a fresh GPT, and asks the kernel to add the new partitions./dev/sda1+/dev/sda9in the host'sdevtmpfs.udevdrunning in the container's mount namespace (the satellite image doesn't ship one) and without the container'sdevtmpfsbeing the host's, the new partition nodes don't appear in the container's/devbefore zpool's internalopen(/dev/sda1)fires.errno 19 (ENODEV)→cannot label 'sda': failed to detect device partitions on '/dev/sda1': 19.zpool list(container-side) still says "no such pool", and the SP CRD stays atpoolMissing=true.Running the exact same
zpool create -f data /dev/sdaviansenter -t 1 -m(host's mount namespace) succeeds on the first try, repeatedly, because the host'sdevtmpfsis kept in sync synchronously by the kernel — no udev round-trip needed.zpool listand other read-only probes are unaffected by the race.Followup:
-m noneon the host-namespace pathAfter the nsenter hop,
zpool createsucceeded on disk but then tried tomkdir /<pool>on the host's read-only root (Talos siderolabs/zfs system-extension hosts have read-only rootfs). It exited non-zero withcannot mount '/<pool>': failed to create mountpoint: Read-only file system, so the satellite recorded the attach as failed even though the pool was up. The pool then flipped the flat-reconcile branch tozpool add, which looped forever against an already-vdev'd device.-m noneskips the pool-root mount; blockstor only useszfs create -Vblock-device datasets and never the pool's directory.Changes
pkg/satellite/attach.go— introducerunHostZpool(ctx, exec, args...)that prefixes any pool-creating zpool subcommand withnsenter -t 1 -m --.attachZFSandextendZFSroutecreate/addthrough it. Pure probes (zpool list) stay on the container-local path.zpool createcarries the new-m noneflag.pkg/satellite/attach_test.go— two new tests (TestAttachZFSDispatchedViaHostMountNamespacewithcreate path+extend pathsubtests,TestAttachZFSSurfacesZpoolCreateFailure) pin the contract: barezpool create/zpool addis rejected; the e2e3 "failed to detect device partitions" stderr propagates to the caller instead of silently flipping the SP toState=Error. ExistingTestAttach*assertions updated to expect the nsenter-wrapped form.tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh— new L6 cell. Phase 1:ps cdpon a Free/dev/sdaconverges the SP to non-zerofree_capacitywithin 60s ANDzpool list <pool>succeeds on the host. Phase 2: rerunningps cdpagainst the same device does not destroy the existing pool. Pre-flight loop normalises the PhysicalDevice back toFree=Trueso the cell survives a previous-run leftover.Validation on stand
Verified on e2e3 stand 2026-05-22 against the user-reported reproduction. With the fix deployed:
Pre-fix the same command sequence left the SP at
State=Errorwithpool backing storage missing on node e2e3-worker-1: storage pool data is not presentindefinitely.E2E cell
tests/e2e/cli-matrix/ps-cdp-zfs-atomic.shexits 0 against the stand (Phase 1 + Phase 2 PASS).Out of scope (notes for follow-up)
ps cdpthe satellite reconciler keeps re-running withSpec.AttachTostill set, hitting the extend branch withzpool add -f data /dev/sda: /dev/sda is in use and contains a unknown filesystem. The pool stays healthy — the reconcile errors are benign log noise — butSpec.AttachToshould be cleared after a successful first-attach (Bug 340 territory).Spec.AttachToreferences to deleted SPs is also Bug 340 lineage, not Bug 359.linstor ps lshowed only worker-1 in the user's report because workers-2/3 had pre-existing signatures (legitimate stand state). Not a bug; mentioned for completeness.Test plan
go test ./pkg/satellite/...— unit tests pass locally including the new contract-pinning ones.tests/e2e/cli-matrix/ps-cdp-zfs-atomic.sh .work/e2e3exits 0.