fix(rest): normalize DRBD_DISKLESS + pin toggle-disk state machine (corner H)#103
Conversation
|
Warning Review limit reached
More reviews will be available in 2 minutes and 27 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces changes to handle the modern --drbd-diskless flag by normalizing the DRBD_DISKLESS wire flag to the canonical DISKLESS flag at the API boundary, accompanied by comprehensive unit and E2E tests for the sync-then-remove migration behavior. It also adds tests to verify the priority hierarchy of the auto-diskful property (Controller < RG < RD). The review feedback suggests adding a defensive nil check for the resource pointer in the normalization function and capturing command output in a variable within the E2E bash script to prevent exit status masking.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| func normalizeDisklessFlag(res *apiv1.Resource) { | ||
| if !slices.Contains(res.Flags, rscFlagDrbdDiskless) { |
There was a problem hiding this comment.
To adhere to defensive programming practices, it is recommended to add a nil check for the res pointer before accessing its fields. This prevents potential runtime panics if the function is called with a nil argument in the future.
| func normalizeDisklessFlag(res *apiv1.Resource) { | |
| if !slices.Contains(res.Flags, rscFlagDrbdDiskless) { | |
| func normalizeDisklessFlag(res *apiv1.Resource) { | |
| if res == nil || !slices.Contains(res.Flags, rscFlagDrbdDiskless) { | |
| return | |
| } |
| exit 1 | ||
| fi | ||
|
|
||
| if linstor_diskful_nodes "$RD" | grep -qx "$SRC"; then |
There was a problem hiding this comment.
Running linstor_diskful_nodes directly inside the if condition pipeline masks any command failures (even with pipefail active, the if statement suppresses set -e exits). If linstor_diskful_nodes fails, the test could silently pass. Consider capturing the output in a variable first to ensure failures are caught by set -e.
| if linstor_diskful_nodes "$RD" | grep -qx "$SRC"; then | |
| nodes=$(linstor_diskful_nodes "$RD") | |
| if echo "$nodes" | grep -qx "$SRC"; then |
|
Stand validation root-caused: the earlier |
|
Stand validation complete: the corrected replay ( |
31a3c53 to
02049ba
Compare
The modern `linstor r c <node> <rd> --drbd-diskless` CLI flag posts the wire flag DRBD_DISKLESS, while the deprecated `--diskless` alias posts the canonical DISKLESS (verified via `linstor --curl` against the upstream 1.33.2 oracle, client 1.27.1). blockstor's diskless-detection surface keys exclusively on the canonical DISKLESS spelling, so a replica requested with the recommended --drbd-diskless flag was mis-classified as diskful: the satellite would carve backing storage and the quorum/tiebreaker math would miscount it. Fold DRBD_DISKLESS into DISKLESS once at the create wire boundary so the rest of the pipeline sees a single spelling. De-duplicates when both spellings are present. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The Controller<RG<RD auto-diskful priority chain had no test for the MIDDLE (RG) layer: the existing RDWins test only covers RD-over-Controller. Add TestAutoDiskfulPropHierarchyRGWins (RG beats Controller when RD is unset) and TestAutoDiskfulPropHierarchyRDBeatsRG (RD beats both when all three are set), closing the lattice. Also document the auto-diskful trigger divergence (deficit-refill + immediate Primary-InUse promote vs upstream's timed Primary>N-min) and the unimplemented allow-cleanup gate as accepted delta #57, and the 1.34 stuck-toggle retry/cancel version-context as delta #56. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Add an L6 cli-matrix cell + L7 replay YAML for the toggle-disk --migrate-from migration (UG9 §"Migrating a resource to another node"). Both assert the sync-then-remove contract: the destination diskful replica is added (count 2->3) and reaches UpToDate BEFORE the migrate source is pruned, so the active diskful count never drops below the original 2 at any observed point. The landing pad uses --drbd-diskless to also exercise the H3 DRBD_DISKLESS normalisation on the same flow. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The H2 sync-then-remove replay/cli-matrix created the diskless landing pad with the modern `--drbd-diskless` flag, which posts the wire flag DRBD_DISKLESS. blockstor's diskless-detection surface keys on the canonical DISKLESS spelling, and the DRBD_DISKLESS->DISKLESS wire normalisation is the H3 fix in this same branch — NOT yet rolled out to the dev stand. On the deployed (pre-H3) image the landing pad was mis-classified as diskful and never reported diskState Diskless, so the replica_diskless await timed out before the migration step ever ran. Switch the landing pad to the deprecated `--diskless` alias, which posts the canonical DISKLESS directly, so the H2 sync-then-remove migration contract is validated on the currently-deployed stand image. H3's DRBD_DISKLESS normalisation remains pinned by the L1 unit test pkg/rest/resource_create_drbd_diskless_test.go, the correct tier for a wire-boundary flag canonicalisation. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
The auto-tiebreaker legitimately re-occupies the vacated node after the diskful source is pruned (2 diskful = even parity), so resource_absent can never hold on a 3-worker stand. Assert replica_diskless on the source plus active_diskful_count=2 instead — the actual sync-then-remove contract. Live-stand evidence: controller logs 'migration complete: src pruned, dst UpToDate' while the witness lands back on the source node. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
02049ba to
a885cd9
Compare
…nt (auto-TB parity) After r d leaves a 2-diskful RD, auto-add-quorum-tiebreaker (default-on, matching upstream LINSTOR) re-occupies the vacated node with a DISKLESS TieBreaker witness for quorum. The resource_absent asserts on the vacated node in remove-replica and r-c-with-tiebreaker-peer were therefore false FAILs of correct, parity-matching behaviour. Flip them to tiebreaker_present (same lesson as toggle-disk-migrate-from #103, n-autoplace-target-excludes #109). The load-bearing assertions (re-spawn DISKFUL, no permanent Connecting) are unchanged. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…to-TB-aware assertions, slow-stand timeouts (#110) * fix(operator-harness): vd_size_kib VOL env across pipe + {{rg}} substitution + expect_exit list vd_size_kib passed the target volume number as a `VOL=... linstor_cli | python3` env prefix. In a pipeline the prefix binds to the LEFT command, not python3 on the right, so os.environ['VOL'] raised KeyError, the parser fell through to print(0), and the assertion could never match — the P0 vd-resize lifecycle (and vd-resize-thick) were permanently red. Pass the volume number as argv to the parser instead. Add {{rg}} substitution (resolved from vars.rg, default rg-<name>-<rand>) so resource-group replays no longer create a group literally named '{{rg}}'. Add list-form expect_exit so idempotent steps whose exit code depends on shared-stand state (e.g. encryption create-passphrase: 0 on a fresh controller, 10 when a passphrase already exists) can accept either. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(operator-harness): correct CLI shapes in evacuate/luks/evict replays - auto-diskful-evicted-node teardown used `node evacuate --restore`, which client 1.27.1 does not accept; the inverse of `node evacuate` is `node restore`. - luks-encrypted-rd passed the passphrase as a positional to `encryption create-passphrase`; the CLI takes it via the -p flag. The literal value is inlined ({{passphrase}} was never a substituted var) and the step accepts exit 0 or 10 (passphrase may already exist on the shared stand). - n-evict-tiebreaker-no-shuffle: document that `node evict` is driven through the harness linstor_cli shim (PUT /v1/nodes/<n>/evict), not a native client verb. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(operator-harness): auto-diskful-evicted-node needs 4 nodes The workflow places a diskful on node1/node2/node3, then evacuates node3 and asserts the auto-diskful controller refills back to 3 diskful. On a 3-node stand the displaced replica has no healthy node to land on (every peer already hosts one), so replica_count can never return to 3 — the assertion times out as a false FAIL of correct controller behaviour. Require min_nodes: 4 so the refill targets the spare node4; on a 3-node stand the runner now SKIPs cleanly. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(operator-harness): flip vacated-node asserts to tiebreaker_present (auto-TB parity) After r d leaves a 2-diskful RD, auto-add-quorum-tiebreaker (default-on, matching upstream LINSTOR) re-occupies the vacated node with a DISKLESS TieBreaker witness for quorum. The resource_absent asserts on the vacated node in remove-replica and r-c-with-tiebreaker-peer were therefore false FAILs of correct, parity-matching behaviour. Flip them to tiebreaker_present (same lesson as toggle-disk-migrate-from #103, n-autoplace-target-excludes #109). The load-bearing assertions (re-spawn DISKFUL, no permanent Connecting) are unchanged. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(operator-harness): luks replay sets DrbdOptions/EncryptPassphrase before rd create A LUKS-layered RD requires the controller property DrbdOptions/EncryptPassphrase in addition to the cluster passphrase; without it the controller correctly rejects rd-create -l drbd,luks,storage with exit 10 ("LUKS layer requires DrbdOptions/EncryptPassphrase to be set first") -- an intentional guard, not a product gap. Add the set-property step before rd-create, mirroring the green L6 cell luks-rd-create-encrypted.sh. Idempotent on the shared stand. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(operator-harness): raise slow-stand resync timeouts to 600s The phase-3 relocate disk_state wait in r-full-lifecycle and the large-volume all_uptodate waits in vd-resize-full-lifecycle provably time out only on the stand's ~4 MB/s DRBD resync, not on a real product fault. Bump those ceilings to 600s with a comment that this is slow-stand headroom (CI stands are faster; the timeout is a ceiling, not a target). Assertion strictness is unchanged. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(operator-harness): {{device}} substitution + fixture-presence SKIP gates ps-cdp-zfs passed the literal {{device}} placeholder to the device-pool create because substitute() never resolved it (same class as the {{rg}} pass-through). Add {{device}} resolution from vars.device. Add two opt-in prerequisite SKIP gates so a workflow that needs a stand fixture the current stand lacks SKIPs cleanly (exit 0) instead of FAILing on a missing fixture -- a missing fixture is "not exercisable here", not a product bug: - prerequisites.storage_pool_min_nodes {name,min}: SKIP unless a LINSTOR pool is on >= min nodes (wired into vd-resize-thick, needs lvm-thick x2). - prerequisites.device_on_any_node <path>: SKIP unless the block device exists on a worker satellite (wired into ps-cdp-zfs, needs /dev/loop9). Both gates are additive; workflows without the key are unaffected. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(operator-harness): accept exit 10 for vd shrink rejection The shrink-rejected step asserted exit 1, but the python-linstor client surfaces an API-level rejection (shrink past DRBD metadata position) as exit 10. Both 1 and 10 mean "rejected"; the load-bearing contract (size unchanged at 4G) is pinned by the follow-up size-unchanged-after-reject assertion. Validated on stand: shrink returns exit 10. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(operator-harness): pin r-full-lifecycle phase-1 placement (deterministic relocate target) Phase 1 used --auto-place=2, which picks the 2 diskful nodes non-deterministically; when a diskful landed on node3 the phase-3 `r c node3` relocate hit "resource already exists" (exit 10) on the stand. Pin the diskful pair to node1+node2 so auto-add-quorum-tiebreaker deterministically lands the witness on node3, making node3 a valid relocate target -- the same topology the L6 ground-truth cell r-full-lifecycle.sh discovers dynamically. The pure autoplace-node-selection path stays covered by autoplace-3r.yaml and the L6 cell. Also raise the phase-1 512M-resync all_uptodate ceiling to 600s for the slow stand. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> --------- Signed-off-by: Andrei Kvapil <kvapss@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
Summary
Closes the group-H corner cases of the LINSTOR parity campaign (toggle-disk / migration state machine). One real wire-flag bug fixed (H3), two state-machine contracts pinned (H2/H4), one version-context divergence documented (H1). Diskless-detection and the
--migrate-fromsync-then-remove ordering are the load-bearing surfaces.Verdict table
r tdretry / cancel)reclaimVolumesForDiskless), and it additionally accepts an explicit?cancel=true(Bug 40) the upstream client never sends. Pinned by existingTestToggleDiskCancelStuckAddDiskScenario4W24+TestBug267ToggleToDisklessReclaimsBackingVolume.toggle-disk --migrate-fromis sync-then-removetests/e2e/cli-matrix/toggle-disk-migrate-from.sh) + L7 (replay/toggle-disk-migrate-from.yaml), validated on the stand.--disklessalias vs modern--drbd-disklessr c --drbd-disklessposts wire flagDRBD_DISKLESS; the deprecated--disklessposts the canonicalDISKLESS(verified vialinstor --curlagainst the 1.33.2 oracle, client 1.27.1). blockstor's diskless-detection surface keys exclusively onDISKLESS, so a--drbd-disklessreplica was mis-classified as diskful — the satellite would carve backing storage and the quorum/tiebreaker math would miscount it. FoldedDRBD_DISKLESS→DISKLESSat the create wire boundary. Pinned bypkg/rest/resource_create_drbd_diskless_test.go.DrbdOptions/auto-diskfultrigger + priority hierarchyTestAutoDiskfulPropHierarchyRGWins+TestAutoDiskfulPropHierarchyRDBeatsRG, closing the lattice. The firing CONDITION (deficit-refill + immediate Primary-InUse promote) and the unimplementedauto-diskful-allow-cleanuptrim are documented as accepted delta #68.The fix (H3)
Diskless-detection across blockstor (the placer's
splitByDiskless, the satellite'sapplyStorageIfDiskful, the quorum/tiebreaker arithmetic, the store's flag projection) keys exclusively on the canonicalDISKLESSspelling. A resource created with the RECOMMENDED--drbd-disklessflag posts onlyDRBD_DISKLESSand was therefore treated as a diskful create.normalizeDisklessFlagfoldsDRBD_DISKLESSintoDISKLESSonce at the resource-create wire boundary (de-duplicating when both spellings are present), so the rest of the pipeline sees a single spelling.Tests
pkg/rest/resource_create_drbd_diskless_test.go(H3 wire-boundary normalisation),internal/controller/auto_diskful_timer_test.go(H4 RD>RG>Controller hierarchy).tests/e2e/cli-matrix/toggle-disk-migrate-from.sh(H2 sync-then-remove; redundancy floor asserted on every poll).tests/operator-harness/replay/toggle-disk-migrate-from.yaml(H2 convergence assertion).Known deltas
Rows #67 (H1 stuck-toggle version-context) and #68 (H4 auto-diskful trigger condition) appended to
docs/cli-parity-known-deltas.mdatmax+1over main's numbering.Stand validation
The H2 L6/L7 artifacts create the diskless landing pad with the deprecated
--disklessalias (canonicalDISKLESS) so they validate the sync-then-remove contract against the currently-deployed stand image, which predates the H3 fix. H3'sDRBD_DISKLESSnormalisation is pinned at the L1 unit tier (the correct level for a wire-boundary canonicalisation) and takes effect once the controller change rolls out.Root-cause of the original replay failure (now fixed): the first stand run timed out at the landing-pad step —
— because the deployed (pre-H3) image keys diskless-detection on the canonical
DISKLESSspelling (pkg/satellite/reconciler.goisDiskless,pkg/store/k8s/resources.go), so a--drbd-disklessreplica (DRBD_DISKLESS) was mis-classified as diskful anddiskStatenever reportedDiskless. Switching the landing pad to the--disklessalias removes that dependency on an unrolled controller change; the migration step itself was unit-validated.Stand re-validation status: BLOCKED-ON-INFRA at submit time. The dev stand entered an INFRA-owned reset during re-validation — a worker (
dev-worker-1) failed to return Ready and.work/dev/kubeconfigwas wiped mid-run (the runner reached the workflow but reportedSKIP: workflow needs 3 workersonce discovery lost the cluster). The corrected replay should be re-run on the stand once it is healthy; the fix is conclusive at the code + L1 + L6/L7-shape level.Notes
No shared-harness changes (
lib.sh/replay-runner.shuntouched).go build ./...+go test ./...green;golangci-lint runandshellcheck -S errorclean for the diff.