Skip to content

fix(rest): reject invalid RD names on s r rst and rg spawn before partial state lands#56

Merged
Andrei Kvapil (kvaps) merged 4 commits into
mainfrom
fix/rd-name-validation-bulk
Jun 2, 2026
Merged

fix(rest): reject invalid RD names on s r rst and rg spawn before partial state lands#56
Andrei Kvapil (kvaps) merged 4 commits into
mainfrom
fix/rd-name-validation-bulk

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps Andrei Kvapil (kvaps) commented Jun 1, 2026

Bug

Bug-hunt v3 finding C.4. Two REST handlers that mint a fresh ResourceDefinition from a user-supplied name bypass the lowercase-RD-name validator the direct POST /v1/resource-definitions path enforces (Bug 97):

  • linstor s r rst SRC --fs SNAP --tr <Bad>POST /v1/resource-definitions/{rd}/snapshot-restore-resource/{snap}
  • linstor rg spawn <RG> <Bad>POST /v1/resource-groups/{rg}/spawn

Reproducer:

linstor s r rst dev-worker-1 dev-worker-2 --fr <good-rd> --fs snap1 --tr hunt3-Bad
linstor rg spawn hunt-zfs hunt3-A1 100M --partial

Root cause

Both handlers passed the raw to_resource / resource_definition_name value straight into Store.ResourceDefinitions().Create(). The k8s CRD store lowercased the metadata.name, and the result no longer matched the CRD admission webhook's metadata.name == <spec.resourceDefinitionName>.<spec.nodeName> invariant. By the time the rejection fired, a partial RD entry had already landed in the linstor view, and the operator saw a raw metadata.name must equal … leak — the same Bug-97 class the direct rd c path already gates against.

Fix

Call the existing validateLinstorName("resource definition", …) helper at the wire boundary of both handlers, BEFORE any Store.Create call, so the failure mode is one consistent LINSTOR envelope and no orphan state is left behind. The sibling snapshot-restore-volume-definition endpoint (Bug 225), which shares the ToResource field, gets the same gate.

The shape mirrors validateRDCreateBody on the direct rd c path: a small validateSnapshotRestoreRequest helper keeps both restore handlers under funlen and gives any future pre-Store gate one canonical home.

Tests

  • pkg/rest/rd_name_validation_bulk_test.go

    • TestSnapshotRestoreRejectsInvalidRdName — table-driven across 7 invalid shapes (mixed-case, uppercase, underscore, leading/trailing hyphen, embedded dot, empty), asserts 400 + envelope + no orphan RD in the in-memory store.
    • TestSnapshotRestoreVolumeDefinitionRejectsInvalidRdName — same gate on the sibling VD-only restore endpoint.
    • TestResourceGroupSpawnRejectsInvalidRdName — same table for the rg-spawn handler.
    • TestSnapshotRestoreAcceptsValidRdName / TestResourceGroupSpawnAcceptsValidRdName — happy-path guards so the gate isn't over-strict on production pvc-<uuid> csi clone names.
  • tests/e2e/rd-name-validation-bulk.sh — hits all 4 RD-minting REST routes through a live in-cluster apiserver via kubectl port-forward. For each invalid name, asserts (a) 4xx, (b) no metadata.name leak in the envelope, (c) no orphan RD CRD lingering after the request returns. Closes the gap that recurring class bugs in this area (94, 97, 98, 99, 100) keep falling into.

Test plan

  • go test ./pkg/rest/... — full suite green, 0.35 s for the new bulk tests, ~60 s for full pkg/rest.
  • golangci-lint run ./pkg/rest/ — 0 issues.
  • go build ./... — clean.
  • e2e on QEMU stand — stand was unreachable at fix-time; the tests/e2e/rd-name-validation-bulk.sh scenario auto-picks-up via ls tests/e2e/*.sh and will run on the next QEMU lane.

Summary by CodeRabbit

  • Bug Fixes

    • Added validation to reject invalid resource definition names with HTTP 400 across REST endpoints, preventing invalid state creation and internal error detail leakage.
  • Tests

    • Added unit tests for resource definition name validation across snapshot-restore and resource-group spawn endpoints.
    • Added end-to-end test validating rejection of invalid names and acceptance of valid RFC-1123 style names.

…tial state lands

Bug-hunt v3 finding C.4: the lowercase-RD-name validator that
`POST /v1/resource-definitions` runs (Bug 97) was bypassed on two
sibling REST handlers that ALSO mint a fresh RD entry from a user-
supplied name:

  - `linstor s r rst SRC --fs SNAP --tr <Bad>`
    (POST /v1/resource-definitions/{rd}/snapshot-restore-resource/{snap})
  - `linstor rg spawn <RG> <Bad>`
    (POST /v1/resource-groups/{rg}/spawn)

Both flowed the raw `to_resource` / `resource_definition_name` value
straight into `Store.ResourceDefinitions().Create()`. The k8s CRD
store lowercased the metadata.name and the result no longer matched
the CRD admission webhook's `metadata.name == <spec.resourceDefinitionName>
.<spec.nodeName>` invariant; the rejection then fired AFTER a partial
RD entry had landed in the linstor view, and the operator saw a raw
"metadata.name must equal …" leak.

Fix mirrors the existing Bug-97 gate on `rd c`: call
`validateLinstorName("resource definition", …)` at the wire boundary,
BEFORE any Store mutation, so the failure mode is one consistent
LINSTOR envelope and no orphan state is left behind. Same gate is
applied to the sibling `snapshot-restore-volume-definition` endpoint
(Bug 225) which shares the `ToResource` field.

Tests:
  - pkg/rest/rd_name_validation_bulk_test.go covers all 3 entry
    points with the same invalid-name table the Bug-97 unit tests
    use, plus happy-path round-trip guards so the gate isn't
    over-strict on production `pvc-<uuid>` names.
  - tests/e2e/rd-name-validation-bulk.sh hits all 4 RD-minting REST
    routes through the in-cluster apiserver and asserts (a) 4xx
    envelope, (b) no `metadata.name` leak, (c) no orphan RD CRD.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7df541ea-72ee-43c0-89e2-e9a489ced12f

📥 Commits

Reviewing files that changed from the base of the PR and between 010e1bb and 0cd54fa.

📒 Files selected for processing (4)
  • pkg/rest/rd_name_validation_bulk_test.go
  • pkg/rest/snapshot_restore.go
  • pkg/rest/spawn.go
  • tests/e2e/rd-name-validation-bulk.sh

📝 Walkthrough

Walkthrough

This PR adds centralized RFC-1123 name validation to four REST endpoints that create or restore resource definitions. Validation functions reject invalid names with HTTP 400 before any store mutation. Unit tests verify rejection and acceptance across all endpoints. E2E tests confirm validation behavior and absence of state leakage.

Changes

Resource Definition Name Validation

Layer / File(s) Summary
Snapshot-restore validation gate
pkg/rest/snapshot_restore.go
Adds validateSnapshotRestoreRequest to centralize validation of the to_resource field (non-empty and RFC-1123-compliant) across both snapshot-restore endpoints. Both handleSnapshotRestore and handleSnapshotRestoreVolumeDefinition call this gate after JSON decoding, rejecting invalid requests early with HTTP 400.
Spawn endpoint validation
pkg/rest/spawn.go
Validates resource_definition_name in handleSpawn before calling Store.ResourceDefinitions().Create(), returning HTTP 400 on invalid names to prevent partial ResourceDefinition state from persisting.
Unit test coverage
pkg/rest/rd_name_validation_bulk_test.go
Five test functions verify snapshot-restore (resource and volume paths) and spawn endpoints: three negative-path tests assert HTTP 400 rejection and no persisted ResourceDefinition for invalid names; two positive-path tests assert HTTP 201 acceptance and persistence for valid names.
E2E integration tests
tests/e2e/rd-name-validation-bulk.sh
Bash script validates all four RD-minting REST endpoints with invalid name matrices. Helpers assert 4xx rejection without metadata.name leakage and absence of orphan Kubernetes ResourceDefinition CRs. Happy-path guard confirms valid names are accepted and persisted.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A fuzzy test suite hops through the REST,
Validating names with RFC-1123 zest!
No orphans left behind, no leaks in the light,
Four endpoints now guard their gates good and tight.
whiskers twitch — the names are validated right!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/rd-name-validation-bulk

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a bug where the lowercase resource definition (RD) name validator was bypassed on snapshot-restore and resource group spawn REST endpoints. It introduces validation gates at the wire boundary to reject invalid RD names before state mutation, preventing orphan states and internal Kubernetes metadata leaks. Additionally, comprehensive unit and end-to-end tests are added to verify the behavior. The review feedback points out a potential background process leak in the end-to-end test script if the port-forward wait fails, suggesting registering the trap immediately after starting the process.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +110 to +115
PF_PID=$!
_wait_port_forward "$LPORT" "$PF_PID" || {
echo "FAIL: could not port-forward apiserver"
exit 1
}
trap 'kill "$PF_PID" 2>/dev/null || true; cleanup' EXIT
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If _wait_port_forward fails, the script exits immediately. Since the EXIT trap that kills the background kubectl port-forward process (PF_PID) is registered after _wait_port_forward, the background process will be leaked and keep the port bound. Registering the trap immediately after starting the background process prevents this leak.

Suggested change
PF_PID=$!
_wait_port_forward "$LPORT" "$PF_PID" || {
echo "FAIL: could not port-forward apiserver"
exit 1
}
trap 'kill "$PF_PID" 2>/dev/null || true; cleanup' EXIT
PF_PID=$!
trap 'kill "$PF_PID" 2>/dev/null || true; cleanup' EXIT
_wait_port_forward "$LPORT" "$PF_PID" || {
echo "FAIL: could not port-forward apiserver"
exit 1
}

Andrei Kvapil (kvaps) and others added 3 commits June 2, 2026 02:48
The seed Snapshot used spec.resourceName which is not a field of the
Snapshot CRD — admission rejected on strict decoding so the scenario
exited 2 before reaching any name-validation assertion. Replace with
the correct resourceDefinitionName + snapshotName and align
metadata.name with the <rd>.<snap> CEL contract; also extend cleanup
to drop the snapshot so the scenario is idempotent across reruns.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Removing the seed Snapshot while a restore-clone RD still holds the
ZFS dataset busy stalled the snapshot finalizer for 120s+, leaving
an orphan that contaminated the next scenario on the lane (CI
showed `dataset is busy` on the following zfs-clone-source-delete
run after lane reshuffling moved it next to rd-name-validation-bulk).

Tear down the restore RD first, then the snapshot's parent RD,
then the Snapshot itself with --wait + an explicit wait-for-delete
so cleanup blocks until the finalizer is actually gone before the
next scenario starts.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…-name-validation-bulk

Seeding a real Snapshot (and exercising the snapshot-restore-resource
happy path) made this scenario stamp a ZFS dependent-clone graph
that the satellite finalizer could not unwind before cleanup
finished — the orphan Snapshot then poisoned the next lane scenario
(zfs-clone-source-delete saw the stand "not idle", dataset busy).

The reject-path matrix only needs the RD-name gate, which fires
before any snapshot existence check, so we keep the assert_reject
cases against the snapshot-restore endpoints with no Snapshot
in store. The snapshot-restore happy path is already covered by
TestSnapshotRestoreRejectsInvalidRdName / its valid-name sibling
in pkg/rest unit tests, without touching ZFS.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) marked this pull request as ready for review June 2, 2026 02:39
@kvaps Andrei Kvapil (kvaps) merged commit 6f69c56 into main Jun 2, 2026
15 of 16 checks passed
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 2, 2026
Bug 359: linstor r d <last-extra-diskful> immediately followed by
linstor r c <ex-tiebreaker-node> races the RD reconcilers
Bug-338 carve-out. When the r d drops the diskful count to one,
removeWitnesses Deletes the TIE_BREAKER CRD on the
ex-witness-node. The kubectl Delete finishes synchronously from
the reconcilers POV but the apiserver still serves the CRD as
exists DeletionTimestamp set finalizer pending for tens of
ms until the satellite strips its finalizer. During that window
REST createOrPromoteResource sees AlreadyExists on Create and
NotFound on Get (or NotFound from promotes PatchResourceSpec)
pre-fix it surfaced that as 404 not found - the same envelope
shape as a real missing-RD or missing-pool class which confused
operators because they never asked for a promote.

Fix wraps createOrPromoteResource in a 5-attempt retry loop with
a 200ms cadence. Both race surfaces (AlreadyExists+NotFound on
the flags probe and NotFound from promote) flag the attempt as
retryable; the next attempt converges as a fresh Create once GC
finishes. Exhausted retries surface a 503 envelope so CSI
or operator tooling can distinguish a transient race from a true
404.

The companion e2e (tiebreaker-r-d-r-c-other-node) stabilises
against the parallel ensureTiebreaker thrashing window: it waits
up to 20s for the post-r-d controller topology to settle (witness
count stable for at least 4s) before issuing the relocate and retries
the r c CLI call up to 5 times on a 503 envelope. CI lane 4
(stand-caught PR #56 PR #59) is green across 6 consecutive
fresh-stand runs on dev with this fix.

Signed-off-by: Andrei Kvapil <andrei.kvapil@aenix.io>
Co-authored-by: Claude <noreply@anthropic.com>
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 4, 2026
Corner-cases C4 (connection-scoped resource drbd-peer-options),
C5 (node-connection drbd-peer-options) and C6 (per-object-class option
validation) are unimplemented / permissive on main. Document them as
accepted deltas (#56/#57/#58) and pin current behavior:

- C4: PUT to the bare resource-connections drbd-peer-options path is not
  handled (no silent persist) — canary flips if the surface gets wired.
- C6: a net{}-class option on a volume-definition is accepted and stored
  permissively (upstream rejects with FAIL_INVLD_PROP) — canary flips
  when the option-class validator lands.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 4, 2026
Corner-cases C4 (connection-scoped resource drbd-peer-options),
C5 (node-connection drbd-peer-options) and C6 (per-object-class option
validation) are unimplemented / permissive on main. Document them as
accepted deltas (#56/#57/#58) and pin current behavior:

- C4: PUT to the bare resource-connections drbd-peer-options path is not
  handled (no silent persist) — canary flips if the surface gets wired.
- C6: a net{}-class option on a volume-definition is accepted and stored
  permissively (upstream rejects with FAIL_INVLD_PROP) — canary flips
  when the option-class validator lands.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 4, 2026
… scopes (corner C) (#98)

* fix(effectiveprops): controller-tier DRBD options must lose to closer scopes

controller drbd-options persists into ControllerConfig.Spec.ExtraProps
(raw, untyped), while rd/rg drbd-options are transcoded into the typed
Spec.DRBDOptions. The effective-props resolver copied the controller
ExtraProps AFTER the RG/RD/Resource typed+raw merge, so a cluster-wide
controller default wrongly overrode a closer scope's explicit override
— violating LINSTOR's 'closer to the resource wins' rule.

Flatten each scope (Controller/RG/RD/Resource) into a single DRBD-props
map (typed-emitted ∪ ExtraProps) and run them all through one
precedence walk (drbd.ResolveOptions), so closer scopes win regardless
of whether a value was stored typed or as an ExtraProp. Controller-tier
inheritance (value reaches a resource that sets nothing) is preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(store): pin --unset-<drbd-opt> reverts typed field to inherited (C3)

linstor rd/rg/r drbd-options --unset-<opt> deletes the flattened
DrbdOptions/... key; the store transcode round-trip must clear the
matching typed slot so the value reverts to inherited/default, while
sibling overrides in the same section survive untouched.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(e2e): L6 cell + L7 replay for controller DRBD-options hierarchy (C1/C2)

Add an operator-CLI convergence gate for the effectiveprops controller-
tier precedence fix:

- L6 cli-matrix/drbd-options-hierarchy-controller.sh: set max-buffers at
  the controller, autoplace an RD, assert drbdsetup show inherits it (C1),
  then set a closer RD-level max-buffers and assert it wins (C2).
- L7 replay/drbd-options-hierarchy-controller.yaml: same sequence under
  the replay harness.
- New additive await kind 'drbd_option' in operator-harness/lib.sh that
  probes drbdsetup show on a node's satellite pod, documented in
  replay-runner.sh. SHARED-HARNESS CHANGE (additive only).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(rest): pin C4/C6 corner-case deltas + whitelist C4/C5/C6

Corner-cases C4 (connection-scoped resource drbd-peer-options),
C5 (node-connection drbd-peer-options) and C6 (per-object-class option
validation) are unimplemented / permissive on main. Document them as
accepted deltas (#56/#57/#58) and pin current behavior:

- C4: PUT to the bare resource-connections drbd-peer-options path is not
  handled (no silent persist) — canary flips if the surface gets wired.
- C6: a net{}-class option on a volume-definition is accepted and stored
  permissively (upstream rejects with FAIL_INVLD_PROP) — canary flips
  when the option-class validator lands.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* fix(effectiveprops): preserve RG/RD non-DRBD ExtraProps in effective render

The hierarchy-funnel refactor routed every scope's props through
drbd.ResolveOptions, which by design only threads through NON-DRBD
props from the most-specific (resource) scope. That silently dropped
load-bearing non-DRBD ExtraProps carried on the RG/RD scopes — most
importantly `FileSystem/Type` (and `FileSystem/MkfsParams`), stamped
on the RD by `linstor rd set-property FileSystem/Type ext4` and by the
linstor-csi CreateVolume path.

The satellite gates its mkfs / `primary --force` seed path on
`dr.GetProps()["FileSystem/Type"]` (hasFileSystemConfigured). With the
key dropped, a fresh replica never mkfs-writes, the DRBD current-UUID
never rotates past the deterministic day0 GI, and the controller's
append-only RD.Spec.Initialized latch (ensureSkipInitSyncDecision)
never flips — the `respawn-standalone-wedge` e2e fails in SETUP with
"RD.Spec.Initialized never latched true after 3 UpToDate replicas".

Re-overlay the non-DRBD ExtraProps from the RG then RD scope after the
precedence walk (RD wins over RG; resource already won via
ResolveOptions). DRBD keys are skipped — already resolved by the walk.
The controller scope is deliberately not re-overlaid, preserving the
C1 contract that cluster-wide non-DRBD knobs (Aux/zone) do not leak
onto every resource.

Adds TestResolve_RDNonDRBDExtraPropSurvives pinning the exact
production RD shape (on-no-quorum typed, auto-quorum + FileSystem/Type
in ExtraProps) that the prior analytic reconstruction missed.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

---------

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 4, 2026
The Controller<RG<RD auto-diskful priority chain had no test for the
MIDDLE (RG) layer: the existing RDWins test only covers RD-over-Controller.
Add TestAutoDiskfulPropHierarchyRGWins (RG beats Controller when RD is
unset) and TestAutoDiskfulPropHierarchyRDBeatsRG (RD beats both when all
three are set), closing the lattice.

Also document the auto-diskful trigger divergence (deficit-refill +
immediate Primary-InUse promote vs upstream's timed Primary>N-min) and
the unimplemented allow-cleanup gate as accepted delta #57, and the
1.34 stuck-toggle retry/cancel version-context as delta #56.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 4, 2026
The Controller<RG<RD auto-diskful priority chain had no test for the
MIDDLE (RG) layer: the existing RDWins test only covers RD-over-Controller.
Add TestAutoDiskfulPropHierarchyRGWins (RG beats Controller when RD is
unset) and TestAutoDiskfulPropHierarchyRDBeatsRG (RD beats both when all
three are set), closing the lattice.

Also document the auto-diskful trigger divergence (deficit-refill +
immediate Primary-InUse promote vs upstream's timed Primary>N-min) and
the unimplemented allow-cleanup gate as accepted delta #57, and the
1.34 stuck-toggle retry/cancel version-context as delta #56.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 5, 2026
The Controller<RG<RD auto-diskful priority chain had no test for the
MIDDLE (RG) layer: the existing RDWins test only covers RD-over-Controller.
Add TestAutoDiskfulPropHierarchyRGWins (RG beats Controller when RD is
unset) and TestAutoDiskfulPropHierarchyRDBeatsRG (RD beats both when all
three are set), closing the lattice.

Also document the auto-diskful trigger divergence (deficit-refill +
immediate Primary-InUse promote vs upstream's timed Primary>N-min) and
the unimplemented allow-cleanup gate as accepted delta #57, and the
1.34 stuck-toggle retry/cancel version-context as delta #56.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 5, 2026
…orner H) (#103)

* fix(rest): normalize DRBD_DISKLESS to DISKLESS on resource create (H3)

The modern `linstor r c <node> <rd> --drbd-diskless` CLI flag posts the
wire flag DRBD_DISKLESS, while the deprecated `--diskless` alias posts
the canonical DISKLESS (verified via `linstor --curl` against the
upstream 1.33.2 oracle, client 1.27.1). blockstor's diskless-detection
surface keys exclusively on the canonical DISKLESS spelling, so a
replica requested with the recommended --drbd-diskless flag was
mis-classified as diskful: the satellite would carve backing storage
and the quorum/tiebreaker math would miscount it.

Fold DRBD_DISKLESS into DISKLESS once at the create wire boundary so the
rest of the pipeline sees a single spelling. De-duplicates when both
spellings are present.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(auto-diskful): pin RG-scope priority in the property hierarchy (H4)

The Controller<RG<RD auto-diskful priority chain had no test for the
MIDDLE (RG) layer: the existing RDWins test only covers RD-over-Controller.
Add TestAutoDiskfulPropHierarchyRGWins (RG beats Controller when RD is
unset) and TestAutoDiskfulPropHierarchyRDBeatsRG (RD beats both when all
three are set), closing the lattice.

Also document the auto-diskful trigger divergence (deficit-refill +
immediate Primary-InUse promote vs upstream's timed Primary>N-min) and
the unimplemented allow-cleanup gate as accepted delta #57, and the
1.34 stuck-toggle retry/cancel version-context as delta #56.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(toggle-disk): pin --migrate-from sync-then-remove (H2)

Add an L6 cli-matrix cell + L7 replay YAML for the toggle-disk
--migrate-from migration (UG9 §"Migrating a resource to another node").
Both assert the sync-then-remove contract: the destination diskful
replica is added (count 2->3) and reaches UpToDate BEFORE the migrate
source is pruned, so the active diskful count never drops below the
original 2 at any observed point. The landing pad uses --drbd-diskless
to also exercise the H3 DRBD_DISKLESS normalisation on the same flow.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(toggle-disk): use --diskless alias for the H2 migrate landing pad

The H2 sync-then-remove replay/cli-matrix created the diskless landing
pad with the modern `--drbd-diskless` flag, which posts the wire flag
DRBD_DISKLESS. blockstor's diskless-detection surface keys on the
canonical DISKLESS spelling, and the DRBD_DISKLESS->DISKLESS wire
normalisation is the H3 fix in this same branch — NOT yet rolled out to
the dev stand. On the deployed (pre-H3) image the landing pad was
mis-classified as diskful and never reported diskState Diskless, so the
replica_diskless await timed out before the migration step ever ran.

Switch the landing pad to the deprecated `--diskless` alias, which posts
the canonical DISKLESS directly, so the H2 sync-then-remove migration
contract is validated on the currently-deployed stand image. H3's
DRBD_DISKLESS normalisation remains pinned by the L1 unit test
pkg/rest/resource_create_drbd_diskless_test.go, the correct tier for a
wire-boundary flag canonicalisation.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(replay): migrate-from asserts source no longer diskful, not absent

The auto-tiebreaker legitimately re-occupies the vacated node after the
diskful source is pruned (2 diskful = even parity), so resource_absent
can never hold on a 3-worker stand. Assert replica_diskless on the
source plus active_diskful_count=2 instead — the actual sync-then-remove
contract. Live-stand evidence: controller logs 'migration complete: src
pruned, dst UpToDate' while the witness lands back on the source node.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

---------

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant