Skip to content

fix(effectiveprops): controller-tier DRBD options must lose to closer scopes (corner C)#98

Merged
Andrei Kvapil (kvaps) merged 5 commits into
mainfrom
corner/c-drbd-options-hierarchy
Jun 4, 2026
Merged

fix(effectiveprops): controller-tier DRBD options must lose to closer scopes (corner C)#98
Andrei Kvapil (kvaps) merged 5 commits into
mainfrom
corner/c-drbd-options-hierarchy

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps Andrei Kvapil (kvaps) commented Jun 4, 2026

Summary

Closes the group-C corner cases of the LINSTOR parity campaign (DRBD options hierarchy). One real precedence bug fixed, --unset revert pinned, three deliberate divergences whitelisted. Validated against an upstream LINSTOR 1.33.2 oracle on the dev stand.

The bug (C1/C2)

Controller-tier DRBD options did not lose to closer scopes in the effective-properties resolver: a linstor controller drbd-options --max-buffers=36864 would keep winning over a later RD-level override (rd drbd-options --max-buffers=8192 <rd>). Upstream semantics ("closer to the resource wins") give the RD value precedence. Fixed in pkg/effectiveprops so the chain resolves controller < RG < RD.

Oracle evidence (upstream 1.33.2, captured on the stand): controller-level value inherited by a fresh resource (36864 in the rendered net{} section), then RD override flips it to 8192; net options land in net {}, peer-device c-* options land in the per-peer connection disk {} section — both controllers route sections identically after the fix.

Pinned (C3)

--unset-<drbd-opt> deletes the key and the effective value reverts to the inherited/default tier (pkg/store/k8s/drbd_unset_revert_c3_test.go).

Deliberate divergences (known-deltas rows 56-58)

  • C4: resource drbd-peer-options / resource-connection drbd-peer-options alias parity — wire-shape pinned, divergence in connection-section routing recorded.
  • C5: node-connection drbd-peer-options surface (deciseconds scope) — not implemented; rejection pinned.
  • C6: per-object option-class validation — upstream rejects wrong-level options (e.g. --protocol on a VD); blockstor's prop bag is permissive at the wire, render-level rejection. Pinned.

Tests

  • L1: pkg/effectiveprops/effective_test.go (precedence matrix), pkg/rest/drbd_options_corner_cases_c4_c6_test.go, pkg/store/k8s/drbd_unset_revert_c3_test.go.
  • L6: tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh (controller set → inherit → RD override → kernel-visible via drbdsetup show).
  • L7: tests/operator-harness/replay/drbd-options-hierarchy-controller.yaml.
  • Harness (additive, flagged): tests/operator-harness/lib.sh + replay-runner.sh — new assertion support for the options check.

Stand validation

The L6 cell reproduces the bug on the stand's pinned v0.1.9 image (controller value 36864 survives RD override — pre-fix repro) while the oracle shows the target 8192; the cell passes only with this PR's image. The launcher will deploy this branch build to the dev stand and verify L6+L7 PASS before merge.

go test ./...        # green in the branch worktree
golangci-lint run    # 0 issues on touched packages

Summary by CodeRabbit

  • New Features

    • DRBD options inheritance/override now obeys correct precedence across controller → RG → RD → resource.
  • Bug Fixes

    • Unsetting DRBD options correctly reverts to inherited/default values.
  • Tests

    • Added unit, REST canary, e2e, and harness replay tests covering hierarchy, precedence, unset-revert, and corner-case behaviors; added an assertion kind to verify kernel-exposed DRBD option values.
  • Documentation

    • Whitelisted three additional DRBD-related CLI parity deltas in known-deltas docs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f7dd92ab-e9ac-493b-a7c9-79ed91b808cc

📥 Commits

Reviewing files that changed from the base of the PR and between af9afbf and 04e2396.

📒 Files selected for processing (2)
  • pkg/effectiveprops/effective.go
  • pkg/effectiveprops/effective_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/effectiveprops/effective.go

📝 Walkthrough

Walkthrough

This PR refactors effective property resolution in blockstor to enforce correct DRBD options precedence across scopes (controller → resource group → resource definition → resource), adds comprehensive unit and integration tests validating the precedence logic, and documents three corner-case parity deltas in the CLI surface.

Changes

DRBD Options Precedence Implementation and Validation

Layer / File(s) Summary
Refactor Resolve for precedence-aware DRBD options merging
pkg/effectiveprops/effective.go
Resolve builds per-scope flattened DRBD wire-key maps combining typed DRBDOptions and ExtraProps, then merges them with upstream precedence so closer scopes override higher ones. New emitScopeProps and mergeScopeProps helpers centralize flattening and precedence-aware merging.
Unit tests for effective props precedence and inheritance
pkg/effectiveprops/effective_test.go
Tests validate that RD overrides RG and controller, RG overrides controller, controller ExtraProps are inherited when closer scopes provide no override, and controller non-DRBD props do not leak into resource effective props. Four test cases cover precedence chains and inheritance.
REST layer corner-case tests
pkg/rest/drbd_options_corner_cases_c4_c6_test.go
Tests verify that resource-connection drbd-peer-options endpoint returns non-2xx when not wired (ensuring no silent persistence), and that volume-definition drbd-options permissively accepts net-class options without validation and persists them.
K8s store unset/revert regression test
pkg/store/k8s/drbd_unset_revert_c3_test.go
Regression test validates that unsetting a flattened DRBD wire-key (e.g., DrbdOptions/Net/protocol) correctly reverts the corresponding typed CRD field to empty string, while sibling settings (e.g., MaxBuffers) remain unchanged.
E2E CLI-matrix test for controller-to-RD precedence
tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh
Bash e2e test verifies DRBD option inheritance from controller scope and override by closer resource-definition scope. Includes pre-flight pool checks, helpers to query kernel DRBD config via drbdsetup show, scenario C1 (inheritance), and scenario C2 (closer-wins override).
Operator-harness E2E test infrastructure and replay
tests/operator-harness/lib.sh, tests/operator-harness/replay-runner.sh, tests/operator-harness/replay/drbd-options-hierarchy-controller.yaml
Extends operator-harness assertion framework with drbd_option kind to query kernel DRBD config on target satellite pods, documents the new assertion in replay-runner, and provides a complete YAML replay test exercising Controller → RG → RD → Resource precedence for max-buffers with prerequisites and no_orphans invariant.
Document accepted DRBD option parity deltas
docs/cli-parity-known-deltas.md
Whitelists three additional known CLI parity deltas: resource/node-connection drbd-peer-options missing implementation (accepted until 2026-12-31), and volume-definition drbd-options permissive validation without class/object-level enforcement (accepted until 2026-12-31).

🎯 3 (Moderate) | ⏱️ ~25 minutes

🐰 Precedence flows like rabbit hops—each scope's closer call
Lands clearer than distant whispers from the controller's hall.
We test the hierarchy, both broad and deep,
So DRBD options in their rightful layers keep!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 73.68% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: fixing controller-tier DRBD options precedence so closer scopes win, which is the core fix across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch corner/c-drbd-options-hierarchy

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the DRBD options resolution logic to correctly enforce the precedence hierarchy (Controller -> ResourceGroup -> ResourceDefinition -> Resource), ensuring that closer scopes override higher scopes. It also adds extensive unit, integration, and E2E tests to cover various corner cases (C1-C6) and updates the known deltas documentation. The review feedback suggests a performance optimization in pkg/effectiveprops/effective.go to avoid redundant map allocations and copies by passing the properties directly to drbd.ResolveOptions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +107 to +112
ctrlMerged := mergeScopeProps(ctrlProps, nil)
rgMerged := mergeScopeProps(rgProps, rgInfo.Props)
rdMerged := mergeScopeProps(rdProps, rdInfo.Props)
resMerged := mergeScopeProps(resProps, target.Spec.Props)

out := drbd.ResolveOptions(ctrlMerged, rgMerged, rdMerged, resMerged)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Performance Optimization: Avoid Redundant Map Allocations and Copies

In the current implementation, ctrlMerged, rgMerged, and rdMerged are created by copying the respective properties into new maps via mergeScopeProps:

  1. ctrlMerged: ctrlProps is already a fresh map returned by emitScopeProps. Creating ctrlMerged via mergeScopeProps(ctrlProps, nil) simply duplicates this map unnecessarily.
  2. rgMerged and rdMerged: rgInfo.Props and rdInfo.Props contain only non-DRBD properties. However, drbd.ResolveOptions explicitly filters out and ignores any keys that do not start with PropPrefix ("DrbdOptions/") for the first three scopes (controller, resourceGroup, and resourceDef). Therefore, merging rgInfo.Props and rdInfo.Props has no effect on the final output, making these merge operations redundant.

By passing ctrlProps, rgProps, and rdProps directly to drbd.ResolveOptions, we can avoid 3 unnecessary map allocations and multiple maps.Copy operations on every reconciliation pass.

Suggested change
ctrlMerged := mergeScopeProps(ctrlProps, nil)
rgMerged := mergeScopeProps(rgProps, rgInfo.Props)
rdMerged := mergeScopeProps(rdProps, rdInfo.Props)
resMerged := mergeScopeProps(resProps, target.Spec.Props)
out := drbd.ResolveOptions(ctrlMerged, rgMerged, rdMerged, resMerged)
resMerged := mergeScopeProps(resProps, target.Spec.Props)
out := drbd.ResolveOptions(ctrlProps, rgProps, rdProps, resMerged)

@kvaps Andrei Kvapil (kvaps) marked this pull request as ready for review June 4, 2026 02:05
@kvaps Andrei Kvapil (kvaps) force-pushed the corner/c-drbd-options-hierarchy branch from 63617b1 to 6209de8 Compare June 4, 2026 02:11
@kvaps
Copy link
Copy Markdown
Member Author

Confirmatory oracle evidence for C3-C6 (captured on the stand against upstream 1.33.2 after the PR was opened):

  • C3 --unset revert: both controllers revert max-buffers to the inherited 36864 after --unset-max-buffers — MATCHES.
  • C4 resource drbd-peer-options: blockstor returns endpoint not implemented (route not registered) — confirms delta row (resource-connection alias).
  • C5 node-connection drbd-peer-options --ping-timeout 299: blockstor persists the property (list-properties shows it) but does NOT render it into the .res; upstream renders ping-timeout 299 into the per-connection net{} — confirms the MISSING_FEATURE delta.
  • C6 net-option on a volume-definition: upstream rejects with FAIL_INVLD_PROP (Invalid property key ... not whitelisted); blockstor accepts permissively and stores — confirms the option-class delta row.

Branch was rebased onto main (known-deltas row renumbering only, no code changes).

Andrei Kvapil (kvaps) and others added 4 commits June 4, 2026 05:03
… scopes

controller drbd-options persists into ControllerConfig.Spec.ExtraProps
(raw, untyped), while rd/rg drbd-options are transcoded into the typed
Spec.DRBDOptions. The effective-props resolver copied the controller
ExtraProps AFTER the RG/RD/Resource typed+raw merge, so a cluster-wide
controller default wrongly overrode a closer scope's explicit override
— violating LINSTOR's 'closer to the resource wins' rule.

Flatten each scope (Controller/RG/RD/Resource) into a single DRBD-props
map (typed-emitted ∪ ExtraProps) and run them all through one
precedence walk (drbd.ResolveOptions), so closer scopes win regardless
of whether a value was stored typed or as an ExtraProp. Controller-tier
inheritance (value reaches a resource that sets nothing) is preserved.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…(C3)

linstor rd/rg/r drbd-options --unset-<opt> deletes the flattened
DrbdOptions/... key; the store transcode round-trip must clear the
matching typed slot so the value reverts to inherited/default, while
sibling overrides in the same section survive untouched.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…(C1/C2)

Add an operator-CLI convergence gate for the effectiveprops controller-
tier precedence fix:

- L6 cli-matrix/drbd-options-hierarchy-controller.sh: set max-buffers at
  the controller, autoplace an RD, assert drbdsetup show inherits it (C1),
  then set a closer RD-level max-buffers and assert it wins (C2).
- L7 replay/drbd-options-hierarchy-controller.yaml: same sequence under
  the replay harness.
- New additive await kind 'drbd_option' in operator-harness/lib.sh that
  probes drbdsetup show on a node's satellite pod, documented in
  replay-runner.sh. SHARED-HARNESS CHANGE (additive only).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Corner-cases C4 (connection-scoped resource drbd-peer-options),
C5 (node-connection drbd-peer-options) and C6 (per-object-class option
validation) are unimplemented / permissive on main. Document them as
accepted deltas (#56/#57/#58) and pin current behavior:

- C4: PUT to the bare resource-connections drbd-peer-options path is not
  handled (no silent persist) — canary flips if the surface gets wired.
- C6: a net{}-class option on a volume-definition is accepted and stored
  permissively (upstream rejects with FAIL_INVLD_PROP) — canary flips
  when the option-class validator lands.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) force-pushed the corner/c-drbd-options-hierarchy branch from 6209de8 to af9afbf Compare June 4, 2026 03:05
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (2)
pkg/effectiveprops/effective_test.go (1)

143-206: ⚡ Quick win

Add a resource-scope override assertion to this precedence regression.

This refactor also changes resProps := emitScopeProps(target.Spec.DRBDOptions, target.Spec.ExtraProps), but the L1 coverage stops at RD. A break where Resource no longer overrides RD/RG/controller would still pass this suite.

➕ Suggested test extension
 func TestResolve_RGOverridesController_RDOverridesRG(t *testing.T) {
 	scheme := newScheme(t)
@@
 	if got["DrbdOptions/Net/max-buffers"] != "2000" {
 		t.Fatalf("RG must win over controller when RD is unset: got %q, want 2000",
 			got["DrbdOptions/Net/max-buffers"])
 	}
+
+	// Resource must win over every parent scope as the closest level.
+	rd.Spec.DRBDOptions = &blockstoriov1alpha1.DRBDOptions{
+		Net: &blockstoriov1alpha1.DRBDNetOptions{MaxBuffers: int32Ptr(3000)},
+	}
+	res.Spec.DRBDOptions = &blockstoriov1alpha1.DRBDOptions{
+		Net: &blockstoriov1alpha1.DRBDNetOptions{MaxBuffers: int32Ptr(4000)},
+	}
+
+	got, err = effectiveprops.Resolve(context.Background(), client.Reader(cli), res, rd)
+	if err != nil {
+		t.Fatalf("Resolve (resource override): %v", err)
+	}
+
+	if got["DrbdOptions/Net/max-buffers"] != "4000" {
+		t.Fatalf("resource must win over RD/RG/controller: got %q, want 4000",
+			got["DrbdOptions/Net/max-buffers"])
+	}
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/effectiveprops/effective_test.go` around lines 143 - 206, The test is
missing a check that a Resource-scoped override beats RD/RG/controller; add a
Resource override and assert it wins: set res.Spec.DRBDOptions (or
res.Spec.ExtraProps) to a value (e.g. MaxBuffers=4000 or ExtraProps
"DrbdOptions/Net/max-buffers":"4000"), call effectiveprops.Resolve(...) again
and assert got["DrbdOptions/Net/max-buffers"] == "4000"; this verifies the
refactor around emitScopeProps/target.Spec.* hasn't broken resource-level
precedence.
tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh (1)

48-56: ⚡ Quick win

Use register_strict_cleanup here instead of a bare EXIT trap.

This cell bypasses the shared e2e cleanup path, so it misses the suite’s standard resource cleanup and cluster-state verification behavior.

Based on learnings: "Applies to tests/e2e/*.sh : Register cleanup for e2e test scenarios via register_strict_cleanup from tests/e2e/lib.sh instead of bare trap directives to ensure proper resource cleanup and cluster state verification"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh` around lines 48 -
56, Replace the bare EXIT trap with the shared e2e cleanup registrar: keep the
cleanup() function (which unsets controller-level knob via "${LCTL[@]}"
controller drbd-options --unset-max-buffers, calls delete_rd "$RD",
assert_no_orphans "$RD" and linstor_cli_teardown) but remove the trap cleanup
EXIT and instead call register_strict_cleanup cleanup so the test uses the
suite’s standard strict cleanup and cluster-state verification logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/cli-parity-known-deltas.md`:
- Line 34: The delta row for "node-connection drbd-peer-options" (Corner-case
C5) references a stale pointer "`#56`"; update that reference in
docs/cli-parity-known-deltas.md to the current row id or an explicit issue/PR
link so traceability is restored—locate the table row containing the command
`node-connection drbd-peer-options` and replace "deferred with `#56`" with the
correct identifier (e.g., the new row number or the canonical issue/PR URL) and
ensure the surrounding text still reads correctly.

In `@pkg/rest/drbd_options_corner_cases_c4_c6_test.go`:
- Around line 52-59: The test currently treats any non-2xx as success, which
allows unrelated 5xx responses to pass; update the assertion in the test that
inspects resp.StatusCode (the block that currently checks "if resp.StatusCode >=
200 && resp.StatusCode < 300") to explicitly require the mux fall-through codes
by asserting resp.StatusCode == 404 || resp.StatusCode == 405, and adjust the
t.Fatalf message in that failure branch to mention the expected 404/405 rather
than any non-2xx so the test fails for wired/broken handlers.

In `@tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh`:
- Around line 101-115: Summary: the test probes kernel DRBD state before
confirming controller/RD Status has converged, causing race failures; fix by
inserting explicit Status-convergence waits. After wait_replica_count "$RD" 2
and before the first wait_max_buffers call, call the existing e2e helper that
waits for LINSTOR resource status convergence (e.g., wait_resource_status or
wait_resource_converged) for "$RD" on the chosen node so the controller settings
are applied; likewise after running "${LCTL[@]} resource-definition drbd-options
--max-buffers ..." add another status-convergence wait for "$RD" before calling
wait_max_buffers again so the RD override has reconciled into the
satellite/kernel; use the same helper that other tests use for asserting
REST→satellite→kernel convergence.

In `@tests/operator-harness/replay/drbd-options-hierarchy-controller.yaml`:
- Around line 23-25: Add an explicit default for the substitution key {{rd}}
inside the top-level vars map (alongside sp: stand) so the replay YAML does not
rely on runner-side fallbacks; ensure the file's top-level metadata also
includes name (kebab-case matching the filename), description, and
prerequisites.min_nodes, and that steps[] include cmd[], expect_exit (and
optional await), with optional teardown[] and invariants[] to comply with the
replay YAML contract.
- Around line 52-58: The test added an unsupported await kind "drbd_option"
(await.kind: drbd_option) that violates the replay YAML contract; either add
"drbd_option" to the replay validation/allowlist or change the test to use an
allowed kind. Locate the replay schema/validator that enforces await.kind (the
validation rule that currently permits only replica_count, disk_state,
all_uptodate, replica_diskless, no_tiebreaker, sync_clean, resource_absent,
rd_absent, vd_size_kib, pvc_capacity, pod_md5_invariant) and update it to
include "drbd_option", and ensure any related error messages/tests are adjusted,
or revert the two occurrences in drbd-options-hierarchy-controller.yaml (lines
with await.kind: drbd_option) to one of the permitted kinds.

---

Nitpick comments:
In `@pkg/effectiveprops/effective_test.go`:
- Around line 143-206: The test is missing a check that a Resource-scoped
override beats RD/RG/controller; add a Resource override and assert it wins: set
res.Spec.DRBDOptions (or res.Spec.ExtraProps) to a value (e.g. MaxBuffers=4000
or ExtraProps "DrbdOptions/Net/max-buffers":"4000"), call
effectiveprops.Resolve(...) again and assert got["DrbdOptions/Net/max-buffers"]
== "4000"; this verifies the refactor around emitScopeProps/target.Spec.* hasn't
broken resource-level precedence.

In `@tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh`:
- Around line 48-56: Replace the bare EXIT trap with the shared e2e cleanup
registrar: keep the cleanup() function (which unsets controller-level knob via
"${LCTL[@]}" controller drbd-options --unset-max-buffers, calls delete_rd "$RD",
assert_no_orphans "$RD" and linstor_cli_teardown) but remove the trap cleanup
EXIT and instead call register_strict_cleanup cleanup so the test uses the
suite’s standard strict cleanup and cluster-state verification logic.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b53af3b8-37dd-43e2-adb8-e89768d4c3a6

📥 Commits

Reviewing files that changed from the base of the PR and between 29b961a and af9afbf.

📒 Files selected for processing (9)
  • docs/cli-parity-known-deltas.md
  • pkg/effectiveprops/effective.go
  • pkg/effectiveprops/effective_test.go
  • pkg/rest/drbd_options_corner_cases_c4_c6_test.go
  • pkg/store/k8s/drbd_unset_revert_c3_test.go
  • tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh
  • tests/operator-harness/lib.sh
  • tests/operator-harness/replay-runner.sh
  • tests/operator-harness/replay/drbd-options-hierarchy-controller.yaml

| 62 | `rd sp DrbdOptions/Resource/quorum-minimum-redundancy <bad>` | BEHAVIOR | permanent | Corner-case B8. Upstream VALIDATES generic DRBD option values against the DRBD schema and rejects e.g. `quorum-minimum-redundancy bogus` with FAIL_INVLD_PROP (oracle: `value 'bogus' is not valid … has to match (1-32) or (all\|majority\|off)`, rc=10). BS's property bag is permissive — it accepts and stores any DrbdOptions value verbatim (no per-key value validator); an invalid value is rejected downstream by drbdadm at `.res` render time. Valid values pass through and propagate to the `options{}` block correctly. Pinned by `TestCornerB_QuorumMinimumRedundancyPassthrough`. |
| 63 | `r d` quorum-property auto-removal INFO + prop strip | BEHAVIOR | permanent | Corner-case B3b. Upstream LINSTOR, on an `r d` that drops a 3-node resource below quorum feasibility, REMOVES `DrbdOptions/Resource/quorum` + `on-no-quorum` from the RD and prints two INFO lines ("…was removed as there are not enough resources for quorum", UG9 ~1380-1401). BS instead STAMPS `DrbdOptions/Resource/quorum=off` (does not remove it) and prints no INFO — Bug 309: the satellite renders an explicit quorum value into the `.res`; an absent prop would fall back to a DRBD built-in that differs from the intended `off`. Resulting DRBD behaviour (`quorum off`) is equivalent; the divergence is the visible prop shape (`off` present vs absent) and the missing INFO envelope. Stamp-off pinned by `TestEnsureTiebreakerOffOnSingleReplica`. |
| 64 | `resource-connection drbd-peer-options` / `resource drbd-peer-options` | MISSING_FEATURE | 2026-12-31 | Corner-case C4: per-(rd,nodeA,nodeB) connection-scoped DRBD peer-options are NOT wired on `main` — the resource-connections REST surface only implements `/paths` (multi-path addresses). The early scenario-5.W04 in-memory registry did not survive the K8s-store migration. Controller/RG/RD/Resource-scope DRBD options (the common path) work; per-pair tuning is deferred to a dedicated feature PR. CSI does not depend on per-pair peer-options. |
| 65 | `node-connection drbd-peer-options` | MISSING_FEATURE | 2026-12-31 | Corner-case C5: `node-connection set-property` / `drbd-peer-options` props are persisted on `ControllerConfig.Spec.NodeConnections` and round-trip via list/get, but are NOT consumed by the `.res` render path — so they do not apply to the resources of the node pair. Persist-only today; render wiring deferred with #56. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Stale delta-row reference (#56) should be updated.

Line 34 references “deferred with #56”, but #56 is not present in the current accept-list, so this pointer is now ambiguous after renumbering. Please update it to the current row id (or explicit issue/PR reference) to keep traceability intact.

Based on learnings: "When adding a new CLI verb or wire-shape change, update docs/cli-parity-known-deltas.md with intentional new divergence (row id, command, delta_kind, accepted_until, why)".

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/cli-parity-known-deltas.md` at line 34, The delta row for
"node-connection drbd-peer-options" (Corner-case C5) references a stale pointer
"`#56`"; update that reference in docs/cli-parity-known-deltas.md to the current
row id or an explicit issue/PR link so traceability is restored—locate the table
row containing the command `node-connection drbd-peer-options` and replace
"deferred with `#56`" with the correct identifier (e.g., the new row number or the
canonical issue/PR URL) and ensure the surrounding text still reads correctly.

Comment on lines +52 to +59
// No handler is registered for the bare peer-options path, so the
// mux falls through. We accept any non-2xx (404 / 405) — the point
// is that the option is NOT silently persisted as if it took effect.
if resp.StatusCode >= 200 && resp.StatusCode < 300 {
t.Fatalf("C4 delta drift: resource-connection drbd-peer-options PUT returned %d "+
"(connection-scoped peer-options now appear wired — update known-deltas #56 "+
"and add a real round-trip/render test)", resp.StatusCode)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Tighten the route-absence assertion to 404/405.

This canary currently passes on any non-2xx, including unrelated 5xx failures. Since the intended contract here is mux fall-through for an unwired endpoint, assert 404/405 explicitly so a broken-but-wired handler does not silently keep C4 pinned.

Suggested change
-	// No handler is registered for the bare peer-options path, so the
-	// mux falls through. We accept any non-2xx (404 / 405) — the point
-	// is that the option is NOT silently persisted as if it took effect.
-	if resp.StatusCode >= 200 && resp.StatusCode < 300 {
+	// No handler is registered for the bare peer-options path, so the
+	// mux should fall through with a routing-level failure.
+	if resp.StatusCode != http.StatusNotFound && resp.StatusCode != http.StatusMethodNotAllowed {
 		t.Fatalf("C4 delta drift: resource-connection drbd-peer-options PUT returned %d "+
 			"(connection-scoped peer-options now appear wired — update known-deltas `#56` "+
 			"and add a real round-trip/render test)", resp.StatusCode)
 	}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// No handler is registered for the bare peer-options path, so the
// mux falls through. We accept any non-2xx (404 / 405) — the point
// is that the option is NOT silently persisted as if it took effect.
if resp.StatusCode >= 200 && resp.StatusCode < 300 {
t.Fatalf("C4 delta drift: resource-connection drbd-peer-options PUT returned %d "+
"(connection-scoped peer-options now appear wired — update known-deltas #56 "+
"and add a real round-trip/render test)", resp.StatusCode)
}
// No handler is registered for the bare peer-options path, so the
// mux should fall through with a routing-level failure.
if resp.StatusCode != http.StatusNotFound && resp.StatusCode != http.StatusMethodNotAllowed {
t.Fatalf("C4 delta drift: resource-connection drbd-peer-options PUT returned %d "+
"(connection-scoped peer-options now appear wired — update known-deltas `#56` "+
"and add a real round-trip/render test)", resp.StatusCode)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/rest/drbd_options_corner_cases_c4_c6_test.go` around lines 52 - 59, The
test currently treats any non-2xx as success, which allows unrelated 5xx
responses to pass; update the assertion in the test that inspects
resp.StatusCode (the block that currently checks "if resp.StatusCode >= 200 &&
resp.StatusCode < 300") to explicitly require the mux fall-through codes by
asserting resp.StatusCode == 404 || resp.StatusCode == 405, and adjust the
t.Fatalf message in that failure branch to mention the expected 404/405 rather
than any non-2xx so the test fails for wired/broken handlers.

Comment on lines +101 to +115
echo ">> wait for 2 diskful replicas"
wait_replica_count "$RD" 2

# Pick a diskful node to probe the kernel config on.
node=$(linstor_diskful_nodes "$RD" | head -1)
[[ -n "$node" ]] || die "no diskful node for $RD"

echo ">> [C1] assert controller max-buffers ($CTRL_MB) inherited into $RD on $node"
wait_max_buffers "$node" "$CTRL_MB"

echo ">> [C2] rd drbd-options --max-buffers $RD_MB $RD (closer scope override)"
"${LCTL[@]}" resource-definition drbd-options --max-buffers "$RD_MB" "$RD" >/dev/null

echo ">> [C2] assert RD max-buffers ($RD_MB) now wins over controller ($CTRL_MB) on $node"
wait_max_buffers "$node" "$RD_MB"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert Status convergence before probing drbdsetup show.

This L6 cell waits for replica count only, then reads kernel DRBD state for both C1 and C2. That leaves the assertion racing a not-yet-converged reconcile/adjust cycle, which is outside the cli-matrix contract for these tests.

As per coding guidelines, "L1 unit tests must run on every commit using go test ./..." is not applicable here, but "L6 operator-CLI e2e tests must test real linstor CLI → REST → satellite → DRBD kernel interactions and assert Status convergence under tests/e2e/cli-matrix/"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/e2e/cli-matrix/drbd-options-hierarchy-controller.sh` around lines 101 -
115, Summary: the test probes kernel DRBD state before confirming controller/RD
Status has converged, causing race failures; fix by inserting explicit
Status-convergence waits. After wait_replica_count "$RD" 2 and before the first
wait_max_buffers call, call the existing e2e helper that waits for LINSTOR
resource status convergence (e.g., wait_resource_status or
wait_resource_converged) for "$RD" on the chosen node so the controller settings
are applied; likewise after running "${LCTL[@]} resource-definition drbd-options
--max-buffers ..." add another status-convergence wait for "$RD" before calling
wait_max_buffers again so the RD override has reconciled into the
satellite/kernel; use the same helper that other tests use for asserting
REST→satellite→kernel convergence.

Comment on lines +23 to +25
vars:
sp: stand

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Add an explicit default for {{rd}} under vars.

The runner has a fallback today, but this replay file contract requires defaults for the substitutions it uses instead of relying on runner-side implicit behavior.

As per coding guidelines, "Replay YAML files must include: name (kebab-case, matches filename), description, prerequisites.min_nodes, vars with defaults for {{rd}}, {{sp}}, etc., steps[] with cmd[], expect_exit, and optional await assertion, teardown[], and invariants[]"

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/operator-harness/replay/drbd-options-hierarchy-controller.yaml` around
lines 23 - 25, Add an explicit default for the substitution key {{rd}} inside
the top-level vars map (alongside sp: stand) so the replay YAML does not rely on
runner-side fallbacks; ensure the file's top-level metadata also includes name
(kebab-case matching the filename), description, and prerequisites.min_nodes,
and that steps[] include cmd[], expect_exit (and optional await), with optional
teardown[] and invariants[] to comply with the replay YAML contract.

Comment on lines +52 to +58
await:
kind: drbd_option
rd: "{{rd}}"
node: "{{node1}}"
key: max-buffers
expected: "36864"
timeout_s: 120
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

await.kind: drbd_option is outside the replay YAML contract.

This workflow introduces a new await kind, but the authoritative replay schema still restricts await.kind to the existing allowlist. Please update that contract/validation in the same PR, or keep this workflow on the supported set.

As per coding guidelines, "Replay YAML await.kind must use one of: replica_count, disk_state, all_uptodate, replica_diskless, no_tiebreaker, sync_clean, resource_absent, rd_absent, vd_size_kib, pvc_capacity, pod_md5_invariant"

Also applies to: 62-68

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/operator-harness/replay/drbd-options-hierarchy-controller.yaml` around
lines 52 - 58, The test added an unsupported await kind "drbd_option"
(await.kind: drbd_option) that violates the replay YAML contract; either add
"drbd_option" to the replay validation/allowlist or change the test to use an
allowed kind. Locate the replay schema/validator that enforces await.kind (the
validation rule that currently permits only replica_count, disk_state,
all_uptodate, replica_diskless, no_tiebreaker, sync_clean, resource_absent,
rd_absent, vd_size_kib, pvc_capacity, pod_md5_invariant) and update it to
include "drbd_option", and ensure any related error messages/tests are adjusted,
or revert the two occurrences in drbd-options-hierarchy-controller.yaml (lines
with await.kind: drbd_option) to one of the permitted kinds.

…render

The hierarchy-funnel refactor routed every scope's props through
drbd.ResolveOptions, which by design only threads through NON-DRBD
props from the most-specific (resource) scope. That silently dropped
load-bearing non-DRBD ExtraProps carried on the RG/RD scopes — most
importantly `FileSystem/Type` (and `FileSystem/MkfsParams`), stamped
on the RD by `linstor rd set-property FileSystem/Type ext4` and by the
linstor-csi CreateVolume path.

The satellite gates its mkfs / `primary --force` seed path on
`dr.GetProps()["FileSystem/Type"]` (hasFileSystemConfigured). With the
key dropped, a fresh replica never mkfs-writes, the DRBD current-UUID
never rotates past the deterministic day0 GI, and the controller's
append-only RD.Spec.Initialized latch (ensureSkipInitSyncDecision)
never flips — the `respawn-standalone-wedge` e2e fails in SETUP with
"RD.Spec.Initialized never latched true after 3 UpToDate replicas".

Re-overlay the non-DRBD ExtraProps from the RG then RD scope after the
precedence walk (RD wins over RG; resource already won via
ResolveOptions). DRBD keys are skipped — already resolved by the walk.
The controller scope is deliberately not re-overlaid, preserving the
C1 contract that cluster-wide non-DRBD knobs (Aux/zone) do not leak
onto every resource.

Adds TestResolve_RDNonDRBDExtraPropSurvives pinning the exact
production RD shape (on-no-quorum typed, auto-quorum + FileSystem/Type
in ExtraProps) that the prior analytic reconstruction missed.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps
Copy link
Copy Markdown
Member Author

Root cause of the 3× respawn-standalone-wedge CI failures found via on-stand A/B (isolated repro stand, 3 runs per image): the resolver rewrite dropped non-DRBD upper-scope ExtraProps — FileSystem/Type from the RD never reached the dispatched resource, so the satellite skipped the mkfs / primary --force seed path, the DRBD current-UUID never rotated past day0, and RD.Spec.Initialized never latched.

Fix: re-overlay non-DrbdOptions/ ExtraProps from RG→RD scopes after the precedence walk (controller scope deliberately excluded — C1 contract preserved; all C1/C2 tests green). Regression unit test pins the production RD shape. Validated on the repro stand: main PASS×3, #98-before FAIL×3, #98+fix PASS×3.

@kvaps Andrei Kvapil (kvaps) merged commit 2e54b4d into main Jun 4, 2026
15 checks passed
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 4, 2026
…ine (#101)

* fix(stand): neutralize backtick command substitution in config-patch heredoc

The unquoted <<YAML heredoc carried comment lines with backtick pairs
(lvextend/drbdmeta/open(...)). On hosts where those binaries exist on
PATH the backticks execute as command substitution and corrupt the
generated config-patch.yaml, so talosctl rejects it and 'make up'
fails. CI runners lack the binaries, which is why this only bit dev
hosts. The heredoc must stay unquoted for $INSTALL_IMG expansion, so
drop the backticks from the comments instead.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(e2e): widen respawn-wedge Initialized-latch wait to 180s

The RD.Spec.Initialized latch rides DRBD current-UUID rotation through
the satellite observer and informer-cache trail; under loaded CI
runners the 60s window lapses while the latch is merely late. Seen
during PR #98 triage where the late latch was initially mistaken for
a flake before the real regression was isolated.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(cli-matrix): count only diskful replicas in drbd-options hierarchy cell

wait_replica_count counts ALL Resource rows including the
auto-tiebreaker witness, so a healthy 2-diskful + 1-TB layout (3 rows)
never satisfied count=2. Count diskful nodes instead; the L7 replay
twin already passes on the stand.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

---------

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant