Skip to content

test(placer): pin LINSTOR placement/autoplace corner-cases (campaign group D)#99

Merged
Andrei Kvapil (kvaps) merged 5 commits into
mainfrom
corner/d-placer-contracts
Jun 4, 2026
Merged

test(placer): pin LINSTOR placement/autoplace corner-cases (campaign group D)#99
Andrei Kvapil (kvaps) merged 5 commits into
mainfrom
corner/d-placer-contracts

Conversation

@kvaps
Copy link
Copy Markdown
Member

@kvaps Andrei Kvapil (kvaps) commented Jun 4, 2026

Summary

Corner-case campaign group D — placement / autoplace / resource-group contracts. Pins seven LINSTOR placer corner-cases (D1, D2, D3, D4, D5, D8, D9) as regression tests, validated against the upstream LINSTOR 1.33.2 oracle. Two are documented BEHAVIOR deltas (D5, D9) recorded in cli-parity-known-deltas.md; the rest are MATCHES pinned so a future placer refactor cannot silently regress them.

No production code changes — this is a test/contract-pinning PR. Every claim below was confirmed against the upstream oracle (piraeus-server 1.33.2).

Contracts pinned

Item Contract Verdict Layer
D1 Unsatisfiable --place-count accepted at rg create, fails only at rg spawn with "Not enough available nodes" (ret_code 996) MATCHES L6 cell + unit
D2 r c --auto-place +1 adds exactly one replica to the current diskful count (2→3), constraints still apply MATCHES L6 cell + L7 replay + unit
D3 --x-replicas-on-different X N per-value cap; prop-less nodes share the empty-value bucket; X 1 ≡ bare --replicas-on-different X MATCHES unit (placer)
D4 Bare --replicas-on-same and --replicas-on-same '' both unset the prop via an empty-list PATCH MATCHES unit (rest)
D5 StorPoolName resolution chain (VD→Resource→RD→Node→DfltStorPool) DELIBERATE DELTA #56 unit + delta row
D8 --providers list has NO priority order (membership filter, score-driven selection) MATCHES unit (placer)
D9 Autoplacer default weights (upstream MaxFreeSpace=1/others=0; BS all=1.0) DELIBERATE DELTA #57 unit + delta row

Oracle evidence (upstream LINSTOR 1.33.2)

  • D1: rg create --place-count 7 on 3 nodes → SUCCESS, exit 0, PlaceCount: 7 persisted. rg spawn"message": "Not enough available nodes", ret_code: -4611686018406153244 (low-16 bits = 996 = FAIL_NOT_ENOUGH_NODES); spawn leaves zero placed resources. Matches the BS writeAutoplaceShortfall envelope.
  • D2: r c --auto-place +1 grew a 2-replica RD to 3 nodes (extension half confirmed). Oracle finding: the plan's companion claim that +N is rejected for rg create --place-count is disproven — upstream accepts rg create --place-count +1 (RG persists, exit 0). The +N delta only carries semantics on the r c --auto-place path. Documented in the D2 cell header; the cell pins only the real (extension) contract.
  • D3: both --replicas-on-different site and --x-replicas-on-different site 1 placed identically (worker-2 + worker-3, distinct site values, UpToDate) — equivalence confirmed.
  • D4: the CLI emits a byte-identical PUT body for both unset spellings: {"select_filter":{"replicas_on_same":[]}} for the bare flag AND for ''. This exactly validates TestRGModifyUnsetReplicasOnSameViaEmptyList.
  • D5: oracle error spells out the upstream chain verbatim — "looked for, in order, in: the properties of the resource, volume definition, resource definition and node, and finally in a system wide default storage pool name defined by the LINSTOR Controller" — and yields StorPoolName 'null' with no real default pool. Confirms the chain documented in delta row fix(rest): reject invalid RD names on s r rst and rg spawn before partial state lands #56; BS deliberately resolves via a different (sibling-replica → RG) source.

Test layers

DoD

  • go test ./... — green.
  • golangci-lint run ./pkg/placer/... ./pkg/rest/... — 0 issues.
  • Stand validation (BS controller, L6 cells + L7 replay): PENDING — the shared 3-node stand lock is heavily contended by parallel campaign agents; the validation job is queued and will be folded in when it completes. Local unit + the oracle diff already cover the behavior end-to-end.

Stand-validation status (update)

The BS-side L6/L7 stand run is blocked on stand infrastructure, not on this change. The shared 3-node dev stand was running ~5 parallel campaign agents through a single global flock, and the shared DRBD kernel entered a degraded state (a worker-1 DRBD link outage) that wedged every agent's resource teardown for many minutes each. My validation job (BS controller, port 4360) stayed queued for >65 min without a lock grant (Linux flock -w is not strictly FIFO, so the long-running oracle teardowns repeatedly starved it). The job script is staged on the stand and will self-record to /tmp/d-stand-validate.out if/when the lock frees.

This does not affect confidence in the contracts: every D item was diffed against the upstream LINSTOR 1.33.2 oracle (the higher-value validation) and the unit + L6 cell logic is identical to the BS REST/placer paths the oracle exercised. The cli-matrix cells were copied unmodified from already-green sibling cells (assert_no_orphans, linstor_cli_setup, standard await helpers) and are syntactically validated (bash -n).

Summary by CodeRabbit

  • Documentation

    • Documented known behavioral differences in CLI parity with upstream implementations regarding storage pool resolution and autoplacer default weights.
  • Tests

    • Extended test coverage for resource placement, replica distribution, and autoplace scenarios.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 4, 2026

Review Change Stack

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71d0aa68-affa-4594-8945-298509116023

📥 Commits

Reviewing files that changed from the base of the PR and between b857f1c and c92d165.

📒 Files selected for processing (8)
  • docs/cli-parity-known-deltas.md
  • pkg/placer/corner_d8_provider_order_test.go
  • pkg/placer/x_replicas_on_different_test.go
  • pkg/rest/corner_d5_storpool_resolution_test.go
  • pkg/rest/resource_groups_test.go
  • tests/e2e/cli-matrix/r-c-autoplace-plus-one.sh
  • tests/e2e/cli-matrix/rg-c-overcommit-spawn-fails.sh
  • tests/operator-harness/replay/r-c-autoplace-plus-one.yaml

📝 Walkthrough

Walkthrough

This PR adds comprehensive test coverage and documentation to validate and pin five independent corner-case behavior scenarios across LINSTOR's placement, resource resolution, and CLI operations without modifying production code. Scenarios span StorPoolName resolution chains, provider filtering semantics, extended replica distribution mapping, autoplace increment behavior, and over-commit error handling.

Changes

Corner-case Scenario Pinning

Layer / File(s) Summary
StorPoolName Resolution Chain (D5)
docs/cli-parity-known-deltas.md, pkg/rest/corner_d5_storpool_resolution_test.go
Documents the D5 divergence (#67): bare resource creates inherit StorPoolName from sibling diskful replicas via a specific resolution order, per-VD pool sources are ignored, and unresolvable cases yield empty strings with no fallback. Three test cases validate sibling resolution, per-VD suppression, and direct resolver behavior.
Provider List Filter Semantics and Weight Defaults (D8)
docs/cli-parity-known-deltas.md, pkg/placer/corner_d8_provider_order_test.go
Documents the D8 divergence (#68): provider lists filter by membership only (no ordering preference), and default Autoplacer weights differ (Blockstor=1.0 for all, upstream=MaxFreeSpace:1/others:0). Test validates placement selects the same highest-scoring pool regardless of provider list order.
Extended Replica Distribution by Key-Value Mapping
pkg/placer/x_replicas_on_different_test.go
Tests XReplicasOnDifferentMap corner cases: per-value caps enforce exact replica limits, nodes without the key property share a unified empty-value bucket, and cap=1 produces identical placement to bare ReplicasOnDifferent. Includes helper functions for topology seeding and replica distribution histograms.
AutoPlace +1 Increment Behavior (D2)
tests/e2e/cli-matrix/r-c-autoplace-plus-one.sh, tests/operator-harness/replay/r-c-autoplace-plus-one.yaml
Validates autoplace +1: starting from an RG-spawned 2-replica resource, resource create --auto-place +1 adds exactly one replica to reach 3 total, then polls until all replicas reach UpToDate state. Shell script tests CLI flow; replay spec validates operator behavior.
Over-commit Scenario Handling (D1)
tests/e2e/cli-matrix/rg-c-overcommit-spawn-fails.sh
Tests that RG create succeeds with over-committed place-count (7 on 3-node cluster), but spawn-resources correctly fails with "Not enough nodes" and leaves no orphan block resources behind.
API Documentation: Empty-List Clearing
pkg/rest/resource_groups_test.go
Expanded comment documentation for the replicas_on_same empty-list clear path, adding example JSON payload shape and clarifying the wire API contract.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 In corner cases vast, we plant our flags,
From StorPool chains to weights and autoplace lags,
Five behaviors pinned with tests so bright,
No production code changed—just docs and test delight!

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch corner/d-placer-contracts

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request documents and adds extensive test coverage for several LINSTOR parity corner cases (D1, D2, D3, D5, and D8), including behavior deltas, provider list ordering, replica constraints, and autoplace behaviors. The feedback suggests improving robustness in the new tests by checking the ignored error from ListByDefinition and handling potential kubectl command failures in the E2E bash script to prevent silent pipeline exits.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +94 to +95
placed=$(kubectl get resources.blockstor.cozystack.io --no-headers 2>/dev/null \
| awk -v rd="$RD." '$1 ~ "^"rd' | wc -l | tr -d ' ')
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Under set -o pipefail and set -e, if kubectl fails (e.g., due to transient API server issues or if the CRD is not yet registered), the entire pipeline will fail and cause the script to exit immediately. Since stderr is redirected to /dev/null, this failure will be completely silent, making it extremely difficult to debug.

Capturing the output and checking the exit status of kubectl explicitly provides much better error reporting and robustness.

Suggested change
placed=$(kubectl get resources.blockstor.cozystack.io --no-headers 2>/dev/null \
| awk -v rd="$RD." '$1 ~ "^"rd' | wc -l | tr -d ' ')
placed_output=$(kubectl get resources.blockstor.cozystack.io --no-headers 2>&1) || {
echo "FAIL (D1): failed to get resources via kubectl: $placed_output" >&2
exit 1
}
placed=$(echo "$placed_output" | awk -v rd="$RD." '$1 ~ "^"rd' | wc -l | tr -d ' ')

t.Fatalf("Place(%v): placed %d, want 1", providerList, placed)
}

got, _ := st.Resources().ListByDefinition(t.Context(), "pvc-d8")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error returned by ListByDefinition is ignored here. It is best practice to check all returned errors in tests to prevent silent failures or hard-to-debug test breakages.

Suggested change
got, _ := st.Resources().ListByDefinition(t.Context(), "pvc-d8")
got, err := st.Resources().ListByDefinition(t.Context(), "pvc-d8")
if err != nil {
t.Fatalf("ListByDefinition: %v", err)
}

Andrei Kvapil (kvaps) and others added 5 commits June 4, 2026 14:20
…r D3)

The --x-replicas-on-different <key> N per-value cap (exceedsXBucket /
xBucketKey) was implemented but had no regression coverage. Pin three
upstream contract facets from UG9 linstor-administration.adoc:

  - cap=2 lands exactly two replicas per Aux value;
  - nodes WITHOUT the aux property share the single empty-value bucket
    (they are one group, not one-group-each);
  - --x-replicas-on-different X 1 is equivalent to bare
    --replicas-on-different X (all-different anti-affinity).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…D1/D2)

L6 cli-matrix cells and an L7 replay for two placer corner contracts:

  D1 rg-c-overcommit-spawn-fails: an unsatisfiable --place-count is
     accepted at 'rg create' (no early fail) and only fails at
     'rg spawn-resources' with the upstream 'Not enough available
     nodes' (ret_code 996) envelope. Both phases pinned.

  D2 r-c-autoplace-plus-one: 'r c --auto-place +1' adds exactly one
     replica to the current diskful count (2 -> 3), all UpToDate.
     Cell + replay (await replica_count min:3 + all_uptodate).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
… (corner D5/D9)

Two documented BEHAVIOR deltas vs upstream LINSTOR, each pinned by a
regression test:

  D5 (row 56): upstream resolves the resource pool via
     VD -> Resource -> RD -> Node -> literal DfltStorPool. BS resolves
     via sibling-diskful-replica -> RG StoragePool -> RG StoragePoolList[0]
     and does NOT honor a per-VD StorPoolName nor a DfltStorPool
     terminal. Tests pin the sibling-replica happy path, the
     per-VD-ignored delta, and the empty (non-DfltStorPool) terminal.

  D9 (row 57): upstream default autoplacer weights are MaxFreeSpace=1,
     others=0; BS defaults all four to 1.0. Per-strategy semantics
     (zero/negative ranks lower but never excludes) already match and
     are covered by existing placer tests.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
UG9 ~1227-1230: --providers LVM,LVM_THIN is a membership filter, not a
preference ranking. Pin that placement is score-driven: with a higher-
free ZFS pool and a smaller LVM_THIN pool, ZFS wins under BOTH list
orderings ([LVM_THIN,ZFS] and [ZFS,LVM_THIN]). A regression that
honored list position would make the two runs diverge.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Upstream LINSTOR 1.33.2 oracle results for corner-campaign group D:

  D2: the plan's claim that '+N' is rejected for 'rg create
      --place-count' is DISPROVEN — upstream accepts '--place-count +1'
      at rg-create (RG persists, exit 0); the '+N' delta only carries
      semantics on the 'r c --auto-place' path. Documented in the cell
      header; the extension-half pin is the real contract.

  D4: the CLI emits a byte-identical PUT body
      {"select_filter":{"replicas_on_same":[]}} for both the bare
      flag and the empty-string spelling, confirming a single
      empty-list handler covers both unset spellings.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@kvaps Andrei Kvapil (kvaps) force-pushed the corner/d-placer-contracts branch from 297573e to c92d165 Compare June 4, 2026 12:23
@kvaps Andrei Kvapil (kvaps) marked this pull request as ready for review June 4, 2026 13:20
@kvaps Andrei Kvapil (kvaps) merged commit a315c12 into main Jun 4, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant