fix(rest): node evict must not demote healthy diskful to tiebreaker (Bug 385) by kvaps · Pull Request #76 · cozystack/blockstor

Andrei Kvapil (kvaps) · 2026-06-02T21:32:45Z

What

linstor n e <node> (node evict) misbehaved when the cluster had a free TieBreaker replica: the evict did not take effect. The TieBreaker stayed pinned to the just-evicted node, and a healthy diskful replica drifted into the tiebreaker role.

Operator-reported repro (3-node dev stand), before evict: worker-1 UpToDate, worker-2 UpToDate, worker-3 TieBreaker. After linstor n e worker-3: worker-2 was wrongly demoted UpToDate → TieBreaker and the evicted worker-3 still held a diskful resource — the eviction "did not happen".

Why

The RD reconciler's tiebreaker pass (ensureTiebreaker) counted every replica of an RD, including a TIE_BREAKER witness sitting on a just-EVICTED node. The witness invariant therefore read as already-satisfied (2 diskful + 1 witness), so the reconciler never relocated the witness off the drained node. The stranded witness then blocked a fresh witness from landing on a healthy spare, which is how the operator observed a healthy diskful drift into the tiebreaker role.

The placer already documents and enforces the correct semantic — replicas on EVICTED/LOST nodes are not counted — but the RD-level witness/quorum decision did not mirror it.

Fix

Treat replicas on EVICTED/LOST nodes as draining placements, not live ones:

Exclude evicted/lost-node replicas from the witness and quorum decision (mirroring the placer's disabledNodes semantic).
Reap a TIE_BREAKER stranded on a disabled node so a fresh witness can relocate to a healthy spare, or quorum falls back to off when no spare remains.
A healthy diskful is never demoted to TieBreaker. Diskful replicas on disabled nodes are left to the NodeReconciler's existing migration path (placer gap-fill for EVICTED, source-delete for LOST).

The change is confined to the RD reconciler; the shared witness-counting helpers (splitByDiskless, shouldTieBreakerExist) are untouched.

Tests (CLI-bug-fix protocol)

L1 regression — internal/controller/ensure_tiebreaker_evict_bug_385_test.go: evicting the tiebreaker node reaps the stranded witness without demoting a healthy diskful; with a spare node the witness relocates to it; evicting a diskful node does not let its stranded replica keep the witness alive.
L6 cli-matrix — tests/e2e/cli-matrix/n-evict-tiebreaker-no-shuffle.sh: real linstor n e on the 2r+tb shape, asserts the witness leaves the evicted node and neither healthy diskful is demoted.
L7 replay — tests/operator-harness/replay/n-evict-tiebreaker-no-shuffle.yaml: codifies the operator sequence and convergence assertions (witness reaped off the evicted node, both diskful stay UpToDate).

go build ./..., go test ./..., and golangci-lint run are green locally. Live-stand replay validation is left to the launcher.

… (Bug 385) ensureTiebreaker counted every replica of an RD, including a TIE_BREAKER witness sitting on a just-EVICTED node. The witness invariant then read as satisfied (diskful=2 + witness=1) and the reconciler never relocated the witness off the drained node, so `linstor n e <tiebreaker-node>` did not actually take effect; downstream the stranded witness blocked a fresh one from landing on a healthy spare, which surfaced as a healthy diskful drifting into the tiebreaker role. Treat replicas on EVICTED / LOST nodes as draining placements (mirroring the placer's disabledNodes semantic): exclude them from the witness / quorum decision and reap a TIE_BREAKER stranded on a disabled node so a fresh witness can relocate to a healthy spare, or quorum falls back to off when none remains. A healthy diskful is never demoted. Diskful replicas on disabled nodes are left to the NodeReconciler's migration path. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…Bug 385) L6 cli-matrix/n-evict-tiebreaker-no-shuffle.sh drives the real `linstor n e` CLI on the 2r+tb shape and asserts the witness leaves the evicted node while neither healthy diskful is demoted to TIE_BREAKER. L7 replay/n-evict-tiebreaker-no-shuffle.yaml codifies the operator sequence (rd c → vd c → 2x r c → tiebreaker on node3 → n e node3) and the convergence assertions: witness reaped off the evicted node (resource_absent), both diskful replicas stay UpToDate. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

coderabbitai · 2026-06-02T21:32:52Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 14a0d5e0-cbc9-4edf-9c24-d9e9e94e3394

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/bug-385-evict-tb-shuffle

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request addresses Bug 385, where evicting a node hosting a TieBreaker replica misbehaves because the reconciler counts replicas on evicted or lost nodes as live. The changes filter out replicas on disabled nodes from the live count and explicitly remove stranded witnesses. Additionally, unit and E2E tests are introduced to verify this behavior. A critical review comment points out a safety issue where evicting a stranded witness when no healthy spare nodes are available can result in a 2-replica resource definition running under majority quorum without a witness, presenting a data-availability hazard. A code suggestion is provided to fallback quorum to off if no candidate tiebreaker node is available.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-02T21:34:37Z

+	diskful, diskless := splitByDiskless(live)
 	witness := filterTieBreaker(diskless)

 	wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness)


There is a critical correctness and safety issue when a stranded witness is evicted and there are no healthy spare nodes available to host a new witness.

The Issue

removeStrandedWitnesses deletes the stranded witness from the store.

live replicas will have len(witness) == 0.

wantWitness is computed as true (since we have 2 live diskful replicas and no live witness).

applyWitnessDecision is called and enters the wantWitness && len(witness) == 0 case, calling createWitness.

Since there are no healthy spare nodes, createWitness finds no candidate node and returns nil (no error).

However, applyWitnessDecision blindly appends a dummy witness to diskless and returns it anyway.

quorumPolicy receives disklessAfter with 1 element and returns QuorumPolicyMajority.

setQuorum sets quorum: majority on the RD.

This results in a 2-replica RD with no witness running under quorum: majority, which is a severe data-availability hazard (the volume will freeze if either of the two remaining diskful nodes fails).

The Fix

We should check if a candidate tiebreaker node is actually available before deciding we want a witness. If no node is available, we set wantWitness to false so that quorum correctly falls back to off.

Suggested change

wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness)

wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness)

if wantWitness && len(witness) == 0 {

hostingReplica := make(map[string]bool)

for i := range replicas {

hostingReplica[replicas[i].NodeName] = true

}

node, err := r.pickTiebreakerNodeForRD(ctx, rd.Name, hostingReplica)

if err != nil {

return err

}

if node == "" {

wantWitness = false

}

}

Andrei Kvapil (kvaps) · 2026-06-03T00:40:04Z

Superseded by #83 (merged): the four operational-lifecycle fixes (Bugs 384–387) plus the L7 harness hardening (Bug 388) were validated as a unit on the live stand and merged together in #83. The commits from this branch are included there.

…s (FILE_THIN loop wedge) Rendering rs-discard-granularity into a loop-backed FILE_THIN volume's disk{} block regressed fresh-create convergence. When the elected day0 winner force-primaries to run mkfs (the FileSystem/Type path), mkfs issues a full-device discard; on the loop backing that discard storm, with rs-discard-granularity active, dirties the bitmap relative to the day0-seeded peers and forces a FULL initial SyncTarget that wedges (PausedSyncT dependency between the two targets). Proven on stand + CI: an identical 3-replica create + mkfs converges instantly on LVM_THIN (real block device, same disk block) but full-resyncs and wedges on FILE_THIN (loop) — the e2e respawn-standalone-wedge failure. Gate rs-discard-granularity on the discard-zero-safe block-device provider set (LVM_THIN / ZFS / ZFS_THIN), the same set as discard-zeroes-if-aligned. FILE_THIN now renders only the inert discard-zeroes-if-aligned no (matching pre-feature behaviour) and omits the granularity. The thin-aware-resync win is retained where it is both safe and effective: real block-device thin/ZFS pools. L1 pins (discardgran_test.go): FILE_THIN + thick LVM omit the granularity; LVM_THIN/ZFS keep it; end-to-end render pins TestBuildResFile_FileThinDay0SkipRender (no rs-discard-granularity) and TestBuildResFile_LvmThinDay0SkipRender (keeps it). L6/L7 repurposed to guard the FILE_THIN fresh-create + mkfs day0-skip convergence and assert rs-discard-granularity is absent. Known-delta #76 updated. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

…ating (#112) * fix(satellite): decouple rs-discard-granularity from discard-zeroes gating A partially-written FILE_THIN volume resynced ~2x the bytes upstream LINSTOR did because the rendered .res lacked rs-discard-granularity. autoDiskOptions() early-returned nil for any provider that was not discard-zeroes-safe, coupling rs-discard-granularity (which upstream gates ONLY on the backing device's reported discard granularity) to discard-zeroes-if-aligned (correctly provider-gated). Decouple the two gates to mirror upstream: - discard-zeroes-if-aligned: provider-gated as before, but now rendered explicitly as `no` for non-safe kinds (FILE_THIN, thick LVM, FILE) instead of omitting the whole block — matching upstream 1.33.2's render. - rs-discard-granularity: emitted whenever the backing device reports a non-zero discard granularity (lsblk DISC-GRAN > 0), independent of provider kind, so an aligned all-zero region is UNMAPped during resync instead of transferred. Multi-volume collapse stays conservative: discard-zeroes-if-aligned is `yes` only when every volume is safe (one `no` pins the resource), and rs-discard-granularity is the smallest across volumes. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(corner-q3): pin FILE_THIN rs-discard-granularity at L6 + L7 L6 cli-matrix cell file-thin-rs-discard-granularity.sh: 512M FILE_THIN single replica, write 320M, add 2nd diskful replica, assert the new replica's resync `received` byte counter is close to the written bytes (not the full device) and the rendered .res carries the discard disk block. L7 replay file-thin-rs-discard-granularity.yaml: operator-CLI sequence asserting the rendered DRBD config carries rs-discard-granularity=4096 and discard-zeroes-if-aligned=no for a FILE_THIN resource, plus a clean 3rd-replica sync (drbd_option + sync_clean awaits). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * docs(cli-parity): record FILE_THIN disk-block render deltas (Q3 #76) After the rs-discard-granularity decoupling fix, BS renders the same thin-aware-resync disk options as upstream for a discard-capable FILE_THIN volume. Two render-shape divergences remain and are accepted: upstream additionally emits `block-size 512;` (BS omits it), and upstream renders the disk block at volume scope vs BS's resource scope — functionally identical for single-volume resources. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(satellite): gate rs-discard-granularity to block-device thin pools (FILE_THIN loop wedge) Rendering rs-discard-granularity into a loop-backed FILE_THIN volume's disk{} block regressed fresh-create convergence. When the elected day0 winner force-primaries to run mkfs (the FileSystem/Type path), mkfs issues a full-device discard; on the loop backing that discard storm, with rs-discard-granularity active, dirties the bitmap relative to the day0-seeded peers and forces a FULL initial SyncTarget that wedges (PausedSyncT dependency between the two targets). Proven on stand + CI: an identical 3-replica create + mkfs converges instantly on LVM_THIN (real block device, same disk block) but full-resyncs and wedges on FILE_THIN (loop) — the e2e respawn-standalone-wedge failure. Gate rs-discard-granularity on the discard-zero-safe block-device provider set (LVM_THIN / ZFS / ZFS_THIN), the same set as discard-zeroes-if-aligned. FILE_THIN now renders only the inert discard-zeroes-if-aligned no (matching pre-feature behaviour) and omits the granularity. The thin-aware-resync win is retained where it is both safe and effective: real block-device thin/ZFS pools. L1 pins (discardgran_test.go): FILE_THIN + thick LVM omit the granularity; LVM_THIN/ZFS keep it; end-to-end render pins TestBuildResFile_FileThinDay0SkipRender (no rs-discard-granularity) and TestBuildResFile_LvmThinDay0SkipRender (keeps it). L6/L7 repurposed to guard the FILE_THIN fresh-create + mkfs day0-skip convergence and assert rs-discard-granularity is absent. Known-delta #76 updated. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * docs(satellite): correct autoDiskOptionsForResource godoc for FILE_THIN gate Comment-only: the rs-discard-granularity gate is no longer INDEPENDENT of provider kind — it follows the discard-zero-safe block-device set and excludes loop-backed FILE_THIN. Align the godoc with the implementation. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> --------- Signed-off-by: Andrei Kvapil <kvapss@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>

Andrei Kvapil (kvaps) and others added 2 commits June 2, 2026 23:29

gemini-code-assist Bot reviewed Jun 2, 2026

View reviewed changes

Andrei Kvapil (kvaps) closed this Jun 3, 2026

Andrei Kvapil (kvaps) mentioned this pull request Jun 5, 2026

fix(satellite): decouple rs-discard-granularity from discard-zeroes gating #112

Merged

Andrei Kvapil (kvaps) mentioned this pull request Jun 6, 2026

test(placer): pin LINSTOR placement-family corner-cases (upstream-issues U6) #117

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rest): node evict must not demote healthy diskful to tiebreaker (Bug 385)#76

fix(rest): node evict must not demote healthy diskful to tiebreaker (Bug 385)#76
Andrei Kvapil (kvaps) wants to merge 2 commits into
mainfrom
fix/bug-385-evict-tb-shuffle

Andrei Kvapil (kvaps) commented Jun 2, 2026

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Uh oh!

Andrei Kvapil (kvaps) commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-	wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness)
+	wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness)
+	if wantWitness && len(witness) == 0 {
+		hostingReplica := make(map[string]bool)
+		for i := range replicas {
+			hostingReplica[replicas[i].NodeName] = true
+		}
+		node, err := r.pickTiebreakerNodeForRD(ctx, rd.Name, hostingReplica)
+		if err != nil {
+			return err
+		}
+		if node == "" {
+			wantWitness = false
+		}
+	}

Conversation

Andrei Kvapil (kvaps) commented Jun 2, 2026

What

Why

Fix

Tests (CLI-bug-fix protocol)

Uh oh!

coderabbitai Bot commented Jun 2, 2026

Review skipped

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 2, 2026

Choose a reason for hiding this comment

The Issue

The Fix

Uh oh!

Andrei Kvapil (kvaps) commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant