fix(rest): node evict must not demote healthy diskful to tiebreaker (Bug 385)#76
fix(rest): node evict must not demote healthy diskful to tiebreaker (Bug 385)#76Andrei Kvapil (kvaps) wants to merge 2 commits into
Conversation
… (Bug 385) ensureTiebreaker counted every replica of an RD, including a TIE_BREAKER witness sitting on a just-EVICTED node. The witness invariant then read as satisfied (diskful=2 + witness=1) and the reconciler never relocated the witness off the drained node, so `linstor n e <tiebreaker-node>` did not actually take effect; downstream the stranded witness blocked a fresh one from landing on a healthy spare, which surfaced as a healthy diskful drifting into the tiebreaker role. Treat replicas on EVICTED / LOST nodes as draining placements (mirroring the placer's disabledNodes semantic): exclude them from the witness / quorum decision and reap a TIE_BREAKER stranded on a disabled node so a fresh witness can relocate to a healthy spare, or quorum falls back to off when none remains. A healthy diskful is never demoted. Diskful replicas on disabled nodes are left to the NodeReconciler's migration path. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…Bug 385) L6 cli-matrix/n-evict-tiebreaker-no-shuffle.sh drives the real `linstor n e` CLI on the 2r+tb shape and asserts the witness leaves the evicted node while neither healthy diskful is demoted to TIE_BREAKER. L7 replay/n-evict-tiebreaker-no-shuffle.yaml codifies the operator sequence (rd c → vd c → 2x r c → tiebreaker on node3 → n e node3) and the convergence assertions: witness reaped off the evicted node (resource_absent), both diskful replicas stay UpToDate. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses Bug 385, where evicting a node hosting a TieBreaker replica misbehaves because the reconciler counts replicas on evicted or lost nodes as live. The changes filter out replicas on disabled nodes from the live count and explicitly remove stranded witnesses. Additionally, unit and E2E tests are introduced to verify this behavior. A critical review comment points out a safety issue where evicting a stranded witness when no healthy spare nodes are available can result in a 2-replica resource definition running under majority quorum without a witness, presenting a data-availability hazard. A code suggestion is provided to fallback quorum to off if no candidate tiebreaker node is available.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| diskful, diskless := splitByDiskless(live) | ||
| witness := filterTieBreaker(diskless) | ||
|
|
||
| wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness) |
There was a problem hiding this comment.
There is a critical correctness and safety issue when a stranded witness is evicted and there are no healthy spare nodes available to host a new witness.
The Issue
removeStrandedWitnessesdeletes the stranded witness from the store.livereplicas will havelen(witness) == 0.wantWitnessis computed astrue(since we have 2 live diskful replicas and no live witness).applyWitnessDecisionis called and enters thewantWitness && len(witness) == 0case, callingcreateWitness.- Since there are no healthy spare nodes,
createWitnessfinds no candidate node and returnsnil(no error). - However,
applyWitnessDecisionblindly appends a dummy witness todisklessand returns it anyway. quorumPolicyreceivesdisklessAfterwith 1 element and returnsQuorumPolicyMajority.setQuorumsetsquorum: majorityon the RD.
This results in a 2-replica RD with no witness running under quorum: majority, which is a severe data-availability hazard (the volume will freeze if either of the two remaining diskful nodes fails).
The Fix
We should check if a candidate tiebreaker node is actually available before deciding we want a witness. If no node is available, we set wantWitness to false so that quorum correctly falls back to off.
| wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness) | |
| wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness) | |
| if wantWitness && len(witness) == 0 { | |
| hostingReplica := make(map[string]bool) | |
| for i := range replicas { | |
| hostingReplica[replicas[i].NodeName] = true | |
| } | |
| node, err := r.pickTiebreakerNodeForRD(ctx, rd.Name, hostingReplica) | |
| if err != nil { | |
| return err | |
| } | |
| if node == "" { | |
| wantWitness = false | |
| } | |
| } |
…s (FILE_THIN loop wedge)
Rendering rs-discard-granularity into a loop-backed FILE_THIN volume's
disk{} block regressed fresh-create convergence. When the elected day0
winner force-primaries to run mkfs (the FileSystem/Type path), mkfs
issues a full-device discard; on the loop backing that discard storm,
with rs-discard-granularity active, dirties the bitmap relative to the
day0-seeded peers and forces a FULL initial SyncTarget that wedges
(PausedSyncT dependency between the two targets). Proven on stand + CI:
an identical 3-replica create + mkfs converges instantly on LVM_THIN
(real block device, same disk block) but full-resyncs and wedges on
FILE_THIN (loop) — the e2e respawn-standalone-wedge failure.
Gate rs-discard-granularity on the discard-zero-safe block-device
provider set (LVM_THIN / ZFS / ZFS_THIN), the same set as
discard-zeroes-if-aligned. FILE_THIN now renders only the inert
discard-zeroes-if-aligned no (matching pre-feature behaviour) and omits
the granularity. The thin-aware-resync win is retained where it is both
safe and effective: real block-device thin/ZFS pools.
L1 pins (discardgran_test.go): FILE_THIN + thick LVM omit the
granularity; LVM_THIN/ZFS keep it; end-to-end render pins
TestBuildResFile_FileThinDay0SkipRender (no rs-discard-granularity) and
TestBuildResFile_LvmThinDay0SkipRender (keeps it). L6/L7 repurposed to
guard the FILE_THIN fresh-create + mkfs day0-skip convergence and assert
rs-discard-granularity is absent. Known-delta #76 updated.
Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…ating (#112) * fix(satellite): decouple rs-discard-granularity from discard-zeroes gating A partially-written FILE_THIN volume resynced ~2x the bytes upstream LINSTOR did because the rendered .res lacked rs-discard-granularity. autoDiskOptions() early-returned nil for any provider that was not discard-zeroes-safe, coupling rs-discard-granularity (which upstream gates ONLY on the backing device's reported discard granularity) to discard-zeroes-if-aligned (correctly provider-gated). Decouple the two gates to mirror upstream: - discard-zeroes-if-aligned: provider-gated as before, but now rendered explicitly as `no` for non-safe kinds (FILE_THIN, thick LVM, FILE) instead of omitting the whole block — matching upstream 1.33.2's render. - rs-discard-granularity: emitted whenever the backing device reports a non-zero discard granularity (lsblk DISC-GRAN > 0), independent of provider kind, so an aligned all-zero region is UNMAPped during resync instead of transferred. Multi-volume collapse stays conservative: discard-zeroes-if-aligned is `yes` only when every volume is safe (one `no` pins the resource), and rs-discard-granularity is the smallest across volumes. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * test(corner-q3): pin FILE_THIN rs-discard-granularity at L6 + L7 L6 cli-matrix cell file-thin-rs-discard-granularity.sh: 512M FILE_THIN single replica, write 320M, add 2nd diskful replica, assert the new replica's resync `received` byte counter is close to the written bytes (not the full device) and the rendered .res carries the discard disk block. L7 replay file-thin-rs-discard-granularity.yaml: operator-CLI sequence asserting the rendered DRBD config carries rs-discard-granularity=4096 and discard-zeroes-if-aligned=no for a FILE_THIN resource, plus a clean 3rd-replica sync (drbd_option + sync_clean awaits). Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * docs(cli-parity): record FILE_THIN disk-block render deltas (Q3 #76) After the rs-discard-granularity decoupling fix, BS renders the same thin-aware-resync disk options as upstream for a discard-capable FILE_THIN volume. Two render-shape divergences remain and are accepted: upstream additionally emits `block-size 512;` (BS omits it), and upstream renders the disk block at volume scope vs BS's resource scope — functionally identical for single-volume resources. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * fix(satellite): gate rs-discard-granularity to block-device thin pools (FILE_THIN loop wedge) Rendering rs-discard-granularity into a loop-backed FILE_THIN volume's disk{} block regressed fresh-create convergence. When the elected day0 winner force-primaries to run mkfs (the FileSystem/Type path), mkfs issues a full-device discard; on the loop backing that discard storm, with rs-discard-granularity active, dirties the bitmap relative to the day0-seeded peers and forces a FULL initial SyncTarget that wedges (PausedSyncT dependency between the two targets). Proven on stand + CI: an identical 3-replica create + mkfs converges instantly on LVM_THIN (real block device, same disk block) but full-resyncs and wedges on FILE_THIN (loop) — the e2e respawn-standalone-wedge failure. Gate rs-discard-granularity on the discard-zero-safe block-device provider set (LVM_THIN / ZFS / ZFS_THIN), the same set as discard-zeroes-if-aligned. FILE_THIN now renders only the inert discard-zeroes-if-aligned no (matching pre-feature behaviour) and omits the granularity. The thin-aware-resync win is retained where it is both safe and effective: real block-device thin/ZFS pools. L1 pins (discardgran_test.go): FILE_THIN + thick LVM omit the granularity; LVM_THIN/ZFS keep it; end-to-end render pins TestBuildResFile_FileThinDay0SkipRender (no rs-discard-granularity) and TestBuildResFile_LvmThinDay0SkipRender (keeps it). L6/L7 repurposed to guard the FILE_THIN fresh-create + mkfs day0-skip convergence and assert rs-discard-granularity is absent. Known-delta #76 updated. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> * docs(satellite): correct autoDiskOptionsForResource godoc for FILE_THIN gate Comment-only: the rs-discard-granularity gate is no longer INDEPENDENT of provider kind — it follows the discard-zero-safe block-device set and excludes loop-backed FILE_THIN. Align the godoc with the implementation. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com> --------- Signed-off-by: Andrei Kvapil <kvapss@gmail.com> Co-authored-by: Claude <noreply@anthropic.com>
What
linstor n e <node>(node evict) misbehaved when the cluster had a free TieBreaker replica: the evict did not take effect. The TieBreaker stayed pinned to the just-evicted node, and a healthy diskful replica drifted into the tiebreaker role.Operator-reported repro (3-node dev stand), before evict:
worker-1UpToDate,worker-2UpToDate,worker-3TieBreaker. Afterlinstor n e worker-3:worker-2was wrongly demoted UpToDate → TieBreaker and the evictedworker-3still held a diskful resource — the eviction "did not happen".Why
The RD reconciler's tiebreaker pass (
ensureTiebreaker) counted every replica of an RD, including a TIE_BREAKER witness sitting on a just-EVICTED node. The witness invariant therefore read as already-satisfied (2 diskful + 1 witness), so the reconciler never relocated the witness off the drained node. The stranded witness then blocked a fresh witness from landing on a healthy spare, which is how the operator observed a healthy diskful drift into the tiebreaker role.The placer already documents and enforces the correct semantic — replicas on EVICTED/LOST nodes are not counted — but the RD-level witness/quorum decision did not mirror it.
Fix
Treat replicas on EVICTED/LOST nodes as draining placements, not live ones:
disabledNodessemantic).offwhen no spare remains.The change is confined to the RD reconciler; the shared witness-counting helpers (
splitByDiskless,shouldTieBreakerExist) are untouched.Tests (CLI-bug-fix protocol)
internal/controller/ensure_tiebreaker_evict_bug_385_test.go: evicting the tiebreaker node reaps the stranded witness without demoting a healthy diskful; with a spare node the witness relocates to it; evicting a diskful node does not let its stranded replica keep the witness alive.tests/e2e/cli-matrix/n-evict-tiebreaker-no-shuffle.sh: reallinstor n eon the 2r+tb shape, asserts the witness leaves the evicted node and neither healthy diskful is demoted.tests/operator-harness/replay/n-evict-tiebreaker-no-shuffle.yaml: codifies the operator sequence and convergence assertions (witness reaped off the evicted node, both diskful stay UpToDate).go build ./...,go test ./..., andgolangci-lint runare green locally. Live-stand replay validation is left to the launcher.