Skip to content

fix(rest): node evict must not demote healthy diskful to tiebreaker (Bug 385)#76

Closed
Andrei Kvapil (kvaps) wants to merge 2 commits into
mainfrom
fix/bug-385-evict-tb-shuffle
Closed

fix(rest): node evict must not demote healthy diskful to tiebreaker (Bug 385)#76
Andrei Kvapil (kvaps) wants to merge 2 commits into
mainfrom
fix/bug-385-evict-tb-shuffle

Conversation

@kvaps
Copy link
Copy Markdown
Member

What

linstor n e <node> (node evict) misbehaved when the cluster had a free TieBreaker replica: the evict did not take effect. The TieBreaker stayed pinned to the just-evicted node, and a healthy diskful replica drifted into the tiebreaker role.

Operator-reported repro (3-node dev stand), before evict: worker-1 UpToDate, worker-2 UpToDate, worker-3 TieBreaker. After linstor n e worker-3: worker-2 was wrongly demoted UpToDate → TieBreaker and the evicted worker-3 still held a diskful resource — the eviction "did not happen".

Why

The RD reconciler's tiebreaker pass (ensureTiebreaker) counted every replica of an RD, including a TIE_BREAKER witness sitting on a just-EVICTED node. The witness invariant therefore read as already-satisfied (2 diskful + 1 witness), so the reconciler never relocated the witness off the drained node. The stranded witness then blocked a fresh witness from landing on a healthy spare, which is how the operator observed a healthy diskful drift into the tiebreaker role.

The placer already documents and enforces the correct semantic — replicas on EVICTED/LOST nodes are not counted — but the RD-level witness/quorum decision did not mirror it.

Fix

Treat replicas on EVICTED/LOST nodes as draining placements, not live ones:

  • Exclude evicted/lost-node replicas from the witness and quorum decision (mirroring the placer's disabledNodes semantic).
  • Reap a TIE_BREAKER stranded on a disabled node so a fresh witness can relocate to a healthy spare, or quorum falls back to off when no spare remains.
  • A healthy diskful is never demoted to TieBreaker. Diskful replicas on disabled nodes are left to the NodeReconciler's existing migration path (placer gap-fill for EVICTED, source-delete for LOST).

The change is confined to the RD reconciler; the shared witness-counting helpers (splitByDiskless, shouldTieBreakerExist) are untouched.

Tests (CLI-bug-fix protocol)

  • L1 regressioninternal/controller/ensure_tiebreaker_evict_bug_385_test.go: evicting the tiebreaker node reaps the stranded witness without demoting a healthy diskful; with a spare node the witness relocates to it; evicting a diskful node does not let its stranded replica keep the witness alive.
  • L6 cli-matrixtests/e2e/cli-matrix/n-evict-tiebreaker-no-shuffle.sh: real linstor n e on the 2r+tb shape, asserts the witness leaves the evicted node and neither healthy diskful is demoted.
  • L7 replaytests/operator-harness/replay/n-evict-tiebreaker-no-shuffle.yaml: codifies the operator sequence and convergence assertions (witness reaped off the evicted node, both diskful stay UpToDate).

go build ./..., go test ./..., and golangci-lint run are green locally. Live-stand replay validation is left to the launcher.

Andrei Kvapil (kvaps) and others added 2 commits June 2, 2026 23:29
… (Bug 385)

ensureTiebreaker counted every replica of an RD, including a TIE_BREAKER
witness sitting on a just-EVICTED node. The witness invariant then read
as satisfied (diskful=2 + witness=1) and the reconciler never relocated
the witness off the drained node, so `linstor n e <tiebreaker-node>` did
not actually take effect; downstream the stranded witness blocked a
fresh one from landing on a healthy spare, which surfaced as a healthy
diskful drifting into the tiebreaker role.

Treat replicas on EVICTED / LOST nodes as draining placements (mirroring
the placer's disabledNodes semantic): exclude them from the witness /
quorum decision and reap a TIE_BREAKER stranded on a disabled node so a
fresh witness can relocate to a healthy spare, or quorum falls back to
off when none remains. A healthy diskful is never demoted. Diskful
replicas on disabled nodes are left to the NodeReconciler's migration
path.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…Bug 385)

L6 cli-matrix/n-evict-tiebreaker-no-shuffle.sh drives the real
`linstor n e` CLI on the 2r+tb shape and asserts the witness leaves the
evicted node while neither healthy diskful is demoted to TIE_BREAKER.

L7 replay/n-evict-tiebreaker-no-shuffle.yaml codifies the operator
sequence (rd c → vd c → 2x r c → tiebreaker on node3 → n e node3) and
the convergence assertions: witness reaped off the evicted node
(resource_absent), both diskful replicas stay UpToDate.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 14a0d5e0-cbc9-4edf-9c24-d9e9e94e3394

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bug-385-evict-tb-shuffle

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses Bug 385, where evicting a node hosting a TieBreaker replica misbehaves because the reconciler counts replicas on evicted or lost nodes as live. The changes filter out replicas on disabled nodes from the live count and explicitly remove stranded witnesses. Additionally, unit and E2E tests are introduced to verify this behavior. A critical review comment points out a safety issue where evicting a stranded witness when no healthy spare nodes are available can result in a 2-replica resource definition running under majority quorum without a witness, presenting a data-availability hazard. A code suggestion is provided to fallback quorum to off if no candidate tiebreaker node is available.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

diskful, diskless := splitByDiskless(live)
witness := filterTieBreaker(diskless)

wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a critical correctness and safety issue when a stranded witness is evicted and there are no healthy spare nodes available to host a new witness.

The Issue

  1. removeStrandedWitnesses deletes the stranded witness from the store.
  2. live replicas will have len(witness) == 0.
  3. wantWitness is computed as true (since we have 2 live diskful replicas and no live witness).
  4. applyWitnessDecision is called and enters the wantWitness && len(witness) == 0 case, calling createWitness.
  5. Since there are no healthy spare nodes, createWitness finds no candidate node and returns nil (no error).
  6. However, applyWitnessDecision blindly appends a dummy witness to diskless and returns it anyway.
  7. quorumPolicy receives disklessAfter with 1 element and returns QuorumPolicyMajority.
  8. setQuorum sets quorum: majority on the RD.

This results in a 2-replica RD with no witness running under quorum: majority, which is a severe data-availability hazard (the volume will freeze if either of the two remaining diskful nodes fails).

The Fix

We should check if a candidate tiebreaker node is actually available before deciding we want a witness. If no node is available, we set wantWitness to false so that quorum correctly falls back to off.

Suggested change
wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness)
wantWitness := shouldTieBreakerExist(rd, diskful, diskless, witness)
if wantWitness && len(witness) == 0 {
hostingReplica := make(map[string]bool)
for i := range replicas {
hostingReplica[replicas[i].NodeName] = true
}
node, err := r.pickTiebreakerNodeForRD(ctx, rd.Name, hostingReplica)
if err != nil {
return err
}
if node == "" {
wantWitness = false
}
}

@kvaps
Copy link
Copy Markdown
Member Author

Superseded by #83 (merged): the four operational-lifecycle fixes (Bugs 384–387) plus the L7 harness hardening (Bug 388) were validated as a unit on the live stand and merged together in #83. The commits from this branch are included there.

Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 6, 2026
…s (FILE_THIN loop wedge)

Rendering rs-discard-granularity into a loop-backed FILE_THIN volume's
disk{} block regressed fresh-create convergence. When the elected day0
winner force-primaries to run mkfs (the FileSystem/Type path), mkfs
issues a full-device discard; on the loop backing that discard storm,
with rs-discard-granularity active, dirties the bitmap relative to the
day0-seeded peers and forces a FULL initial SyncTarget that wedges
(PausedSyncT dependency between the two targets). Proven on stand + CI:
an identical 3-replica create + mkfs converges instantly on LVM_THIN
(real block device, same disk block) but full-resyncs and wedges on
FILE_THIN (loop) — the e2e respawn-standalone-wedge failure.

Gate rs-discard-granularity on the discard-zero-safe block-device
provider set (LVM_THIN / ZFS / ZFS_THIN), the same set as
discard-zeroes-if-aligned. FILE_THIN now renders only the inert
discard-zeroes-if-aligned no (matching pre-feature behaviour) and omits
the granularity. The thin-aware-resync win is retained where it is both
safe and effective: real block-device thin/ZFS pools.

L1 pins (discardgran_test.go): FILE_THIN + thick LVM omit the
granularity; LVM_THIN/ZFS keep it; end-to-end render pins
TestBuildResFile_FileThinDay0SkipRender (no rs-discard-granularity) and
TestBuildResFile_LvmThinDay0SkipRender (keeps it). L6/L7 repurposed to
guard the FILE_THIN fresh-create + mkfs day0-skip convergence and assert
rs-discard-granularity is absent. Known-delta #76 updated.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Andrei Kvapil (kvaps) added a commit that referenced this pull request Jun 6, 2026
…ating (#112)

* fix(satellite): decouple rs-discard-granularity from discard-zeroes gating

A partially-written FILE_THIN volume resynced ~2x the bytes upstream
LINSTOR did because the rendered .res lacked rs-discard-granularity.
autoDiskOptions() early-returned nil for any provider that was not
discard-zeroes-safe, coupling rs-discard-granularity (which upstream
gates ONLY on the backing device's reported discard granularity) to
discard-zeroes-if-aligned (correctly provider-gated).

Decouple the two gates to mirror upstream:

  - discard-zeroes-if-aligned: provider-gated as before, but now
    rendered explicitly as `no` for non-safe kinds (FILE_THIN, thick
    LVM, FILE) instead of omitting the whole block — matching upstream
    1.33.2's render.
  - rs-discard-granularity: emitted whenever the backing device reports
    a non-zero discard granularity (lsblk DISC-GRAN > 0), independent of
    provider kind, so an aligned all-zero region is UNMAPped during
    resync instead of transferred.

Multi-volume collapse stays conservative: discard-zeroes-if-aligned is
`yes` only when every volume is safe (one `no` pins the resource), and
rs-discard-granularity is the smallest across volumes.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* test(corner-q3): pin FILE_THIN rs-discard-granularity at L6 + L7

L6 cli-matrix cell file-thin-rs-discard-granularity.sh: 512M FILE_THIN
single replica, write 320M, add 2nd diskful replica, assert the new
replica's resync `received` byte counter is close to the written bytes
(not the full device) and the rendered .res carries the discard disk
block.

L7 replay file-thin-rs-discard-granularity.yaml: operator-CLI sequence
asserting the rendered DRBD config carries rs-discard-granularity=4096
and discard-zeroes-if-aligned=no for a FILE_THIN resource, plus a clean
3rd-replica sync (drbd_option + sync_clean awaits).

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* docs(cli-parity): record FILE_THIN disk-block render deltas (Q3 #76)

After the rs-discard-granularity decoupling fix, BS renders the same
thin-aware-resync disk options as upstream for a discard-capable
FILE_THIN volume. Two render-shape divergences remain and are accepted:
upstream additionally emits `block-size 512;` (BS omits it), and upstream
renders the disk block at volume scope vs BS's resource scope —
functionally identical for single-volume resources.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* fix(satellite): gate rs-discard-granularity to block-device thin pools (FILE_THIN loop wedge)

Rendering rs-discard-granularity into a loop-backed FILE_THIN volume's
disk{} block regressed fresh-create convergence. When the elected day0
winner force-primaries to run mkfs (the FileSystem/Type path), mkfs
issues a full-device discard; on the loop backing that discard storm,
with rs-discard-granularity active, dirties the bitmap relative to the
day0-seeded peers and forces a FULL initial SyncTarget that wedges
(PausedSyncT dependency between the two targets). Proven on stand + CI:
an identical 3-replica create + mkfs converges instantly on LVM_THIN
(real block device, same disk block) but full-resyncs and wedges on
FILE_THIN (loop) — the e2e respawn-standalone-wedge failure.

Gate rs-discard-granularity on the discard-zero-safe block-device
provider set (LVM_THIN / ZFS / ZFS_THIN), the same set as
discard-zeroes-if-aligned. FILE_THIN now renders only the inert
discard-zeroes-if-aligned no (matching pre-feature behaviour) and omits
the granularity. The thin-aware-resync win is retained where it is both
safe and effective: real block-device thin/ZFS pools.

L1 pins (discardgran_test.go): FILE_THIN + thick LVM omit the
granularity; LVM_THIN/ZFS keep it; end-to-end render pins
TestBuildResFile_FileThinDay0SkipRender (no rs-discard-granularity) and
TestBuildResFile_LvmThinDay0SkipRender (keeps it). L6/L7 repurposed to
guard the FILE_THIN fresh-create + mkfs day0-skip convergence and assert
rs-discard-granularity is absent. Known-delta #76 updated.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

* docs(satellite): correct autoDiskOptionsForResource godoc for FILE_THIN gate

Comment-only: the rs-discard-granularity gate is no longer INDEPENDENT
of provider kind — it follows the discard-zero-safe block-device set and
excludes loop-backed FILE_THIN. Align the godoc with the implementation.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>

---------

Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant