fix(controller): r d of diskful must not convert to useless tiebreaker on inactive-replica resource (Bug 387)#77
Conversation
…g count (Bug 387) An INACTIVE replica is `drbdadm down` (operator deactivation) — its DRBD device is not up, so it casts no vote in the quorum:majority decision the auto-tiebreaker invariant defends. The RD reconciler's ensureTiebreaker counted it as a full diskful, so deleting one active diskful on a 2-active + 1-INACTIVE resource looked like "2 diskful, even parity, no user-diskless" and spuriously grew a TIE_BREAKER witness (1 active diskful + 1 witness = a 2-voter quorum with no majority protection), diverging from upstream LINSTOR. Drop INACTIVE replicas from the voting set before the diskful/diskless split so they influence neither the diskful count nor the diskless/witness count. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…wns no tiebreaker L1 regression. TestBug387InactiveReplicaNotCountedAsVotingDiskful reproduces the operator repro (1 active diskful + 1 INACTIVE after `r d`): EnsureTiebreaker must create no witness and write quorum=off. TestBug387TwoActiveDiskfulStillGetWitness is the positive control — 2 genuine active diskful + 1 ignored INACTIVE still grows a witness with quorum=majority, so the INACTIVE-exclusion does not regress the canonical auto-witness invariant. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
L6 r-d-inactive-no-tiebreaker.sh: 3 diskful → deactivate one (INACTIVE) → r d one active diskful, asserts no TIE_BREAKER is spawned and the deleted node does not re-appear as a witness. L7 r-d-inactive-no-tiebreaker.yaml: codifies the same operator sequence with no_tiebreaker + resource_absent convergence assertions. Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request addresses Bug 387 by ensuring that inactive replicas are excluded from the auto-tiebreaker calculations, preventing the controller from spuriously spawning a tiebreaker witness when an active replica is deleted. The changes include a new filtering function, unit tests, and integration tests. The reviewer pointed out that overwriting the replicas slice directly could lead to inaccurate logging and potential witness placement bugs, and suggested using a separate activeReplicas slice instead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| replicas = filterActiveReplicas(replicas) | ||
|
|
||
| diskful, diskless := splitByDiskless(replicas) |
There was a problem hiding this comment.
Overwriting the replicas slice with only active replicas has two unintended side effects:\n\n1. Inaccurate Logging: The log statement on line 269 will output the count of active replicas as the total replicas count, which can be confusing for operators.\n2. Potential Witness Placement Bug: Passing the filtered replicas slice to applyWitnessDecision (and subsequently createWitness) means hostingReplica will not contain nodes hosting inactive replicas. Although pickTiebreakerNodeForRD attempts to re-probe the store as a defense-in-depth measure, it only excludes diskful replicas. If there is an inactive diskless replica, or if the store query is stale, the controller could attempt to place a tiebreaker witness on a node that already hosts an inactive replica.\n\nInstead, we should keep replicas as the full list of replicas, and define a new activeReplicas slice for the quorum and tiebreaker calculations.
\tactiveReplicas := filterActiveReplicas(replicas)\n\n\tdiskful, diskless := splitByDiskless(activeReplicas)
Summary
Deleting one active diskful replica with
linstor r don a resource that has 2 active diskful + 1 INACTIVE replica wrongly converted the deleted replica into a TieBreaker witness, producing a useless 2-voter quorum (1 active diskful + 1 TB, no majority protection) and diverging from upstream LINSTOR, which simply deletes the replica with no witness conversion.Operator repro on resource
test1(worker-1 INACTIVE, worker-2/worker-3 UpToDate):Root cause
An INACTIVE replica is
drbdadm down(operator deactivation) — its DRBD device is not up, so it casts no vote in thequorum: majoritydecision the auto-tiebreaker invariant defends. The RD reconciler'sensureTiebreakercounted it as a full diskful, so after ther dof one active diskful the topology looked like "2 diskful, even parity, no user-diskless" and it spuriously grew a TIE_BREAKER witness.Fix
Drop INACTIVE replicas from the voting set before the diskful/diskless split in
ensureTiebreaker, so they influence neither the diskful count nor the diskless/witness count. Aligns with upstream LINSTOR (delete-only, no witness conversion).Test coverage (CLI-bug-fix protocol)
internal/controller/ensure_tiebreaker_inactive_bug_387_test.go— reproduces the operator repro (no witness, quorum=off) plus a positive control proving 2 genuine active diskful + 1 ignored INACTIVE still grows a witness (quorum=majority), so the canonical auto-witness invariant does not regress.tests/e2e/cli-matrix/r-d-inactive-no-tiebreaker.sh— stand cell: 3 diskful → deactivate one →r done active diskful, asserts no TieBreaker and the deleted node never re-appears.tests/operator-harness/replay/r-d-inactive-no-tiebreaker.yaml— codifies the operator sequence withno_tiebreaker+resource_absentconvergence assertions.go build ./...,go test ./internal/controller/... ./pkg/rest/..., andgolangci-lint run ./internal/controller/...are green locally. L6/L7 stand validation to be run by the launcher.