Skip to content

fix(controller): r d of diskful must not convert to useless tiebreaker on inactive-replica resource (Bug 387)#77

Closed
Andrei Kvapil (kvaps) wants to merge 3 commits into
mainfrom
fix/bug-387-rd-inactive-no-tb
Closed

fix(controller): r d of diskful must not convert to useless tiebreaker on inactive-replica resource (Bug 387)#77
Andrei Kvapil (kvaps) wants to merge 3 commits into
mainfrom
fix/bug-387-rd-inactive-no-tb

Conversation

@kvaps
Copy link
Copy Markdown
Member

Summary

Deleting one active diskful replica with linstor r d on a resource that has 2 active diskful + 1 INACTIVE replica wrongly converted the deleted replica into a TieBreaker witness, producing a useless 2-voter quorum (1 active diskful + 1 TB, no majority protection) and diverging from upstream LINSTOR, which simply deletes the replica with no witness conversion.

Operator repro on resource test1 (worker-1 INACTIVE, worker-2/worker-3 UpToDate):

$ linstor r d worker-2 test1
SUCCESS: resource deleted: test1 on worker-2
$ linstor r l
 test1  worker-1  DRBD,STORAGE  INACTIVE
 test1  worker-2  DRBD,STORAGE  Connecting(worker-3)  Created   <-- WRONG: came back as a TieBreaker
 test1  worker-3  DRBD,STORAGE  Unknown

Root cause

An INACTIVE replica is drbdadm down (operator deactivation) — its DRBD device is not up, so it casts no vote in the quorum: majority decision the auto-tiebreaker invariant defends. The RD reconciler's ensureTiebreaker counted it as a full diskful, so after the r d of one active diskful the topology looked like "2 diskful, even parity, no user-diskless" and it spuriously grew a TIE_BREAKER witness.

Fix

Drop INACTIVE replicas from the voting set before the diskful/diskless split in ensureTiebreaker, so they influence neither the diskful count nor the diskless/witness count. Aligns with upstream LINSTOR (delete-only, no witness conversion).

Test coverage (CLI-bug-fix protocol)

  • L1 internal/controller/ensure_tiebreaker_inactive_bug_387_test.go — reproduces the operator repro (no witness, quorum=off) plus a positive control proving 2 genuine active diskful + 1 ignored INACTIVE still grows a witness (quorum=majority), so the canonical auto-witness invariant does not regress.
  • L6 tests/e2e/cli-matrix/r-d-inactive-no-tiebreaker.sh — stand cell: 3 diskful → deactivate one → r d one active diskful, asserts no TieBreaker and the deleted node never re-appears.
  • L7 tests/operator-harness/replay/r-d-inactive-no-tiebreaker.yaml — codifies the operator sequence with no_tiebreaker + resource_absent convergence assertions.

go build ./..., go test ./internal/controller/... ./pkg/rest/..., and golangci-lint run ./internal/controller/... are green locally. L6/L7 stand validation to be run by the launcher.

Andrei Kvapil (kvaps) and others added 3 commits June 2, 2026 23:32
…g count (Bug 387)

An INACTIVE replica is `drbdadm down` (operator deactivation) — its
DRBD device is not up, so it casts no vote in the quorum:majority
decision the auto-tiebreaker invariant defends. The RD reconciler's
ensureTiebreaker counted it as a full diskful, so deleting one active
diskful on a 2-active + 1-INACTIVE resource looked like "2 diskful,
even parity, no user-diskless" and spuriously grew a TIE_BREAKER
witness (1 active diskful + 1 witness = a 2-voter quorum with no
majority protection), diverging from upstream LINSTOR.

Drop INACTIVE replicas from the voting set before the diskful/diskless
split so they influence neither the diskful count nor the
diskless/witness count.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
…wns no tiebreaker

L1 regression. TestBug387InactiveReplicaNotCountedAsVotingDiskful
reproduces the operator repro (1 active diskful + 1 INACTIVE after
`r d`): EnsureTiebreaker must create no witness and write quorum=off.
TestBug387TwoActiveDiskfulStillGetWitness is the positive control —
2 genuine active diskful + 1 ignored INACTIVE still grows a witness
with quorum=majority, so the INACTIVE-exclusion does not regress the
canonical auto-witness invariant.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
L6 r-d-inactive-no-tiebreaker.sh: 3 diskful → deactivate one (INACTIVE)
→ r d one active diskful, asserts no TIE_BREAKER is spawned and the
deleted node does not re-appear as a witness.

L7 r-d-inactive-no-tiebreaker.yaml: codifies the same operator sequence
with no_tiebreaker + resource_absent convergence assertions.

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Andrei Kvapil <kvapss@gmail.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 2, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3dc7264e-ab1f-4c2b-8277-6b820cc40fac

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/bug-387-rd-inactive-no-tb

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses Bug 387 by ensuring that inactive replicas are excluded from the auto-tiebreaker calculations, preventing the controller from spuriously spawning a tiebreaker witness when an active replica is deleted. The changes include a new filtering function, unit tests, and integration tests. The reviewer pointed out that overwriting the replicas slice directly could lead to inaccurate logging and potential witness placement bugs, and suggested using a separate activeReplicas slice instead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +236 to 238
replicas = filterActiveReplicas(replicas)

diskful, diskless := splitByDiskless(replicas)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Overwriting the replicas slice with only active replicas has two unintended side effects:\n\n1. Inaccurate Logging: The log statement on line 269 will output the count of active replicas as the total replicas count, which can be confusing for operators.\n2. Potential Witness Placement Bug: Passing the filtered replicas slice to applyWitnessDecision (and subsequently createWitness) means hostingReplica will not contain nodes hosting inactive replicas. Although pickTiebreakerNodeForRD attempts to re-probe the store as a defense-in-depth measure, it only excludes diskful replicas. If there is an inactive diskless replica, or if the store query is stale, the controller could attempt to place a tiebreaker witness on a node that already hosts an inactive replica.\n\nInstead, we should keep replicas as the full list of replicas, and define a new activeReplicas slice for the quorum and tiebreaker calculations.

\tactiveReplicas := filterActiveReplicas(replicas)\n\n\tdiskful, diskless := splitByDiskless(activeReplicas)

@kvaps
Copy link
Copy Markdown
Member Author

Superseded by #83 (merged): the four operational-lifecycle fixes (Bugs 384–387) plus the L7 harness hardening (Bug 388) were validated as a unit on the live stand and merged together in #83. The commits from this branch are included there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant