Skip to content

fix(controllers): replace a persistently crash-looping member#336

Merged
Andrey Kolkov (androndo) merged 2 commits into
mainfrom
fix/fail-pvc-on-member
Jun 25, 2026
Merged

fix(controllers): replace a persistently crash-looping member#336
Andrey Kolkov (androndo) merged 2 commits into
mainfrom
fix/fail-pvc-on-member

Conversation

@androndo

@androndo Andrey Kolkov (androndo) commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

A non-bootstrap PVC member whose etcd cannot start — classically because its data dir was lost while the cluster membership moved on, leaving its frozen --initial-cluster stale (etcd refuses to boot: "error validating peerURLs ... member count is unequal") — crash-loops forever with no recovery path. The existing self-heal covers only memory-medium members (pod lost => data lost); for PVC members the PVC survives pod restarts, so "pod lost" != "data lost" and the stale --initial-cluster is never escaped.

Detect such a member (etcd container not ready and restarted past a threshold, excluding OOMKilled) and delete it for replacement — but only when the rest of the cluster still has quorum, so a cluster-wide outage never cascades into mass deletion (the finalizer's MemberRemove is quorum-gated too). The cluster controller then gap-fills a fresh member with a current --initial-cluster.

Also extend the Kamaji e2e to wait for readyMembers==3, then churn all three members one at a time and re-verify the tenant API still roundtrips through the fully replaced (hash-named) member set — guarding that member naming/replacement stays transparent to a Kamaji DataStore (stable -client Service + wildcard server-cert SAN).

Summary by CodeRabbit

  • Bug Fixes
    • Added self-healing for persistently crash-looping PVC-backed etcd members, with quorum gating to prevent cascading replacements.
    • Excludes bootstrap members from automatic replacement and avoids triggering recovery for OOMKilled cases.
  • Tests
    • Expanded unit and end-to-end coverage for crash-loop self-heal behavior and member churn.
    • Increased e2e test timeout to improve stability.
  • Documentation
    • Updated operational runbooks and concepts to clarify PVC crash-loop replacement behavior, timing expectations, and quorum safeguards.

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Adds self-healing for persistently crash-looping PVC-backed etcd members. The controller now detects stuck containers, checks quorum without the member, and deletes eligible EtcdMember CRs for replacement. Tests, e2e coverage, docs, and the e2e timeout are updated.

Changes

Crash-loop self-heal for EtcdMember

Layer / File(s) Summary
Self-heal helpers and updateStatus
controllers/etcdmember_controller.go
Introduces dataLossRestartThreshold, etcdContainerStuck, and clusterHasQuorumWithout. updateStatus now deletes a non-bootstrap PVC-backed EtcdMember when the pod is persistently stuck and quorum remains.
Controller self-heal unit coverage
controllers/etcdmember_controller_test.go
Adds TestEtcdContainerStuck, crash-loop and cluster-status test helpers, and updateStatus cases for quorum, bootstrap members, and stale ready-member counts.
PVC crash-loop self-heal e2e
test/e2e/member_selfheal_test.go
Adds an end-to-end PVC member recovery test that corrupts a member data dir, triggers replacement, waits for PVC/member cleanup, and verifies etcd traffic after recovery.
Member churn datastore e2e
test/e2e/kamaji_datastore_test.go
Extends the datastore e2e test with sequential member churn, waits for cluster recovery after each deletion, and verifies the original data survives across replacements.
Crash-loop replacement documentation
Makefile, README.md, docs/concepts.md, docs/operations.md
Increases the e2e test timeout and updates the README and docs to describe PVC crash-loop replacement behavior, quorum gating, replacement sequencing, and BrokenMembers status handling.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • kvaps
  • Kirill-Garbar
  • BROngineer

Poem

🐇 I hop through loops of crash and restart,
Quorum holds steady, that’s the smart part.
Fresh members bloom where old ones fell,
And etcd sings, “I’m doing well!” 🌿

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main controller behavior change: replacing a persistently crash-looping member.
Docstring Coverage ✅ Passed Docstring coverage is 95.45% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/fail-pvc-on-member

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces self-healing capabilities for unrecoverable etcd members by automatically deleting and replacing persistently crash-looping members, provided the rest of the cluster maintains quorum. It includes helper functions to detect stuck containers and verify quorum, along with comprehensive unit and E2E tests. Feedback on the changes suggests checking the Pod's deletion timestamp and current container state to avoid false-positive self-healing triggers during graceful termination or active OOMKilled events. Additionally, it is recommended to account for stale status by decrementing the ready member count if the member under evaluation is still marked as ready, preventing potential race conditions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread controllers/etcdmember_controller.go
Comment thread controllers/etcdmember_controller.go

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/kamaji_datastore_test.go`:
- Around line 181-183: The memberNames helper function calls t.Fatalf internally
when encountering List errors, which causes the test to fail immediately when
invoked inside the waitFor callback, bypassing the retry logic. To fix this,
create a variant of memberNames that returns an error value instead of calling
t.Fatalf, or inline the list call directly within the waitFor callback body to
handle errors gracefully and allow waitFor to retry on transient API failures.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 76cd05b0-e161-44f0-8a10-4521832db9dd

📥 Commits

Reviewing files that changed from the base of the PR and between 4248ade and b54a52f.

📒 Files selected for processing (3)
  • controllers/etcdmember_controller.go
  • controllers/etcdmember_controller_test.go
  • test/e2e/kamaji_datastore_test.go

Comment thread test/e2e/kamaji_datastore_test.go Outdated

@lllamnyp Timofei Larkin (lllamnyp) left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this change does

Adds self-heal for PVC-backed members: in updateStatus, when a non-bootstrap, non-memory member's etcd container is "stuck" (not ready, RestartCount >= 5, last termination not OOMKilled) AND the rest of the cluster has quorum, the operator deletes the EtcdMember CR. The finalizer then runs MemberRemove, the member-owned PVC is GC'd, and the cluster controller gap-fills a fresh GenerateName member with an empty data dir. The mechanism mirrors the existing memory-backed pod-loss self-heal and is sound: PVC owner-ref GC discards the corrupt data dir, and the quorum gate plus write-requires-quorum on MemberRemove prevent a cascade. Build, go vet, and the full controllers unit suite pass; the e2e file compiles under -tags e2e.

The code logic is correct. The blockers below are about documentation that now contradicts the code, and about the absence of end-to-end coverage for the new path.

Blocking issues

1. Documentation directly contradicts the new behavior

The repo documents — repeatedly and emphatically — that PVC-backed members do NOT auto-replace. This PR introduces exactly that auto-replacement and updates none of it. Since the PR changes this behavior area, it owns these docs.

  • README.md:35 — states verbatim: "No automatic broken-member replacement for PVC-backed clusters (memory-backed members do auto-replace on Pod loss; status.brokenMembers reads 0 in practice ...)". Now false. Rewrite to describe the new crash-loop self-heal (non-bootstrap PVC member, RestartCount >= dataLossRestartThreshold, OOMKilled excluded, quorum-gated) and drop/qualify the "No automatic broken-member replacement" claim.

  • docs/concepts.md:130 — the storage table's PVC "Pod loss →" cell reads "Same Pod / new Pod re-uses existing data dir; etcd rejoins with the same member ID and ClusterID." It omits the new terminal case: a PVC member whose data dir is lost/corrupt and that crash-loops past the threshold is now deleted and replaced with a fresh member ID (not a same-ID rejoin). Add the crash-loop replacement case to the PVC row.

  • docs/concepts.md:156 — states "For PVC-backed members today, isBroken stays a stub; richer detection is a future concern." and frames PVC corruption replacement as "a future hook ... (e.g. PVC corruption with a grace period)." That "future concern" is now implemented (via etcdContainerStuck, not via isBroken/BrokenMembers). Update to reflect that PVC crash-loop self-heal now exists, and clarify it does not flow through BrokenMembers.

  • docs/operations.md:462-469 ("Pod loss and auto-replacement") describes auto-replacement as a memory-only / Pod-loss mechanism keyed on Status.PodUID. There is now a second, PVC-specific replacement trigger (persistent crash-loop) with different detection (RestartCount, container readiness) and different latency (≥5 restarts at the up-to-5m CrashLoopBackOff cap ⇒ tens of minutes, not ~5s). Add a subsection documenting the PVC crash-loop replacement path — threshold, OOMKilled exclusion, quorum gate, and expected detection latency — so operators don't mistake a deliberately-deleted member for operator misbehavior.

2. The new e2e test does not exercise the new code path

test/e2e/kamaji_datastore_test.go adds a "member churn" block that deletes each member via kube.Delete (the normal scale-down / MemberRemove path) and asserts Kamaji keeps working through GenerateName-hashed replacements. This is a good churn test, but it never triggers etcdContainerStuck / the crash-loop self-heal — no member is made to crash-loop, no RestartCount is driven past the threshold. The production logic added by this PR therefore has zero e2e coverage; only the fake-client unit tests (TestEtcdContainerStuck, TestUpdateStatus_ReplacesStuckMember, TestUpdateStatus_KeepsStuckMemberWithoutQuorum, TestUpdateStatus_KeepsStuckBootstrapMember) cover it.

This is also the test that satisfies issue 1's contract: the docs claim PVC members auto-replace on persistent crash-loop, and that contract is between the running operator and a running cluster, so it belongs in e2e, not in a doc-grepping test. Add an e2e case that induces a real persistent crash-loop on a PVC-backed member (e.g. corrupt/empty the data dir so etcd hits "member count is unequal", or otherwise force RestartCount past dataLossRestartThreshold while the other two members stay ready) and assert: (a) the stuck member's CR is deleted, (b) a fresh GenerateName member with a new member ID appears, (c) the old data-<name> PVC is GC'd, and (d) the cluster returns to 3 ready members and still serves reads/writes.

Analyzed, not a blocker

Quorum threshold (controllers/etcdmember_controller.go:954). clusterHasQuorumWithout returns ReadyMembers >= desired/2+1. In Go integer arithmetic desired/2+1 == floor(desired/2)+1, which is exactly etcd quorum for every N (odd and even). The gate counts ready OTHERS (the stuck member is !podReady, so excluded from ReadyMembers), which is the correct set for the MemberRemove that runs while the stuck member is still registered. No change required. Optional: extend the unit table to cover desired ∈ {1,2,4,5}; today only desired=3 at ready ∈ {1,2} is covered.

Non-blocking observations

  • clusterHasQuorumWithout (controllers/etcdmember_controller.go:937-954) reads cluster.Status.ReadyMembers, set by the cluster controller, which can lag. Worst case in a multi-member outage, a stale-high count lets one member be deleted before the count catches up; the next member's reconcile sees the decremented count and is kept, and the finalizer's MemberRemove fails closed without quorum. Acceptable degradation (matches the documented memory behavior), but worth a line in issue 1's new docs subsection.

  • dataLossRestartThreshold = 5 is reasonable, but a slow restore or slow learner join on a replacement member could trip it, producing repeated replacement. Quorum-gated so it can't break the cluster, and a genuinely-stuck replacement should be replaced — acceptable, but note it in the new docs so operators expect it.

A non-bootstrap PVC member whose etcd cannot start — classically because its
data dir was lost while the cluster membership moved on, leaving its frozen
--initial-cluster stale (etcd refuses to boot: "error validating peerURLs ...
member count is unequal") — crash-loops forever with no recovery path. The
existing self-heal covers only memory-medium members (pod lost => data lost);
for PVC members the PVC survives pod restarts, so "pod lost" != "data lost"
and the stale --initial-cluster is never escaped.

Detect such a member (etcd container not ready and restarted past a threshold,
excluding OOMKilled) and delete it for replacement — but only when the rest of
the cluster still has quorum, so a cluster-wide outage never cascades into mass
deletion (the finalizer's MemberRemove is quorum-gated too). The cluster
controller then gap-fills a fresh member with a current --initial-cluster.

Also extend the Kamaji e2e to wait for readyMembers==3, then churn all three
members one at a time and re-verify the tenant API still roundtrips through the
fully replaced (hash-named) member set — guarding that member naming/replacement
stays transparent to a Kamaji DataStore (stable <cluster>-client Service +
wildcard server-cert SAN).

Signed-off-by: Andrey Kolkov <androndo@gmail.com>
@github-actions github-actions Bot added the documentation Improvements or additions to documentation label Jun 25, 2026
Signed-off-by: Andrey Kolkov <androndo@gmail.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/e2e/member_selfheal_test.go`:
- Around line 126-132: The wait in the self-heal test is swallowing member-list
failures because selfHealMembersErr can return nil even when listing fails, so
the victim check may pass without actually verifying removal. Update
selfHealMembersErr to return the list error instead of converting it to nil, and
make the waitFor callback in the self-heal test keep retrying on that error
until the victim is confirmed absent. Apply the same change anywhere else the
same wait pattern is used, especially the other self-heal verification block.
- Line 200: The ephemeral corruption command in member_selfheal_test.go is
masking failures because the final echo makes it always succeed. Update the
Command used for the corruption step so it fails fast when no files are modified
and only reports success after at least one file has actually been corrupted,
using the same find/dd-based logic in the test harness.
- Around line 73-79: The self-heal test is choosing a victim from the full
member list without excluding the bootstrap seed, which can make the replacement
wait fail. Update the selection in self-heal member setup to explicitly choose a
non-bootstrap member by checking each member’s Spec.Bootstrap field, and keep
the rest of the victim/PVC corruption flow in selfHealMembers and the related
test assertions unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab449dbb-60ba-4d5b-8f92-b91c12cb3793

📥 Commits

Reviewing files that changed from the base of the PR and between b54a52f and 4d3f703.

📒 Files selected for processing (8)
  • Makefile
  • README.md
  • controllers/etcdmember_controller.go
  • controllers/etcdmember_controller_test.go
  • docs/concepts.md
  • docs/operations.md
  • test/e2e/kamaji_datastore_test.go
  • test/e2e/member_selfheal_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • controllers/etcdmember_controller.go
  • test/e2e/kamaji_datastore_test.go

Comment thread test/e2e/member_selfheal_test.go
Comment thread test/e2e/member_selfheal_test.go
Comment thread test/e2e/member_selfheal_test.go

@lllamnyp Timofei Larkin (lllamnyp) left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All previously raised concerns are addressed. Approving.

Verification performed

  • go build ./..., go vet ./controllers/, go vet -tags e2e ./test/e2e/ — all clean.
  • go test ./controllers/ -count=1 — pass, including the four new self-heal unit tests and TestEtcdContainerStuck.

How the earlier blockers were resolved

Documentation. The stale "no automatic broken-member replacement for PVC-backed clusters" claim is gone. README.md:35, the docs/concepts.md storage table, and the new Crash-loop self-heal (PVC members) section now describe the actual behavior — trigger (RestartCount >= 5), OOMKilled exclusion, quorum gate, tens-of-minutes latency, and self-limiting replacement-of-replacement. The new section anchor matches the GitHub slug and the storage-table cross-link resolves. docs/operations.md gains a matching subsection. The status.brokenMembers "reads 0" claim is preserved and correctly explained (self-heal triggers off restart count, not isBroken).

End-to-end coverage. test/e2e/member_selfheal_test.go (TestPVCMemberCrashLoopSelfHeal) induces a real crash-loop by corrupting the victim's data dir, then asserts the full chain: stuck member CR deleted, data-<member> PVC GC'd, a fresh GenerateName member gap-fills, cluster back to 3 ready, and etcdctl put/get still serves. Makefile e2e timeout raised 30m→45m for the CrashLoopBackOff ramp.

Correctness trace (sound)

  • Quorum math (controllers/etcdmember_controller.go:979): readyOthers >= desired/2+1; integer desired/2+1 == floor(desired/2)+1 == etcd quorum for all N. Conservative.
  • Stale-count compensation (:972-978): if this member's own stored MemberReady condition is still True, it is subtracted from ReadyMembers, so a lagging stale-high count cannot green-light a delete that drops below quorum. Covered by TestUpdateStatus_KeepsStuckMemberWhenStaleReadyCountIncludesIt.
  • No stale write after delete (:1044-1054): the self-heal branch returns immediately after r.Delete, never reaching r.Status().Update.
  • Replacement serialized: the cluster controller blocks further scaling while a member deletion is in flight, then gap-fills via scaleUp. The finalizer's MemberRemove is independently quorum-gated (belt-and-braces).
  • etcdContainerStuck (:924-944): excludes DeletionTimestamp != nil, Ready, RestartCount < 5, and OOMKilled in both current and last termination state; bootstrap/memory members excluded at the call site. Exhaustively covered by TestEtcdContainerStuck.
  • PVC GC: the PVC carries a controller owner-ref to the member, so deletion discards the corrupt data dir.

Residual note (not blocking)

The stale-count compensation handles "condition still True but no longer ready." The opposite narrow transient — condition already False while ReadyMembers is still stale-high by one — is not compensated by the subtraction, so the gate could briefly read one too high. This is already defended by the independent quorum gate on the finalizer's MemberRemove, and the e2e test exercises real (non-fake-client) timing. Flagged for awareness only; no change required.

@androndo Andrey Kolkov (androndo) merged commit e74a580 into main Jun 25, 2026
10 checks passed
@androndo Andrey Kolkov (androndo) deleted the fix/fail-pvc-on-member branch June 25, 2026 10:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix controllers documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants