Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Handle permanent quorum loss scenario for ETCD multinode cluster #437

Closed
Tracked by #107
abdasgupta opened this issue Sep 21, 2022 · 1 comment
Closed
Tracked by #107
Labels
kind/enhancement Enhancement, improvement, extension release/ga Planned for GA(General Availability) release of the Feature status/closed Issue is closed (either delivered or triaged)

Comments

@abdasgupta
Copy link
Contributor

Feature (What you would like to be added):

ETCD multinode cluster should be able to recover from permanent quorum loss in a graceful way after human intervention.

Motivation (Why is this needed?):
Permanent quorum loss happens when most number(>=n/2+1) of ETCD member nodes are down due to disk failure completely. in such quorum loss, the cluster does not recover automatically as the ETCD data in disks are unavailable for most of the ETCD member nodes. The remaining nodes also can't take in new ETCD members as the quorum is already lost. The cluster can't process any new requests thereafter. A human operator must detect such state of cluster and should recover the cluster gracefully, possibly without losing any data.

Approach/Hint to the implement solution (optional):
As this situation is highly unlikely to happen in a running cluster in production, we are yet to decide a concrete design that would serve the purpose. One thing we decided so far that in such scenario, a human operator has to intervene and take decision with what to do. We are already preparing a playbook for human operator to take action in such scenario. The playbook mainly ask the human operator to execute the following steps:

  1. Scale down the sts replicas to 0
  2. Delete all 3 PVCs
  3. Scale up the sts to 1
  4. Wait for the single member etcd to come up, download snapshots and start running normally
  5. Scale up the STS to desired replicas.
  6. Verify that the cluster is formed correctly

We may need to add some intermediary steps in the above playbook steps based on our finding from our testing. We may automate some of the steps in the above playbook to ease the work of human operator. But even if we automate, we should not delete the PVs automatically and instead preserve the old data of remaining ETCD members in a separate folder (will require some parts of #382)

Additionally, we are also thinking of following implementations:

  1. Let the leader always take a final incremental snapshot after it loses leadership (independent, but relevant nonetheless: we should avoid double-application of the same revisions, e.g. 500-2000 should trump 500-1000) . If we always have the latest snapshots in our backup bucket even for quorum loss case, we don't need to worry about PVCs. We can safely discard all the PVCs and restore a cluster from the backup bucket.2.
@abdasgupta
Copy link
Contributor Author

We are not handling permanent quorum loss automatically, as of now. If the ETCD cluster face a permanent quorum loss, a human operator needs to intervene. We wrote a playbook to guide the human operator to bring up the cluster from permanent quorum loss.

I am closing this issue, as of now. If we need to add any additional flow for handling permanent quorum loss in more automatic way, we will raise another issue and track our progress there.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Nov 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension release/ga Planned for GA(General Availability) release of the Feature status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

3 participants