[Feature] Recover ETCD multinode cluster from transient quorum loss #436

abdasgupta · 2022-09-19T20:28:13Z

Feature (What you would like to be added):
Transient quorum loss in a multinode ETCD cluster happens when most(>= n/2 + 1) ETCD pods can't join to the cluster due to network error, pod scheduling error, high CPU/Mem usage etc. Transient quorum loss generally lasts for short period of time. ETCD cluster remain unavailable during that time. But when the failed pods restart again properly, they should join the cluster and make ETCD cluster available like before.

Motivation (Why is this needed?):
Recovery from a transient quorum loss is supported by ETCD multinode cluster. But it is currently not working with our current implementation of ETCD backup-restore. Currently in Backup-Restore, we are taking extra action to restore single node and scale up ETCD cluster. The action is blocking normal path of recovering from transient quorum loss. So, we need this feature to recover from transient quorum loss, single node restoration and scale up.

Approach/Hint to the implement solution (optional):
If a ETCD pod restarts and it has valid ETCD data directory, let it join to the cluster as per it's ETCD config.

abdasgupta · 2022-09-19T20:29:23Z

This feature is addressed by gardener/etcd-backup-restore#528 in ETCDBR side

timuthy · 2022-09-20T08:39:17Z

I tested a transient quorum loss scenarios with Etcd-Druid release v0.13.1 (latest release at the time of writing this comment)

Create a KinD cluster.
Apply Druid CRDs.
Run Etcd-Druid against KinD cluster.
Create a three member cluster (w/ or w/o backup) via etcd resource.
Wait until all pods are ready.
Test cluster with etcdctl by putting and reading data.
Shotdown Etcd-Druid to prevent any further reversions.
Make an "invalid" change in the statefulset, e.g. change the etcd .image to a non-existing one.
Observe that etcd-test-2 pod will be unavailable and stuck in ImagePullBackOff.
Delete another pod etcd-test-1 (or also the third one) to provoke quorum loss.
Use etcdctl to verify the cluster lost its quorum.
Revert the change from step 8. and delete the pods stuck in ImagePullBackOff.
Wait until all pods are ready.
Use etcdctl to very the quorum has been recovered w/o manual intervention.

Because of the steps mentioned above, it's unclear why a transient quorum loss is currently not covered and under which circumstances it causes problems. Can you please explain your test path(s)?

ishan16696 · 2022-09-20T09:50:36Z

yes, as long as PVC of majority of etcd cluster members is intact and no data-dir corruption happen to those members then etcd cluster can recover from transient quorum loss without any intervention.

ishan16696 · 2022-09-21T05:08:23Z

Closing this issue as transient quorum loss is handled (refer: #436 (comment))
Please feel free to re-open if any edge case is not handled.
/close

abdasgupta added the kind/enhancement Enhancement, improvement, extension label Sep 19, 2022

This was referenced Sep 19, 2022

Multi-Node/Clustered ETCD #107

Closed

Updated the logics to accommodate with transient quorum loss scenario. gardener/etcd-backup-restore#528

Closed

gardener-robot closed this as completed Sep 21, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Recover ETCD multinode cluster from transient quorum loss #436

[Feature] Recover ETCD multinode cluster from transient quorum loss #436

abdasgupta commented Sep 19, 2022

abdasgupta commented Sep 19, 2022

timuthy commented Sep 20, 2022

ishan16696 commented Sep 20, 2022

ishan16696 commented Sep 21, 2022 •

edited

[Feature] Recover ETCD multinode cluster from transient quorum loss #436

[Feature] Recover ETCD multinode cluster from transient quorum loss #436

Comments

abdasgupta commented Sep 19, 2022

abdasgupta commented Sep 19, 2022

timuthy commented Sep 20, 2022

ishan16696 commented Sep 20, 2022

ishan16696 commented Sep 21, 2022 • edited

ishan16696 commented Sep 21, 2022 •

edited