Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Recover ETCD multinode cluster from transient quorum loss #436

Closed
Tracked by #107
abdasgupta opened this issue Sep 19, 2022 · 4 comments
Closed
Tracked by #107
Labels
kind/enhancement Enhancement, improvement, extension status/closed Issue is closed (either delivered or triaged)

Comments

@abdasgupta
Copy link
Contributor

Feature (What you would like to be added):
Transient quorum loss in a multinode ETCD cluster happens when most(>= n/2 + 1) ETCD pods can't join to the cluster due to network error, pod scheduling error, high CPU/Mem usage etc. Transient quorum loss generally lasts for short period of time. ETCD cluster remain unavailable during that time. But when the failed pods restart again properly, they should join the cluster and make ETCD cluster available like before.

Motivation (Why is this needed?):
Recovery from a transient quorum loss is supported by ETCD multinode cluster. But it is currently not working with our current implementation of ETCD backup-restore. Currently in Backup-Restore, we are taking extra action to restore single node and scale up ETCD cluster. The action is blocking normal path of recovering from transient quorum loss. So, we need this feature to recover from transient quorum loss, single node restoration and scale up.

Approach/Hint to the implement solution (optional):
If a ETCD pod restarts and it has valid ETCD data directory, let it join to the cluster as per it's ETCD config.

@abdasgupta abdasgupta added the kind/enhancement Enhancement, improvement, extension label Sep 19, 2022
@abdasgupta
Copy link
Contributor Author

This feature is addressed by gardener/etcd-backup-restore#528 in ETCDBR side

@timuthy
Copy link
Member

timuthy commented Sep 20, 2022

I tested a transient quorum loss scenarios with Etcd-Druid release v0.13.1 (latest release at the time of writing this comment)

  1. Create a KinD cluster.
  2. Apply Druid CRDs.
  3. Run Etcd-Druid against KinD cluster.
  4. Create a three member cluster (w/ or w/o backup) via etcd resource.
  5. Wait until all pods are ready.
  6. Test cluster with etcdctl by putting and reading data.
  7. Shotdown Etcd-Druid to prevent any further reversions.
  8. Make an "invalid" change in the statefulset, e.g. change the etcd .image to a non-existing one.
  9. Observe that etcd-test-2 pod will be unavailable and stuck in ImagePullBackOff.
  10. Delete another pod etcd-test-1 (or also the third one) to provoke quorum loss.
  11. Use etcdctl to verify the cluster lost its quorum.
  12. Revert the change from step 8. and delete the pods stuck in ImagePullBackOff.
  13. Wait until all pods are ready.
  14. Use etcdctl to very the quorum has been recovered w/o manual intervention.

Because of the steps mentioned above, it's unclear why a transient quorum loss is currently not covered and under which circumstances it causes problems. Can you please explain your test path(s)?

@ishan16696
Copy link
Member

yes, as long as PVC of majority of etcd cluster members is intact and no data-dir corruption happen to those members then etcd cluster can recover from transient quorum loss without any intervention.

@ishan16696
Copy link
Member

ishan16696 commented Sep 21, 2022

Closing this issue as transient quorum loss is handled (refer: #436 (comment))
Please feel free to re-open if any edge case is not handled.
/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

4 participants