Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Validate and analysis restoration handling in etcd-backup-restore according to the multi-node ETCD proposal #323

Closed
Tracked by #107
amshuman-kr opened this issue Apr 12, 2021 · 7 comments
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) release/ga Planned for GA(General Availability) release of the Feature status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@amshuman-kr
Copy link
Collaborator

amshuman-kr commented Apr 12, 2021

Feature (What you would like to be added):

Restoration handling of single node is already implemented and release. Yet we need to evaluate and analysis restoration handling in etcd-backup-restore according to the enhancement to initialisation sequence in the multi-node ETCD proposal.

The enhancements should cover the following cases.

  1. Restart of an existing member of a quorate cluster with valid metadata but without valid data
  2. Restart of an existing member of a quorate cluster without valid metadata
  3. Restart of the first member of a non-quorate cluster without valid data
  4. Restart of a following member of a non-quorate cluster without valid data

Motivation (Why is this needed?):

Pick individually executable pieces of the multi-node proposal.

Approach/Hint to the implement solution (optional):

@amshuman-kr amshuman-kr added the kind/enhancement Enhancement, improvement, extension label Apr 12, 2021
@amshuman-kr amshuman-kr added this to the 2021-Q2 milestone Apr 12, 2021
@amshuman-kr amshuman-kr modified the milestones: 2021-Q2, v0.15.0 Apr 15, 2021
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Oct 27, 2021
@brumhard
Copy link

Hi @amshuman-kr,

Me and @breuerfelix would like to tackle this issue. If we understand the proposal correctly this is the last part of the series of cases explained in the decision table.
Because of that we wanted to ask whether it makes sense to start this issue or if we could help to bring the HA feature forward in any other way (possibly by contributing to other ongoing feature PRs).

@amshuman-kr
Copy link
Collaborator Author

Thanks @brumhard and @breuerfelix for showing interest and going through the proposal. And also for offering to help with implementing parts of it.

Unfortunately, there are some sequencing issues due to dependencies between the sub-tasks of the muli-node/HA story. For example, this task depends on #321 and #322.

This is the reason this task hasn't been picked yet. I think @ishan16696 is working on #321 but #322 is probably not taken up yet because of a couple of other topics being picked up before the HA topic (such as backup compaction).

The #322 has a bit of overlap with #321 but not too much. Though I expect there will be some merge conflicts if they are done in parallel.

@timuthy, @stoyanr, @abdasgupta do you have any thoughts about which issues @brumhard and @breuerfelix can contribute?

@timuthy
Copy link
Member

timuthy commented Nov 29, 2021

I'd say that contributions for gardener/etcd-druid#221 are very appreciated. It's a topic which doesn't require much coordination because of few or no depending items and so far no one picked up work in this area.

@brumhard
Copy link

@amshuman-kr @timuthy ok sounds a bit tough. I just wanted to stress that the HA proposal is pretty high priority for us and if we can help with anything we will put time and effort into it.

Tbh gardener/etcd-druid#22 doesn't seem to be the most crucial thing to do for the HA proposal or is it (apart from the BackupReady condition alert which is not implemented yet if I understand correctly)? If we can help with anything else for the actual implementation we'd be glad to do so.

Our main goal is to get this feature up and running asap.

@timuthy
Copy link
Member

timuthy commented Nov 30, 2021

@amshuman-kr @timuthy ok sounds a bit tough. I just wanted to stress that the HA proposal is pretty high priority for us and if we can help with anything we will put time and effort into it.

This is definitely true for us as well 👍 Due to dependencies to backup compaction and CP migration (GEP-17) plenty of use-cases were considered so that we can come to a well functioning multi-node etcd feature.
Today we discussed that It'd be challenging on top if we needed to plan additional ramp-up and coordination time, so that you can take over items like this, as it has plenty of dependencies as well (mentioned by @amshuman-kr). Thus, we absolutely appreciate your willingness to help pushing the multi-node feature, but think that a contribution for a more isolated topic can help more to bring the topic forward.

Tbh gardener/etcd-druid#22 doesn't seem to be the most crucial thing to do for the HA proposal or is it (apart from the BackupReady condition alert which is not implemented yet if I understand correctly)?

For the reasons mentioned above #221 was suggested. For us, it's an important topic as it'll help to get crucial insights when we eventually roll-out/transition to multi-node. As of today, we plan to ship proper observability together with the multi-node features and don't consider it a nice to have for the future.

Our main goal is to get this feature up and running asap.

+1 here 🙂

@ishan16696 ishan16696 removed this from the v0.15.0 milestone Feb 25, 2022
@abdasgupta abdasgupta changed the title [Feature] Enhance restoration handling in etcd-backup-restore according to the multi-node ETCD proposal [Feature] Validate and analysis restoration handling in etcd-backup-restore according to the multi-node ETCD proposal Jun 8, 2022
@ashwani2k ashwani2k added the release/ga Planned for GA(General Availability) release of the Feature label Jul 6, 2022
@abdasgupta abdasgupta added this to the v0.19.0 milestone Jul 11, 2022
@abdasgupta abdasgupta modified the milestones: v0.19.0, v0.20.0 Jul 25, 2022
@ishan16696 ishan16696 self-assigned this Aug 11, 2022
@ishan16696 ishan16696 removed this from the v0.20.0 milestone Aug 16, 2022
@ishan16696
Copy link
Member

ishan16696 commented Aug 19, 2022

Update on this issue:

  1. Restart of an existing member of a quorate cluster with valid metadata but without valid data
  2. Restart of an existing member of a quorate cluster without valid metadata
  3. Restart of the first member of a non-quorate cluster without valid data
    • It will be taken care by quorum loss feature handling.
  4. Restart of a following member of a non-quorate cluster without valid data
    • It will be taken care by quorum loss feature handling.

@ishan16696 ishan16696 removed their assignment Aug 19, 2022
@ishan16696 ishan16696 added this to the v0.20.0 milestone Aug 22, 2022
@abdasgupta abdasgupta modified the milestones: v0.20.0, ---, v0.21.0 Sep 19, 2022
@abdasgupta
Copy link
Contributor

We have already validated recovery from transient quorum loss ( see here gardener/etcd-druid#436) . For a non quorate cluster, we will need human intervention. The human operator will decide how to recover a non quorate cluster. We will be providing a playbook for their guidance. Please follow gardener/etcd-druid#437 for more details. As the scope of this issue is finished , I am closing this issue.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/stale Nobody worked on this for 6 months (will further age) release/ga Planned for GA(General Availability) release of the Feature status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

No branches or pull requests

7 participants