Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEP-7] Control Plane Migration: Etcd Backup Migration / Leader Election PoC #3875

Closed
stoyanr opened this issue Apr 13, 2021 · 4 comments
Closed
Assignees
Labels
area/control-plane-migration Control plane migration related kind/enhancement Enhancement, improvement, extension lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/3 Priority (lower number equals higher priority)

Comments

@stoyanr
Copy link
Contributor

stoyanr commented Apr 13, 2021

How to categorize this issue?

/area control-plane-migration
/kind enhancement
/priority 3

What would you like to be added:

We would like to implement a PoC for the etcd backup migration / leader election mechanism as currently proposed with #3066, in order to evaluate the feasibility of the proposed approach and try to address the remaining open issues. We don't aim yet for production quality code and won't be opening PRs to gardener components while still at the PoC stage.

Below is a rough outline of what we would like to implement as part of the PoC. Based on what we learned or ongoing discussions, we might change direction and implement things differently. In such cases, we will update this issue accordingly.

  1. gardenlet restoration flow
  • Creates one additional BE for source seed, changes the seed name to the dest seed in the existing BE
  • Annotates the etcd resource with operation=restore
  • Waits for the status of the etcd resource to indicate that the migration is finished
  • Annotates the etcd resource with operation=reconcile
  • Deletes the source seed BE
  1. etcd-druid
  • Reconciles an etcd resource with operation=restore and creates etcd-backup-restore migrate job
  • Waits for the etcd-backup-restore migrate job to complete and updates the status of the etcd resource
  1. BE controller
  • Handles 2 BEs correctly to avoid name collisions in secrets
  1. etcd-backup-restore support for migrate
  • Writes header files in source and dest BBs for the migrate operation
  • Waits for a snapshot newer than the header timestamp to be created (final snapshot) with a timeout (e.g. 10 min)
  • Migrates all snapshots from source to dest BB (final snapshot first, rest after that)
  • Updates header files that the migration was performed
  1. etcd-backup-restore normal operation
  • Checks header files before taking a snapshot
  • If source detected
    • If final snapshot not taken (how to detect?), cuts access to main etcd (how?), takes final snapshot, and exits
    • Otherwise, just exits
  • If destination detected
    • If a snapshot exists, restores it
    • Otherwise, waits with a timeout, then exits
  • Otherwise, takes a normal snapshot as before
@stoyanr stoyanr added the kind/enhancement Enhancement, improvement, extension label Apr 13, 2021
@gardener-robot gardener-robot added area/control-plane-migration Control plane migration related priority/3 Priority (lower number equals higher priority) labels Apr 13, 2021
@stoyanr
Copy link
Contributor Author

stoyanr commented Apr 13, 2021

/assign @stoyanr @plkokanov @kris94
/cc @rfranzke @mandelsoft @amshuman-kr This is PoC / experimentation, if you would rather have us try things differently, please comment. Otherwise, we would in any case share and discuss the results with you in order to agree on a final approach.

@vlerenc
Copy link
Member

vlerenc commented Apr 13, 2021

@stoyanr I think I understand 2, 4, and 5, but can you elaborate a bit more on 1 and 3?

@stoyanr
Copy link
Contributor Author

stoyanr commented Apr 13, 2021

I think I understand 2, 4, and 5, but can you elaborate a bit more on 1 and 3?

1 is simply the changes to the overall restoration flow (executed by gardenlet) to bind everything together and actually perform a successful "restore" reconciliation, provided that all other changes are in place.
3 is needed to ensure that if the BE controller has to reconcile 2 different BEs and therefore create 2 etcd-backup secrets for the same shoot, they are created with different names; currently the name is hardcoded to etcd-backup.

I apologise for the lack of in-depth technical explanations, but this issue is only intended to be a starting point for us to work on the PoC. In the above list, we are likely still missing some things, and perhaps also have a few that are not really needed. We'll be figuring this out as we go.

@stoyanr
Copy link
Contributor Author

stoyanr commented Oct 16, 2021

Closing this as the PoC is considered finished and GEP-17 is now approved.

@stoyanr stoyanr closed this as completed Oct 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/control-plane-migration Control plane migration related kind/enhancement Enhancement, improvement, extension lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

5 participants