Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEP-17] Add GEP-17 with an updated Control Plane Migration "Bad Case" Scenario description #4107

Merged
merged 5 commits into from
Sep 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,7 @@
* [GEP-14: Reversed Cluster VPN](proposals/14-reversed-cluster-vpn.md)
* [GEP-15: Manage Bastions and SSH Key Pair Rotation](proposals/15-manage-bastions-and-ssh-key-pair-rotation.md)
* [GEP-16: Dynamic kubeconfig generation for Shoot clusters](proposals/16-adminkubeconfig-subresource.md)
* [GEP-17: Shoot Control Plane Migration "Bad Case" Scenario](proposals/17-shoot-control-plane-migration-bad-case.md)

## Development

Expand Down
42 changes: 0 additions & 42 deletions docs/proposals/07-shoot-control-plane-migration.md
Original file line number Diff line number Diff line change
Expand Up @@ -200,45 +200,3 @@ The ShootState synchronization controller will become part of the gardenlet. It
1. The gardenlet in the __Destination Seed__ fetches the state of extension resources from the `ShootState` resource in the garden cluster.
1. Normal reconciliation flow is resumed in the __Destination Seed__. Extension resources are annotated with `gardener.cloud/operation=restore` to instruct the extension controllers to reconstruct their state.
1. The Shoot's namespace in __Source Seed__ is deleted.


### Leader Election and Control Plane Termination

During migration "split brain" scenario must be avoided. This means that a Shoot's control plane in the __Source Seed__ must be scaled down before it is scaled up in the __Destination Seed__.

**Note**: This section is still under discussion. The plan is to first implement the `ShootState` and modify the reconciliation flow and extensions accordingly. Additionally multiple scenarios need to be considered depending on the reachability to the Garden cluster from the __Source Seed__ components, the __Source Seed's__ API server from the Garden cluster and the __Source Seed's__ API server from the controllers running on the seed. **The initial implementation will only cover the case where everything is running**.

Extension controllers do not need leader election functionality because they only reconcile the extension resources if the reconcile operation annotation is specified in the resource. Since this is set only during a reconciliation triggered by the Garden Controller Manager it cannot happen during migration.

For other controllers in the controlplane (e.g MCM, etcd backup restore, kube apiserver) leader election has to be implemented. We plan to introduce gardenlet soon therefore the garden cluster is expected to be reachable from all seeds.

Another flag might be needed to tell the gardenlet in the __Destination Seed__ that the control plane in the __Source Seed__ has been scaled down and the Shoot reconciliation can begin.

Horizontal Pod Autoscalers also need to be considered and removed.

#### Garden cluster and __Source Seed__ are healthy and there are no network problems

If both the Garden cluster and __Source Seed__ cluster are healthy, the Garden Controller Manager (or gardenlet after checking the `spec.seedName`) can directly scale down the Shoot's control plane as part of the migration flow.

#### If components in the __Source Seed__ cannot reliably read who the leader is from the Garden cluster

Currently we have come up with two ideas to handle this case:

**DNS leader election:** A DNS TXT entry with TTL=60s and value seed='__Source Seed__' is used. The record is created and maintained by the Gardener Controller Manager (by using the DNS Controller Manager and its DNSEntry resource). When a control plane migration is detected, the Gardener Controller Manager changes the value of the DNS Entry to seed='__Destination Seed__' and waits for 2*TTL + 1 = 121 seconds to ensure that the change is propagated to all controllers in the old seed. We rely on the fact that DNS is highly available (100% for AWS Route53) and that control plane components in the __Source Seed__ can see the changes.

Control plane components have to be shut down even when there is no access to the __Source Seed's__ API server. To be able to do that a daemonset is deployed in each Seed cluster. When the daemonset in the __Source Seed__ sees that it is nolonger the leader (by checking the DNS record) and there is no connection to the __Source Seed's__ API server, the daemonset will kill the Shoot's control plane pods by directly talking to the Kubelet's API server. If the __Source Seed's__ API Server comes back up, then gardenlet should take care of scaling down the deployments and statefulsets in the Shoot's control plane. This could be problematic if the gardenlet is in a crashloop backoff or takes too much time to do the scaling.

As an alternative to the daemonset, a sidecar container can be added to each control plane component. The sidecar checks the DNS Entry to see if it is still the leader. If it is not, it shuts down the entire pod. This way we do not run the risk of deploymnets and statefulsets recreating the control plane pods after the seed's apiserver comes back up.

The problems with using DNS as leader election is caching. Additionally, not all DNS servers respect TTL settings.

**Using timestamps in the ETCD backup entries:** Once a Shoot is successfully created on the __Source Seed__ a timestamp is saved in the Cluster Resource and/or saved in the ETCD backup restore sidecar (either as an environment variable or additional configuration). The timestamps must not be modified afterwards and is used when the backup restore container writes data to the backup entry of the Shoot in the following way:
- If there is no timestamp in the backup entry, the current timestamp is uploaded.
- If there is a timestamp in the backup entry it is compared to the backup restore container's timestamp:
1. If it is the same, nothing is done
2. If it is older, it is replaced with the timestamp of the current backuprestore container
3. If it is newer - the current backuprestore container does not have ownership of the backup entry

When case 3 happens it means that the shoot has been migrated and the backuprestore container in the __Destination Seed__ has started using the Shoot's backup entry. The backuprestore on the __Source Seed__ should be configured to shut itself and the etcd container down once it sees the newer timestamp. Shutting down the etcd will cause the kubernetes control plane components to go into crashloop backoff and the MCM will not be able to do anything as it will not be able to list nodes (this has to be verified with MCM).

For this approach to work backups must be enabled for the Shoot which is migrated. Additionally, synchronization based on the timestamps in the backup entry depend on the frequency of backups made by the backuprestore container.