Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

☂️ [GEP-20] Highly Available Seed and Shoot Clusters #6529

Closed
56 tasks done
shreyas-s-rao opened this issue Aug 18, 2022 · 14 comments
Closed
56 tasks done

☂️ [GEP-20] Highly Available Seed and Shoot Clusters #6529

shreyas-s-rao opened this issue Aug 18, 2022 · 14 comments
Assignees
Labels
area/high-availability High availability related kind/enhancement Enhancement, improvement, extension

Comments

@shreyas-s-rao
Copy link
Contributor

shreyas-s-rao commented Aug 18, 2022

How to categorize this issue?

/area high-availability
/kind enhancement

What would you like to be added:

This is an umbrella issue to track the implementation of GEP-20 Highly Available Shoot Control Planes.

Tasks

@shreyas-s-rao
Copy link
Contributor Author

/assign
/assign @timuthy

@ashwani2k
Copy link
Contributor

ashwani2k commented Aug 22, 2022

  • Introduce shoot spec field for enabling HA control planes
  • Add validations for updating the shoot.spec.controlPlanes field
    • Allow non-HA shoot -> HA shoot
    • Only allow non-HA -> multi-zone if assigned seed is multi-zonal
    • Single-zone HA shoot <-> multi-zone HA shoot must not be allowed
    • HA shoot -> non-HA shoot must not be allowed (until etcd scale-down is implemented)

This needs some modifications along with a change required for Seed.
Change needs to be part of GEP -> GEP enhancement PR | Implementation--> Review

We reprioritised in discussion with @timuthy

  • Support zone-pinning for single-zone control HA planes (via GRM mutating webhook)

@ashwani2k
Copy link
Contributor

ashwani2k commented Sep 8, 2022

Below bullet points highlight the api contract for introducing HA control planes via -

  controlPlane:
    highAvailability:
      failureTolerance:
        type:  <node|zone>
  1. non-HA shoots can be scheduled on non-HA or HA(multi-zone) seeds.
  2. single-zone shoots can be scheduled on non-HA or HA(multi-zone) seeds.
  3. multi-zone shoots can only be scheduled ONLY on HA(multi-zone) seeds.
  4. non-HA shoots can be upgraded to single-zone on non-HA or HA seeds. **
  5. non-HA shoots can be upgraded to multi-zone only on HA seeds. **
  6. single-zone shoots shall not be allowed to upgrade to multi-zone shoots and shall be stopped by admission plugins.

** this can lead to a short disruption|downtime when etcd sts is rolled


Legend:
non-HA shoot : any shoot which has no faultTolerance defined.
single-zone shoot: any shoot which has faultTolerance defined as type node.
multi-zone shoot: any shoot which has faultTolerance defined as type zone.
non-HA seed: any seed which has worker pools for etcd/cpu running only on a single availability zone.
HA seed: any seed which has worker pools for etcd/cpu defined across 3 availability zones and has the label seed.gardener.cloud/multi-zonal: "true".

@ialidzhikov
Copy link
Member

ialidzhikov commented Sep 21, 2022

Wrt to Enhance Pod eviction in case of zone outage (delete Pods in Terminating state): for Deployments the kube-controller-manager behaviour is to create new Pods right away when the old Pods are Terminating.

Example:

Expand to see the Deployment!
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  labels:
    app: nginx
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      terminationGracePeriodSeconds: 60
      containers:
      - name: nginx
        image: centos
        command: ["/bin/sh"]
        args: ["-c", "sleep 3600"]
        ports:
        - containerPort: 80

Above we have a Deployment. Its container does not handle SIGTERM and it will hang in Terminating until it is force killed after terminationGracePeriodSeconds.

$ k get po
NAME                                READY   STATUS    RESTARTS   AGE
nginx-deployment-746759f465-z95lj   1/1     Running   0          11m

$ k delete po nginx-deployment-746759f465-z95lj
pod "nginx-deployment-746759f465-z95lj" deleted

$ k get po
NAME                                READY   STATUS              RESTARTS   AGE
nginx-deployment-746759f465-z95lj   1/1     Terminating         0          11m
nginx-deployment-746759f465-ztz65   0/1     ContainerCreating   0          2s

$ k get po
NAME                                READY   STATUS        RESTARTS   AGE
nginx-deployment-746759f465-z95lj   1/1     Terminating   0          12m
nginx-deployment-746759f465-ztz65   1/1     Running       0          54s

Above you can see that when old replica is deleted, the new one is created right away.


I suspect that in the experiments of @unmarshall (ref #6287 (comment)) there was a webhook preventing creation of new Pods (for some unknown reason) or kube-controller-manager was down (for some unknown reason). These are the 2 potential things that could explain #6287 (comment).

Anyways, I will try to simulate a zone outage and check why KCM does not create the new Pods when the old ones are terminating.

@ialidzhikov
Copy link
Member

ialidzhikov commented Sep 22, 2022

We had a sync with @unmarshall and we are able to confirm that in a simulation of zone outage (simulated via network acl that denies all ingress and egress traffic for a zone) the recovery for a (multi-zone) control plane worked well as outlined in #6529 (comment):

  • For Deployments kube-controller-manager creates new replicas right away when the old replicas are terminating. The new replicas start successfully on a healthy zone.
  • I think during @unmarshall's simulations kube-controller-manager was down for some reason. I also revised the webhooks we deploy and whether we could have a deadlock situation that could block new Pod creation but I didn't see anything abnormal.

PS: We also found that the existing garbage-collector (shoot-care-controller of gardenlet) already deletes Terminating pods in the Shoot's control plane after 5min.

// PerformGarbageCollectionSeed performs garbage collection in the Shoot namespace in the Seed cluster
func (g *GarbageCollection) performGarbageCollectionSeed(ctx context.Context) error {
podList := &corev1.PodList{}
if err := g.seedClient.List(ctx, podList, client.InNamespace(g.shoot.SeedNamespace)); err != nil {
return err
}
return g.deleteStalePods(ctx, g.seedClient, podList)
}

But this is not a recovery mechanism and does not bring to the recovery. For Deployments kube-controller-manager already creates the new replicas. For StatefulSets, even when the old Terminating replicas are forcefully deleted, this does not lead to a recovery as the new StatefulSet Pods fail to be scheduled - they have scheduling requirements that cannot be satisfied during the zone outage (etcd Pod to run on the outage zone or loki/prometheus Pods to run on the outage zone because their volume is already on provisioned on this zone).

TL;DR: We will resolve the corresponding item as completed as nothing has to be done. Let us know if you have additional comments on this topic. We have to update GEP-20 with the new learnings.

@vlerenc
Copy link
Member

vlerenc commented Sep 23, 2022

Great to hear! Thank you!

@timuthy
Copy link
Contributor

timuthy commented Sep 30, 2022

I added another item Support control-plane migration for HA shoots since this doesn't seem to work out of the box. We should create a separate issue once we have more certainty and details and find a proper way to support this use-case.

cc @plkokanov @vlerenc

@plkokanov
Copy link
Contributor

I added another item Support control-plane migration for HA shoots since this doesn't seem to work out of the box. We should create a separate issue once we have more certainty and details and find a proper way to support this use-case.

cc @plkokanov @vlerenc

Should we (for now) add validation that forbids migration for HA shoots?

@timuthy
Copy link
Contributor

timuthy commented Nov 22, 2022

/assign @plkokanov @ishan16696
for tasks related to

Support control-plane migration for HA shoots

@ishan16696
Copy link
Member

/assign @plkokanov @ishan16696
for tasks related to

Please see the approaches possible to achieve CPM in multi-node etcd: gardener/etcd-druid#479 (comment)

@rfranzke
Copy link
Member

All tasks have been completed.
/close

@gardener-prow gardener-prow bot closed this as completed May 16, 2023
@gardener-prow
Copy link
Contributor

gardener-prow bot commented May 16, 2023

@rfranzke: Closing this issue.

In response to this:

All tasks have been completed.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/high-availability High availability related kind/enhancement Enhancement, improvement, extension
Projects
None yet
Development

No branches or pull requests

8 participants