New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
☂️ [GEP-20] Highly Available Seed and Shoot Clusters #6529
Comments
/assign |
This needs some modifications along with a change required for Seed. We reprioritised in discussion with @timuthy
|
Below bullet points highlight the api contract for introducing HA control planes via -
** this can lead to a short disruption|downtime when Legend: |
Wrt to Example: Expand to see the Deployment!apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
labels:
app: nginx
spec:
replicas: 1
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
terminationGracePeriodSeconds: 60
containers:
- name: nginx
image: centos
command: ["/bin/sh"]
args: ["-c", "sleep 3600"]
ports:
- containerPort: 80 Above we have a Deployment. Its container does not handle SIGTERM and it will hang in Terminating until it is force killed after
Above you can see that when old replica is deleted, the new one is created right away. I suspect that in the experiments of @unmarshall (ref #6287 (comment)) there was a webhook preventing creation of new Pods (for some unknown reason) or Anyways, I will try to simulate a zone outage and check why KCM does not create the new Pods when the old ones are terminating. |
We had a sync with @unmarshall and we are able to confirm that in a simulation of zone outage (simulated via network acl that denies all ingress and egress traffic for a zone) the recovery for a (multi-zone) control plane worked well as outlined in #6529 (comment):
PS: We also found that the existing garbage-collector (shoot-care-controller of gardenlet) already deletes Terminating pods in the Shoot's control plane after gardener/pkg/operation/care/garbage_collection.go Lines 85 to 93 in 24b667c
But this is not a recovery mechanism and does not bring to the recovery. For Deployments kube-controller-manager already creates the new replicas. For StatefulSets, even when the old Terminating replicas are forcefully deleted, this does not lead to a recovery as the new StatefulSet Pods fail to be scheduled - they have scheduling requirements that cannot be satisfied during the zone outage (etcd Pod to run on the outage zone or loki/prometheus Pods to run on the outage zone because their volume is already on provisioned on this zone). TL;DR: We will resolve the corresponding item as completed as nothing has to be done. Let us know if you have additional comments on this topic. We have to update GEP-20 with the new learnings. |
Great to hear! Thank you! |
I added another item |
Should we (for now) add validation that forbids migration for HA shoots? |
/assign @plkokanov @ishan16696
|
Please see the approaches possible to achieve CPM in multi-node etcd: gardener/etcd-druid#479 (comment) |
All tasks have been completed. |
@rfranzke: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
How to categorize this issue?
/area high-availability
/kind enhancement
What would you like to be added:
This is an umbrella issue to track the implementation of GEP-20 Highly Available Shoot Control Planes.
Tasks
shoot.spec.controlPlanes
field.spec.provider.zones
inSeed
, deprecate.spec.highAvailability
and dropseed.gardener.cloud/multi-zonal
label #6914.spec.highAvailability
fromSeed
API #6960Enhance Pod eviction in case of zone outage (delete Pods in(@ialidzhikov) -> see ☂️ [GEP-20] Highly Available Seed and Shoot Clusters #6529 (comment)Terminating
state), see [GEP-20] Make shoot control plane components HA #6646 (comment)gardener-resource-manager
#6665gardenlet
#6750gardener-resource-manager
#6685high-availability-config
webhook ingardener-resource-manager
#6967vpn-seed-server
andvpn-shoot
, [GEP-20] High Availability for reversed VPN connection #6890istio-ingressgateway
HighAvailabilityConfig
webhook to handleHPA
andHVPA
objects #7105v1.31.0
released with(no need to wait for it)v?
v1.23.0
v1.41.0
v?
v1.33.0
v1.43.0
v1.20.0
v1.28.0
v0.9.0
v0.9.0
v1.20.0
v0.16.0
v1.20.0
v1.16.0
v1.28.0
v1.27.0
v0.15.0
v0.7.0
v0.1.0
v0.5.0
auditlog-extension: kubernetes/auditlog-extension#388, released withv?
alpha.control-plane.shoot.gardener.cloud/high-availability
annotation--node-monitor-grace-period
to40s
#7688NotReady
/Unreachable
tolerations from300s
to something much smaller, e.g.60s
#7689minDomains
to number of zones #7690etcd-main
for HA shoots #7626Shoot
s #7742HAControlPlanes
feature gate to beta #7867HAControlPlanes
feature gate to GA: Maintain feature gates #8008HAControlPlanes
andFullNetworkPoliciesInRuntimeCluster
feature gates #8083The text was updated successfully, but these errors were encountered: