-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change pod NotReady
/Unreachable
tolerations from 300s
to something much smaller, e.g. 60s
#7689
Comments
NotReady
/Unreachable
tolerations from 300s
to 0s
NotReady
/Unreachable
tolerations from 300s
to something closer to 0s
NotReady
/Unreachable
tolerations from 300s
to something closer to 0s
NotReady
/Unreachable
tolerations from 300s
to something much smaller, e.g. 60s
- For seeds (KAPI setting)
Then the afterwards created pods will have the automatically added tolerations - For seeds (selected components) and shoot add-ons (pods in @vlerenc WDYT? |
@Sallyan Yes, I was thinking about the web hook we already have for HA: https://github.com/gardener/gardener/blob/master/docs/development/high-availability.md#convenient-application-of-these-rules So, maybe we want to make both toleration settings configurable in the |
Agree, we keep in mind of customer autonomy, customer should decide the pod toleration settings for their own workload.
Personally I prefer to use one webhook for seeds to update pod toleration of all pods and another webhook on shoot just to update pod toleration in kube-system namespace.
Maybe we can extend the
For seeds, probably write a new webhook to update all pods toleration.
|
Why would we write a new webhook? @vlerenc already suggested to use the existing HA webhook to specify these settings, and I agree that this makes sense. It is active in both seeds and shoots, so it seems a good fit. |
/assign |
What would you like to be added:
The KAPI's
--default-not-ready-toleration-seconds
and--default-unreachable-toleration-seconds
options define how fast to evict pods from nodes whoseReady
status condition is eitherUnknown
(node status unknown, a.k.a unreachable) orFalse
(kubelet not ready). This can also be overridden individually per pod.We do not make use of it, so the default of
300s
applies but we probably should set it to something closer to0s
:kube-system
)Why is this needed:
We saw during "zone outage simulations" that recovery happens very slowly. It takes
5m
for KCM to evict pods that are sitting on dead nodes.The text was updated successfully, but these errors were encountered: