Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change pod NotReady/Unreachable tolerations from 300s to something much smaller, e.g. 60s #7689

Closed
Tracked by #6529
vlerenc opened this issue Mar 22, 2023 · 5 comments · Fixed by #7861
Closed
Tracked by #6529
Assignees

Comments

@vlerenc
Copy link
Member

vlerenc commented Mar 22, 2023

What would you like to be added:

The KAPI's --default-not-ready-toleration-seconds and --default-unreachable-toleration-seconds options define how fast to evict pods from nodes whose Ready status condition is either Unknown (node status unknown, a.k.a unreachable) or False (kubelet not ready). This can also be overridden individually per pod.

We do not make use of it, so the default of 300s applies but we probably should set it to something closer to 0s:

  • Generally for seeds (KAPI setting or for all pods) and also...
  • For our shoot add-ons (for all our pods in kube-system)

Why is this needed:

We saw during "zone outage simulations" that recovery happens very slowly. It takes 5m for KCM to evict pods that are sitting on dead nodes.

@vlerenc vlerenc changed the title Change pod NotReady/Unreachable tolerations from 300s to 0s Change pod NotReady/Unreachable tolerations from 300s to something closer to 0s Mar 23, 2023
@vlerenc vlerenc changed the title Change pod NotReady/Unreachable tolerations from 300s to something closer to 0s Change pod NotReady/Unreachable tolerations from 300s to something much smaller, e.g. 60s Apr 3, 2023
@Sallyan
Copy link
Contributor

Sallyan commented Apr 17, 2023

- For seeds (KAPI setting)
Like create seed cluster with below configuration

  kubernetes:
    kubeAPIServer:
      defaultNotReadyTolerationSeconds: 60
      defaultUnreachableTolerationSeconds: 60

Then the afterwards created pods will have the automatically added tolerations node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=60
It will take effect for the extensions, garden components (pods in garden namespace), and shoot namespace

- For seeds (selected components) and shoot add-ons (pods in kube-system)
Probably could create kind of mutating admission webhook which will add the tolerations to pods.

@vlerenc WDYT?

@vlerenc
Copy link
Member Author

vlerenc commented Apr 17, 2023

@Sallyan Yes, I was thinking about the web hook we already have for HA: https://github.com/gardener/gardener/blob/master/docs/development/high-availability.md#convenient-application-of-these-rules

So, maybe we want to make both toleration settings configurable in the ManagedSeed for seeds (or hardcode it to 60s at KAPI level), but we need a specific solution anyway for the shoot add-ons (as we cannot and should not force our end users to different KAPI toleration settings). If that's the case, maybe we want the web hook to do it in all cases? 🤷‍♂️

@Sallyan
Copy link
Contributor

Sallyan commented Apr 19, 2023

Agree, we keep in mind of customer autonomy, customer should decide the pod toleration settings for their own workload.
And it is easy and configurable in shoot manifest to update KAPI default toleration time, just adding below lines in shoot yaml.

  kubernetes:
    kubeAPIServer:
      defaultNotReadyTolerationSeconds: 60
      defaultUnreachableTolerationSeconds: 60

Personally I prefer to use one webhook for seeds to update pod toleration of all pods and another webhook on shoot just to update pod toleration in kube-system namespace.
We already have many nice webhooks of GRM (Gardener Resource Manager) [link]
Two interesting ones:

  • systemcomponentsconfig: will set spec.nodeSelector and spec.tolerations on system components pods
    It adds following field:
"worker.gardener.cloud/system-components": "true"
  • highavailabilityconfig
    It sets fields .spec.replicas .spec.template.spec.affinity and .spec.template.spec.topologySpreadConstraints for HA based on the failure tolerance type and the component type

Maybe we can extend the systemcomponentsconfig webhook to update the tolerationSeconds which key is node.kubernetes.io/not-ready or node.kubernetes.io/unreachable of pod .spec.tolerations

[
  {
    "effect": "NoExecute",
    "key": "node.kubernetes.io/not-ready",
    "operator": "Exists",
    "tolerationSeconds": 300
  },
  {
    "effect": "NoExecute",
    "key": "node.kubernetes.io/unreachable",
    "operator": "Exists",
    "tolerationSeconds": 300
  }
]

For seeds, probably write a new webhook to update all pods toleration.

func (h *Handler) Default(_ context.Context, obj runtime.Object) error {
	pod, ok := obj.(*corev1.Pod)
	if !ok {
		return fmt.Errorf("expected *corev1.Pod but got %T", obj)
	}

    // Modify the Pod tolerations to add a new default toleration
    tolerations := append(pod.Spec.Tolerations, corev1.Toleration{
        Key:      "node.kubernetes.io/not-ready",
        Operator: corev1.TolerationOpExists,
        Effect:   corev1.TaintEffectNoExecute,
        TolerationSeconds: new(int64),
    })
    pod.Spec.Tolerations = tolerations
}
....

@rfranzke
Copy link
Member

Why would we write a new webhook? @vlerenc already suggested to use the existing HA webhook to specify these settings, and I agree that this makes sense. It is active in both seeds and shoots, so it seems a good fit.

@timuthy
Copy link
Contributor

timuthy commented Apr 27, 2023

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants