support drain timeout #743

grosser · 2022-09-08T22:34:23Z

Tell us about your request
Sometimes nodes get stuck when a pod could not be drained, so we have to alert ourselves and then manually kill it, which is not ideal.

It would be nice to set an optional drainTimeout so we can ensure nodes always die after x hours/days.

Are you currently working around this issue?
Helper pod that kills pods on stuck nodes so they finish draining.

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

The text was updated successfully, but these errors were encountered:

spring1843 · 2022-09-12T18:18:22Z

What should happen after the drainTimeout has elapsed? A pod not terminating may be by design, e.g. it's waiting for an IO operation with a long timeout.

grosser · 2022-09-12T18:28:12Z

it should force-delete the pod
the flag will definitely need "this has xyz downsides" warning
but dealing with empty nodes hanging around is also an overhead and especially in staging clusters we want to make sure cost is under control when things go wrong

jonathan-innis · 2022-09-12T20:41:33Z

Related to aws/karpenter-provider-aws#2391

ellistarn · 2023-06-04T00:58:49Z

@grosser, can you elaborate a bit on what can cause the node to get stuck? Do you have do-not-evict pods, or broken PDBs?. We're currently scoping some work in this area and looking to see what makes sense here.

grosser · 2023-06-04T01:22:58Z

broken pdbs
we as cluster admins need to roll the clusters, but some other teams left their apps in a bad state
in production that would mean we have to reach out to them or at least check that their app is not critical, but in staging we'd like to just ignore it

ellistarn · 2023-06-04T22:23:03Z

I wonder if we should just hardcode something like 10 minutes if eviction is returning a 500. Do you need to configure this value?

grosser · 2023-06-04T22:49:33Z

I don't think it's good to hardcode any timeout since by default the system should be safe for the workloads and fail the cluster rollout instead.
The opt-in could be either a timeout or true/false ... but I think timeout makes the most sense.

ellistarn · 2023-06-05T00:15:33Z

What would you set it to?

grosser · 2023-06-05T00:21:19Z

nothing in production and 10-15m in staging

ellistarn · 2023-06-05T00:29:25Z

The other wrinkle here is that we won't attempt (iirc, or perhaps it's just deprioritized) to deprovision a node that has a blocked PDB, so we'll need handling for both disruption and termination flows.

ellistarn · 2023-06-05T00:31:48Z

Would you want to treat a 500 (misconfigured) differently than a 429 (pdb unfulfilled)?

grosser · 2023-06-05T00:40:25Z

I'd treat them the same, either way:
production: the owning team needs to check it out before we can proceed
staging: go fix your app 🤷

ellistarn · 2023-06-05T02:12:12Z

Great feedback, thank you!

sftim · 2023-06-05T17:03:45Z

Related / extra credit: manage machine drains through a custom resource, similar to how the node maintenance operator does it. This could include reporting progress via .status

njtran · 2023-08-14T19:30:00Z

How does https://kubernetes.io/blog/2023/01/06/unhealthy-pod-eviction-policy-for-pdbs/ fit into your use-case? Does it solve it? It's behind a Beta feature gate in 1.27.

grosser · 2023-08-15T02:01:20Z

it solves it for completely unhealthy applications and should not have a big impact for regular operations
it however does not solve the situation when only part of a pdb is unhealthy (for example new pods are unhealthy and old pods are healthy)

mallow111 · 2023-09-06T17:34:42Z

@jonathan-innis does anyone start to work on this issue yet?

garvinp-stripe · 2023-09-25T19:02:11Z

Is it possible to expose and implement a timeout in the finalizer? If nothing is set (default) then it never times out but if it exist it should be easy enough for the finalizer to simply annotation the machine/ node with drain started with and if that now - annotation > timeout -> force kill the node/ machine?

sftim · 2023-09-25T19:06:17Z

I found out recently that graceful deletion for custom resources (as in CRDs) doesn't really seem to be a thing.
If that's right then we'd need an improvement to Kubernetes itself, and then could indeed gracefully delete the Machine, using the future-dated removal to signal the drain timeout.

garvinp-stripe · 2023-09-25T19:53:02Z

Can you provide some details? I think you are correct however shouldn't finalizer on CRD (machine) work equally as well? Trying to understand what specific API you need from K8s

sftim · 2023-09-26T08:38:06Z

Graceful deletion has an associated time; finalizers don't have any time stored in the API.

garvinp-stripe · 2023-09-26T17:07:06Z

Ah ya that's unfortunate

garvinp-stripe · 2023-09-26T17:25:58Z

I think to upstream that change would take a while. I wonder if there is a mid to short term solution? I believe we might just fork Karpenter core and add the ability into the finalizer that picks up the timeout from the config map and some kind of annotation on the machine to mark it with time deletion attempt

njtran · 2023-10-06T17:53:08Z

We had talked about including a timeout here that would infer from the NodePool or node itself to control the drain/eviction logic and forcefully evict once a drain/termination has reached the timeout. We'd essentially need to mock the graceful deletion ourselves, as I don't think there's any graceful deletion for nodes. We had a PR #466 that explored this, but haven't had the time to get to it.

wmgroot · 2023-11-22T01:20:30Z

Just for reference, this is the field that supports this behavior in ClusterAPI.

$ kubectl explain machinedeployment.spec.template.spec.nodeDrainTimeout
KIND:     MachineDeployment
VERSION:  cluster.x-k8s.io/v1beta1

FIELD:    nodeDrainTimeout <string>

DESCRIPTION:
     NodeDrainTimeout is the total amount of time that the controller will spend
     on draining a node. The default value is 0, meaning that the node can be
     drained without any time limitations. NOTE: NodeDrainTimeout is different
     from `kubectl drain --timeout`

In our case, we set this value to 24h in all clusters, production or otherwise. We make it clear to our users that they must tolerate graceful disruption of their pods, and to alert themselves if their pods are in an unhealthy state for a long period of time in production. This allows us to ensure we can keep up to date with security patches across all of our clusters.

We've discussed lowering the value from 24h to something much shorter once we have a better story for automated traffic failover between our clusters.

wmgroot · 2023-11-22T01:21:09Z

Should support for volume detach timeouts be opened as another issue? We use EBS CSI and find that some volumes refuse to detach due to timeouts that aren't handled well by the EBS CSI controller. We've opened several issues there, but these problems still occur after years of using EBS CSI. This blocks removal of nodes and can greatly delay the process of updating nodes in a cluster if left to manual intervention.

$ kubectl explain machinedeployment.spec.template.spec.nodeVolumeDetachTimeout --context mgmt-use2
KIND:     MachineDeployment
VERSION:  cluster.x-k8s.io/v1beta1

FIELD:    nodeVolumeDetachTimeout <string>

DESCRIPTION:
     NodeVolumeDetachTimeout is the total amount of time that the controller
     will spend on waiting for all volumes to be detached. The default value is
     0, meaning that the volumes can be detached without any time limitations.

garvinp-stripe · 2023-11-22T19:18:24Z

I think in general there is a need/ desire for more control over how nodes shut down happens from cluster operators from timeouts and how drain happens.

Wondering if we should prioritize #740 to expand how node termination happens and exposing more control for users?

ellistarn · 2023-11-23T03:34:04Z

Supportive of a node drain timeout proposal. Anyone want to take this on?

wmgroot · 2023-12-01T08:08:23Z

I'm interested, opened a PR for a design on this above.

k8s-triage-robot · 2024-02-29T08:32:17Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

garvinp-stripe · 2024-02-29T20:10:03Z

/remove-lifecycle stale

jonathan-innis · 2024-03-05T00:56:59Z

cc: @akestner

k8s-triage-robot · 2024-06-03T01:31:43Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

jonathan-innis · 2024-06-03T06:15:28Z

/remove-lifecycle stale

grosser added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 8, 2022

grosser mentioned this issue Oct 27, 2022

node drain timeout via labels #36

Closed

njtran mentioned this issue Jun 5, 2023

Mega Issue: Deprovisioning Controls aws/karpenter-provider-aws#1738

Closed

18 tasks

njtran added the deprovisioning Issues related to node deprovisioning label Jun 5, 2023

This was referenced Aug 12, 2023

feat: Respect DeletionTimestamp and DeletionGracePeriodSeconds for Node, Machine, and Pods #465

Closed

feat: Support cascading delete from CRD deletion #466

Closed

njtran added the v1.x Issues prioritized for post-1.0 label Aug 14, 2023

jonathan-innis mentioned this issue Sep 2, 2023

Enable the feature that PDB Bypass After Timeout in Karpenter aws/karpenter-provider-aws#4542

Closed

njtran mentioned this issue Oct 31, 2023

Configurable Deprovisioning Process aws/karpenter-provider-aws#3520

Closed

njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023

This was referenced Nov 16, 2023

Supporting stuck node metrics/ events #796

Closed

v1 Laundry List #758

Closed

jonathan-innis mentioned this issue Nov 24, 2023

Cordon node with a do-not-evict pod when ttlSecondsUntilExpired is met #622

Open

wmgroot mentioned this issue Dec 1, 2023

docs: RFC for disruption.terminationGracePeriod feature #834

Merged

garvinp-stripe mentioned this issue Dec 6, 2023

docs: RFC for Customizable Node Finalizer #843

Closed

wmgroot mentioned this issue Dec 27, 2023

feat: disruption.terminationGracePeriod #916

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024

njtran mentioned this issue Mar 11, 2024

Allow Forced Disruption of Nodes after TTL (disruptionTolerationDuration) #1047

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2024

k8s-ci-robot closed this as completed in #916 Jul 12, 2024

wmgroot mentioned this issue Jul 18, 2024

feat: nodeclaim.spec.terminationGracePeriod, add pod delete RBAC aws/karpenter-provider-aws#6539

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support drain timeout #743

support drain timeout #743

grosser commented Sep 8, 2022

spring1843 commented Sep 12, 2022

grosser commented Sep 12, 2022

jonathan-innis commented Sep 12, 2022

ellistarn commented Jun 4, 2023

grosser commented Jun 4, 2023

ellistarn commented Jun 4, 2023

grosser commented Jun 4, 2023

ellistarn commented Jun 5, 2023

grosser commented Jun 5, 2023

ellistarn commented Jun 5, 2023

ellistarn commented Jun 5, 2023

grosser commented Jun 5, 2023

ellistarn commented Jun 5, 2023

sftim commented Jun 5, 2023

njtran commented Aug 14, 2023

grosser commented Aug 15, 2023

mallow111 commented Sep 6, 2023

garvinp-stripe commented Sep 25, 2023

sftim commented Sep 25, 2023

garvinp-stripe commented Sep 25, 2023

sftim commented Sep 26, 2023

garvinp-stripe commented Sep 26, 2023

garvinp-stripe commented Sep 26, 2023

njtran commented Oct 6, 2023

wmgroot commented Nov 22, 2023

wmgroot commented Nov 22, 2023

garvinp-stripe commented Nov 22, 2023

ellistarn commented Nov 23, 2023

wmgroot commented Dec 1, 2023

k8s-triage-robot commented Feb 29, 2024

garvinp-stripe commented Feb 29, 2024

jonathan-innis commented Mar 5, 2024

k8s-triage-robot commented Jun 3, 2024

jonathan-innis commented Jun 3, 2024

support drain timeout #743

support drain timeout #743

Comments

grosser commented Sep 8, 2022

Community Note

spring1843 commented Sep 12, 2022

grosser commented Sep 12, 2022

jonathan-innis commented Sep 12, 2022

ellistarn commented Jun 4, 2023

grosser commented Jun 4, 2023

ellistarn commented Jun 4, 2023

grosser commented Jun 4, 2023

ellistarn commented Jun 5, 2023

grosser commented Jun 5, 2023

ellistarn commented Jun 5, 2023

ellistarn commented Jun 5, 2023

grosser commented Jun 5, 2023

ellistarn commented Jun 5, 2023

sftim commented Jun 5, 2023

njtran commented Aug 14, 2023

grosser commented Aug 15, 2023

mallow111 commented Sep 6, 2023

garvinp-stripe commented Sep 25, 2023

sftim commented Sep 25, 2023

garvinp-stripe commented Sep 25, 2023

sftim commented Sep 26, 2023

garvinp-stripe commented Sep 26, 2023

garvinp-stripe commented Sep 26, 2023

njtran commented Oct 6, 2023

wmgroot commented Nov 22, 2023

wmgroot commented Nov 22, 2023

garvinp-stripe commented Nov 22, 2023

ellistarn commented Nov 23, 2023

wmgroot commented Dec 1, 2023

k8s-triage-robot commented Feb 29, 2024

garvinp-stripe commented Feb 29, 2024

jonathan-innis commented Mar 5, 2024

k8s-triage-robot commented Jun 3, 2024

jonathan-innis commented Jun 3, 2024