Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support drain timeout #743

Closed
grosser opened this issue Sep 8, 2022 · 34 comments · Fixed by #916
Closed

support drain timeout #743

grosser opened this issue Sep 8, 2022 · 34 comments · Fixed by #916
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. v1.x Issues prioritized for post-1.0

Comments

@grosser
Copy link

grosser commented Sep 8, 2022

Tell us about your request
Sometimes nodes get stuck when a pod could not be drained, so we have to alert ourselves and then manually kill it, which is not ideal.

It would be nice to set an optional drainTimeout so we can ensure nodes always die after x hours/days.

Are you currently working around this issue?
Helper pod that kills pods on stuck nodes so they finish draining.

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@grosser grosser added the kind/feature Categorizes issue or PR as related to a new feature. label Sep 8, 2022
@spring1843
Copy link
Contributor

What should happen after the drainTimeout has elapsed? A pod not terminating may be by design, e.g. it's waiting for an IO operation with a long timeout.

@grosser
Copy link
Author

grosser commented Sep 12, 2022

it should force-delete the pod
the flag will definitely need "this has xyz downsides" warning
but dealing with empty nodes hanging around is also an overhead and especially in staging clusters we want to make sure cost is under control when things go wrong

@jonathan-innis
Copy link
Member

Related to aws/karpenter-provider-aws#2391

@ellistarn
Copy link
Contributor

@grosser, can you elaborate a bit on what can cause the node to get stuck? Do you have do-not-evict pods, or broken PDBs?. We're currently scoping some work in this area and looking to see what makes sense here.

@grosser
Copy link
Author

grosser commented Jun 4, 2023

broken pdbs
we as cluster admins need to roll the clusters, but some other teams left their apps in a bad state
in production that would mean we have to reach out to them or at least check that their app is not critical, but in staging we'd like to just ignore it

@ellistarn
Copy link
Contributor

I wonder if we should just hardcode something like 10 minutes if eviction is returning a 500. Do you need to configure this value?

@grosser
Copy link
Author

grosser commented Jun 4, 2023

I don't think it's good to hardcode any timeout since by default the system should be safe for the workloads and fail the cluster rollout instead.
The opt-in could be either a timeout or true/false ... but I think timeout makes the most sense.

@ellistarn
Copy link
Contributor

What would you set it to?

@grosser
Copy link
Author

grosser commented Jun 5, 2023

nothing in production and 10-15m in staging

@ellistarn
Copy link
Contributor

The other wrinkle here is that we won't attempt (iirc, or perhaps it's just deprioritized) to deprovision a node that has a blocked PDB, so we'll need handling for both disruption and termination flows.

@ellistarn
Copy link
Contributor

Would you want to treat a 500 (misconfigured) differently than a 429 (pdb unfulfilled)?

@grosser
Copy link
Author

grosser commented Jun 5, 2023

I'd treat them the same, either way:
production: the owning team needs to check it out before we can proceed
staging: go fix your app 🤷

@ellistarn
Copy link
Contributor

Great feedback, thank you!

@sftim
Copy link

sftim commented Jun 5, 2023

Related / extra credit: manage machine drains through a custom resource, similar to how the node maintenance operator does it. This could include reporting progress via .status

@njtran
Copy link
Contributor

njtran commented Aug 14, 2023

How does https://kubernetes.io/blog/2023/01/06/unhealthy-pod-eviction-policy-for-pdbs/ fit into your use-case? Does it solve it? It's behind a Beta feature gate in 1.27.

@njtran njtran added the v1.x Issues prioritized for post-1.0 label Aug 14, 2023
@grosser
Copy link
Author

grosser commented Aug 15, 2023

it solves it for completely unhealthy applications and should not have a big impact for regular operations
it however does not solve the situation when only part of a pdb is unhealthy (for example new pods are unhealthy and old pods are healthy)

@mallow111
Copy link

@jonathan-innis does anyone start to work on this issue yet?

@garvinp-stripe
Copy link
Contributor

Is it possible to expose and implement a timeout in the finalizer? If nothing is set (default) then it never times out but if it exist it should be easy enough for the finalizer to simply annotation the machine/ node with drain started with and if that now - annotation > timeout -> force kill the node/ machine?

@sftim
Copy link

sftim commented Sep 25, 2023

I found out recently that graceful deletion for custom resources (as in CRDs) doesn't really seem to be a thing.
If that's right then we'd need an improvement to Kubernetes itself, and then could indeed gracefully delete the Machine, using the future-dated removal to signal the drain timeout.

@garvinp-stripe
Copy link
Contributor

Can you provide some details? I think you are correct however shouldn't finalizer on CRD (machine) work equally as well? Trying to understand what specific API you need from K8s

@sftim
Copy link

sftim commented Sep 26, 2023

Graceful deletion has an associated time; finalizers don't have any time stored in the API.

@garvinp-stripe
Copy link
Contributor

Ah ya that's unfortunate

@garvinp-stripe
Copy link
Contributor

I think to upstream that change would take a while. I wonder if there is a mid to short term solution? I believe we might just fork Karpenter core and add the ability into the finalizer that picks up the timeout from the config map and some kind of annotation on the machine to mark it with time deletion attempt

@njtran
Copy link
Contributor

njtran commented Oct 6, 2023

We had talked about including a timeout here that would infer from the NodePool or node itself to control the drain/eviction logic and forcefully evict once a drain/termination has reached the timeout. We'd essentially need to mock the graceful deletion ourselves, as I don't think there's any graceful deletion for nodes. We had a PR #466 that explored this, but haven't had the time to get to it.

@njtran njtran transferred this issue from aws/karpenter-provider-aws Nov 2, 2023
This was referenced Nov 16, 2023
@wmgroot
Copy link
Contributor

wmgroot commented Nov 22, 2023

Just for reference, this is the field that supports this behavior in ClusterAPI.

$ kubectl explain machinedeployment.spec.template.spec.nodeDrainTimeout
KIND:     MachineDeployment
VERSION:  cluster.x-k8s.io/v1beta1

FIELD:    nodeDrainTimeout <string>

DESCRIPTION:
     NodeDrainTimeout is the total amount of time that the controller will spend
     on draining a node. The default value is 0, meaning that the node can be
     drained without any time limitations. NOTE: NodeDrainTimeout is different
     from `kubectl drain --timeout`

In our case, we set this value to 24h in all clusters, production or otherwise. We make it clear to our users that they must tolerate graceful disruption of their pods, and to alert themselves if their pods are in an unhealthy state for a long period of time in production. This allows us to ensure we can keep up to date with security patches across all of our clusters.

We've discussed lowering the value from 24h to something much shorter once we have a better story for automated traffic failover between our clusters.

@wmgroot
Copy link
Contributor

wmgroot commented Nov 22, 2023

Should support for volume detach timeouts be opened as another issue? We use EBS CSI and find that some volumes refuse to detach due to timeouts that aren't handled well by the EBS CSI controller. We've opened several issues there, but these problems still occur after years of using EBS CSI. This blocks removal of nodes and can greatly delay the process of updating nodes in a cluster if left to manual intervention.

$ kubectl explain machinedeployment.spec.template.spec.nodeVolumeDetachTimeout --context mgmt-use2
KIND:     MachineDeployment
VERSION:  cluster.x-k8s.io/v1beta1

FIELD:    nodeVolumeDetachTimeout <string>

DESCRIPTION:
     NodeVolumeDetachTimeout is the total amount of time that the controller
     will spend on waiting for all volumes to be detached. The default value is
     0, meaning that the volumes can be detached without any time limitations.

@garvinp-stripe
Copy link
Contributor

I think in general there is a need/ desire for more control over how nodes shut down happens from cluster operators from timeouts and how drain happens.

Wondering if we should prioritize #740 to expand how node termination happens and exposing more control for users?

@ellistarn
Copy link
Contributor

Supportive of a node drain timeout proposal. Anyone want to take this on?

@wmgroot
Copy link
Contributor

wmgroot commented Dec 1, 2023

I'm interested, opened a PR for a design on this above.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024
@garvinp-stripe
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024
@jonathan-innis
Copy link
Member

cc: @akestner

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2024
@jonathan-innis
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deprovisioning Issues related to node deprovisioning kind/feature Categorizes issue or PR as related to a new feature. v1.x Issues prioritized for post-1.0
Projects
None yet