Node Interruption should be aware of deprovisioning #2917

ellistarn · 2022-11-22T20:52:36Z

Tell us about your request

Right now, spot interruption triggers cordon/drain on a node.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

When interruption happens, deprovisioning logic should be aware so as to not voluntarily terminate additional capacity while the cluster is in flux.

Are you currently working around this issue?

Eating dirt.

Additional Context

No response

Attachments

No response

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

jonathan-innis · 2022-11-23T17:06:03Z

Talked with @bwagner5. I'm not really sure the difference here for pre-spin. I see the value of this being integrated into the deprovisioning flow, but we will always have to start with deleting the node and kicking off graceful pod termination.

Once we start a delete, we consider all pods on that node to be in a pseudo-pending state, meaning that they will be scheduled for. The main benefits of pre-spinning were:

We would properly schedule for all pods (this is solved already with the current deprovisioning logic)
We would reduce pod churn/disruption of workloads (we have no alternative for this since the point of interruption notification is to make sure deletion is kicked off so pods have enough time to gracefully terminate).

Let me know if I am missing some point that you are making here though.

ellistarn · 2022-11-23T22:21:35Z

This SGTM. This issue was intended to track the work to integrate node interruption into the deprovisioning flow. I'll rename.

ellistarn · 2023-11-08T22:05:54Z

We would reduce pod churn/disruption of workloads (we have no alternative for this since the point of interruption notification is to make sure deletion is kicked off so pods have enough time to gracefully terminate).

I'd like to revisit this tradeoff. Do we want to maximize graceful time for a pod to drain, or minimize the amount of time that a drain pod will sit pending.

We've heard recently about customers facing situations where pods are able to drain very rapidly, but their applications become unavailable during the replacement node launch duration.

jonathan-innis · 2024-01-05T17:06:02Z

but their applications become unavailable during the replacement node launch duration

This seems strange to me. We could perhaps do a bit better here to ensure that we time the delete so that we reduce application unavailability but allow pods to go to their full pod termination grace period. I would expect that if a user sets up PDBs for their applications, this should be sufficient to ensure application stability while new nodes are coming up, unless I'm missing something in my understanding.

ellistarn · 2024-01-06T17:00:04Z

IIRC this was to help cases of single pod deployments which can't use pdbs

ellistarn · 2024-01-06T17:01:42Z

Could we somehow look at the grace period of all pods, and make sure we leave at least that amount of time left before triggering the deletion?

jonathan-innis · 2024-01-08T03:39:09Z

Could we somehow look at the grace period of all pods, and make sure we leave at least that amount of time left before triggering the deletion

This still seems tenuous to me due to our ordered deletion that we are executing right now. Even if all of the workload pods get a chance to drain this might not give enough time for the DaemonSet pods to run through their termination grace period. To get this right, we would basically have to reason about our ordered deletion and propagate that to orchestrate the time that we got the interruption.

I'd be more interested to hear the number of users that are constrained by the lack of an ability to use PDBs here. This seems to me like an edge case and I'm skeptical if doing this is worth the complexity of trying to reason about when we actually terminate the node at the interruption-side.

ellistarn · 2024-01-08T06:55:54Z

:thinking do we have to evict some pods before others? I know that we do critical after noncritical, but I can't remember why we have this sequencing requirement.

It's a good point, though.

However, if we could make this simple, I could see benefits for reducing downtime on singletons that run on spot. E.g. running a Jupiter notebook on spot, the developer might not even notice the disruption if it were only a few seconds.

jonathan-innis · 2024-01-26T06:26:43Z

An interesting idea that was floated around among some of the maintainers offline: If we decide to set terminationGracePeriod directly on the NodeClaim that shakes out of the Force Drain RFC in kubernetes-sigs/karpenter, we can actually just set the terminationGracePeriod of a NodeClaim to 2m on delete and trigger a deletion of the NodeClaim.

The rub for this is that the terminationGracePeriod feature that is getting proposed is aware of the terminationGracePeriodSeconds of all pods on the node. This means that it will only start evicting the pod if the pod's terminationGracePeriodSeconds is less or equal to the amount of time that is remaining in the NodeClaim's terminationGracePeriod. This basically solves the problem and kills two birds with one stone.

On top of that, it's extremely elegant.

jonathan-innis · 2024-02-06T22:07:19Z

Thinking on this more, I'm realizing that this idea isn't perfect. The terminationGracePeriod only ensures a maximum amount of time that the Node should be draining but doesn't ensure a minimum. In our current iteration of this logic, we will still attempt to drain pods as quickly as possible while processing the drain request. Practically, what this means is that we will still won't maximize pod uptime for a pod with a short terminationGracePeriodSeconds, since these pods probably don't have blocking PDBs and are going to be eligible for draining immediately.

I'm also not convinced that we want to leverage terminationGracePeriod in this way. This is effectively a minimum drain time here (or a place where minDrainTime == maxDrainTime) and I don't think we want to be treating users setting terminationGracePeriod to mean that they want the drain period to be strictly this long.

jonathan-innis · 2024-02-06T23:53:32Z

It's worth noting here that setting terminationGracePeriod here would help with the case where you have a fully blocking PDB that doesn't even initiate the drain request on the pod. terminationGracePeriod would push through the blocking PDB to try to maximize drain time.

ellistarn added the feature New feature or request label Nov 22, 2022

ellistarn changed the title ~~Node Interruption Prespin Replacements~~ Node Interruption should be aware of deprovisioning Nov 23, 2022

ellistarn added the v1 Issues requiring resolution by the v1 milestone label Apr 18, 2023

billrayburn added v1.x Issues prioritized for post-1.0 and removed v1 Issues requiring resolution by the v1 milestone labels Apr 26, 2023

jonathan-innis mentioned this issue Jan 8, 2024

Interruption handling: handle 'SpotInterrupted' message immediately rather than a queue. #5400

Closed

jonathan-innis mentioned this issue Jan 26, 2024

How should interruption-handling really work? #4028

Closed

This was referenced Feb 16, 2024

Add configurable dedupe timeout to NodeFailedToDrain event kubernetes-sigs/karpenter#1021

Closed

support to add delay in node termination to honor ELB connection draining interval #4673

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node Interruption should be aware of deprovisioning #2917

Node Interruption should be aware of deprovisioning #2917

ellistarn commented Nov 22, 2022 •

edited

Loading

jonathan-innis commented Nov 23, 2022

ellistarn commented Nov 23, 2022

ellistarn commented Nov 8, 2023

jonathan-innis commented Jan 5, 2024

ellistarn commented Jan 6, 2024

ellistarn commented Jan 6, 2024

jonathan-innis commented Jan 8, 2024

ellistarn commented Jan 8, 2024

jonathan-innis commented Jan 26, 2024

jonathan-innis commented Feb 6, 2024

jonathan-innis commented Feb 6, 2024

Node Interruption should be aware of deprovisioning #2917

Node Interruption should be aware of deprovisioning #2917

Comments

ellistarn commented Nov 22, 2022 • edited Loading

Tell us about your request

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

Are you currently working around this issue?

Additional Context

Attachments

Community Note

jonathan-innis commented Nov 23, 2022

ellistarn commented Nov 23, 2022

ellistarn commented Nov 8, 2023

jonathan-innis commented Jan 5, 2024

ellistarn commented Jan 6, 2024

ellistarn commented Jan 6, 2024

jonathan-innis commented Jan 8, 2024

ellistarn commented Jan 8, 2024

jonathan-innis commented Jan 26, 2024

jonathan-innis commented Feb 6, 2024

jonathan-innis commented Feb 6, 2024

ellistarn commented Nov 22, 2022 •

edited

Loading