Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node Interruption should be aware of deprovisioning #2917

Open
ellistarn opened this issue Nov 22, 2022 · 11 comments
Open

Node Interruption should be aware of deprovisioning #2917

ellistarn opened this issue Nov 22, 2022 · 11 comments
Labels
feature New feature or request v1.x Issues prioritized for post-1.0

Comments

@ellistarn
Copy link
Contributor

ellistarn commented Nov 22, 2022

Tell us about your request

Right now, spot interruption triggers cordon/drain on a node.

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

When interruption happens, deprovisioning logic should be aware so as to not voluntarily terminate additional capacity while the cluster is in flux.

Are you currently working around this issue?

Eating dirt.

Additional Context

No response

Attachments

No response

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
@ellistarn ellistarn added the feature New feature or request label Nov 22, 2022
@jonathan-innis
Copy link
Contributor

Talked with @bwagner5. I'm not really sure the difference here for pre-spin. I see the value of this being integrated into the deprovisioning flow, but we will always have to start with deleting the node and kicking off graceful pod termination.

Once we start a delete, we consider all pods on that node to be in a pseudo-pending state, meaning that they will be scheduled for. The main benefits of pre-spinning were:

  1. We would properly schedule for all pods (this is solved already with the current deprovisioning logic)
  2. We would reduce pod churn/disruption of workloads (we have no alternative for this since the point of interruption notification is to make sure deletion is kicked off so pods have enough time to gracefully terminate).

Let me know if I am missing some point that you are making here though.

@ellistarn
Copy link
Contributor Author

This SGTM. This issue was intended to track the work to integrate node interruption into the deprovisioning flow. I'll rename.

@ellistarn ellistarn changed the title Node Interruption Prespin Replacements Node Interruption should be aware of deprovisioning Nov 23, 2022
@ellistarn ellistarn added the v1 Issues requiring resolution by the v1 milestone label Apr 18, 2023
@billrayburn billrayburn added v1.x Issues prioritized for post-1.0 and removed v1 Issues requiring resolution by the v1 milestone labels Apr 26, 2023
@ellistarn
Copy link
Contributor Author

We would reduce pod churn/disruption of workloads (we have no alternative for this since the point of interruption notification is to make sure deletion is kicked off so pods have enough time to gracefully terminate).

I'd like to revisit this tradeoff. Do we want to maximize graceful time for a pod to drain, or minimize the amount of time that a drain pod will sit pending.

We've heard recently about customers facing situations where pods are able to drain very rapidly, but their applications become unavailable during the replacement node launch duration.

@jonathan-innis
Copy link
Contributor

but their applications become unavailable during the replacement node launch duration

This seems strange to me. We could perhaps do a bit better here to ensure that we time the delete so that we reduce application unavailability but allow pods to go to their full pod termination grace period. I would expect that if a user sets up PDBs for their applications, this should be sufficient to ensure application stability while new nodes are coming up, unless I'm missing something in my understanding.

@ellistarn
Copy link
Contributor Author

IIRC this was to help cases of single pod deployments which can't use pdbs

@ellistarn
Copy link
Contributor Author

Could we somehow look at the grace period of all pods, and make sure we leave at least that amount of time left before triggering the deletion?

@jonathan-innis
Copy link
Contributor

Could we somehow look at the grace period of all pods, and make sure we leave at least that amount of time left before triggering the deletion

This still seems tenuous to me due to our ordered deletion that we are executing right now. Even if all of the workload pods get a chance to drain this might not give enough time for the DaemonSet pods to run through their termination grace period. To get this right, we would basically have to reason about our ordered deletion and propagate that to orchestrate the time that we got the interruption.

I'd be more interested to hear the number of users that are constrained by the lack of an ability to use PDBs here. This seems to me like an edge case and I'm skeptical if doing this is worth the complexity of trying to reason about when we actually terminate the node at the interruption-side.

@ellistarn
Copy link
Contributor Author

:thinking do we have to evict some pods before others? I know that we do critical after noncritical, but I can't remember why we have this sequencing requirement.

It's a good point, though.

However, if we could make this simple, I could see benefits for reducing downtime on singletons that run on spot. E.g. running a Jupiter notebook on spot, the developer might not even notice the disruption if it were only a few seconds.

@jonathan-innis
Copy link
Contributor

An interesting idea that was floated around among some of the maintainers offline: If we decide to set terminationGracePeriod directly on the NodeClaim that shakes out of the Force Drain RFC in kubernetes-sigs/karpenter, we can actually just set the terminationGracePeriod of a NodeClaim to 2m on delete and trigger a deletion of the NodeClaim.

The rub for this is that the terminationGracePeriod feature that is getting proposed is aware of the terminationGracePeriodSeconds of all pods on the node. This means that it will only start evicting the pod if the pod's terminationGracePeriodSeconds is less or equal to the amount of time that is remaining in the NodeClaim's terminationGracePeriod. This basically solves the problem and kills two birds with one stone.

On top of that, it's extremely elegant.

@jonathan-innis
Copy link
Contributor

Thinking on this more, I'm realizing that this idea isn't perfect. The terminationGracePeriod only ensures a maximum amount of time that the Node should be draining but doesn't ensure a minimum. In our current iteration of this logic, we will still attempt to drain pods as quickly as possible while processing the drain request. Practically, what this means is that we will still won't maximize pod uptime for a pod with a short terminationGracePeriodSeconds, since these pods probably don't have blocking PDBs and are going to be eligible for draining immediately.

I'm also not convinced that we want to leverage terminationGracePeriod in this way. This is effectively a minimum drain time here (or a place where minDrainTime == maxDrainTime) and I don't think we want to be treating users setting terminationGracePeriod to mean that they want the drain period to be strictly this long.

@jonathan-innis
Copy link
Contributor

It's worth noting here that setting terminationGracePeriod here would help with the case where you have a fully blocking PDB that doesn't even initiate the drain request on the pod. terminationGracePeriod would push through the blocking PDB to try to maximize drain time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request v1.x Issues prioritized for post-1.0
Projects
None yet
Development

No branches or pull requests

3 participants