Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to reduce unnecessary preemptions in target clusters??? #97

Open
everpeace opened this issue Dec 17, 2020 · 5 comments
Open

How to reduce unnecessary preemptions in target clusters??? #97

everpeace opened this issue Dec 17, 2020 · 5 comments
Labels
enhancement New feature or request

Comments

@everpeace
Copy link

everpeace commented Dec 17, 2020

As described official document, filter phase in proxy scheduler waits until candidate pods reaches at reserve phase in candidate scheduler. This might happen pod preemptions in candiadte schedulers.
However, at most one candidate pod can be the delegate pod. So, unnecessary preemption could happen in many target clusters.

For example, if I have 10 Target and all the target are very full, and, I created a source pod with high priority. Then, preemptions might happen in all the target clusters. So, preemptions in 9 target clusters will be unnecessary resultantly.

I don't have any specific solution, but how to reduce it? Any idea??

@adrienjt
Copy link
Contributor

Excellent question. I think there may be a way to coordinate preemptions by waiting in the PostFilter plugin in the candidate schedulers, and, in the proxy scheduler, requesting one of the candidate schedulers to proceed with preemption if and when all of them are waiting (in the Filter plugin or a different channel, not sure yet). The waiting-to-preempt status and preemption request could use annotations, like the rest of the algorithm.

Would you like to prototype this solution?

@everpeace
Copy link
Author

everpeace commented Dec 21, 2020

The waiting-to-preempt status and preemption request could use annotations, like the rest of the algorithm.

ah, nice idea. But, I can see one thing that we need to discuss in this algorithm. As you may know, preemption is performed in scheduling cycle and scheduling cycle is single-threaded. So, waiting something in scheduling cycle does have huge impact for reducing scheduling throughput (please consider the case which Cluster X is targeted from bunch of source clusters and bunch of high priority pods were created in source clusters).

@adrienjt
Copy link
Contributor

You're right. Here's another suggestion. Instead of waiting, the candidate scheduler PostFilter plugin could "succeed" to bypass preemption, unless the candidate pod is annotated to allow preemption; either way, the pod would be requeued for scheduling. The proxy scheduler would ensure that only one candidate pod for a given proxy pod is allowed to preempt (and would somehow cycle through them).

Another option, readily available but poorly documented, is to use the alternate scheduling algorithm, enabled by the multicluster.admiralty.io/no-reservation pod annotation. It bypasses the candidate scheduler altogether. And selects a target cluster over multiple proxy pod scheduling cycles: if the first target cluster cannot schedule a candidate, try another target, etc.

See also: https://github.com/admiraltyio/admiralty/blob/master/pkg/scheduler_plugins/proxy/plugin.go

I'd actually have to think more to see if my suggestion above is any better than the existing alternate algorithm, which was originally designed to work with custom third-party schedulers like Fargate, cf. "caution" box in documentation: https://admiralty.io/docs/concepts/scheduling

@everpeace
Copy link
Author

Thank you for your further suggestion and nice ideas. I think your both plans would work to some extent without reducing scheduling throughput. I'm a little bit uneasy about schedule latency because both plans try to schedule candidate pod in one by one manner.

Re: bypassing preemption in candidate schedulers and controlling which candidate pod can go preemption by the proxy scheduler

This idea avoids blocking the scheduling cycle and utilizes the binding cycle instead. I think it sounds nice to me overall. I would propose to introduce a concurrency option in proxy scheuler for preemption in target clusters so as to improve schedule latency of proxy pods. If concurrency=N, the proxy scheduler allows preemption in upto N candidate pods at the same time. That means concurrency and amount of unnecessary preemptions form a tradeoff relationship.

Re: bypassing candidate scheduler

I didn't know this is already supported. This idea also utilizes the binding cycle. This selects a target cluster one by one and waiting in the PreBind cycle similarly. But, I can see there exists a difficult situation in this algorithm. If I understood correctly, that is what is mixed by candidate schedulers and 3rd-party schedulers, right? For example, Source cluster X, target cluster Y where candidate scheduler can live and target Z(e.g. Fargate) where candidate scheduler can not live. In this case, users must select the no-reservation algorithm even if candidate scheduler can be used in target Y, right?

I'd actually have to think more to see if my suggestion above is any better than the existing alternate algorithm

Thank you very much. I will also try thinking about other ideas.

@adrienjt
Copy link
Contributor

Re: bypassing preemption in candidate schedulers and controlling which candidate pod can go preemption by the proxy scheduler

I would propose to introduce a concurrency option in proxy scheuler for preemption in target clusters so as to improve schedule latency of proxy pods. If concurrency=N, the proxy scheduler allows preemption in upto N candidate pods at the same time. That means concurrency and amount of unnecessary preemptions form a tradeoff relationship.

Great idea!

Re: bypassing candidate scheduler

But, I can see there exists a difficult situation in this algorithm. If I understood correctly, that is what is mixed by candidate schedulers and 3rd-party schedulers, right? For example, Source cluster X, target cluster Y where candidate scheduler can live and target Z(e.g. Fargate) where candidate scheduler can not live. In this case, users must select the no-reservation algorithm even if candidate scheduler can be used in target Y, right?

Right. You bring up a very good point. I wonder if no-reservation should actually be a target spec parameter instead of a pod annotation. Then, a hybrid algorithm would send candidates in the Filter step for targets with candidate schedulers, other targets would just pass the Filter, and if a target without a candidate scheduler scored highest, the candidate would be created in the Reserve step.

Actually, no-reservation per target rather than per pod could be inferred via the ClusterSummaries (given by targets), rather than specified on Targets, and the feature would be transparent for users on the source side.

@adrienjt adrienjt added the enhancement New feature or request label Sep 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants