-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement Gardener Scheduler #356
Comments
Nice! |
This may become more and more interesting for us as we are now beyond 100 nodes for one of the seeds in a tightly packed infrastructure/region. So far we have not yet implemented proper shoot cluster control plane auto-scaling beyond this number of nodes, so either we do the control plane auto-scaling, stick with large seed defaults or have a scheduler and multiple seeds. The static defaults are easiest to achieve, but expensive and not following our strategy. Having multiple seeds (also for active-active DR) and control plane auto-scaling are the things that we want to have anyways. |
Issue description updated according to latest discussions |
Thanks @rfranzke. Just a minor implementation idea on the topic of "distance": a useful indicator of sorts of what is close or distant is latency, right? So maybe we can at first statically maintain an IaaS+region matrix (every IaaS+region to every other) and their corresponding latencies among themselves (on the diagonal, the same IaaS+region has a latency of 0 and will thereby always be preferred). Maybe the data can later be dynamically probed/averaged/extracted, so that nobody has to maintain that ugly matrix anymore. |
Thank you for adding the simplified strategy (lexicographically compare regions). @grolu also contacted me and the Dashboard can most likely change the logic easily from regions-with-seeds to regions-in-cloud-profiles, which may be even easier/faster. So, looking forward to have the simplified strategy. That will resolve a problem for all the users that either turn away because of a limited seed region support in a landscape or that can't determine the seed name themselves/programmatically (and if they do/guess it, they can't do load balancing and tend to pick the first one, e.g. always aws-eu1, which is likewise not in the landscapes interest). Thank you very much in advance. |
As @rfranzke already pointed out Node, pod & seed, shoot are semantically similar. The kube-scheduler tries to find a Node for a pod, in our case, the gardener scheduler tries to find a seed for shoot. The kube-scheduler defines predicates and priority functions to determine the "best node" defined by the highest score. Predicates for kube-schedulerr: Adapted to our context:
Predicates
PriorityFunctions // HostPriority represents the priority of scheduling to a particular host, higher priority is better.
Example So for instance if we want to achieve the equivalent of the Strategy "SameRegion", we loop over all seed clusters and execute the sameCloudProvider predicate, then the SameRegion predicate to filter out the feasible seeds. Then we would execute the ResourceLimit and ShootsDeployedAlready priority functions. In the end the seed with the highest sum of scores wins. Or if the Configuration should be like what you call today the "MinimalDistance Strategy" -> execute sameCloudProvider Predicate and then the minimalDistance, shoots deployedAlready Priority functions. Possible advantages
What do you think about the general concept? @rfranzke @vlerenc |
Actually, I like the idea, it would be even nice to add weights to priority functions, to add a flavour of a favorable seed to our scheduling. For example, assuming three seeds have already became eligible for scoring election, we can do the following:
The seed with the highest weighted score, gets the shoot. |
Thanks @danielfoehrKn, sounds good. The overall concept (building the gardener-scheduler based on the principles in the kube-scheduler) definitely makes sense as we probably can benefit a lot from the experience/knowledge built in to the kube-scheduler. However, that's a lot that needs to be done, so let's try to to it in small steps. The kube-scheduler was also build gradually, so let's first keep our logic as it is today and move everything into a scheduler binary. After that we can define the next step towards your suggestion. It's good to have this plan/goal but we need to split it into small parts and get it in step by step. |
If I may, I'd like to add a proposal into the mix of priority functions: seed anti-affinity (like node or hypervisor anti-affinity). People will start to have multiple clusters, maybe because of load considerations, maybe for HA reasons. If they are in the same region, it makes sense to have them spread across the seeds. Thereby the clusters are also not affected at the same time when the seed is rolled. I don't think we should already expose a new API for that, but maybe we can start small and prefer (softly) placement of shoots from the same project across as many seeds as possible. E.g. if project foo has 6 clusters while the landscape in that region has 3 seeds, let's prefer placing 2 shoots in the 3 seeds each. |
In addition to #356 (comment), people also have already multiple clusters in multiple regions (more ambitious HA) and want to make sure that their control planes are not colocated in the same region should there be a regional outage. Also this cannot be expressed today, so people with shoots, e.g. in the Sydney region and in the Melbourne region that act as fail-overs for each other may end up having their control planes in one or many seeds in either the Sydney or Melbourne region if that's where the seeds are. Possibly we should take some time to think that through also with #2874 in mind, which is similarly about expressing scheduling hints/affinities/tolerations. WDYT? |
Every
Shoot
belongs to aSeed
. When creating a newShoot
resource it is possible to specify onto whichSeed
cluster its control plane shall be deployed to (.spec.cloud.seed
). This field is similar to the.spec.nodeName
field in the specification of Kubernetes nativePod
resources.In case the user does not specify the field (normal case; the user does not/should not care about where the control plane is hosted and leaves it to Gardener) an appropriate
Seed
resource is determined by Gardener itself (it tries to match the.spec.cloud.region
and.spec.cloud.profile
of theShoot
to aSeed
so that the control plane gets hosted at the same cloud provider and in the same region as theShoot
cluster's worker nodes in order to keep the network gap as low as possible).Currently, this logic is executed in the SeedManager admission plugin inside the Gardener API server. In order to achieve a cleaner architecture we should remove this admission plugin and move its logic into a dedicated
gardener-scheduler
component/binary. The idea is that it will function similar to thekube-scheduler
(watching forShoot
resources and filling its.spec.cloud.seed
field so that thegardener-controller-manager
can start provisioning it).Compared to the Kubernetes native
Node
andPod
objects, Gardener'sSeed
andShoot
objects are very similar, and we should leverage the advantages regarding scalability and better separation of concerns provided by the Kubernetes architecture. Generally, Gardener heavily relies on and implements the same concepts as Kubernetes itself, hence, most Kubernetes approaches also apply to Gardener.Currently, the
SeedManager
admission controller does first identify all possible seed candidates, and then it determines the best out of this list. Currently, the "best" is the seed with the minimal number of managed shoots.Let's keep the last behaviour (determining the best out of the candidates) for now, but let's add two strategies for determining the candidates:
Same region strategy: The one used today: Only consider seeds of the same cloud provider in the same region. (should be the default)
Minimal Distance strategy: Only consider seeds of the same cloud provider in the same region. If no seed in the specified region is available, find another one in another region that is nearest to the shoot region.
For the "minimal distance" strategy we will probably need to maintain proper configuration for the scheduler so that it is able to understand the distances between regions.
@vlerenc proposed to first implement an even more simplified strategy into the current
SeedManager
admission plugin before starting to extract these parts into thegardener-scheduler
binary. This is to enable end-users faster to use also shoots in other regions without needing to specify.spec.cloud.seed
in theShoot
manifest themselves. The proposal is:Let's start with this approach before focusing on the
gardener-scheduler
binary itself. The to-be-used strategy should be configurable via admission configuration and default to the "same region" strategy.Acceptance criteria
pkg/scheduler
.pkg/scheduler/apis/config
(similar to the one for the controller-manager), together with code generation.example
directory (similar to the example configuration of the controller-manager)gardener
Helm chartrefs/meta/ci
branch) and theDockerfile
start-scheduler
for starting the scheduler inMakefile
docs/concepts/scheduler.md
The text was updated successfully, but these errors were encountered: