Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Gardener Scheduler #356

Closed
10 tasks
rfranzke opened this issue Sep 6, 2018 · 10 comments · Fixed by #981
Closed
10 tasks

Implement Gardener Scheduler #356

rfranzke opened this issue Sep 6, 2018 · 10 comments · Fixed by #981
Assignees
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension platform/all topology/garden Affects Garden clusters

Comments

@rfranzke
Copy link
Member

rfranzke commented Sep 6, 2018

Every Shoot belongs to a Seed. When creating a new Shoot resource it is possible to specify onto which Seed cluster its control plane shall be deployed to (.spec.cloud.seed). This field is similar to the .spec.nodeName field in the specification of Kubernetes native Pod resources.

In case the user does not specify the field (normal case; the user does not/should not care about where the control plane is hosted and leaves it to Gardener) an appropriate Seed resource is determined by Gardener itself (it tries to match the .spec.cloud.region and .spec.cloud.profile of the Shoot to a Seed so that the control plane gets hosted at the same cloud provider and in the same region as the Shoot cluster's worker nodes in order to keep the network gap as low as possible).

Currently, this logic is executed in the SeedManager admission plugin inside the Gardener API server. In order to achieve a cleaner architecture we should remove this admission plugin and move its logic into a dedicated gardener-scheduler component/binary. The idea is that it will function similar to the kube-scheduler (watching for Shoot resources and filling its .spec.cloud.seed field so that the gardener-controller-manager can start provisioning it).

Compared to the Kubernetes native Node and Pod objects, Gardener's Seed and Shoot objects are very similar, and we should leverage the advantages regarding scalability and better separation of concerns provided by the Kubernetes architecture. Generally, Gardener heavily relies on and implements the same concepts as Kubernetes itself, hence, most Kubernetes approaches also apply to Gardener.

Currently, the SeedManager admission controller does first identify all possible seed candidates, and then it determines the best out of this list. Currently, the "best" is the seed with the minimal number of managed shoots.

Let's keep the last behaviour (determining the best out of the candidates) for now, but let's add two strategies for determining the candidates:

  1. Same region strategy: The one used today: Only consider seeds of the same cloud provider in the same region. (should be the default)

  2. Minimal Distance strategy: Only consider seeds of the same cloud provider in the same region. If no seed in the specified region is available, find another one in another region that is nearest to the shoot region.

For the "minimal distance" strategy we will probably need to maintain proper configuration for the scheduler so that it is able to understand the distances between regions.

@vlerenc proposed to first implement an even more simplified strategy into the current SeedManager admission plugin before starting to extract these parts into the gardener-scheduler binary. This is to enable end-users faster to use also shoots in other regions without needing to specify .spec.cloud.seed in the Shoot manifest themselves. The proposal is:

If no seed is specified, compare the region names of all available seeds of the desired infrastructure and pick the one that matches lexicographically the best (from left to right). E.g. if someone wants a cluster in AWS eu-north-1, we pick AWS eu-central-1, because at least the continent “eu-“ matches (even better with region instances like AWS ap-southeast-1 and AWS ap-southeast-2). This works for Aliyun, AWS, and GCP. Azure has an unfortunate (geographically non-hierarchical) naming and doesn’t start with the continent, but then we can’t help this case, but at least the other three.

Let's start with this approach before focusing on the gardener-scheduler binary itself. The to-be-used strategy should be configurable via admission configuration and default to the "same region" strategy.

Acceptance criteria

@rfranzke rfranzke added kind/enhancement Enhancement, improvement, extension topology/garden Affects Garden clusters status/accepted platform/all area/quality Output qualification (tests, checks, scans, automation in general, etc.) related labels Sep 6, 2018
@vlerenc
Copy link
Member

vlerenc commented Sep 9, 2018

Nice!

@vlerenc
Copy link
Member

vlerenc commented Oct 19, 2018

This may become more and more interesting for us as we are now beyond 100 nodes for one of the seeds in a tightly packed infrastructure/region. So far we have not yet implemented proper shoot cluster control plane auto-scaling beyond this number of nodes, so either we do the control plane auto-scaling, stick with large seed defaults or have a scheduler and multiple seeds. The static defaults are easiest to achieve, but expensive and not following our strategy. Having multiple seeds (also for active-active DR) and control plane auto-scaling are the things that we want to have anyways.

@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 19, 2018
@gardener-robot-ci-1 gardener-robot-ci-1 added lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 18, 2019
@rfranzke rfranzke changed the title Move Seed determination logic into a dedicated Gardener Scheduler binary Implement Gardener Scheduler Mar 2, 2019
@rfranzke
Copy link
Member Author

rfranzke commented Mar 2, 2019

Issue description updated according to latest discussions
/cc @vlerenc (please modify if I missed something)

@vlerenc
Copy link
Member

vlerenc commented Mar 2, 2019

Thanks @rfranzke. Just a minor implementation idea on the topic of "distance": a useful indicator of sorts of what is close or distant is latency, right? So maybe we can at first statically maintain an IaaS+region matrix (every IaaS+region to every other) and their corresponding latencies among themselves (on the diagonal, the same IaaS+region has a latency of 0 and will thereby always be preferred). Maybe the data can later be dynamically probed/averaged/extracted, so that nobody has to maintain that ugly matrix anymore.

@vlerenc
Copy link
Member

vlerenc commented Apr 3, 2019

Thank you for adding the simplified strategy (lexicographically compare regions). @grolu also contacted me and the Dashboard can most likely change the logic easily from regions-with-seeds to regions-in-cloud-profiles, which may be even easier/faster.

So, looking forward to have the simplified strategy. That will resolve a problem for all the users that either turn away because of a limited seed region support in a landscape or that can't determine the seed name themselves/programmatically (and if they do/guess it, they can't do load balancing and tend to pick the first one, e.g. always aws-eu1, which is likewise not in the landscapes interest).

Thank you very much in advance.

@danielfoehrKn
Copy link
Contributor

danielfoehrKn commented Apr 11, 2019

As @rfranzke already pointed out Node, pod & seed, shoot are semantically similar. The kube-scheduler tries to find a Node for a pod, in our case, the gardener scheduler tries to find a seed for shoot.
Hence I am thinking it would be worthy taking a look at the way kubernetes build the scheduler (possibly adapt for the gardener-scheduler).

The kube-scheduler defines predicates and priority functions to determine the "best node" defined by the highest score.
So for each node the predicates are being executed to check if that node is feasible at all (candidates).
If yes, the prioritiy functions are executed and return a score.
In the end the node with the highest sum of scores wins.

Predicates for kube-schedulerr:
Priority functions for kube-scheduler

Adapted to our context:

  • we would make all the predicates and priority functions configurable in a config file (some mutually exclusive)

Predicates

  • SameRegionWithinCloudProviderPredicate(seed,shoot) bool

  • SameCloudProvider(seed,shoot) bool

PriorityFunctions
priority functions return HostPriority where a host == node:

// HostPriority represents the priority of scheduling to a particular host, higher priority is better.
type HostPriority struct {
// Name of the host
Host string
// Score associated with the host
Score int
}

  • MinimalDistance(seed, shoot) hostPriority
  • ResourceLimits(seed, shoot) hostPriority --> Give good score if sum of controle plane of shoot cluster resource limits would "fit" on the seed cluster similar to here but for whole seed cluster and not only for a node
  • ResourceAllocation(seed, shoot) hostPriority --> based on actual allocation similar to here but for whole seed cluster and not only for a node
  • ShootsDeployedAlready(allSeeds, allShoots, hostPriorities) hostPriorities --> give a score to each seed based on how it compares in terms of how many shoots are already deployed on it.

Example

So for instance if we want to achieve the equivalent of the Strategy "SameRegion", we loop over all seed clusters and execute the sameCloudProvider predicate, then the SameRegion predicate to filter out the feasible seeds. Then we would execute the ResourceLimit and ShootsDeployedAlready priority functions. In the end the seed with the highest sum of scores wins.

Or if the Configuration should be like what you call today the "MinimalDistance Strategy" -> execute sameCloudProvider Predicate and then the minimalDistance, shoots deployedAlready Priority functions.

Possible advantages

  • Cleaner seperation/decoupeling of the scheduler logic -> A "Strategy" as of today e.g MinimalDistance does a bunch of things/is not atomic: first check if there is a seed with the same cloud provider, if not takes the seeds with minimal distance and then finds the seed with the least shoots.
  • Scheduler logic would be configurable with finer granularity (even for instance defining the score of certain priority function to be more important)
  • Having a score for priority vs currently only one way to find "best seed" -> what do we do in case we want to have an additional way of choosing the "best seed"? What if we want to compose them?
  • Possible extension points where a user could write its own gardener scheduler logic and register a predicate/priority function -> I have to check how and whether kube-scheduler is doing that

What do you think about the general concept? @rfranzke @vlerenc

@zanetworker
Copy link
Contributor

zanetworker commented Apr 12, 2019

Actually, I like the idea, it would be even nice to add weights to priority functions, to add a flavour of a favorable seed to our scheduling. For example, assuming three seeds have already became eligible for scoring election, we can do the following:

SeedOneScore = MinimalDistance * w1 +  ResourceLimits * w2 + ShootsDeployedAlready * w3  + ShootsDeployedAlready * w4
SeedTwoScore = MinimalDistance * w1 +  ResourceLimits * w2 + ShootsDeployedAlready * w3  + ShootsDeployedAlready * w4
SeedThreeScore = .... 

ElectedSeedScore  = Max(SeedOneScore, SeedTwoScore, SeedThreeScore, ...) 

Schedule(ElectedSeed) 

The seed with the highest weighted score, gets the shoot.

@rfranzke
Copy link
Member Author

Thanks @danielfoehrKn, sounds good. The overall concept (building the gardener-scheduler based on the principles in the kube-scheduler) definitely makes sense as we probably can benefit a lot from the experience/knowledge built in to the kube-scheduler. However, that's a lot that needs to be done, so let's try to to it in small steps. The kube-scheduler was also build gradually, so let's first keep our logic as it is today and move everything into a scheduler binary. After that we can define the next step towards your suggestion. It's good to have this plan/goal but we need to split it into small parts and get it in step by step.

@vlerenc
Copy link
Member

vlerenc commented Apr 13, 2019

If I may, I'd like to add a proposal into the mix of priority functions: seed anti-affinity (like node or hypervisor anti-affinity). People will start to have multiple clusters, maybe because of load considerations, maybe for HA reasons. If they are in the same region, it makes sense to have them spread across the seeds. Thereby the clusters are also not affected at the same time when the seed is rolled. I don't think we should already expose a new API for that, but maybe we can start small and prefer (softly) placement of shoots from the same project across as many seeds as possible. E.g. if project foo has 6 clusters while the landscape in that region has 3 seeds, let's prefer placing 2 shoots in the 3 seeds each.

@vlerenc
Copy link
Member

vlerenc commented Mar 7, 2023

In addition to #356 (comment), people also have already multiple clusters in multiple regions (more ambitious HA) and want to make sure that their control planes are not colocated in the same region should there be a regional outage. Also this cannot be expressed today, so people with shoots, e.g. in the Sydney region and in the Melbourne region that act as fail-overs for each other may end up having their control planes in one or many seeds in either the Sydney or Melbourne region if that's where the seeds are.

Possibly we should take some time to think that through also with #2874 in mind, which is similarly about expressing scheduling hints/affinities/tolerations. WDYT?

ialidzhikov pushed a commit to ialidzhikov/gardener that referenced this issue Aug 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/enhancement Enhancement, improvement, extension platform/all topology/garden Affects Garden clusters
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants