Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better build scheduling / load distribution #3695

Open
ddadlani opened this issue Apr 5, 2019 · 31 comments

Comments

@ddadlani
Copy link
Contributor

commented Apr 5, 2019

There have been multiple issues (#2577, #2928 etc) opened regarding Concourse's build/container scheduling and the container placement strategies currently present (random, volume-locality and fewest-build-containers) are not able to equitably distribute the workload because

  1. they are not aware of the actual load present on the workers (number of volumes and/or build containers are loose approximations)
  2. there is no queueing logic, meaning that even if all the workers are busy, any new work will still be scheduled immediately, which can cause workers to fall over.

This is a meta-issue to discuss all the use cases and possible implementations for better work scheduling in Concourse.

@qoomon

This comment has been minimized.

Copy link

commented Apr 8, 2019

A queue on concourse master would be awesome.

  • Workers could then pull as many job as configured.
  • Scaling would be much easier to implement, just monitor queue size and scale accordingly
    • off-topic would be great to have an option to shutdown workers after a specific count of job runs
@ddadlani

This comment has been minimized.

Copy link
Contributor Author

commented May 2, 2019

Doing some investigation into a btrfs issue on one of our workers, we came across this thread which suggests that scaling to >1000 volumes/subvolumes at a time is known to cause various issues. We might be able to mitigate the occurrence of these issues with a queue, since we can avoid scheduling more than ~1000 volumes on a worker.

@kcmannem

This comment has been minimized.

Copy link
Member

commented May 3, 2019

Thoughts on the scheduler (not addressing the queue problem with this):

We wrote placement strategies based on singular requirements wether its to pick a worker with the most locally required resources or one that has the fewest builds on it. We've known that no one strategy is best suited in any situation as tasks come in many shapes and sizes. In most situations we want to make placement decisions based on many requirements. These requirements vary based on the tasks users run and environmental state. Examples:

  • Don't pick workers at max containers/volumes
  • Don't pick workers in an error state for retries
  • Don't pick workers with high load
  • Don't pick a recently chosen worker

Some of these requirements have manifested themselves into strategies, some are not possible to do in our current flow. The main problem is that we can't merge them. We shouldn't look at worker placement as definitive, but rather providing options. Each strategy now returns a list of workers that meet the strategies criteria (filtering out ones that don't, so they are not picked). This gives the ability to stack and reorder these requirements in any way we want.

If our strategy interface looked like this:
[]worker -> Stragegy() -> []worker
then this would be possible with our current strategies (lets ignore their correctness for now)
[]worker -> fewestBuilds -> volumeLocality -> rand -> []worker
This will first pick a collection of workers with build containers under a certain threshold, then pick workers with the highest local resources for the task, then pick a random worker out of that.

This interface allows us and the community to build out strategies needed for far more scenarios. From the examples listed above, we could layer filters such that we pick a worker based on all the requirements.
[]worker -> ContainersUnder(200) -> LoadAverageUnder(30) -> rand -> []worker

This also allows our scheduler to be dynamic, removing requirements when they're not needed. This is beneficial when we can insert a filter to exclude a worker in an error state when something goes wrong and then removing that filter when the worker comes back healthy (This wouldn't even have to be done by runtime, it could be something done by healthcheck probes #3025 ). The previous example becomes if worker_1 wasn't healthy or shouldn't be used for a retry.
[]worker -> ExcludeWorker(worker_1) -> ContainersUnder(200) -> LoadAverageUnder(30) -> rand -> []worker

cc @ddadlani @topherbullock
thoughts? @marco-m @jchesterpivotal

Update:

I realized this can also help unify the Satisfies() step we do for each worker selection which is used for picking a worker with the right team, tag, and base resource type available.

func (worker *gardenWorker) Satisfies(logger lager.Logger, spec WorkerSpec) bool {

Converting this to return []workers makes it such that it can be stacked at the top of the selection process.

@topherbullock

This comment has been minimized.

Copy link
Member

commented May 6, 2019

@kcmannem I like this idea

I think for this redesign it would make sense to describe each individual selection manner as "filters", "criteria" or some similar name, as this starts to look less like a strategy pattern and more like a filter / criteria pattern. The 'strategy' part then just becomes a cascading result of a set these filters applied to the worker pool available to eventually choose a worker. As you pointed out, this would also clarify
some other worker selection logic which is currently outside of the "strategy" ; eg team-scoped workers, tagged workers, etc.

Would be interesting to see the impact of being able to configure the precedent of each filter globally or per job, and probably much cleaner in the code to compose these sets of filters together into a "strategy" for selecting a single worker.

@jchesterpivotal

This comment has been minimized.

Copy link
Contributor

commented May 6, 2019

Well there's my usual reservation about continuing down the path of building a general-purpose scheduler.

In terms of the filtering/criteria concept, I'd mostly be concerned to ensure that you push the actual execution down into the database, meaning this would effectively be a skin over the query builder library. If each of the criteria is applied in the ATC, you will see a drastic reduction in performance compared to building a single query that is sent to the database. Databases are really really good at working out how to efficiently execute set theoretic projection problems.

In particular, letting the database do it makes the question of efficient ordering moot. The query planner can typically guess what ordering of criteria will be most efficient for the current data, whereas a human-configured ordering will inevitably degrade over time as the data's cardinality and statistical profile evolves.

@topherbullock

This comment has been minimized.

Copy link
Member

commented May 6, 2019

@jchesterpivotal in the current absence of the ability to use an existing scheduler (concourse/rfcs#22), and the need to improve and maintain support for our existing Garden + Baggageclaim worker pools, I consider it a worthwhile investment.

Good point on ensuring we push this down to the database. Regarding ordering of each filter; my point around allowing users to define the precedence is for the scheduler to select a more adequate candidate for workloads which would benefit more from one criteria over another; eg. if my single task job pulls in a lot of inputs, volume locality should take precedence.

@vito vito moved this from Icebox to Backlog in Runtime May 6, 2019

@kcmannem

This comment has been minimized.

Copy link
Member

commented May 6, 2019

@topherbullock that’s what I was hoping this would allow. Concourse use cases are so wide ranging, it’s almost impossible to optimize for all scenarios but if an users task is needs fine tuning, they can (ie making sure builds end up on separate workers). I didn’t mention it as it would distract from what I wanted to convey.

@marco-m

This comment has been minimized.

Copy link
Collaborator

commented May 7, 2019

@kcmannem (and all the others who chimed in :-) your proposed approach makes sense to me.

As a Concourse operator who really needs to control how much gets dropped on each worker :-), I would suggest also to sketch how Concourse would be configured / customized by an operator to take advantage of this proposal.

@kcmannem

This comment has been minimized.

Copy link
Member

commented May 7, 2019

@pivotal-jwinters brought up a good point; we should be careful about what filter we expose to the user. It makes no sense for them to specify filters to do with max containers or load average (fluctuations in cluster state) but might allow filters that are "characteristic" to a task. For example an anti-affinity filter should apply regardless of system state similar to tagging a task.

@qoomon

This comment has been minimized.

Copy link

commented May 7, 2019

I'm still thinking it would be the best approach to have the workers pulling their work. Just like GitLab CI workers do.
concourse just needs to provide a queue.

@jchesterpivotal

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

Worker-pull would definitely be easier in a lot of ways. If that's an option I'd suggest taking it.

@marco-m

This comment has been minimized.

Copy link
Collaborator

commented May 9, 2019

The problem I see with worker pull is that in Concourse not all workers are created equal. For example, a worker could have a tag, or a task could have a tag. The ATC can not just give a task to the first worker asking for it, it still has to go through all the ready-to-run tasks and find one that matches the worker. I suspect this (the tag) is not the only case when the ATC would have to walk the pseudo queue of ready-to-run tasks. Also, it is not clear to me how worker pull would solve the problem of controlling the number of tasks landing on a given worker.

@jchesterpivotal

This comment has been minimized.

Copy link
Contributor

commented May 9, 2019

Essentially it would be a "reactive" architecture: the worker says "I can accept 3 builds". If it's tagged, it says "I can accept 3 jobs, including tag-foo jobs". The key is that the worker itself has the best possible knowledge of whether it is able to accept more work.

In terms of matching, you can think of there being multiple queues, defined in various ways. When the ATC receives a "request 3" or "bid 3", it decides whether that worker is the best available option from other bids it has in hand.

@marco-m

This comment has been minimized.

Copy link
Collaborator

commented May 9, 2019

@jchesterpivotal I agree with your sketch of implementation, this is what I meant actually: the complexity becomes very close to the current one, where the ATC knows everything.

The key is that the worker itself has the best possible knowledge of whether it is able to accept more work.

Well, yes and no. To have best possible knowledge, it must either be configured, or receive configuration from the ATC, because at the end it is the operator that decides. At this point, the worker could report its load status to the ATC and let the ATC choose.

Again, the point I want to make is that worker pull is on the surface simpler, while to me it is as complicated as ATC push, unless I am missing something :-D

@pn-santos

This comment has been minimized.

Copy link

commented May 10, 2019

Just brainstorming here but I like the idea of workers pulling/accepting workloads. Putting aside the ideas of whether this would be technically feasible, the best system I can think of would be something like an event-driven one, something like:

  1. ATC "produces" workload events/requests
  2. Workers are listening for these events
  3. A worker uses a decision engine (based on it's current state and a particular policy/config) to determine whether it can pick up a workload or not
  4. Once a worker does pick up the workload it signals that to the ATC

The ATC would be the one "pushing" policies/configs to workers and just recording in the db whenever a particular worker picks up an available workload.

Maybe this is unfeasible, but feels much more natural way of assigning work than having a all-knowing ATC.

@marco-m

This comment has been minimized.

Copy link
Collaborator

commented May 11, 2019

Another approach is to recognize that scheduling is not something that a CI/CD system should do itself, and embrace Nomad (https://www.nomadproject.io/). In the past there was already an attempt to integrate Nomad, see #2037 (please go through the thread, full of gems). Why Nomad? because it handles containers and non-containers, so it would match perfectly with Concourse, which runs not only on Linux but also on Windows and Mac, and because it is way simpler than k8s and friends.

This presentation, by the CTO of CircleCI (similar to Travis, they do run at scale!), has a title that says it all https://www.hashicorp.com/resources/nomad-vault-circleci-security-scheduling "Nomad and Vault at CircleCI: Security and Scheduling are Not Your Core Competencies".

But, as a Concourse operator with workloads that regularly bring any worker to its knees and make all my developers unhappy, I would suggest something incremental. This is why before this ticket we had for example #2928, which is way simpler and not enough, but incremental and, I think, would have allowed any worker to survive :-)

@smgoller

This comment has been minimized.

Copy link

commented May 15, 2019

We run a number of pipelines that are the same, but run against different branches of the codebase. There are a couple of jobs that are extremely resource intensive, to the point we would only want to run one of these types of jobs at a time on a given worker, but since the jobs are in different pipelines, scheduling is weird. It would be nice to have some way to tag jobs, then say "run no more than X jobs with this tag on a given worker".

@gerhard

This comment has been minimized.

Copy link

commented May 16, 2019

Team RabbitMQ has been struggling with Concourse for over a year now, primarily because of the build scheduling.

We are pinning pipelines to workers, effectively constraining jobs & resources to run on specific workers. We must do this because when too many builds are triggered, across many pipelines, builds fail in ways that are only possible when CPU is contended. The only thing that worked so far was over-provisioning CPU and manually defining which tasks & resources should run on which workers. We are choosing to waste hundreds of vCPUs so that we don't waste time troubleshooting problems that only exist because of poor build scheduling.

We would love to be able to define CPU & memory requirements on a per-job & per-resource basis. No CPU sharing and no memory overcommit please, just simple I need 4 vCPUs and 4GB, please dedicate these resources to me. Dedicated volumes come in as a close second, BTRFS just isn't working.

Everyone on the Concourse team should be familiar with the RabbitMQ CI as well as the CI metrics, please talk to @cirocosta if not.

cc @dumbell @michaelklishin

@marco-m

This comment has been minimized.

Copy link
Collaborator

commented May 16, 2019

@gerhard

just simple I need 4 vCPUs and 4GB, please dedicate these resources to me

This can be obtained with #2928: you would set max build container per worker to 1 and deploy workers with 4 vCPUs and 4GB.

@cirocosta

This comment has been minimized.

Copy link
Member

commented May 16, 2019

No CPU sharing and no memory overcommit please, just simple I need 4 vCPUs and 4GB, please dedicate these resources to me

A nice side effect of this is that one would then be able to safely mix the worker types that compose the whole pool of workers as no job that is "too big" (whatever "big" means) would end up being scheduled to a "small" worker while still leaving the small worker for work that fits it (small workloads).

@ddadlani

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

I agree that the worker would need to advertise its current "free capacity" to the ATCs, perhaps through the heartbeat or through another API call.
I don't see an advantage to giving workers a decision engine, especially since by design they do not currently hold any state.

Here are some things that we would like to take into account when scheduling jobs:

  1. Worker system OS (Linux, Darwin, Windows etc)
  2. Teams and other worker/job tags
  3. Number of containers and volumes on a given worker
  4. CPU, memory and disk usage per worker

Of these, the workers have all this information, and already send 1, 2 and 3 to the ATC. But giving the workers the ability to use this information to pull work would be counterintuitive. For example, it would be odd if a team worker could see all available work to be scheduled, including tasks not tagged to a team. (It might potentially be a security flaw).

Another concern would be equitable distribution of work, which is hard to do because workers don't have knowledge of other workers' capacities.
Let's say we have 2 workers, both of which can run 10 tasks in parallel. Worker 1 currently is running 8 moderately heavy tasks. Worker 2 is running 3 moderately heavy tasks. Ideally, worker 2 would run the new task, but in this scenario, if worker 1 "pulled" from the queue first, it would get the new task. Now this is not a problem because we would eventually achieve even distribution of tasks. However, if this new task is extremely resource heavy (which it may be), it is less likely to cause problems if it was scheduled on worker 2.

So, for the above 2 reasons, I think it makes more sense to let the ATC assign work to each worker.

I do like @marco-m's suggestion of offloading scheduling to Nomad, it would be great to not have to worry about it ourselves. We have not investigated how it would work with Concourse but that might be worth fleshing out too. I also don't know how it would work with Garden + Baggageclaim.

@ddadlani

This comment has been minimized.

Copy link
Contributor Author

commented May 16, 2019

@gerhard Having a queue would probably help with your use case as well, and max-build-containers would be a decent workaround until then. It's interesting to know how much resources a task would need; it would simplify scheduling greatly. It's unfortunate that we cannot make that assumption for all use cases. I'd be interested to see how much of an improvement that would provide in a workload where some tasks specify the required resources but some don't.

@gerhard

This comment has been minimized.

Copy link

commented May 17, 2019

@marco-m yes, max-build-containers would work. Your comment describes our use-case well. 1 job per worker would be a great first step.

If there was a story to implement max-build-containers, how many points complexity are we looking at?

@ddadlani the ideal solution would imply taking into account resources used by all previous builds and working out the optimal resource allocation based on past metrics. The simplest thing would be to allows user to manually define resource requirements on a per-job basis, similar to Docker resource constraints, K8S resource requests or Nomad resources. It really doesn't matter how it's implemented in Concourse as long as it exists.

@marco-m

This comment has been minimized.

Copy link
Collaborator

commented May 17, 2019

@gerhard

yes, max-build-containers would work. Your comment describes our use-case well. 1 job per worker would be a great first step.

it is because we have exactly the same problem :-D

@freelock

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

We had challenges like these at first, but were able to work around it by adding an external system to unpause a particular pipeline and trigger a job, and then pause the pipeline after the last job in the pipeline completes. This was working great for us until v5.1.0, which now seems to check every resource in the pipeline before letting a triggered job begin... #3845 and #3759 have more details...

The problems we were having were not related to too many concurrent jobs, though -- the problems we had stemmed from having way too many check containers active in pipelines. Even if no jobs were running, just having 5 or 6 active pipelines (each with ~19 resources) crushed our workers. And we could only get that many active by setting the "check_every" to slow down the frequency of checks.

@ddadlani

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

Hi @gerhard, the reason for not implementing max-build-containers (essentially max-tasks-per-worker as outlined in #2928) right now is not because of its complexity. It is because of other refactoring work that is ongoing, as well as other issues that need to be fixed (e.g. #3301). In the interim we are trying to gather feedback from users about their problem use cases, so we ensure that our solution to this problem covers as many as possible.

Being able to "predict" resource usage of a task based on previous runs is something we have discussed, but might be overkill for a first approach at tackling scheduling. It might be something worked on in later stages.

@ddadlani

This comment has been minimized.

Copy link
Contributor Author

commented May 17, 2019

@freelock having too many check containers is related but separate, and we are working to reduce that in other ways (#3079 for example).

Looking through your comment history on #3759, however, it looks like you have modified your workflow to rely on manually triggering builds because if not, your Concourse workers are under very high load. I'm more interested in what exactly was happening in the automatic build trigger scenario. Do you have statistics on how many containers and volumes were on each worker? How did the workers fail?

Some context, which I neglected to provide when creating the issue
For anyone coming here from #3759 or similar issues, I'd like to clarify an important but obscure difference: there are two kinds of "scheduling" that happen in Concourse.
First, the resource scanner will run checks for new versions (on random workers), then the scheduler will schedule a pending build in the DB based on those versions.
Second, when the pending build is started, the steps of that build (get, put, task etc) will each be scheduled (maybe a better word is "placed"?) on available workers.

#3759 is talking about the first kind, whereas this issue is more focused on the second kind, i.e. the worker selection and container placement. This is because checks are generally much more lightweight and do not usually cause workers to fail (except in the case where too many check containers are on the worker). It is usually the get, put and task steps that cause a worker to fail because of high CPU load, memory etc. It would be nice to not place new work on already busy workers.

@freelock

This comment has been minimized.

Copy link
Contributor

commented May 17, 2019

@ddadlani figured the other issue might be a better place to respond...

@ukabu

This comment has been minimized.

Copy link

commented Jun 5, 2019

Worker pulling job would allow to scale up and down the number of workers as load requires it ( I don't need 40 workers at night) and make it simpler to configure worker for certain type of loads and behavior. Think volume locality for compilation workers, random for deploy jobs. The worker would ask for task that match some criteria and the ATC would give it preferably task that match the criteria.
Also giving worker the responsibility to manage its own load would remove the need to coordinate load management with the ATC.

@qoomon

This comment has been minimized.

Copy link

commented Jun 5, 2019

Totally agree

@ddadlani

This comment has been minimized.

Copy link
Contributor Author

commented Jun 5, 2019

@ukabu

The worker would ask for task that match some criteria and the ATC would give it preferably task that match the criteria

You can already configure workers to do certain types of loads using tags. For example, you could tag some workers as compilation and then tag your compilation tasks as compilation to ensure they run only on that pool of workers.

If you did that, you wouldn't even need volume-locality for compilation tasks since they would always run on the same small pool of workers, which means they all should have the same volumes eventually.

Having a scheduler would likely mean getting rid of individual container placement strategies and instead including all the available strategies, alongside worker load, in the placement decision.

Our workers are currently pretty stateless and if we were to push decisions down to them it would mean a lot of changes to the way we do things. For example, if worker A and worker B both had a compilation tag, and a compilation task was available, wouldn't it make sense for the ATC to look at the workers and pick the one which had less load on it, rather than whichever worker pulled from the queue first?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.