New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: slightly better scheduling (don't kill my workers under load) #2577

Open
marco-m opened this Issue Sep 11, 2018 · 26 comments

Comments

@marco-m
Copy link

marco-m commented Sep 11, 2018

What challenge are you facing?

In my understanding, Concourse doesn't have a build queue and doesn't control the load of the workers: any time a task is ready to run, it gets dispatched to a worker.

In our deployment, we use Concourse not only to build the master branch of a project, we instantiate one pipeline per feature branch (and so max_in_flight or serial are ineffective, since their scope is within the same pipeline).

Our worst workloads are C++ builds on Concourse Windows workers. While Linux degrades gracefully with load, Windows just melts down. We are using pretty beefy images on AWS (t2.2xlarge, 8 vCPU, 32 GB RAM).

We cannot just simply keep adding workers, it becomes too expensive. For this reason, we would like to add a slightly better scheduling to Concourse.

We will be happy to do the work; I am opening this ticket to see what would be the best way to approach this collaboration.

What I am sketching below is, in my understanding, the absolute minimum that could help us. We are not tied to this particular idea, what is really important for us is to get an incremental improvement soon, because this is currently holding back the migration of our bigger project from Jenkins to Concourse :-(

See also

  • #675 Better distribution of jobs across workers
  • #1741 Fix "worker gravity": Allow more random scheduling of jobs/tasks across workers
  • #676 Feature request: ability to limit number of tasks per worker

@vito what do you think?

What would make this better?

The idea is that there is no need to measure the actual load on the workers, it is enough to use the number of tasks currently executing on a worker as a rough load measure (which is better than nothing and could be extended to other load definitions).

Phase 1 (all workers support the same load)

Introduce following options for the ATC:

  • less-tasks for --container-placement-strategy (currently supports [volume-locality|random]).
  • --max-tasks-per-worker, default unlimited (as is the case now).

Option less-tasks can be specified independently from max-tasks-per-worker, while if max-tasks-per-worker is specified, then it must require option less-tasks and fail if different.

Following description assumes that less-tasks is set.

The scheduler should be modified as follows:

The scheduler keeps the workers in a priority queue (real or conceptual), sorted by number of tasks currently running on the worker. Workers with less tasks are at the top of the queue. Implementation details: One queue per worker type or one global queue?

Add new event to wake up the scheduler: task on a worker finished executing (this means that a slot become available). This should also update the priority queue.

When a task becomes runnable, the scheduler should look at the first worker on the priority queue (if one queue per worker type), or traverse it to find a matching worker type (in case one global queue).

If max-tasks-per-worker is set:

  • If the current worker has less than max-tasks-per-worker tasks running, then dispatch to it the new task (and directly or indirectly update the priority queue).
  • If on the other hand the current worker has >= than max-tasks-per-worker tasks running, then since the list is a priority queue, all other workers will be in a similar situation, no need to keep traversing the list. Do not dispatch the runnable task and wait for the next event to wake up the scheduler.

If max-tasks-per-worker is not set:

  • Take the current worker from the priority queue and dispatch the runnable task to it. This is still better than random placement because it dispatches to the matching worker will less running tasks.

Phase 2 (each worker can support different load)

Add option max-tasks-for-worker to concourse worker and modify the scheduling algorithm on the ATC to allow to take care of differences between workers, while max-tasks-per-worker on the ATC would still be a maximum cap.

Distinction between resource tasks and pipeline tasks?

To be decided what we want to do for tasks associated to Concourse resources, like git, S3, .... Should they be subjected to the same scheduling rules of less-tasks and max-tasks-for-worker or should they be dispatched as usual ?

Are you interested in implementing this yourself?

Yes.

@topherbullock

This comment has been minimized.

Copy link
Member

topherbullock commented Sep 13, 2018

To track how much 'work' a worker can perform we could have workers report their Garden configuration for max_containers on heartbeat and persist that to the database for use in worker selection.

We might want to stick to more of a weighted round robin approach (at least per-ATC), to avoid a case where a newly registered worker (with 0 containers) will suddenly be the target for every build step. This will be dependant on load and the time between worker heartbeats and the number of scheduled builds. Simplistic approach would be to pick say, 3 of the least used workers (matching the criteria for resources, tags, platform, etc) and randomly choose from that subset.

Another consideration is what we do in the case where all of the workers we have are at capacity. This would be useful on its own, as in the case of a single worker deployment, every build would still work eventually, but steps would be pending until a worker is available to perform the step.

This might be a good time to move towards a work queue, which exec throws steps onto to be picked up later. We'll need to worry about the 'context' of a build for each step, as steps' outputs need to be associated to inputs in later steps, but maybe this is as simple as adding a field to volumes associated with a build.

@topherbullock topherbullock added this to Icebox in Runtime via automation Sep 13, 2018

@marco-m

This comment has been minimized.

Copy link
Author

marco-m commented Sep 13, 2018

@topherbullock thanks for following up!

Is max_containers available also for non-Linux workers or would non-Linux workers need to be handled in a different way ?

I see the reasoning about the weighted round robin, actually RR is what I though at the beginning but then it looked like priority queue was enough. I don't have a strong opinion here. Consider though in your example of a new worker with 0 containers: with priority queue it would get "filled-up" until max-tasks-per-worker is reached only if all the other workers are already at capacity, since the priority queue would be adjusted on each task dispatch. To make an example:

Say we have 3 workers w1, w2, w3 with "load" (number of running tasks) 2 2 3, and max-tasks-per-worker is 4. The priority queue is (w1 2), (w2 2), (w3 3). We add the new worker, w4. The priority queue becomes (w4 0), (w1 2), (w2 2), (w3 3). A task becomes runnable, it is dispatched to w4. The queue becomes (w4 1), (w1 2), (w2 2), (w3 3). Another task becomes runnable, it is dispatched again to w4, the queue becomes (w4 2), (w1 2), (w2 2), (w3 3). Another task becomes runnable, it is dispatched for the last time to w4, the queue becomes (w1 2), (w2 2), (w4, 3) (w3 3). See? w4 changed place in the queue.

Another consideration is what we do in the case where all of the workers we have are at capacity.

Yes I agree. I mentioned that case: "Do not dispatch the runnable task and wait for the next event to wake up the scheduler.", which is, in my understanding, the same as what you are suggesting.

@marco-m

This comment has been minimized.

Copy link
Author

marco-m commented Sep 17, 2018

@topherbullock Do you or @vito have an idea of the timeline for this ticket ? Would you accept external contributions ? As I wrote in the description, this problem is a showstopper for us. Thanks!

@jchesterpivotal

This comment has been minimized.

Copy link
Contributor

jchesterpivotal commented Sep 22, 2018

I'd call out a distinction here: the difference between job scheduling and job placement.

The former is the business of deciding what job to run next. The latter is where to run it. From my hazy understanding of queueing theory, if you choose a work-stealing arrangement, you can treat a pool of multiple workers as a "single" worker, which leaves job scheduling as the remaining problem.

There are lots of scheduling schemes. From what I've read "shortest remaining processing time" is the best all-round performer, though that scheme requires the ability to preempt other jobs currently running. Of the non-preemptible scheduling arrangements it appears "shortest job first" is the best performer, though it becomes poorly behaved in cases with high job-length variability.

In practical terms, this would mean:

  1. There's a single pool of things waiting to be run. As a special case, checks would always get first dibs, as they are time-sensitive.
  2. Jobs are examined for the number of steps they contain; steps are used as a very rough proxy for expected running time.
  3. Take a step from the Job with the fewest unplaced steps and place it on the worker with the fewest running containers. If there are equal steps, pick randomly. If there are equal containers on workers, place randomly.
  4. Repeat this process until all steps have been assigned, or until there are no workers with free containers.

The business of chopping up Jobs is intended to vaguely resemble Shortest Job First, since it continuously selects the Job with the fewest unplaced steps. Shortest Remaining Processing Time would require the ability to preempt running steps, which is not really a plausible option in a general-purpose system like Concourse.

This approach ignores cache-locality, which is a huge variable and one of the blessings and curses of Concourse's default placement logic. If we consider multiple variables in placement, we're starting to sail back towards building a fully-featured container orchestrator, which is probably not worth the hassle. I think a relatively dumb queueing scheme will buy time until it's possible to plug in some other orchestrator.

@topherbullock

This comment has been minimized.

Copy link
Member

topherbullock commented Sep 26, 2018

@marco-m Sorry for the delay in getting back; lots of travel the last few days.

I think in the short term a PR for a new container placement strategy which looks at container counts would be greatly appreciated. We haven't prioritized this in the runtime backlog yet.

There's a lot of other underpinnings which @jchesterpivotal summarized quite well, so in the long term there's a lot to consider, but in the meantime "select the least used worker" might be the best first pass at making this better. As far as the caching issue is concerned, I see this as a better version of the random strategy which already exists.

We're going to need the placement part anyhow to get to a good place for scheduling, so it makes sense to start there!

@marco-m

This comment has been minimized.

Copy link
Author

marco-m commented Sep 26, 2018

Thanks @topherbullock ! At this point I will wait for #2534 to settle while getting more familiar with Concourse sources, and will chime back either here on on Discord in the contributors channel.

jchesterpivotal added a commit to jchesterpivotal/concourse that referenced this issue Oct 21, 2018

Introduce a global max-in-flight limit
Currently, Concourse scopes max-in-flight build throttling logic
to jobs and pipelines. But it does not account for global
workloads across all pipelines in light of available workers. In
cases of sudden overload, Concourse will dispatch builds to workers
even if those workers have exceeded their container limit. Those
builds error, but the versions that triggered them are still
considered to have been consumed.

This is undersirable behaviour. To avoid losing any builds, many
teams choose to deploy enough capacity to deal with load surges.
As a tradeoff this means that many workers spend almost all of
their time at very low utilisation, which can be very expensive.

This commit introduces a deliberately simplistic scheme for throttling
container workloads globally across a whole Concourse. The basis
of the scheme is that the number of workers defines available capacity
and the number of active containers represents current utilisation.
By setting this limit, Concourse will not try to schedule builds
if it thinks workers are too busy. This means it will be possible
to shrink worker deployments in exchange for accepting slowdowns
under heavy load.

250 containers per worker is set as the upper limit, reflecting the
default setting for Garden. 90% is set as the utilisation limit,
reflecting that delays in system behaviour will cause occasional
overshoots. So long as utilisation is below 90%, builds are not
prevented from proceeding by the global rule. They may still be
prevented by a standard max-in-flight rule.

The change is inspired by concourse#2577 but represents
only a small part of that discussion.

Signed-off-by: Jacques Chester <jacques@chester.id.au>

@topherbullock topherbullock moved this from Icebox to Backlog in Runtime Nov 13, 2018

@kcmannem kcmannem self-assigned this Nov 13, 2018

@ddadlani ddadlani moved this from Backlog to In Flight in Runtime Nov 15, 2018

@ddadlani ddadlani self-assigned this Nov 15, 2018

@kcmannem

This comment has been minimized.

Copy link
Member

kcmannem commented Nov 19, 2018

Some things to consider:

  • How do we want to measure load on a worker? Number of containers? Number of volumes? Disk usage? CPU load? A combination of these? The amount of volumes outnumber the amount of containers so we want to avoid bias. The number of containers (roughly) correlates with the number of tasks.
  • We should have a metric measuring this on Wings and/or prod to determine if our solution balances the load effectively.
  • What are the network implications of changing scheduling? Depending on our load metric, we may be streaming volumes across workers a lot more.
  • We could assign weights to the different metrics that we use to measure load, and this could be configurable by the user, perhaps even per-pipeline.

Our current spike involves a naive implementation by scheduling new jobs on the worker with the least total number of containers and volumes.

This story may require multiple iterations to get the perfect scheduling balance.

@kcmannem

This comment has been minimized.

Copy link
Member

kcmannem commented Nov 19, 2018

Here is how Diego's auctioning algorithm works, for reference:
https://github.com/cloudfoundry/rep/blob/HEAD/resources.go#L115

@jama22

This comment has been minimized.

Copy link
Member

jama22 commented Nov 19, 2018

One thing to consider would be to push the logic to the worker, where a worker could declare whether it can take more load. However this would also require us to implement some form of queuing on the ATC so if all workers are blocked we don't end up dropping work.

To determine the busy-ness (business?) of a worker you could investigate using the Garden API to report on compute cycles used? That could just be a lie though and it might not exist

@marco-m

This comment has been minimized.

Copy link
Author

marco-m commented Nov 19, 2018

Super happy to see this moving, thanks! :-)

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Nov 19, 2018

We are going to experiment with the Garden API to get container usage info (CPU, memory, disk etc) for all worker platforms (Linux, Darwin, Windows). If this information is available, we can use that to estimate future resource usage for each task and schedule them on workers that way.

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Nov 29, 2018

Quick update: as a first effort solution, we have decided to go with using the existing number of active containers on the workers to determine container placement. This means that we are adding a placement strategy that adds the new task on to a worker with the least existing active containers.

We are currently working on a topgun test for this placement strategy.

@JohannesRudolph

This comment has been minimized.

Copy link
Contributor

JohannesRudolph commented Dec 1, 2018

Thanks to everyone who had been chiming in here so far and pushing this. We also feel the pain of concourse lack of an intelligent scheduling strategy. Fixing this will provide a major boost in maturity of concourse for real workloads.

In terms of „active containers“ I’d very much favor this to exclude resource containers and only consider task containers. If such a distinction is available to the ATC during container placement...

The reasoning is simple, resource containers are typically just io bound and very light on load whereas task containers do the heavy lifting and are usually io, cpu and memory bound. I agree this seems like it would provide a great short term solution with reasonable implementation effort.

As far as the „proper“long term solution is concerned I’d argue against trying to get fancy with garden metrics or even recording historic task performance. A scheduling system like K8s limits and requests that users can specify on their tasks gives the user all the control to control and influence placement. And at that point it makes sense to have concourse just hand over tasks to a real k8s cluster for scheduling anyway.

@JohannesRudolph

This comment has been minimized.

Copy link
Contributor

JohannesRudolph commented Dec 1, 2018

@marco-m my experience with windows and Linux workers has been exactly the opposite, we’ve had tons of issues with our gradle builds on Linux because the oom killer would just randomly kill daemons (and in the jvm world even the Kotlin compilers have their own daemons, so there’s tons of attractive processes for the oom killer to act on and break our builds). Took us a while to configure all the different parts to reduce the chance for oom kills).

@marco-m

This comment has been minimized.

Copy link
Author

marco-m commented Dec 2, 2018

In terms of „active containers“ I’d very much favor this to exclude resource containers and only consider task containers. If such a distinction is available to the ATC during container placement...

Very much +1 !

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Dec 6, 2018

Thanks for your input! We are now going down the route of using containers associated with a build as our metric to decide container placement.

This essentially excludes check containers from the metric. Get and put containers are included in the metric because we cannot estimate their resource usage beforehand.

@mhuangpivotal

This comment has been minimized.

Copy link
Contributor

mhuangpivotal commented Dec 10, 2018

We pushed our change that added the least-build-containers placement strategy. This strategy will choose a worker based on the least number of build containers (including get, task, and put).

For the second part of this feature that limits the max number of build containers on the workers, we created a separate issue #2928.

@marco-m

This comment has been minimized.

Copy link
Author

marco-m commented Dec 10, 2018

@ddadlani

This comment has been minimized.

Copy link
Contributor

ddadlani commented Jan 8, 2019

Update: least-build-containers is now the placement strategy of choice for https://ci.concourse-ci.org/ for the next week.
Assuming no issues arise, it should be done.

@cirocosta

This comment has been minimized.

Copy link
Member

cirocosta commented Jan 24, 2019

Being operating a Concourse environment myself for a little while I'd like to +1 this one:

As far as the „proper“long term solution is concerned I’d argue against trying to get fancy with garden metrics or even recording historic task performance. A scheduling system like K8s limits and requests that users can specify on their tasks gives the user all the control to control and influence placement. And at that point it makes sense to have concourse just hand over tasks to a real k8s cluster for scheduling anyway.

As I see, the feedback for the operator could get pretty good with something like that - just like in k8s you can see that your thing is pending because there's no place in the cluster for it (it just doesn't fit), we could have similar feedback.

One could even have metrics around those requests and have a nice way of deciding when to automatically scale the number of workers, etc.

@marco-m

This comment has been minimized.

Copy link
Author

marco-m commented Jan 24, 2019

@cirocosta regarding k8s, that is fine as long as one uses only Linux workers. If on the other hand a poor soul has to build on Windows and Mac or any non-container OS (as myself), then containers are not an option unfortunately. I am fine with keeping the thing as simple as possible, but we have to remember that there are situation where Linux is not enough.

@vito

This comment has been minimized.

Copy link
Member

vito commented Jan 24, 2019

@marco-m Even in the Linux case, K8s is not going to replace the current backend entirely - it would be added alongside it, and we would have to support both equally. I don't want users to have to learn the K8s ecosystem or maintain a cluster (even if they can just use a cloud provider) just to run CI. It's often overkill.

@cirocosta

This comment has been minimized.

Copy link
Member

cirocosta commented Jan 24, 2019

Oh yeah, I was thinking of the reservation system independent of the platform - more of an extra constraint that the scheduler could keep track of when deciding where to place tasks to run. e.g., in the registration of a worker, the worker would advertise its capacity (5 units of something - cpu / ram / etc), then when scheduling a task to it, that'd be taken into account by the scheduler to see if that worker can fit a given workload or not (given the current amount of things that got consumed by the currently assigned tasks to that specific worker).

(k8s was an example).

Sorry for the confusion.

@marco-m

This comment has been minimized.

Copy link
Author

marco-m commented Jan 24, 2019

Thanks for the clarification @vito and @cirocosta. And talking about wild ideas, a potential and simpler alternative to k8s is Nomad by HashiCorp (of Terraform fame), which could maybe used as generic backend (does containers and non-containers) https://www.nomadproject.io/ :-)

@vito

This comment has been minimized.

Copy link
Member

vito commented Jan 24, 2019

@marco-m Yeah, I'd be super interested in a backend for Nomad too. @evanphx actually started on one a long while ago and had a working proof-of-concept here: https://github.com/nomad-ci/nomad-atc

TBH Nomad seems a much more natural fit in terms of the interfaces it exposes, and feels a lot more intuitive to use. I wish it would take off.

@evanphx

This comment has been minimized.

Copy link
Contributor

evanphx commented Jan 25, 2019

Hi! Yeah, I was working on it for work but my priorities took me elsewhere, sadly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment