Better distribution of jobs across workers #675

oppegard · 2016-09-29T00:00:06Z

Concourse's present (v2.2.1) task scheduler doesn't appear to distribute builds across workers based on CPU load, disk I/O, or simple round-robin.

When we first moved from CircleCI to Concourse we were using a single, smaller worker, which was scaled up to handle more jobs over time (c3.2xlarge -> c3.8xlarge). We've reached the limit of scaling vertically (x1.32xlarge notwithstanding). As a result, developers on the team might wait 30 minutes for builds to finish that would normally take 10 minutes.

Scaling horizontally requires zero effort on our part (as we've done by BOSH-deploying 4 workers), but it's not helping much. Output from bosh instances --vitals:

+--------------------------------------------------+---------+----+------------+------------+-----------------------+--------------------+--------------+------------+------------+------------+------------+
| Instance                                         | State   | AZ | VM Type    | IPs        |         Load          |       CPU %        | Memory Usage | Swap Usage | System     | Ephemeral  | Persistent |
|                                                  |         |    |            |            | (avg01, avg05, avg15) | (User, Sys, Wait)  |              |            | Disk Usage | Disk Usage | Disk Usage |
+--------------------------------------------------+---------+----+------------+------------+-----------------------+--------------------+--------------+------------+------------+------------+------------+
| web/0 (79c07c59-6548-4b01-8856-495ba1c2b327)*    | running | z1 | t2.medium  | XX.X.16.11 | 0.24, 0.18, 0.21      | 3.7%, 0.4%, 0.1%   | 11% (426.1M) | 0% (0B)    | 44%        | 6%         | n/a        |
+--------------------------------------------------+---------+----+------------+------------+-----------------------+--------------------+--------------+------------+------------+------------+------------+
| worker/0 (7d2353fe-1610-4c9f-94f4-174f1400e0ed)* | running | z1 | c3.8xlarge | XX.X.16.12 | 84.96, 115.32, 115.82 | 65.2%, 30.5%, 0.1% | 69% (40.6G)  | 6% (3.4G)  | 44%        | 27%        | n/a        |
| worker/1 (46b34a6b-292e-4552-97a8-d5747f62aadd)  | running | z1 | c3.8xlarge | XX.X.16.13 | 0.77, 6.18, 9.43      | 3.3%, 0.9%, 0.2%   | 13% (7.5G)   | 0% (4.8M)  | 44%        | 9%         | n/a        |
| worker/2 (863020a1-555a-4c47-8371-6c797d7d1672)  | running | z1 | c3.8xlarge | XX.X.16.16 | 0.45, 0.24, 0.31      | 0.1%, 0.1%, 0.0%   | 8% (4.5G)    | 0% (24.0K) | 44%        | 8%         | n/a        |
| worker/3 (7ac184a4-7162-4fc9-be58-4a7a30adc6c1)  | running | z1 | c3.8xlarge | XX.X.16.15 | 0.17, 0.34, 0.83      | 0.3%, 0.1%, 0.0%   | 10% (5.6G)   | 0% (0B)    | 44%        | 6%         | n/a        |
+--------------------------------------------------+---------+----+------------+------------+-----------------------+--------------------+--------------+------------+------------+------------+------------+

There are four c3.8xlarge workers (32 vCPU / 60GB memory). One worker is running the bulk of jobs and overutilized, while the other three are esentially idle. While tagging workers and tasks is a blunt instrument, we'd prefer not going down that route.

The text was updated successfully, but these errors were encountered:

concourse-bot · 2016-09-29T00:05:42Z

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

#139732069 I can see the load estimate for each worker via the API
#139735927 Step placement should be determined by worker load estimate
#140235033 Better distribution of jobs across workers
#139734393 I can see the CPU usage and timing information of each step via the API

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

jchesterpivotal · 2016-10-05T16:03:21Z

We (buildpacks) see this behaviour also, but also see a slow equalisation over time. My guess is that this is because containers aren't relocatable. So adding new workers won't redistribute your historical containers.

A hacky workaround we stumbled on today was to pause and then unpause all our pipelines. We have something like 300 check containers. Pausing and unpausing causes these to be destroyed and recreated. The load is still lumpy around containers retained for other purposes (failed builds, mostly).

vito · 2016-10-05T18:58:47Z

There may be some clever things we can do here to have things auto-balance based on metrics collected by a container after its workload ran. Garden can tell us how much time was spent in CPU, and there may be more metrics we can request from them in the future (disk i/o, network, etc).

The gist of my idea is that, after running a task, we'll record the "cost" (total time spent in CPU for starters) for the task in the build. We would then keep track of the "expected cost" of tasks running on a worker, and try to place tasks on workers with the lowest expected cost of their current workload.

Something like that. This could also apply to all containers (get and put and check).

jchesterpivotal · 2016-10-05T19:00:12Z

You could write an auction mechanism! And then you'll have a healthchecker and etcd to back it. Somewhere is a system for structured descriptions of tasks to underlie it all.

I think that a name like Ciego would cover this.

vito · 2016-10-05T19:05:17Z

@jchesterpivotal Or we could keep things simple and keep the project approachable. :)

jchesterpivotal · 2016-10-05T19:05:39Z

Damn you and your troll-detection!

cjcjameson · 2016-10-07T20:59:40Z

Comment from @d regarding the hacky workaround you mention -- he observes that even if a pipeline is paused, you can run fly check-resource on a pipeline that's paused. So he's initially skeptical that pausing pipelines actually stops the check containers. Or he acknowledges that the implementation of fly check-resource might be separate somehow.

Jesse -- comment more if you like

jchesterpivotal · 2016-10-07T21:16:37Z

All I know is that pausing all our pipeline reduces our total live container count by about 300. Without having read the code closely, the obvious candidate is check containers.

As I understand it, fly check-resource creates a new container, independent of any check containers arising from normal pipeline operations. So that'd be consistent with both our observations.

RochesterinNYC · 2016-10-13T14:53:18Z

We observed this behavior again when one of our test suites booted up 18 tasks running rspec and all 18 of them landed on the same worker. The worker's at 100% CPU usage and we're unsure if it's thrashing or if the tasks are still running properly (but slowly).

blizzo521 · 2016-10-19T17:57:00Z

Would love to see some progress on this. Also here on the Tracker team, we have lots of issues with builds taking forever and then getting reaped when they hit the timeout, but then we have to force them to run again and hope things are better next time. The primary job that runs to check our daily work has a history of being about 60% red, and the majority of those are due to this issue.

metatype · 2016-12-08T15:48:39Z

Yep we're continuing to hit this issue as well. We have multiple pipelines (master + maintenance) with lots of parallel jobs. When we do a maintenance fix that spans multiple branches all the pipelines get activated and the jobs--which are cpu intensive--slam onto a single worker and crush it. That can lead to timing related failures even in well-written tests.

ajmurmann · 2016-12-08T15:59:42Z

@vito Accounting for past job cost would of course be nice. However, I think most issues at least that @metatype and I are seeing could probably be solved by something simple like round robin or even random assignment.

schubert · 2016-12-08T16:06:42Z

+1 even the option for round robin or random would be preferred where the benefit of resource affinity for job location is vastly outweighed by certain workloads that just would otherwise crush even the biggest worker instance types we can throw at it (we've tried).

Sometimes less smart is more better.

jchesterpivotal · 2016-12-08T16:37:52Z

Buildpacks is still observing this sporadically.

It is particularly bad if, say, for some reason you need to tear down a whole infrastructure in a hurry and then restart it. With more than 30 pipelines, upon relaunching, workers become available over a short span. The first worker or two to heartbeat the atc get essentially murdered for their charitable efforts.

It takes a while for this to stabilise. Redeployments usually get us there, as well as pausing and unpausing pipelines to kill check containers.

Randomness and round-robin wouldn't completely solve this, but they would help.

ajmurmann · 2016-12-08T18:26:01Z

@jchesterpivotal Maybe a simple setting for max allowed jobs per worker would help?

jchesterpivotal · 2016-12-08T18:49:20Z

I think it would, so long as I'm confident containers get queued up.

In conversation here we realise that some of the lumpiness is due to having get-intensive jobs. If a job pulls in 10 resources, that means at least 20 containers (10x check, 10x get), which all need to be colocated with relevant task containers.

(Is this where I ha-ha-only-serious suggest Diego again, or should I wait a bit?)

vito · 2016-12-08T18:54:32Z

@jchesterpivotal gets don't create a container if the cache is already warm on the chosen worker. also check containers are reused, and that would only be relevant for manually-triggered jobs, so those shouldn't be additional containers either

jchesterpivotal · 2016-12-08T20:32:42Z

OK, so 10 in the steady state, 20 at startup time. But startup is still where we see the wedging.

Putting it another way: it's clear that I don't understand the exact cause and mechanism.

But I am able to observe this behaviour.

larham · 2016-12-29T21:38:44Z

We are feeling slow pain in the greenplum pipeline for the lack of concourse load balancing via round-robin or other algorithm. Affinity is not distributing the load among workers. Manual tagging tasks can help, but doesn't distribute all the work, and manual tagging seems silly.

metatype · 2016-12-30T17:40:19Z

Any updates on where this falls on your backlog?

vito · 2017-02-13T20:06:20Z

Prioritized pretty highly, hopefully we can start it in parallel with #291 soon.

Current plan is to have Concourse store metrics about the "cost" of a step after it ran (basically just avg CPU% for now). Then, each worker will have a "estimated current cost", which is the sum of the estimated cost of all steps running on it, and when placing a new step, we'll pick the worker with the lowest current cost.

cjcjameson · 2017-06-02T17:48:24Z

Just checking -- is this expected to be better in 3.0+? Haven't tried yet. If not, does the 3.0 and 3.1 rewrites make it easier to address now? Thanks!

andrewedstrom · 2017-06-27T19:13:25Z

@cjcjameson As far as I know, the changes in 3.0 did not impact the distribution of jobs across workers. I'm not sure whether those changes make it easier to address this, but this issue is still very much on our radar!

JohannesRudolph · 2017-08-03T08:55:35Z

I think adding smarter scheduling (or in general, more round robin like behavior instead of the current pinning of tasks to workers) would significantly increase concourse's maturity for teams with many pipelines and builds.

We currently have ~15 pipelines and two 16GB workers and see bad distribution of jobs hurting our build performance very much (e.g. workers alternating between full load or being idle, not scaling jobs and tasks out over the cluster).

jdeppe-pivotal · 2018-08-10T20:31:34Z

Is any work happening on this issue? It's been 2 years since this was first raised and it is still a problem for us.

marco-m · 2018-08-10T20:45:37Z

FYI We found that random container placement (see #1741) is better than nothing.

vito · 2019-02-02T20:50:34Z

per @jdeppe-pivotal's point of this being open for 2 years, I think it's time to finally close it as we've implemented least-build-containers (...should that be fewest-?) for #2577. So it's "better" now! This will be released in v5.0.0 in the coming month, though the default will remain as volume-locality for now.

The random container placement strategy is also an option.

There may of course be further improvements to make but I think those are best discussed in newer, more specific issues so we don't leave overarching ones like this open for too long.

Thanks all for participating!

concourse-bot added the unscheduled label Sep 29, 2016

vito added enhancement labels Oct 5, 2016

vito mentioned this issue Oct 7, 2016

prevent overloading the workers #559

Closed

vito mentioned this issue Nov 8, 2016

ATC and Workers stop executing jobs #707

Closed

concourse-bot added scheduled and removed unscheduled labels Dec 3, 2016

concourse-bot added unscheduled and removed scheduled labels Feb 20, 2017

concourse-bot added scheduled and removed unscheduled labels Feb 21, 2017

osis mentioned this issue Mar 16, 2017

Uneven workload causing builds to fail #104

Closed

mariash added the efficiency label Apr 11, 2017

vito removed scheduled labels May 8, 2017

jama22 modified the milestones: v3.4.0, v3.5.0 Aug 9, 2017

vito added this to In Flight in Runtime Aug 14, 2017

vito moved this from In Flight to Research 🤔 in Runtime Aug 14, 2017

topherbullock mentioned this issue Aug 17, 2017

Concourse should support imposing limits on container resources #787

Closed

vito mentioned this issue Sep 18, 2017

Container jobs do not get distributed correctly #1586

Closed

jama22 removed this from the v3.5.0 milestone Sep 19, 2017

JohannesRudolph mentioned this issue Oct 24, 2017

Fix "worker gravity": Allow more random scheduling of jobs/tasks across workers #1741

Closed

phillbaker mentioned this issue Nov 2, 2017

Add random assignment of tasks across eligible workers. vmware-archive/atc#219

Merged

vito removed the enhancement label Nov 28, 2017

vito mentioned this issue Dec 12, 2017

Add optional skew to interval concourse/time-resource#22

Closed

topherbullock mentioned this issue Jun 11, 2018

Epic : Multi-tenant runtime scalability #2271

Closed

topherbullock mentioned this issue Jul 26, 2018

Automatically retry build steps if they fail because the worker is unusable #755

Open

marco-m mentioned this issue Sep 11, 2018

Feature Request: slightly better scheduling (don't kill my workers under load) #2577

Closed

vito closed this as completed Feb 2, 2019

Runtime automation moved this from Research 🤔 to Done Feb 2, 2019

topherbullock moved this from Done to Accepted in Runtime Mar 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better distribution of jobs across workers #675

Better distribution of jobs across workers #675

oppegard commented Sep 29, 2016

concourse-bot commented Sep 29, 2016 •

edited

Loading

jchesterpivotal commented Oct 5, 2016 •

edited

Loading

vito commented Oct 5, 2016

jchesterpivotal commented Oct 5, 2016

vito commented Oct 5, 2016

jchesterpivotal commented Oct 5, 2016

cjcjameson commented Oct 7, 2016

jchesterpivotal commented Oct 7, 2016

RochesterinNYC commented Oct 13, 2016

blizzo521 commented Oct 19, 2016

metatype commented Dec 8, 2016

ajmurmann commented Dec 8, 2016

schubert commented Dec 8, 2016

jchesterpivotal commented Dec 8, 2016

ajmurmann commented Dec 8, 2016

jchesterpivotal commented Dec 8, 2016

vito commented Dec 8, 2016

jchesterpivotal commented Dec 8, 2016

larham commented Dec 29, 2016

metatype commented Dec 30, 2016

vito commented Feb 13, 2017 •

edited

Loading

cjcjameson commented Jun 2, 2017

andrewedstrom commented Jun 27, 2017

JohannesRudolph commented Aug 3, 2017

jdeppe-pivotal commented Aug 10, 2018

marco-m commented Aug 10, 2018

vito commented Feb 2, 2019

Better distribution of jobs across workers #675

Better distribution of jobs across workers #675

Comments

oppegard commented Sep 29, 2016

concourse-bot commented Sep 29, 2016 • edited Loading

jchesterpivotal commented Oct 5, 2016 • edited Loading

vito commented Oct 5, 2016

jchesterpivotal commented Oct 5, 2016

vito commented Oct 5, 2016

jchesterpivotal commented Oct 5, 2016

cjcjameson commented Oct 7, 2016

jchesterpivotal commented Oct 7, 2016

RochesterinNYC commented Oct 13, 2016

blizzo521 commented Oct 19, 2016

metatype commented Dec 8, 2016

ajmurmann commented Dec 8, 2016

schubert commented Dec 8, 2016

jchesterpivotal commented Dec 8, 2016

ajmurmann commented Dec 8, 2016

jchesterpivotal commented Dec 8, 2016

vito commented Dec 8, 2016

jchesterpivotal commented Dec 8, 2016

larham commented Dec 29, 2016

metatype commented Dec 30, 2016

vito commented Feb 13, 2017 • edited Loading

cjcjameson commented Jun 2, 2017

andrewedstrom commented Jun 27, 2017

JohannesRudolph commented Aug 3, 2017

jdeppe-pivotal commented Aug 10, 2018

marco-m commented Aug 10, 2018

vito commented Feb 2, 2019

concourse-bot commented Sep 29, 2016 •

edited

Loading

jchesterpivotal commented Oct 5, 2016 •

edited

Loading

vito commented Feb 13, 2017 •

edited

Loading