Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better distribution of jobs across workers #675

Closed
oppegard opened this issue Sep 29, 2016 · 27 comments
Closed

Better distribution of jobs across workers #675

oppegard opened this issue Sep 29, 2016 · 27 comments
Projects

Comments

@oppegard
Copy link
Contributor

Concourse's present (v2.2.1) task scheduler doesn't appear to distribute builds across workers based on CPU load, disk I/O, or simple round-robin.

When we first moved from CircleCI to Concourse we were using a single, smaller worker, which was scaled up to handle more jobs over time (c3.2xlarge -> c3.8xlarge). We've reached the limit of scaling vertically (x1.32xlarge notwithstanding). As a result, developers on the team might wait 30 minutes for builds to finish that would normally take 10 minutes.

Scaling horizontally requires zero effort on our part (as we've done by BOSH-deploying 4 workers), but it's not helping much. Output from bosh instances --vitals:

+--------------------------------------------------+---------+----+------------+------------+-----------------------+--------------------+--------------+------------+------------+------------+------------+
| Instance                                         | State   | AZ | VM Type    | IPs        |         Load          |       CPU %        | Memory Usage | Swap Usage | System     | Ephemeral  | Persistent |
|                                                  |         |    |            |            | (avg01, avg05, avg15) | (User, Sys, Wait)  |              |            | Disk Usage | Disk Usage | Disk Usage |
+--------------------------------------------------+---------+----+------------+------------+-----------------------+--------------------+--------------+------------+------------+------------+------------+
| web/0 (79c07c59-6548-4b01-8856-495ba1c2b327)*    | running | z1 | t2.medium  | XX.X.16.11 | 0.24, 0.18, 0.21      | 3.7%, 0.4%, 0.1%   | 11% (426.1M) | 0% (0B)    | 44%        | 6%         | n/a        |
+--------------------------------------------------+---------+----+------------+------------+-----------------------+--------------------+--------------+------------+------------+------------+------------+
| worker/0 (7d2353fe-1610-4c9f-94f4-174f1400e0ed)* | running | z1 | c3.8xlarge | XX.X.16.12 | 84.96, 115.32, 115.82 | 65.2%, 30.5%, 0.1% | 69% (40.6G)  | 6% (3.4G)  | 44%        | 27%        | n/a        |
| worker/1 (46b34a6b-292e-4552-97a8-d5747f62aadd)  | running | z1 | c3.8xlarge | XX.X.16.13 | 0.77, 6.18, 9.43      | 3.3%, 0.9%, 0.2%   | 13% (7.5G)   | 0% (4.8M)  | 44%        | 9%         | n/a        |
| worker/2 (863020a1-555a-4c47-8371-6c797d7d1672)  | running | z1 | c3.8xlarge | XX.X.16.16 | 0.45, 0.24, 0.31      | 0.1%, 0.1%, 0.0%   | 8% (4.5G)    | 0% (24.0K) | 44%        | 8%         | n/a        |
| worker/3 (7ac184a4-7162-4fc9-be58-4a7a30adc6c1)  | running | z1 | c3.8xlarge | XX.X.16.15 | 0.17, 0.34, 0.83      | 0.3%, 0.1%, 0.0%   | 10% (5.6G)   | 0% (0B)    | 44%        | 6%         | n/a        |
+--------------------------------------------------+---------+----+------------+------------+-----------------------+--------------------+--------------+------------+------------+------------+------------+

There are four c3.8xlarge workers (32 vCPU / 60GB memory). One worker is running the bulk of jobs and overutilized, while the other three are esentially idle. While tagging workers and tasks is a blunt instrument, we'd prefer not going down that route.

@concourse-bot
Copy link
Collaborator

concourse-bot commented Sep 29, 2016

Hi there!

We use Pivotal Tracker to provide visibility into what our team is working on. A story for this issue has been automatically created.

The current status is as follows:

  • #139732069 I can see the load estimate for each worker via the API
  • #139735927 Step placement should be determined by worker load estimate
  • #140235033 Better distribution of jobs across workers
  • #139734393 I can see the CPU usage and timing information of each step via the API

This comment, as well as the labels on the issue, will be automatically updated as the status in Tracker changes.

@jchesterpivotal
Copy link
Contributor

jchesterpivotal commented Oct 5, 2016

We (buildpacks) see this behaviour also, but also see a slow equalisation over time. My guess is that this is because containers aren't relocatable. So adding new workers won't redistribute your historical containers.

A hacky workaround we stumbled on today was to pause and then unpause all our pipelines. We have something like 300 check containers. Pausing and unpausing causes these to be destroyed and recreated. The load is still lumpy around containers retained for other purposes (failed builds, mostly).

@vito
Copy link
Member

vito commented Oct 5, 2016

There may be some clever things we can do here to have things auto-balance based on metrics collected by a container after its workload ran. Garden can tell us how much time was spent in CPU, and there may be more metrics we can request from them in the future (disk i/o, network, etc).

The gist of my idea is that, after running a task, we'll record the "cost" (total time spent in CPU for starters) for the task in the build. We would then keep track of the "expected cost" of tasks running on a worker, and try to place tasks on workers with the lowest expected cost of their current workload.

Something like that. This could also apply to all containers (get and put and check).

@jchesterpivotal
Copy link
Contributor

You could write an auction mechanism! And then you'll have a healthchecker and etcd to back it. Somewhere is a system for structured descriptions of tasks to underlie it all.

I think that a name like Ciego would cover this.

@vito
Copy link
Member

vito commented Oct 5, 2016

@jchesterpivotal Or we could keep things simple and keep the project approachable. :)

@jchesterpivotal
Copy link
Contributor

Damn you and your troll-detection!

@cjcjameson
Copy link
Contributor

Comment from @d regarding the hacky workaround you mention -- he observes that even if a pipeline is paused, you can run fly check-resource on a pipeline that's paused. So he's initially skeptical that pausing pipelines actually stops the check containers. Or he acknowledges that the implementation of fly check-resource might be separate somehow.

Jesse -- comment more if you like

@jchesterpivotal
Copy link
Contributor

All I know is that pausing all our pipeline reduces our total live container count by about 300. Without having read the code closely, the obvious candidate is check containers.

As I understand it, fly check-resource creates a new container, independent of any check containers arising from normal pipeline operations. So that'd be consistent with both our observations.

@RochesterinNYC
Copy link

We observed this behavior again when one of our test suites booted up 18 tasks running rspec and all 18 of them landed on the same worker. The worker's at 100% CPU usage and we're unsure if it's thrashing or if the tasks are still running properly (but slowly).

@blizzo521
Copy link

Would love to see some progress on this. Also here on the Tracker team, we have lots of issues with builds taking forever and then getting reaped when they hit the timeout, but then we have to force them to run again and hope things are better next time. The primary job that runs to check our daily work has a history of being about 60% red, and the majority of those are due to this issue.

@metatype
Copy link

metatype commented Dec 8, 2016

Yep we're continuing to hit this issue as well. We have multiple pipelines (master + maintenance) with lots of parallel jobs. When we do a maintenance fix that spans multiple branches all the pipelines get activated and the jobs--which are cpu intensive--slam onto a single worker and crush it. That can lead to timing related failures even in well-written tests.

@ajmurmann
Copy link

@vito Accounting for past job cost would of course be nice. However, I think most issues at least that @metatype and I are seeing could probably be solved by something simple like round robin or even random assignment.

@schubert
Copy link

schubert commented Dec 8, 2016

+1 even the option for round robin or random would be preferred where the benefit of resource affinity for job location is vastly outweighed by certain workloads that just would otherwise crush even the biggest worker instance types we can throw at it (we've tried).

Sometimes less smart is more better.

@jchesterpivotal
Copy link
Contributor

Buildpacks is still observing this sporadically.

It is particularly bad if, say, for some reason you need to tear down a whole infrastructure in a hurry and then restart it. With more than 30 pipelines, upon relaunching, workers become available over a short span. The first worker or two to heartbeat the atc get essentially murdered for their charitable efforts.

It takes a while for this to stabilise. Redeployments usually get us there, as well as pausing and unpausing pipelines to kill check containers.

Randomness and round-robin wouldn't completely solve this, but they would help.

@ajmurmann
Copy link

@jchesterpivotal Maybe a simple setting for max allowed jobs per worker would help?

@jchesterpivotal
Copy link
Contributor

I think it would, so long as I'm confident containers get queued up.

In conversation here we realise that some of the lumpiness is due to having get-intensive jobs. If a job pulls in 10 resources, that means at least 20 containers (10x check, 10x get), which all need to be colocated with relevant task containers.

(Is this where I ha-ha-only-serious suggest Diego again, or should I wait a bit?)

@vito
Copy link
Member

vito commented Dec 8, 2016

@jchesterpivotal gets don't create a container if the cache is already warm on the chosen worker. also check containers are reused, and that would only be relevant for manually-triggered jobs, so those shouldn't be additional containers either

@jchesterpivotal
Copy link
Contributor

OK, so 10 in the steady state, 20 at startup time. But startup is still where we see the wedging.

Putting it another way: it's clear that I don't understand the exact cause and mechanism.

But I am able to observe this behaviour.

@larham
Copy link

larham commented Dec 29, 2016

We are feeling slow pain in the greenplum pipeline for the lack of concourse load balancing via round-robin or other algorithm. Affinity is not distributing the load among workers. Manual tagging tasks can help, but doesn't distribute all the work, and manual tagging seems silly.

@metatype
Copy link

Any updates on where this falls on your backlog?

@vito
Copy link
Member

vito commented Feb 13, 2017

Prioritized pretty highly, hopefully we can start it in parallel with #291 soon.

Current plan is to have Concourse store metrics about the "cost" of a step after it ran (basically just avg CPU% for now). Then, each worker will have a "estimated current cost", which is the sum of the estimated cost of all steps running on it, and when placing a new step, we'll pick the worker with the lowest current cost.

@cjcjameson
Copy link
Contributor

Just checking -- is this expected to be better in 3.0+? Haven't tried yet. If not, does the 3.0 and 3.1 rewrites make it easier to address now? Thanks!

@andrewedstrom
Copy link
Contributor

@cjcjameson As far as I know, the changes in 3.0 did not impact the distribution of jobs across workers. I'm not sure whether those changes make it easier to address this, but this issue is still very much on our radar!

@JohannesRudolph
Copy link
Contributor

I think adding smarter scheduling (or in general, more round robin like behavior instead of the current pinning of tasks to workers) would significantly increase concourse's maturity for teams with many pipelines and builds.

We currently have ~15 pipelines and two 16GB workers and see bad distribution of jobs hurting our build performance very much (e.g. workers alternating between full load or being idle, not scaling jobs and tasks out over the cluster).

@jdeppe-pivotal
Copy link

Is any work happening on this issue? It's been 2 years since this was first raised and it is still a problem for us.

@marco-m
Copy link
Contributor

marco-m commented Aug 10, 2018

FYI We found that random container placement (see #1741) is better than nothing.

@vito
Copy link
Member

vito commented Feb 2, 2019

per @jdeppe-pivotal's point of this being open for 2 years, I think it's time to finally close it as we've implemented least-build-containers (...should that be fewest-?) for #2577. So it's "better" now! This will be released in v5.0.0 in the coming month, though the default will remain as volume-locality for now.

The random container placement strategy is also an option.

There may of course be further improvements to make but I think those are best discussed in newer, more specific issues so we don't leave overarching ones like this open for too long.

Thanks all for participating!

@vito vito closed this as completed Feb 2, 2019
Runtime automation moved this from Research 🤔 to Done Feb 2, 2019
@topherbullock topherbullock moved this from Done to Accepted in Runtime Mar 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Runtime
Accepted
Development

No branches or pull requests