Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.Sign up
Add ATC/worker flags to limit max build containers for workers #2928
Note: we want to refactor runtime code in #2926 before implementing this.
The proposal in #2577 wants to add
This option requires
We may want to change the flag names to say
referenced this issue
Dec 10, 2018
Hey @mhuangpivotal ,
Do you think this is something that could be advertised by the worker at registration time?
E.g., you could have a worker "here" that sets the default max-containers for garden (250), but another "there" that sets "10", in which case by having a per-worker setting,
Should I close PR #2707 in favour of this? I prefer this approach overall, especially since I don't have to write it.
I would still set a default value below the default Garden limit (250), since there will be a lagging response to detecting hitting a capacity limit.
@ddadlani here is my understanding:
The fix in #3251, along with
On the other hand, I think that this ticket, that stems from #2577, is about controlling the load. More in details: it would allow to control the max number of task containers on a given worker. For example: as an operator, if I know that more than, say, 2 task containers kill my workers, then I can set max-tasks-per-worker to 2. If the total number of runnable tasks is more than number_of_workers * max-tasks-per-worker, then the Concourse scheduler will not dispatch any task and wait for the next scheduler run / next event. This would provide a rough queue.
If it is possible to obtain the equivalent behavior with Garden (after the fix for #3251), even better!
Does it makes sense ?
I submitted a PR for the same idea, expressed with the inverse: a global max-in-flight. It was deliberately simple to enable quick adoption.
I strongly agree. Because Concourse does maintain a safe work buffer, it becomes necessary as a safety measure to retain a capacity buffer. The CF buildpacks team, for example, retains enough workers to handle the possibility of around 40 pipelines operating simultaneously. But this is not the common case, so average utilisation is very low.
Similarly, this behaves badly in disaster-recovery scenarios. It's not uncommon for many pipelines to fire up simultaneously when a Concourse is restored or rebuilt from scratch. This is doubly problematic because the ATC will begin loading workers as soon as they begin to report, leading to a flapping restart when workers are added progressively (as BOSH does). In DR scenarios I have found that it becomes necessary to manually disable all pipelines and then unpause them one by one, waiting for each added load to stabilise before proceeding.
I shouldn't have to do this.
It sounds to me a lot similar to #2577 with the idea of having an extra constraint in scheduling in the sense that a task would "reserve" a container from the number of "available containers" that one can reserve from the "pool" that the worker has (like #2577 (comment), but instead of CPU / RAM, containers).
With a task "reserving"
By keeping track of how much work is being reserved and how much capacity we have, one could have a better metric for autoscaling too, not needing to keep a large buffer of resources as @jchesterpivotal expressed.
As mentioned in the comment in #2577, this could scale to other resources too, not only number of containers, but cpu & mem too.
Does that make sense?
hello @cirocosta, yes, I agree 100%.
This task is the same as #2577 as you mention. My understanding is that this task, #2928, has been created by @mhuangpivotal to track a specific activity, while #2577 has been considered more a "discussion" ticket.
As you mention, then