Add grace period to scale-down #78

gabriel-samfira · 2023-02-06T13:48:43Z

Add a grace period for idle runners of 5 minutes. A new idle runner will not be taken into consideration for scale-down unless it's older than 5 minutes. This should prevent situations where the scaleDown() routine that runs every minute will evaluate candidates for reaping and erroneously count the new one as well. The in_progress hooks that transitiones an idle runner to "active" may arive a long while after the "queued" hook has spun up a runner.

Signed-off-by: Gabriel Adrian Samfira gsamfira@cloudbasesolutions.com

gabriel-samfira · 2023-02-06T13:51:27Z

@maigl Could you see if this solves some of the churn you were seeing previously? This should have the same effect and allow for zero idle runners on pools.

maigl · 2023-02-07T09:39:38Z

I'll check this out..

maigl · 2023-02-07T09:45:28Z

I'm skeptical that this will help us, the problem was, that we saw scaling up on queue events and at the same time scaling down due to too many idling runners.
So the fix to not spin any more runners when we already have enough idling seems to work pretty well in our setup. If we now remove this we will again spin way to many runners ..

gabriel-samfira · 2023-02-07T11:13:18Z

I think I misunderstood the initial problem. I think you're dealing with situations where non-unique tag sets are used in workflows, and garm selects the wrong pool when it reacts to the queued event. A runner that matches the required labels from Pool A picks up the job, but when garm looks up pools by labels it selects Pool B to spin up the runner in.

The problem with using non-unique labels in workflows, when targeting runners is that the wrong type of runner may pick up the job automatically (garm is not involved in this).

For example, if you create a pool with GPUs enabled and one with high amounts of storage for the same repo, runners in both pools will react to self-hosted. If your workflows require GPUs, it may end up on a server with high amounts of storage. The best way to avoid this scenario is to make sure all your workflows avoid using the default tags (self-hosted, x64/arm64, linux/windows/macos).

I will change this PR. Will leave in the checks for enough idle runners, but will add a check for pools with MinIdleRunners set to 0. What do you think?

Add a grace period for idle runners of 5 minutes. A new idle runner will not be taken into consideration for scale-down unless it's older than 5 minutes. This should prevent situations where the scaleDown() routine that runs every minute will evaluate candidates for reaping and erroneously count the new one as well. The in_progress hooks that transitiones an idle runner to "active" may arive a long while after the "queued" hook has spun up a runner. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

gabriel-samfira · 2023-02-07T11:38:04Z

I left in the grace period for scale-down and also added an extra check when deciding to skip adding an idle runner. We check if there is at least one idle runner available in the pool. This should work even for pools that have MinIdleRunners set to 0.

The idea is that we want the pool to still react to the queued event, even if it has MinIdleRunners set to 0.

What do you think?

gabriel-samfira · 2023-02-07T11:56:49Z

The problem with using non-unique labels in workflows, when targeting runners is that the wrong type of runner may pick up the job automatically (garm is not involved in this).

Once we re-work the way garm reacts to queued events, I think we can prevent spinning up runners in the wrong pool, and simply create a new runner in the pool that lost one when it picked up a job. But using non-unique label sets won't fix the potential problem of getting the wrong type of runner (garm is not involved in that), because the job ended up on a runner from the wrong pool.

maigl · 2023-02-07T16:49:20Z

.. this looks fine for me ..

.. but I think we have another problem here:
for redundancy reasons we have multiple pools with the identical label set. Currently garm only looks for the First() matching pool it finds in the db. I.e. every scale event goes to the very same pool. I think we need to fix this.

gabriel-samfira · 2023-02-07T17:28:03Z

for redundancy reasons we have multiple pools with the identical label set

This is interesting.

Currently garm only looks for the First() matching pool it finds in the db

The thought process here was that each pool should have unique sets of tags. Mostly because there is no way to influence which runners react to a workflow from GitHub if more than one pool have the same tags. So it becomes impossible to efficiently scale automatically.

Would you mind writing up a user story of how you use this? Do you have multiple AZ/regions? How do you envision distributing the runners? Things of this nature. The more detailed the better, as it gives us more data-points to take into account when we'll eventually expand how the pools are managed.

maigl · 2023-02-08T08:30:48Z

yes .. we do have multiple AZs and we must run garm in a HA setup to reach some 9s. Currently we have only one garm but it has pools on multiple AZs. I think if queue events would be distributed to a pool randomly or with round robin it would be sufficient. I don't see the need for some elaborate strategy.
But currently all queue events go to only one pool.

I a future setup we will probably have multiple garm instances. Currently we think that they would not need to know each other and also don't share any data.

gabriel-samfira · 2023-02-08T13:16:18Z

Thanks for the input!

gabriel-samfira · 2023-04-07T17:00:27Z

There is an issue with skipping the creation of a new runner when we already have idling runners.

In some situations, we may have many queued events sent in rapid succession. This can happen if we start a workflow with multiple jobs, maybe in a matrix. There may be a significant delay between that rapid succession of "queued" events and the time a runner picks up one of the jobs and an "in_progress" event is received. So if we get 10 "queued" events and we only have 1 runner, we essentially ignore 9 events.

I will have to remove that check for now, as scaling down idling runners is less time consuming than waiting for that 1 idle runner to finish in order to pick up one of the other 9 jobs.

I will allocate time next week to implement job tracking so we fix multiple problems.

maigl · 2023-04-08T06:52:56Z

Currently, we don't seem to have much trouble with that. But job tracking sounds promising.

gabriel-samfira · 2023-05-11T17:20:26Z

Some good news. An option to skip the default labels in the self hosted github runners has been added via this PR:

actions/runner#2443

As soon as a stable release is cut with this option, I will add it to garm. We will finally be able to define just custom labels without the usual self-hosted, x64, linux labels, making it easier to properly target runners.

gabriel-samfira force-pushed the add-scale-down-grace-period branch from 26b80e2 to 43d2fd8 Compare February 7, 2023 11:36

maigl approved these changes Feb 7, 2023

View reviewed changes

gabriel-samfira merged commit 24f61ce into cloudbase:main Feb 8, 2023

gabriel-samfira mentioned this pull request Apr 17, 2023

Add job tracking #86

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add grace period to scale-down #78

Add grace period to scale-down #78

gabriel-samfira commented Feb 6, 2023

gabriel-samfira commented Feb 6, 2023

maigl commented Feb 7, 2023

maigl commented Feb 7, 2023

gabriel-samfira commented Feb 7, 2023

gabriel-samfira commented Feb 7, 2023 •

edited

Loading

gabriel-samfira commented Feb 7, 2023

maigl commented Feb 7, 2023

gabriel-samfira commented Feb 7, 2023

maigl commented Feb 8, 2023

gabriel-samfira commented Feb 8, 2023

gabriel-samfira commented Apr 7, 2023

maigl commented Apr 8, 2023

gabriel-samfira commented May 11, 2023

Add grace period to scale-down #78

Add grace period to scale-down #78

Conversation

gabriel-samfira commented Feb 6, 2023

gabriel-samfira commented Feb 6, 2023

maigl commented Feb 7, 2023

maigl commented Feb 7, 2023

gabriel-samfira commented Feb 7, 2023

gabriel-samfira commented Feb 7, 2023 • edited Loading

gabriel-samfira commented Feb 7, 2023

maigl commented Feb 7, 2023

gabriel-samfira commented Feb 7, 2023

maigl commented Feb 8, 2023

gabriel-samfira commented Feb 8, 2023

gabriel-samfira commented Apr 7, 2023

maigl commented Apr 8, 2023

gabriel-samfira commented May 11, 2023

gabriel-samfira commented Feb 7, 2023 •

edited

Loading