issue with small pool sizes #47

MoritzKeppler · 2022-10-21T15:41:20Z

A minor issue we found during testing with small pool sizes (in our case: min=0, max=2):
If max pool size is reached garm will ignore additional jobs queued by GH.
As soon as the currently running jobs are completed, the pool is scaled down to min=0 again and the jobs waiting in GH will never be scheduled until we manually resend the webhook event / manually trigger the creation of a new runner.

_{Moritz Keppler moritz.keppler@mercedes-benz.com, Mercedes-Benz Tech Innovation GmbH, legal info/Impressum}

gabriel-samfira · 2022-10-21T15:58:58Z

Indeed. That is a bit of a pickle. Garm creates new runners based on 2 things:

an event from github
ensuring min_idle_runners

And will obey the max runner limit ignoring "queued" events if that limit is reached. This means that one decision factor is removed in that circumstance. And min_idle_runner=0 will eliminate the other.

An immediate solution would be to have at least 1 set on min_idle_runners. A good solution would be to queue the new runner in the database instead of ignoring the event, and not spin it up until we can accommodate it (number of runners falls under max). Unfortunately, this one will take some time to implement due to backlog. But it will be fixed.

I for the time being would suggest to define pools at the org/enterprise level (which should reduce the number of pools you will need), and have at least 1 runner per pool.

MoritzKeppler · 2022-10-22T09:14:34Z

thanks! definitely not a critical issue as we have a workaround.

Queued runners in the database could be a solution to at least guarantee that there will be an sufficient amount of runners. But what would happen if the job gets cancelled in the meanwhile or executed on another runner or ...? Maybe easier to have an algorithm that just reacts on the current load, e.g. "always start some extra runners that get automatically shutdown again if not used within a timeframe"... but only a vague idea, not sure if it will work.

gabriel-samfira · 2022-10-23T14:14:13Z

The main issue I see now, is that we ignore events sent via web hooks if max_runners is reached. That is not a good thing to do, and the issue you're seeing is proof of that.

I think a good approach would be to record web hook events as they come in, and use that as a source of truth to make decisions. It also gives us greater transparency into which event triggered the creation of a runner.

We currently receive web hooks for 3 events:

A workflow is queued. This means a workflow has been requested, but no runner picked it up yet.
A workflow transitions to in_progress. This means that one of our runners or some other runner has picked up a workflow in our repo. If it's one of ours, we currently mark the runner as active.
A workflow transitions to completed. This means that a workflow has been canceled or has finished running. The outcome of the workflow is not relevant (error or success). If a workflow is canceled before a runner picks it up, the runner_name property is not set.

What we could do is to create a new entity in our store that tracks web hook events and their state. Then we can use this to scale our runners. As long as we receive web hooks, we can keep this updated. The challenge is to make this robust enough to still work during periods of github outages, prolonged downtime periods for garm, or even restarts. We also need to keep in mind that we may have pools and web hooks configured on all 3 levels (repo, org, enterprise), so a workflow will trigger an event on all 3 for the same workflow.

This is a non-trivial change that will require some thought and time, but will ultimately improve the way garm makes decisions.

gabriel-samfira mentioned this issue Apr 17, 2023

Add job tracking #86

Merged

gabriel-samfira closed this as completed in #86 Jul 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

issue with small pool sizes #47

issue with small pool sizes #47

MoritzKeppler commented Oct 21, 2022

gabriel-samfira commented Oct 21, 2022 •

edited

Loading

MoritzKeppler commented Oct 22, 2022

gabriel-samfira commented Oct 23, 2022

issue with small pool sizes #47

issue with small pool sizes #47

Comments

MoritzKeppler commented Oct 21, 2022

gabriel-samfira commented Oct 21, 2022 • edited Loading

MoritzKeppler commented Oct 22, 2022

gabriel-samfira commented Oct 23, 2022

gabriel-samfira commented Oct 21, 2022 •

edited

Loading