Add job tracking #86

gabriel-samfira · 2023-04-17T16:44:00Z

This branch changes the way garm decides when to spawn a new runner, by implementing some rudimentary job tracking. The bullet points of what is happening are:

A new model that tracks jobs has been added
When webhooks are sent by github as a result of a job event, we record that job and it's state in the database
We no longer create a new runner when a "queued" job shows up, we just record the job
When a job transitions to "in_progress", we call ensureIdleRunnersForOnePool() for the pool to which the runner that has picked up the job, belongs to. Doing so will replace the idle runner if a minimum idle runner value is set to something other than zero, while obeying max runners.
A new function called consumeQueuedJobs() was added. This function will attempt to create a new runner in response to jobs that are in queued state. It will run before the consolidate() function. It's purpose is to attempt to consume all queued jobs by creating any needed runners until the queue is gone. There is a back-off period of 15 seconds before it will attempt to create a new runner. The consolidate loop runs every 5 seconds, so there will be at least ~10 seconds in which an idle runner has time to pick up the job. After that, it will add a new runner.

Some notes:

We cannot control which runners pick up which jobs. We may have pools of runners defined on multiple hierarchy levels that are suitable to sun a particular job. For example, if the job requests a runner with self-hosted, all pools on all levels (repo, org and enterprise) will be able to handle that job.

We may create a runner in pool A as a response to a queued job, but a runner from pool B may pick it up. The runner we spawned may pick up a completely different job than the one that prompted us to create the runner. This does not present an issue as long as all runners were spawned to handle jobs with the same label set. Even if we do end up with some dangling runners, the scale-down logic we have will clean those up after a short period of being idle. While there is no way to guarantee there will be no runner churn, it should be minimal.

For best results, make sure to use unique sets of labels to target runners in a particular pool. Something like this should work:

self-hosted
x64
linux
ubuntu-latest
gpu

Or if you want to keep it breef, simply use:

ubuntu-latest-gpu

If you want to target a particular pool, just add something to narrow down the possibilities just to that pool:

ubuntu-latest-gpu
ml-team

Fixes: #47
Related: #78 #73

gabriel-samfira · 2023-04-17T16:44:55Z

CC @maigl @MoritzKeppler

If you're feeling brave, feel free to test this out. This branch should, in theory, properly handle zero min idle runner pools, with a small max runner value, without loosing jobs. Once slots become available, new runners will be added to handle the remaining queued jobs.

maigl · 2023-04-18T08:48:38Z

sounds good .. we'll check it out, when we find some time .. (maybe not brave enough :))
But the new provider currently has higher priority.

gabriel-samfira · 2023-04-18T09:06:01Z

But the new provider currently has higher priority.

There is no rush at all. Whenever you get a chance. No issues popped up during testing for me, but your use case is more varied than mine. When you do get a chance to test this, ping back.

MoritzKeppler · 2023-04-18T15:23:47Z

first of all: thanks one more time for the PR!

some thoughts:

wonder if we could run some component test that simulate various scenarios.
Should be possible to mock GitHub and an external provider; little bit tricky with all the async activities...
how will the algorithm work if garm is scaled horizontally?
what will happen if we miss some of the events?

gabriel-samfira · 2023-04-18T16:25:00Z

first of all: thanks one more time for the PR!

some thoughts:

* wonder if we could run some component test that simulate various scenarios.
  Should be possible to mock GitHub and an external provider; little bit tricky with all the async activities...

Yup. Should be doable. I need to write proper tests for the whole pool manager.

* how will the algorithm work if garm is scaled horizontally?

Sadly, not well. There is currently no global lock. There is a plan to try out etcd as a store, which gives us some nice primitives to work with like watchers and global locks. But I have no ETA on that.

* what will happen if we miss some of the events?

They will linker on github until they expire. Sadly, that can't be helped. See this unnecessarily long comment for more details:

https://github.com/cloudbase/garm/pull/86/files#diff-0928304a4f9afb0bf84875a5a19d4d42692d7ed54843569166a233b61cbb7603R1190

Coincidentally, github itself does the same thing. When there is an issue with the gh runners, jobs started while the issue manifested will never be picked up. My guess is that internally they use the same hooks to queue the jobs for their own runners.

maigl · 2023-06-22T07:01:41Z

thx .. will try this out today :)

gabriel-samfira · 2023-06-22T09:13:31Z

thx .. will try this out today :)

Curious how it works out compared to the old workflow.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

* Removes completed jobs from the db * Skip ensure min idle runners for pools with min idle runners set to 0 Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Break the lock on a job if it's still queued and the runner that it triggered was assigned to another job. This may cause leftover runners to be created, but we scale those down in ~3 minutes. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

* enable foreign key constraints on sqlite * on delete cascade for addresses and status messages * add debug server config option * fix rr allocation Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

For now, the aditional labels would only contain the job ID that triggered the creation of the runner. It does not make sense to add this label to the actual runner that registeres against github. We can simply use it internally by fetching it from the DB. Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

gabriel-samfira force-pushed the add-job-tracking branch from 11d4c01 to 1c7a7c8 Compare April 20, 2023 13:55

gabriel-samfira force-pushed the add-job-tracking branch from 1c7a7c8 to 8afb401 Compare June 21, 2023 15:39

gabriel-samfira force-pushed the add-job-tracking branch 5 times, most recently from 175319f to e6fd812 Compare June 29, 2023 06:37

gabriel-samfira force-pushed the add-job-tracking branch 2 times, most recently from a4bd85d to 5d7cf5b Compare July 1, 2023 21:22

gabriel-samfira added 14 commits July 3, 2023 07:46

Add job tracking

fbffd81

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Small fixes

5153738

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Remove completed jobs and slight optimization

b6a02db

* Removes completed jobs from the db * Skip ensure min idle runners for pools with min idle runners set to 0 Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Add job list to API and CLI

1287a93

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Do not record jobs not meant for us

28360fd

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Reduce timeout to 10 seconds

4b9c20e

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Add basic round robin for pools

c04a93d

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Update log messages

b52f107

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Remove lock for update

faf4f18

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

set max connections to 1 when using sqlite3

b3c754c

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Use the same db connection in runner

4a2ba68

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

use journal mode WAL

4ec684b

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Add back update locks

bf90eb3

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

gabriel-samfira added 16 commits July 3, 2023 07:46

Fix test

8f604c8

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Amend some log messages

3796c25

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

fix sqlite connection URI

b6a593b

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

increase backoff to 30 seconds

f7cf6bb

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Various fixes

a526c10

* enable foreign key constraints on sqlite * on delete cascade for addresses and status messages * add debug server config option * fix rr allocation Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Fix constraints

f0e761b

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Add cascade on tags

117a41e

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Attempt to add OnDelete:CASCADE to sqlite

f027bc4

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Convert uuid.Nil to NULL

6169677

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

tweak select

774b790

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Add requested labels to cli

45dceae

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Check if we have a recorded job

7f510ec

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Lower backoff timer to 1 minute

f92ac2a

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Lower the tool update interval to 1 minute

9101cdc

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

Use r.log()

0ab8f73

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>

gabriel-samfira force-pushed the add-job-tracking branch from b1f188a to 6c06afb Compare July 3, 2023 07:50

gabriel-samfira merged commit 0889f6c into cloudbase:main Jul 4, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add job tracking #86

Add job tracking #86

gabriel-samfira commented Apr 17, 2023

gabriel-samfira commented Apr 17, 2023 •

edited

Loading

maigl commented Apr 18, 2023

gabriel-samfira commented Apr 18, 2023

MoritzKeppler commented Apr 18, 2023

gabriel-samfira commented Apr 18, 2023

maigl commented Jun 22, 2023

gabriel-samfira commented Jun 22, 2023

Add job tracking #86

Add job tracking #86

Conversation

gabriel-samfira commented Apr 17, 2023

gabriel-samfira commented Apr 17, 2023 • edited Loading

maigl commented Apr 18, 2023

gabriel-samfira commented Apr 18, 2023

MoritzKeppler commented Apr 18, 2023

gabriel-samfira commented Apr 18, 2023

maigl commented Jun 22, 2023

gabriel-samfira commented Jun 22, 2023

gabriel-samfira commented Apr 17, 2023 •

edited

Loading