Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add job tracking #86

Merged
merged 30 commits into from
Jul 4, 2023
Merged

Conversation

gabriel-samfira
Copy link
Member

This branch changes the way garm decides when to spawn a new runner, by implementing some rudimentary job tracking. The bullet points of what is happening are:

  • A new model that tracks jobs has been added
  • When webhooks are sent by github as a result of a job event, we record that job and it's state in the database
  • We no longer create a new runner when a "queued" job shows up, we just record the job
  • When a job transitions to "in_progress", we call ensureIdleRunnersForOnePool() for the pool to which the runner that has picked up the job, belongs to. Doing so will replace the idle runner if a minimum idle runner value is set to something other than zero, while obeying max runners.
  • A new function called consumeQueuedJobs() was added. This function will attempt to create a new runner in response to jobs that are in queued state. It will run before the consolidate() function. It's purpose is to attempt to consume all queued jobs by creating any needed runners until the queue is gone. There is a back-off period of 15 seconds before it will attempt to create a new runner. The consolidate loop runs every 5 seconds, so there will be at least ~10 seconds in which an idle runner has time to pick up the job. After that, it will add a new runner.

Some notes:

We cannot control which runners pick up which jobs. We may have pools of runners defined on multiple hierarchy levels that are suitable to sun a particular job. For example, if the job requests a runner with self-hosted, all pools on all levels (repo, org and enterprise) will be able to handle that job.

We may create a runner in pool A as a response to a queued job, but a runner from pool B may pick it up. The runner we spawned may pick up a completely different job than the one that prompted us to create the runner. This does not present an issue as long as all runners were spawned to handle jobs with the same label set. Even if we do end up with some dangling runners, the scale-down logic we have will clean those up after a short period of being idle. While there is no way to guarantee there will be no runner churn, it should be minimal.

For best results, make sure to use unique sets of labels to target runners in a particular pool. Something like this should work:

  • self-hosted
  • x64
  • linux
  • ubuntu-latest
  • gpu

Or if you want to keep it breef, simply use:

  • ubuntu-latest-gpu

If you want to target a particular pool, just add something to narrow down the possibilities just to that pool:

  • ubuntu-latest-gpu
  • ml-team

Fixes: #47
Related: #78 #73

@gabriel-samfira
Copy link
Member Author

gabriel-samfira commented Apr 17, 2023

CC @maigl @MoritzKeppler

If you're feeling brave, feel free to test this out. This branch should, in theory, properly handle zero min idle runner pools, with a small max runner value, without loosing jobs. Once slots become available, new runners will be added to handle the remaining queued jobs.

@maigl
Copy link
Contributor

maigl commented Apr 18, 2023

sounds good .. we'll check it out, when we find some time .. (maybe not brave enough :))
But the new provider currently has higher priority.

@gabriel-samfira
Copy link
Member Author

But the new provider currently has higher priority.

There is no rush at all. Whenever you get a chance. No issues popped up during testing for me, but your use case is more varied than mine. When you do get a chance to test this, ping back.

@MoritzKeppler
Copy link

first of all: thanks one more time for the PR!

some thoughts:

  • wonder if we could run some component test that simulate various scenarios.
    Should be possible to mock GitHub and an external provider; little bit tricky with all the async activities...
  • how will the algorithm work if garm is scaled horizontally?
  • what will happen if we miss some of the events?

@gabriel-samfira
Copy link
Member Author

first of all: thanks one more time for the PR!

some thoughts:

* wonder if we could run some component test that simulate various scenarios.
  Should be possible to mock GitHub and an external provider; little bit tricky with all the async activities...

Yup. Should be doable. I need to write proper tests for the whole pool manager.

* how will the algorithm work if garm is scaled horizontally?

Sadly, not well. There is currently no global lock. There is a plan to try out etcd as a store, which gives us some nice primitives to work with like watchers and global locks. But I have no ETA on that.

* what will happen if we miss some of the events?

They will linker on github until they expire. Sadly, that can't be helped. See this unnecessarily long comment for more details:

https://github.com/cloudbase/garm/pull/86/files#diff-0928304a4f9afb0bf84875a5a19d4d42692d7ed54843569166a233b61cbb7603R1190

Coincidentally, github itself does the same thing. When there is an issue with the gh runners, jobs started while the issue manifested will never be picked up. My guess is that internally they use the same hooks to queue the jobs for their own runners.

@maigl
Copy link
Contributor

maigl commented Jun 22, 2023

thx .. will try this out today :)

@gabriel-samfira
Copy link
Member Author

thx .. will try this out today :)

Curious how it works out compared to the old workflow.

@gabriel-samfira gabriel-samfira force-pushed the add-job-tracking branch 5 times, most recently from 175319f to e6fd812 Compare June 29, 2023 06:37
@gabriel-samfira gabriel-samfira force-pushed the add-job-tracking branch 2 times, most recently from a4bd85d to 5d7cf5b Compare July 1, 2023 21:22
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
  * Removes completed jobs from the db
  * Skip ensure min idle runners for pools with min idle runners set to 0

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Break the lock on a job if it's still queued and the runner that it
triggered was assigned to another job. This may cause leftover runners
to be created, but we scale those down in ~3 minutes.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
  * enable foreign key constraints on sqlite
  * on delete cascade for addresses and status messages
  * add debug server config option
  * fix rr allocation

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
For now, the aditional labels would only contain the job ID that triggered
the creation of the runner. It does not make sense to add this label to the
actual runner that registeres against github. We can simply use it internally
by fetching it from the DB.

Signed-off-by: Gabriel Adrian Samfira <gsamfira@cloudbasesolutions.com>
@gabriel-samfira gabriel-samfira merged commit 0889f6c into cloudbase:main Jul 4, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

issue with small pool sizes
3 participants