New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Support for scaling from/to zero #465
Conversation
Related to #447
What happens in environments where you have multiple runner groups and tags? Is the scope for this to allow scaling from 0 only for workflows targeting no custom tags i.e. |
@callum-tait-pbx Hey thanks for the question! This patch tries to create one registration only per RunnerDeployment. You usually differentiate groups/labels per RunnerDeployment so it should work great with groups and labels. |
Cool! Does it try to make them in the controller namespace? |
@callum-tait-pbx Nope! Currently, the registration-only pod is designed to reside within the same namespace as the runners that are managed by the RunnerDeployment/RunnerReplicaSet, to take advantage of K8s GC/finalizer to automatically remove the registration-only runner/pod on RunnerDeployment deletion. (I thought k8s gc doesn't work across namespaces |
I'm a bit confused how this feature works then? I assumed the driver for @cbrand wanting to be able to scale from 0 is to reduce costs as you pay by the minute / hour for machines in cloud environments. Being able to scale from 0 will mean you don't need a node running doing nothing costing money until a job is actually assigned. If you are creating your registration runner in the runner namespace that means a node needs to come up to do that? Is the idea that a node comes up just for a brief window to register a runner and then dies? I'm just a bit confused as to what the underlying driver is for wanting to scale from 0 and how the implemetation details of this feature line up with those drivers. Can you comment on how you are expecting the node lifecycle to work in conjuction with this feature? Will we potentially getting flapping issues with nodes or do runners that go offline stay registered for long enough that this isn't a concern? |
@callum-tait-pbx Hey! That's absolutely correct, and...
Mostly yes, but I think I'm trying to that much earlier than you might have imagined.
No. The idea is to create a registration-only runner before creating any "normal" runners. The registration-only runner creates the pod only for registration and exists immediately after that, consuming no cluster or node resource after that, while leaving the GitHub Actions runner registered by the runner pod being an "offline" state. |
Updated the PR description to clarify how GitHub Actions and the controller are supposed to work in concert to achieve scale from/to zero. |
Okay, so I've manually verified that it's working as expected. You will always get the offline runner whose name suffixed with Without the pod hanging around, as the controller deletes the registration-only runner pod after it gets registered:
|
Any docs from Github saying how long an offline runner will stick around? I think this is worth documenting / linking to a GitHub page as part of the PR? Do they eventually get pruned or do they stay forever until manually removed by an administrator? This is going to be a great feature once it's done. What would be really really really great as a phase 2 would be if we could have some day/time config around when to scale to 0 somehow (cron style maybe) as part of the Maybe if a schedule was set then scale to 0 based on that schedule (example below is a cron style schedule) otherwise respect the min / max settings? apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: example-runner-deployment-autoscaler
spec:
scaleTargetRef:
name: example-runner-deployment
scaleReplicasToZeroSchedule: 0 0 * * 6,0
minReplicas: 5
maxReplicas: 10
metrics:
- type: PercentageRunnersBusy
scaleUpThreshold: '0.75'
scaleDownThreshold: '0.3'
ScaleUpAdjustment: '2'
ScaleDownAdjustment: '1' I can see how scaling from 0 would be super great where you have a single runner per node due to their size and thus cost. Changing the 0 base scaling from simply setting the cc @mumoshu |
@callum-tait-pbx Hey! Unfortunately, I don't know any. But my assumption is that it is retained forever, until anyone deletes it. Here's the dump of the oldest registration-only runner in my local cluster. It's like 5 days old.
And I'm seeing the offline runner is still there from GitHub's view. |
I hear you! I once considered that any kind of scheduled scaling is out of the scope of this project, as it can theoretically be done by another controller that manipulates HRA on schedule. But it turns out such solution wouldn't work very nicely with GitOps- because then you need to ignore changes on minReplicas/maxReplicas on HRA to avoid misdetected "skew" between the desired and current state. That seems to work, but not perfectly, because doing so effectively prevents you from "overriding" scheduled scale settings manually via a git commit when necessary. So let's try once we settled on this and #447 cc/ @cbrand |
Perhaps it's worth including in the docs that is is an experimental feature for the moment? I can raise a question with our enterprise support to try and find out more.
Yup makes sense, it's probably a new issue as an enhancement to this feature anyway. |
@callum-tait-pbx Yes, that sounds like a good idea!
I'd appreciate it if you could 🙏 |
https://docs.github.com/en/actions/hosting-your-own-runners/removing-self-hosted-runners "A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 30 days." Logic is needed then for reprovisioning the registration only runner at least every 29 days? @mumoshu Without this change I don't think this feature is mergable? |
@callum-tait-pbx Thanks for finding and sharing! That's good to know 👍
Maybe. I can imagine that a RunnerReplicaSet/RunnerDeployment can easily be months or years old in a long running cluster. Scale-from-zero degraded consistently after 30 days does seem impractical. Another way would be to create the registration-only runner only on scaling to zero, and delete it on scaling to 1 or greater. That way, it should work as long as you switch between zero and non-zero runners at least once per 30 days. Actually I think this is a legit assumption and easier to implement. |
I intend for my
This sounds much simplier to implement and I agree probably a realistic assumption. |
@callum-tait-pbx Thanks for confirming! Alright, I've made it to create a registration-only runner on scaling runners to zero and keep it only while the RS is scaled to zero. |
@mumoshu excuse my stepping into this conversation. I have just started testing this awesome solution to evaluate GitHub Actions and stumbled upon this PR as I was looking for a scale to/from zero configuration. I gave it a try and seeing some interesting things. When I apply the configuration both a registration only and a normal pod is created then the normal pod is kept running. Please see the configuration below. If I remove the min/maxReplicas then the controller complains there's no minReplicas set but even then one normal pod is kept around running. May I ask for some guidance? I am happy to keep testing.
|
@ScOut3R what guidance are you looking for exactly? You have to provide the The README.md should have everything you need in it to get started, once this PRs funtionality is ready docs will be included in the README as part of this PR. |
@callum-tait-pbx as you can see minReplicas is set to 0 but the runners are not scaled down. One runner is kept running while the queue is empty. And yes, I am using the dev image of the controller. |
Thanks @mumoshu, that explains it. So in order to scale to/from 0 and use GitHub webhook as the trigger for scaling #447 is also required. |
This is an attempt to support scaling from/to zero.
The basic idea is that we create a one-off "registration-only" runner pod on RunnerReplicaSet being scaled to zero, so that there is one "offline" runner, which enables GitHub Actions to queue jobs instead of discarding those.
GitHub Actions seems to immediately throw away the new job when there are no runners at all. Generally, having runners of any status,
busy
,idle
, oroffline
would prevent GitHub actions from failing jobs. But retainingbusy
oridle
runners means that we need to keep runner pods running, which conflicts with our desired to scale to/from zero, hence we retainoffline
runners.In this change, I enhanced the runnerreplicaset controller to create a registration-only runner on very beginning of its reconciliation logic, only when a runnerreplicaset is scaled to zero. The runner controller creates the registration-only runner pod, waits for it to become "offline", and then removes the runner pod. The runner on GitHub stays
offline
, until the runner resource on K8s is deleted. As we remove the registration-only runner pod as soon as it registers, this doesn't block cluster-autoscaler.Related to #447