feat: Support for scaling from/to zero #465

mumoshu · 2021-04-18T08:34:35Z

This is an attempt to support scaling from/to zero.

The basic idea is that we create a one-off "registration-only" runner pod on RunnerReplicaSet being scaled to zero, so that there is one "offline" runner, which enables GitHub Actions to queue jobs instead of discarding those.

GitHub Actions seems to immediately throw away the new job when there are no runners at all. Generally, having runners of any status, busy, idle, or offline would prevent GitHub actions from failing jobs. But retaining busy or idle runners means that we need to keep runner pods running, which conflicts with our desired to scale to/from zero, hence we retain offline runners.

In this change, I enhanced the runnerreplicaset controller to create a registration-only runner on very beginning of its reconciliation logic, only when a runnerreplicaset is scaled to zero. The runner controller creates the registration-only runner pod, waits for it to become "offline", and then removes the runner pod. The runner on GitHub stays offline, until the runner resource on K8s is deleted. As we remove the registration-only runner pod as soon as it registers, this doesn't block cluster-autoscaler.

Related to #447

callum-tait-pbx · 2021-04-19T10:56:34Z

What happens in environments where you have multiple runner groups and tags? Is the scope for this to allow scaling from 0 only for workflows targeting no custom tags i.e. ['self-hosted' ,'Linux', 'x64'] only. If not how will you provide a registration only runner for a range of possible groups / labels?

mumoshu · 2021-04-19T11:09:46Z

If not how will you provide a registration only runner for a range of possible groups / labels?

@callum-tait-pbx Hey thanks for the question! This patch tries to create one registration only per RunnerDeployment. You usually differentiate groups/labels per RunnerDeployment so it should work great with groups and labels.

callum-tait-pbx · 2021-04-19T11:26:12Z

Cool! Does it try to make them in the controller namespace?

mumoshu · 2021-04-19T11:30:38Z

@callum-tait-pbx Nope! Currently, the registration-only pod is designed to reside within the same namespace as the runners that are managed by the RunnerDeployment/RunnerReplicaSet, to take advantage of K8s GC/finalizer to automatically remove the registration-only runner/pod on RunnerDeployment deletion. (I thought k8s gc doesn't work across namespaces

callum-tait-pbx · 2021-04-19T12:28:34Z

I'm a bit confused how this feature works then? I assumed the driver for @cbrand wanting to be able to scale from 0 is to reduce costs as you pay by the minute / hour for machines in cloud environments. Being able to scale from 0 will mean you don't need a node running doing nothing costing money until a job is actually assigned. If you are creating your registration runner in the runner namespace that means a node needs to come up to do that? Is the idea that a node comes up just for a brief window to register a runner and then dies?

I'm just a bit confused as to what the underlying driver is for wanting to scale from 0 and how the implemetation details of this feature line up with those drivers. Can you comment on how you are expecting the node lifecycle to work in conjuction with this feature? Will we potentially getting flapping issues with nodes or do runners that go offline stay registered for long enough that this isn't a concern?

mumoshu · 2021-04-20T00:23:21Z

Being able to scale from 0 will mean you don't need a node running doing nothing costing money until a job is actually assigned

@callum-tait-pbx Hey! That's absolutely correct, and...

If you are creating your registration runner in the runner namespace that means a node needs to come up to do that?

Mostly yes, but I think I'm trying to that much earlier than you might have imagined.

Is the idea that a node comes up just for a brief window to register a runner and then dies?

No. The idea is to create a registration-only runner before creating any "normal" runners. The registration-only runner creates the pod only for registration and exists immediately after that, consuming no cluster or node resource after that, while leaving the GitHub Actions runner registered by the runner pod being an "offline" state.

mumoshu · 2021-04-20T02:17:35Z

Updated the PR description to clarify how GitHub Actions and the controller are supposed to work in concert to achieve scale from/to zero.

…m/to zero

mumoshu · 2021-04-20T10:55:02Z

Okay, so I've manually verified that it's working as expected.

You will always get the offline runner whose name suffixed with --registration-only:

Without the pod hanging around, as the controller deletes the registration-only runner pod after it gets registered:

$ k get po
NAME                                           READY   STATUS              RESTARTS   AGE
example-runnerdeploy-h25vk-4dkdm               0/2     ContainerCreating   0          1s
example-runnerdeploy-h25vk-registration-only   0/2     ContainerCreating   0          1s

$ k get po
NAME                               READY   STATUS    RESTARTS   AGE
example-runnerdeploy-h25vk-4dkdm   2/2     Running   0          2m31s

callum-tait-pbx · 2021-04-21T16:39:26Z

Any docs from Github saying how long an offline runner will stick around? I think this is worth documenting / linking to a GitHub page as part of the PR? Do they eventually get pruned or do they stay forever until manually removed by an administrator?

This is going to be a great feature once it's done. What would be really really really great as a phase 2 would be if we could have some day/time config around when to scale to 0 somehow (cron style maybe) as part of the HorizontalRunnerAutoscaler spec? I would love to be able to scale down to 0 on weekends and / or late evening until early morning. Being about to do this would save us us quite a bit of ££££ over the course of a year. I wouldn't want to scale from 0 within normal business hours in my environment as to ensure builds get allocated promptly and we don't run into some sort of node flapping problem.

Maybe if a schedule was set then scale to 0 based on that schedule (example below is a cron style schedule) otherwise respect the min / max settings?

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: example-runner-deployment-autoscaler
spec:
  scaleTargetRef:
    name: example-runner-deployment
  scaleReplicasToZeroSchedule: 0 0 * * 6,0
  minReplicas: 5
  maxReplicas: 10
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    ScaleUpAdjustment: '2'
    ScaleDownAdjustment: '1'

I can see how scaling from 0 would be super great where you have a single runner per node due to their size and thus cost. Changing the 0 base scaling from simply setting the minReplicas: to 0 to driving it from a schedule instead would greatly improve the general usefulness of this feature making it relevant for those running a larger fleet of smaller runners, imo this feature is well worth considering! 🙏

cc @mumoshu

mumoshu · 2021-04-25T05:21:36Z

Any docs from Github saying how long an offline runner will stick around? I think this is worth documenting / linking to a GitHub page as part of the PR? Do they eventually get pruned or do they stay forever until manually removed by an administrator?

@callum-tait-pbx Hey! Unfortunately, I don't know any.

But my assumption is that it is retained forever, until anyone deletes it.

Here's the dump of the oldest registration-only runner in my local cluster. It's like 5 days old.

apiVersion: actions.summerwind.dev/v1alpha1
kind: Runner
metadata:
  annotations:
    actions-runner-controller/registration-only: "true"
  creationTimestamp: "2021-04-20T10:51:19Z"
  finalizers:
  - runner.actions.summerwind.dev
  generation: 1
  managedFields:
  - apiVersion: actions.summerwind.dev/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:actions-runner-controller/registration-only: {}
        f:finalizers: {}
        f:ownerReferences: {}
      f:spec:
        .: {}
        f:dockerdContainerResources: {}
        f:image: {}
        f:labels: {}
        f:repository: {}
        f:resources: {}
      f:status:
        .: {}
        f:lastRegistrationCheckTime: {}
        f:phase: {}
        f:registration:
          .: {}
          f:expiresAt: {}
          f:labels: {}
          f:repository: {}
          f:token: {}
    manager: manager
    operation: Update
    time: "2021-04-20T10:52:18Z"
  name: example-runnerdeploy-h25vk-registration-only
  namespace: default
  ownerReferences:
  - apiVersion: actions.summerwind.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: RunnerReplicaSet
    name: example-runnerdeploy-h25vk
    uid: 7ca18ca3-1fb0-42ea-9d4b-29325c95d640
  resourceVersion: "5743352"
  uid: 604f6e8f-cbdf-4c88-8b98-03c9a1b03227
spec:
  dockerdContainerResources: {}
  image: mumoshu/actions-runner:latest
  labels:
  - mylabel 1
  - mylabel 2
  repository: mumoshu/actions-test
  resources: {}
status:
  lastRegistrationCheckTime: "2021-04-20T10:51:19Z"
  phase: Running
  registration:
    expiresAt: "2021-04-20T11:48:32Z"
    labels:
    - mylabel 1
    - mylabel 2

And I'm seeing the offline runner is still there from GitHub's view.

mumoshu · 2021-04-25T05:28:25Z

This is going to be a great feature once it's done. What would be really really really great as a phase 2 would be if we could have some day/time config around when to scale to 0 somehow (cron style maybe) as part of the HorizontalRunnerAutoscaler spec?

I hear you!

I once considered that any kind of scheduled scaling is out of the scope of this project, as it can theoretically be done by another controller that manipulates HRA on schedule.

But it turns out such solution wouldn't work very nicely with GitOps- because then you need to ignore changes on minReplicas/maxReplicas on HRA to avoid misdetected "skew" between the desired and current state. That seems to work, but not perfectly, because doing so effectively prevents you from "overriding" scheduled scale settings manually via a git commit when necessary.

So let's try once we settled on this and #447

cc/ @cbrand

callum-tait-pbx · 2021-04-25T08:19:55Z

@callum-tait-pbx Hey! Unfortunately, I don't know any.

Perhaps it's worth including in the docs that is is an experimental feature for the moment? I can raise a question with our enterprise support to try and find out more.

So let's try once we settled on this and #447

Yup makes sense, it's probably a new issue as an enhancement to this feature anyway.

mumoshu · 2021-04-26T00:54:59Z

Perhaps it's worth including in the docs that is is an experimental feature for the moment?

@callum-tait-pbx Yes, that sounds like a good idea!

I can raise a question with our enterprise support to try and find out more.

I'd appreciate it if you could 🙏

callum-tait-pbx · 2021-04-27T07:09:41Z

Perhaps it's worth including in the docs that is is an experimental feature for the moment?

@callum-tait-pbx Yes, that sounds like a good idea!

I can raise a question with our enterprise support to try and find out more.

I'd appreciate it if you could 🙏

https://docs.github.com/en/actions/hosting-your-own-runners/removing-self-hosted-runners "A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 30 days."

Logic is needed then for reprovisioning the registration only runner at least every 29 days? @mumoshu Without this change I don't think this feature is mergable?

mumoshu · 2021-04-27T23:58:18Z

@callum-tait-pbx Thanks for finding and sharing! That's good to know 👍

Logic is needed then for reprovisioning the registration only runner at least every 29 days? @mumoshu Without this change I don't think this feature is mergable?

Maybe. I can imagine that a RunnerReplicaSet/RunnerDeployment can easily be months or years old in a long running cluster. Scale-from-zero degraded consistently after 30 days does seem impractical.

Another way would be to create the registration-only runner only on scaling to zero, and delete it on scaling to 1 or greater. That way, it should work as long as you switch between zero and non-zero runners at least once per 30 days. Actually I think this is a legit assumption and easier to implement.

callum-tait-pbx · 2021-04-28T08:08:15Z

Maybe. I can imagine that a RunnerReplicaSet/RunnerDeployment can easily be months or years old in a long running cluster. Scale-from-zero degraded consistently after 30 days does seem impractical.

Scale-from-zero degraded consistently after 30 days does seem impractical.

I intend for my RunnerDeployments to be deployed for years with the main changes being done being updating runner container image for the newer runner version which may only get done as little as few months or so. I know for me scaling from zero breaking every 30 days would be a showstopper for using the feature.

Another way would be to create the registration-only runner only on scaling to zero, and delete it on scaling to 1 or greater. That way, it should work as long as you switch between zero and non-zero runners at least once per 30 days. Actually I think this is a legit assumption and easier to implement.

This sounds much simplier to implement and I agree probably a realistic assumption.

mumoshu · 2021-04-30T01:09:52Z

@callum-tait-pbx Thanks for confirming! Alright, I've made it to create a registration-only runner on scaling runners to zero and keep it only while the RS is scaled to zero.

ScOut3R · 2021-04-30T06:25:30Z

@mumoshu excuse my stepping into this conversation. I have just started testing this awesome solution to evaluate GitHub Actions and stumbled upon this PR as I was looking for a scale to/from zero configuration.

I gave it a try and seeing some interesting things. When I apply the configuration both a registration only and a normal pod is created then the normal pod is kept running.

Please see the configuration below. If I remove the min/maxReplicas then the controller complains there's no minReplicas set but even then one normal pod is kept around running.

May I ask for some guidance? I am happy to keep testing.

kind: RunnerDeployment
metadata:
  name: generic-runner
spec:
  template:
    spec:
      organization: ""
      labels:
        - generic
      nodeSelector:
        node-role.kubernetes.io/spot-worker: "true"
      resources:
        requests:
          memory: 4Gi

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: generic-runner-deployment-autoscaler
spec:
  minReplicas: 0
  maxReplicas: 40
  scaleTargetRef:
    name: generic-runner
  scaleUpTriggers:
    - githubEvent:
        checkRun:
          types:
            - created
          status: queued
      amount: 1
      duration: 5m

callum-tait-pbx · 2021-04-30T18:29:37Z

@ScOut3R what guidance are you looking for exactly? You have to provide the minReplicas and maxReplicas attributes if you want your runner deployments to scale, which you do if you want to "scale to/from zero" however this applies if you want to scale any amount, that's the purpose of the HorizontalRunnerAutoscaler kind.

The README.md should have everything you need in it to get started, once this PRs funtionality is ready docs will be included in the README as part of this PR.

ScOut3R · 2021-04-30T21:30:45Z

@ScOut3R what guidance are you looking for exactly? You have to provide the minReplicas and maxReplicas attributes if you want your runner deployments to scale, which you do if you want to "scale to/from zero" however this applies if you want to scale any amount, that's the purpose of the HorizontalRunnerAutoscaler kind.

The README.md should have everything you need in it to get started, once this PRs funtionality is ready docs will be included in the README as part of this PR.

@callum-tait-pbx as you can see minReplicas is set to 0 but the runners are not scaled down. One runner is kept running while the queue is empty. And yes, I am using the dev image of the controller.

mumoshu · 2021-05-01T06:14:23Z

@ScOut3R Hey! You're very likely to need to create RunnerDeployment alone, without HRA.

HRA forces minReplicas to be 1 at minimum. The patch that allows HRA to have 0 minReplicas is not included in this PR, but in #447.

ScOut3R · 2021-05-03T22:18:02Z

@ScOut3R Hey! You're very likely to need to create RunnerDeployment alone, without HRA.

HRA forces minReplicas to be 1 at minimum. The patch that allows HRA to have 0 minReplicas is not included in this PR, but in #447.

Thanks @mumoshu, that explains it. So in order to scale to/from 0 and use GitHub webhook as the trigger for scaling #447 is also required.

feat: Support for scale from/to zero

f4297e5

Related to #447

mumoshu added 2 commits April 20, 2021 02:05

fixup! feat: Support for scale from/to zero

4a7ccc3

fixup! fixup! feat: Support for scale from/to zero

e6d8bc1

mumoshu added 4 commits April 20, 2021 02:31

fixup! fixup! fixup! feat: Support for scale from/to zero

37ab4eb

fixup! fixup! fixup! fixup! feat: Support for scale from/to zero

5a38660

fixup! fixup! fixup! fixup! fixup! feat: Support for scale from/to zero

0a2e705

fixup! fixup! fixup! fixup! fixup! fixup! feat: Support for scale fro…

e8e5a5e

…m/to zero

mumoshu mentioned this pull request Apr 20, 2021

feat: allow autoscaler to scale down to 0 #447

Merged

callum-tait-pbx mentioned this pull request Apr 26, 2021

Add scaling minReplicas count to zero based on a schedule #484

Closed

Have a registration-only runner only while scaling-to-zero

4a1a991

mumoshu added 3 commits May 2, 2021 05:25

fixup! Have a registration-only runner only while scaling-to-zero

77d11f0

Merge branch 'master' into scale-from-to-zero

188e32e

Update README

a472682

mumoshu merged commit dbd7b48 into master May 2, 2021

mumoshu deleted the scale-from-to-zero branch May 2, 2021 07:11

mikesplain mentioned this pull request May 3, 2021

edge: Registration-only runner disappears before runners are up #516

Closed

mumoshu mentioned this pull request May 5, 2021

edge: Make scale-from-zero even more reliable #522

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support for scaling from/to zero #465

feat: Support for scaling from/to zero #465

mumoshu commented Apr 18, 2021 •

edited

callum-tait-pbx commented Apr 19, 2021 •

edited

mumoshu commented Apr 19, 2021

callum-tait-pbx commented Apr 19, 2021

mumoshu commented Apr 19, 2021

callum-tait-pbx commented Apr 19, 2021 •

edited

mumoshu commented Apr 20, 2021

mumoshu commented Apr 20, 2021

mumoshu commented Apr 20, 2021 •

edited

callum-tait-pbx commented Apr 21, 2021 •

edited

mumoshu commented Apr 25, 2021 •

edited

mumoshu commented Apr 25, 2021

callum-tait-pbx commented Apr 25, 2021

mumoshu commented Apr 26, 2021

callum-tait-pbx commented Apr 27, 2021 •

edited

mumoshu commented Apr 27, 2021 •

edited

callum-tait-pbx commented Apr 28, 2021 •

edited

mumoshu commented Apr 30, 2021

ScOut3R commented Apr 30, 2021 •

edited

callum-tait-pbx commented Apr 30, 2021 •

edited

ScOut3R commented Apr 30, 2021

mumoshu commented May 1, 2021

ScOut3R commented May 3, 2021

feat: Support for scaling from/to zero #465

feat: Support for scaling from/to zero #465

Conversation

mumoshu commented Apr 18, 2021 • edited

callum-tait-pbx commented Apr 19, 2021 • edited

mumoshu commented Apr 19, 2021

callum-tait-pbx commented Apr 19, 2021

mumoshu commented Apr 19, 2021

callum-tait-pbx commented Apr 19, 2021 • edited

mumoshu commented Apr 20, 2021

mumoshu commented Apr 20, 2021

mumoshu commented Apr 20, 2021 • edited

callum-tait-pbx commented Apr 21, 2021 • edited

mumoshu commented Apr 25, 2021 • edited

mumoshu commented Apr 25, 2021

callum-tait-pbx commented Apr 25, 2021

mumoshu commented Apr 26, 2021

callum-tait-pbx commented Apr 27, 2021 • edited

mumoshu commented Apr 27, 2021 • edited

callum-tait-pbx commented Apr 28, 2021 • edited

mumoshu commented Apr 30, 2021

ScOut3R commented Apr 30, 2021 • edited

callum-tait-pbx commented Apr 30, 2021 • edited

ScOut3R commented Apr 30, 2021

mumoshu commented May 1, 2021

ScOut3R commented May 3, 2021

mumoshu commented Apr 18, 2021 •

edited

callum-tait-pbx commented Apr 19, 2021 •

edited

callum-tait-pbx commented Apr 19, 2021 •

edited

mumoshu commented Apr 20, 2021 •

edited

callum-tait-pbx commented Apr 21, 2021 •

edited

mumoshu commented Apr 25, 2021 •

edited

callum-tait-pbx commented Apr 27, 2021 •

edited

mumoshu commented Apr 27, 2021 •

edited

callum-tait-pbx commented Apr 28, 2021 •

edited

ScOut3R commented Apr 30, 2021 •

edited

callum-tait-pbx commented Apr 30, 2021 •

edited