Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support for scaling from/to zero #465

Merged
merged 11 commits into from May 2, 2021
Merged

feat: Support for scaling from/to zero #465

merged 11 commits into from May 2, 2021

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Apr 18, 2021

This is an attempt to support scaling from/to zero.

The basic idea is that we create a one-off "registration-only" runner pod on RunnerReplicaSet being scaled to zero, so that there is one "offline" runner, which enables GitHub Actions to queue jobs instead of discarding those.

GitHub Actions seems to immediately throw away the new job when there are no runners at all. Generally, having runners of any status, busy, idle, or offline would prevent GitHub actions from failing jobs. But retaining busy or idle runners means that we need to keep runner pods running, which conflicts with our desired to scale to/from zero, hence we retain offline runners.

In this change, I enhanced the runnerreplicaset controller to create a registration-only runner on very beginning of its reconciliation logic, only when a runnerreplicaset is scaled to zero. The runner controller creates the registration-only runner pod, waits for it to become "offline", and then removes the runner pod. The runner on GitHub stays offline, until the runner resource on K8s is deleted. As we remove the registration-only runner pod as soon as it registers, this doesn't block cluster-autoscaler.

Related to #447

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Apr 19, 2021

What happens in environments where you have multiple runner groups and tags? Is the scope for this to allow scaling from 0 only for workflows targeting no custom tags i.e. ['self-hosted' ,'Linux', 'x64'] only. If not how will you provide a registration only runner for a range of possible groups / labels?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 19, 2021

If not how will you provide a registration only runner for a range of possible groups / labels?

@callum-tait-pbx Hey thanks for the question! This patch tries to create one registration only per RunnerDeployment. You usually differentiate groups/labels per RunnerDeployment so it should work great with groups and labels.

@callum-tait-pbx
Copy link
Contributor

Cool! Does it try to make them in the controller namespace?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 19, 2021

@callum-tait-pbx Nope! Currently, the registration-only pod is designed to reside within the same namespace as the runners that are managed by the RunnerDeployment/RunnerReplicaSet, to take advantage of K8s GC/finalizer to automatically remove the registration-only runner/pod on RunnerDeployment deletion. (I thought k8s gc doesn't work across namespaces

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Apr 19, 2021

I'm a bit confused how this feature works then? I assumed the driver for @cbrand wanting to be able to scale from 0 is to reduce costs as you pay by the minute / hour for machines in cloud environments. Being able to scale from 0 will mean you don't need a node running doing nothing costing money until a job is actually assigned. If you are creating your registration runner in the runner namespace that means a node needs to come up to do that? Is the idea that a node comes up just for a brief window to register a runner and then dies?

I'm just a bit confused as to what the underlying driver is for wanting to scale from 0 and how the implemetation details of this feature line up with those drivers. Can you comment on how you are expecting the node lifecycle to work in conjuction with this feature? Will we potentially getting flapping issues with nodes or do runners that go offline stay registered for long enough that this isn't a concern?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 20, 2021

Being able to scale from 0 will mean you don't need a node running doing nothing costing money until a job is actually assigned

@callum-tait-pbx Hey! That's absolutely correct, and...

If you are creating your registration runner in the runner namespace that means a node needs to come up to do that?

Mostly yes, but I think I'm trying to that much earlier than you might have imagined.

Is the idea that a node comes up just for a brief window to register a runner and then dies?

No. The idea is to create a registration-only runner before creating any "normal" runners. The registration-only runner creates the pod only for registration and exists immediately after that, consuming no cluster or node resource after that, while leaving the GitHub Actions runner registered by the runner pod being an "offline" state.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 20, 2021

Updated the PR description to clarify how GitHub Actions and the controller are supposed to work in concert to achieve scale from/to zero.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 20, 2021

Okay, so I've manually verified that it's working as expected.

You will always get the offline runner whose name suffixed with --registration-only:

image

Without the pod hanging around, as the controller deletes the registration-only runner pod after it gets registered:

$ k get po
NAME                                           READY   STATUS              RESTARTS   AGE
example-runnerdeploy-h25vk-4dkdm               0/2     ContainerCreating   0          1s
example-runnerdeploy-h25vk-registration-only   0/2     ContainerCreating   0          1s
$ k get po
NAME                               READY   STATUS    RESTARTS   AGE
example-runnerdeploy-h25vk-4dkdm   2/2     Running   0          2m31s

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Apr 21, 2021

Any docs from Github saying how long an offline runner will stick around? I think this is worth documenting / linking to a GitHub page as part of the PR? Do they eventually get pruned or do they stay forever until manually removed by an administrator?

This is going to be a great feature once it's done. What would be really really really great as a phase 2 would be if we could have some day/time config around when to scale to 0 somehow (cron style maybe) as part of the HorizontalRunnerAutoscaler spec? I would love to be able to scale down to 0 on weekends and / or late evening until early morning. Being about to do this would save us us quite a bit of ££££ over the course of a year. I wouldn't want to scale from 0 within normal business hours in my environment as to ensure builds get allocated promptly and we don't run into some sort of node flapping problem.

Maybe if a schedule was set then scale to 0 based on that schedule (example below is a cron style schedule) otherwise respect the min / max settings?

apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: example-runner-deployment-autoscaler
spec:
  scaleTargetRef:
    name: example-runner-deployment
  scaleReplicasToZeroSchedule: 0 0 * * 6,0
  minReplicas: 5
  maxReplicas: 10
  metrics:
  - type: PercentageRunnersBusy
    scaleUpThreshold: '0.75'
    scaleDownThreshold: '0.3'
    ScaleUpAdjustment: '2'
    ScaleDownAdjustment: '1'

I can see how scaling from 0 would be super great where you have a single runner per node due to their size and thus cost. Changing the 0 base scaling from simply setting the minReplicas: to 0 to driving it from a schedule instead would greatly improve the general usefulness of this feature making it relevant for those running a larger fleet of smaller runners, imo this feature is well worth considering! 🙏

cc @mumoshu

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 25, 2021

Any docs from Github saying how long an offline runner will stick around? I think this is worth documenting / linking to a GitHub page as part of the PR? Do they eventually get pruned or do they stay forever until manually removed by an administrator?

@callum-tait-pbx Hey! Unfortunately, I don't know any.

But my assumption is that it is retained forever, until anyone deletes it.

Here's the dump of the oldest registration-only runner in my local cluster. It's like 5 days old.

apiVersion: actions.summerwind.dev/v1alpha1
kind: Runner
metadata:
  annotations:
    actions-runner-controller/registration-only: "true"
  creationTimestamp: "2021-04-20T10:51:19Z"
  finalizers:
  - runner.actions.summerwind.dev
  generation: 1
  managedFields:
  - apiVersion: actions.summerwind.dev/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:actions-runner-controller/registration-only: {}
        f:finalizers: {}
        f:ownerReferences: {}
      f:spec:
        .: {}
        f:dockerdContainerResources: {}
        f:image: {}
        f:labels: {}
        f:repository: {}
        f:resources: {}
      f:status:
        .: {}
        f:lastRegistrationCheckTime: {}
        f:phase: {}
        f:registration:
          .: {}
          f:expiresAt: {}
          f:labels: {}
          f:repository: {}
          f:token: {}
    manager: manager
    operation: Update
    time: "2021-04-20T10:52:18Z"
  name: example-runnerdeploy-h25vk-registration-only
  namespace: default
  ownerReferences:
  - apiVersion: actions.summerwind.dev/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: RunnerReplicaSet
    name: example-runnerdeploy-h25vk
    uid: 7ca18ca3-1fb0-42ea-9d4b-29325c95d640
  resourceVersion: "5743352"
  uid: 604f6e8f-cbdf-4c88-8b98-03c9a1b03227
spec:
  dockerdContainerResources: {}
  image: mumoshu/actions-runner:latest
  labels:
  - mylabel 1
  - mylabel 2
  repository: mumoshu/actions-test
  resources: {}
status:
  lastRegistrationCheckTime: "2021-04-20T10:51:19Z"
  phase: Running
  registration:
    expiresAt: "2021-04-20T11:48:32Z"
    labels:
    - mylabel 1
    - mylabel 2

And I'm seeing the offline runner is still there from GitHub's view.

image

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 25, 2021

This is going to be a great feature once it's done. What would be really really really great as a phase 2 would be if we could have some day/time config around when to scale to 0 somehow (cron style maybe) as part of the HorizontalRunnerAutoscaler spec?

I hear you!

I once considered that any kind of scheduled scaling is out of the scope of this project, as it can theoretically be done by another controller that manipulates HRA on schedule.

But it turns out such solution wouldn't work very nicely with GitOps- because then you need to ignore changes on minReplicas/maxReplicas on HRA to avoid misdetected "skew" between the desired and current state. That seems to work, but not perfectly, because doing so effectively prevents you from "overriding" scheduled scale settings manually via a git commit when necessary.

So let's try once we settled on this and #447

cc/ @cbrand

@callum-tait-pbx
Copy link
Contributor

@callum-tait-pbx Hey! Unfortunately, I don't know any.

Perhaps it's worth including in the docs that is is an experimental feature for the moment? I can raise a question with our enterprise support to try and find out more.

So let's try once we settled on this and #447

Yup makes sense, it's probably a new issue as an enhancement to this feature anyway.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 26, 2021

Perhaps it's worth including in the docs that is is an experimental feature for the moment?

@callum-tait-pbx Yes, that sounds like a good idea!

I can raise a question with our enterprise support to try and find out more.

I'd appreciate it if you could 🙏

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Apr 27, 2021

Perhaps it's worth including in the docs that is is an experimental feature for the moment?

@callum-tait-pbx Yes, that sounds like a good idea!

I can raise a question with our enterprise support to try and find out more.

I'd appreciate it if you could 🙏

https://docs.github.com/en/actions/hosting-your-own-runners/removing-self-hosted-runners "A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 30 days."

Logic is needed then for reprovisioning the registration only runner at least every 29 days? @mumoshu Without this change I don't think this feature is mergable?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 27, 2021

@callum-tait-pbx Thanks for finding and sharing! That's good to know 👍

Logic is needed then for reprovisioning the registration only runner at least every 29 days? @mumoshu Without this change I don't think this feature is mergable?

Maybe. I can imagine that a RunnerReplicaSet/RunnerDeployment can easily be months or years old in a long running cluster. Scale-from-zero degraded consistently after 30 days does seem impractical.

Another way would be to create the registration-only runner only on scaling to zero, and delete it on scaling to 1 or greater. That way, it should work as long as you switch between zero and non-zero runners at least once per 30 days. Actually I think this is a legit assumption and easier to implement.

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Apr 28, 2021

Maybe. I can imagine that a RunnerReplicaSet/RunnerDeployment can easily be months or years old in a long running cluster. Scale-from-zero degraded consistently after 30 days does seem impractical.

Scale-from-zero degraded consistently after 30 days does seem impractical.

I intend for my RunnerDeployments to be deployed for years with the main changes being done being updating runner container image for the newer runner version which may only get done as little as few months or so. I know for me scaling from zero breaking every 30 days would be a showstopper for using the feature.

Another way would be to create the registration-only runner only on scaling to zero, and delete it on scaling to 1 or greater. That way, it should work as long as you switch between zero and non-zero runners at least once per 30 days. Actually I think this is a legit assumption and easier to implement.

This sounds much simplier to implement and I agree probably a realistic assumption.

@mumoshu
Copy link
Collaborator Author

mumoshu commented Apr 30, 2021

@callum-tait-pbx Thanks for confirming! Alright, I've made it to create a registration-only runner on scaling runners to zero and keep it only while the RS is scaled to zero.

@ScOut3R
Copy link

ScOut3R commented Apr 30, 2021

@mumoshu excuse my stepping into this conversation. I have just started testing this awesome solution to evaluate GitHub Actions and stumbled upon this PR as I was looking for a scale to/from zero configuration.

I gave it a try and seeing some interesting things. When I apply the configuration both a registration only and a normal pod is created then the normal pod is kept running.

Please see the configuration below. If I remove the min/maxReplicas then the controller complains there's no minReplicas set but even then one normal pod is kept around running.

May I ask for some guidance? I am happy to keep testing.

kind: RunnerDeployment
metadata:
  name: generic-runner
spec:
  template:
    spec:
      organization: ""
      labels:
        - generic
      nodeSelector:
        node-role.kubernetes.io/spot-worker: "true"
      resources:
        requests:
          memory: 4Gi

---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: generic-runner-deployment-autoscaler
spec:
  minReplicas: 0
  maxReplicas: 40
  scaleTargetRef:
    name: generic-runner
  scaleUpTriggers:
    - githubEvent:
        checkRun:
          types:
            - created
          status: queued
      amount: 1
      duration: 5m

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Apr 30, 2021

@ScOut3R what guidance are you looking for exactly? You have to provide the minReplicas and maxReplicas attributes if you want your runner deployments to scale, which you do if you want to "scale to/from zero" however this applies if you want to scale any amount, that's the purpose of the HorizontalRunnerAutoscaler kind.

The README.md should have everything you need in it to get started, once this PRs funtionality is ready docs will be included in the README as part of this PR.

@ScOut3R
Copy link

ScOut3R commented Apr 30, 2021

@ScOut3R what guidance are you looking for exactly? You have to provide the minReplicas and maxReplicas attributes if you want your runner deployments to scale, which you do if you want to "scale to/from zero" however this applies if you want to scale any amount, that's the purpose of the HorizontalRunnerAutoscaler kind.

The README.md should have everything you need in it to get started, once this PRs funtionality is ready docs will be included in the README as part of this PR.

@callum-tait-pbx as you can see minReplicas is set to 0 but the runners are not scaled down. One runner is kept running while the queue is empty. And yes, I am using the dev image of the controller.

@mumoshu
Copy link
Collaborator Author

mumoshu commented May 1, 2021

@ScOut3R Hey! You're very likely to need to create RunnerDeployment alone, without HRA.

HRA forces minReplicas to be 1 at minimum. The patch that allows HRA to have 0 minReplicas is not included in this PR, but in #447.

@mumoshu mumoshu merged commit dbd7b48 into master May 2, 2021
@mumoshu mumoshu deleted the scale-from-to-zero branch May 2, 2021 07:11
@ScOut3R
Copy link

ScOut3R commented May 3, 2021

@ScOut3R Hey! You're very likely to need to create RunnerDeployment alone, without HRA.

HRA forces minReplicas to be 1 at minimum. The patch that allows HRA to have 0 minReplicas is not included in this PR, but in #447.

Thanks @mumoshu, that explains it. So in order to scale to/from 0 and use GitHub webhook as the trigger for scaling #447 is also required.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants