feat: HorizontalRunnerAutoscaler Webhook server #282

mumoshu · 2021-01-31T01:33:34Z

This introduces a Webhook server that responds GitHub check_run, pull_request, and push events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.

HorizontalRunnerAutoscaler.Spec has been extended to make this behavior configurable.

Example 1: Scale up on each `check_run` event

Note: This should work almost like https://github.com/philips-labs/terraform-aws-github-runner

To scale up replicas of the runners for example/myrepo by 1 for 5 minutes on each check_run, you write manifests like the below:

kind: RunnerDeployment:
metadata:
   name: myrunners
spec:
  repository: example/myrepo
---
kind: HorizontalRunnerAutoscaler
spec:
  scaleTargetRef:
    name: myrunners
  scaleUpTrigggers:
  - githubEvent:
      checkRun:
        types: ["created"]
        status: "queued"
    amount: 1
    duration: "5m"

Example 2: Scale on each `pull_request` event against `develop` or `main` branches

kind: RunnerDeployment:
metadata:
   name: myrunners
spec:
  repository: example/myrepo
---
kind: HorizontalRunnerAutoscaler
spec:
  scaleTargetRef:
    name: myrunners
  scaleUpTrigggers:
  - githubEvent:
      pullRequest:
        types: ["synchronize"]
        branches: ["main", "develop"]
    amount: 1
    duration: "5m"

See "activity types" for the list of valid values for scaleUpTriggers[].githubEvent.pullRequest.types.

Implementation

First of all, this feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. I appreciate authors and maintainers of terraform-aws-github-runner for sharing their awesome ideas and work as opensource.

The biggest difference would be that terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.

On each GitHub event received, the webhook server searches for repository-wide and organizational runners from the cluster and searches for the single target to scale up.

Note that searches are executed using controller-runtime's internal search index, so it doesn't end up DoS attacking your K8s API server.

Then the webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it considers that as the scale target.

If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.

The webhook server scales up the HorizontalRunnerAutoscaler by adding an array entry to HRA.Spec.CapacityReservations[] which is also added in this PR. The HRA controller watches the change and reconciles the HRA resource by updating the scale target(RunnerDeployment)'s desired replicas.

The CapacityReservations item (we call it "capacity reservation") is composed of Name, ExpirationTime, and Replicas fields. Each item is uniquely identified by Name. Each capacity reservation denotes that you want to "temporarily" add the number of Replicas "in addition" to the desired number of replicas computed by querying GitHub API. "Temporarily" because it is valid only until ExpirationTime is met.

This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners. This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs. On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.

…caling

…s for scaling

Follow-up for #282 and #334

* Fix panic on webhook for user-owned repository Follow-up for #282 and #334

jonico

@mumoshu: After all my auto scalers started to show the accumulation / etcd storage extended problem (#346 ), I looked into this code and think you may have forgotten to delete updated entries from the cache so that it just adds up

jonico · 2021-02-23T23:10:25Z

controllers/horizontalrunnerautoscaler_controller.go

+		for _, ent := range updated.Status.CacheEntries {
+			if ent.ExpirationTime.Before(&metav1.Time{Time: now}) {
+				cacheEntries = append(cacheEntries, ent)
+			}
+		}


@mumoshu: I am having trouble understanding the intent of this loop. It seems to go through all expired keys and adds them to an array but this array is not used afterwards any more.

jonico · 2021-02-23T23:12:09Z

controllers/horizontalrunnerautoscaler_controller.go

+		updated.Status.CacheEntries = append(updated.Status.CacheEntries, v1alpha1.CacheEntry{
+			Key:            v1alpha1.CacheEntryKeyDesiredReplicas,
+			Value:          *replicas,
+			ExpirationTime: metav1.Time{Time: time.Now().Add(cacheDuration)},
+		})


if I understand this code correctly, it keeps adding to the status of the horizontal runner autoscaler until it finally exceeds the limits of the etcd storage backend, see #346 . Was your intent to remove the expired cache entries ☝️ ?

@jonico Ah! Good catch. Definitely, my intent was to remove the expired entries

As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event. Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted. This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas. Follow-up for #282

#406) As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event. Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted. This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas. Additionally, I've also updated logging so that you observe which number was fetched from cache, and what number was suggested by either TotalNumberOfQueuedAndInProgressWorkflowRuns or PercentageRunnersBusy, and what was the final number used as the desired-replicas(after applying {Min,Max}Replicas). Follow-up for #282

Nuru · 2022-04-23T00:22:42Z

@mumoshu Please add some documentation on how to set up the webhook on the GitHub side:

org vs repo webhook
which events to configure
setting "Content type" to application/json
Configuring a Secret (if that is supported)

racker7 · 2022-06-09T19:02:27Z

Hi, your doc https://github.com/actions-runner-controller/actions-runner-controller#webhook-driven-scaling
mentions "workflow_job" and points here.
However, its not recognized and its not mentioned on this page.
Has "workflow_job" and/or "workflowJob" webhook event type been implemented, and if not can it be added please?

$ kubectl -n myns apply -f test/deploy-hra-webhook.yaml                                                                      
error: error validating "test/deploy-hra-webhook.yaml": error validating data: ValidationError(HorizontalRunnerAutoscaler.spec.scaleUpTriggers[0].githubEvent): unknown field "workflowJob" in dev.summerwind.actions.v1alpha1.HorizontalRunnerAutoscaler.spec.scaleUpTriggers.githubEvent;

This also fails with "workflow_job", or "pullRequest", but it succeeds with "checkRun" and "push".
And it works with empty array for githubEvent: {}

$ cat test/deploy-hra-webhook.yaml                                                                                                   
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: myrun
spec:
  maxReplicas: 15
  minReplicas: 0
  scaleTargetRef:
    name: myrd
  scaleUpTriggers:
  - githubEvent:
      checkRun:
        status: "queued"

EDIT:
I see it was added here https://github.com/actions-runner-controller/actions-runner-controller/releases/tag/v0.20.0
FWIW we are in AWS EKS and on Release v0.22.3 of ARC and the CRDs appear to be updated correctly.
It seems to work fine with workflow_jobs from GHE using generic githubEvent: {}, it just wont let me specify workflowJob in the HRA definition. This just means I can have only one kind of github event, which has not yet presented any problems.
UPDATE:
Convinced the team to delete the CRDs and redeploy via Argo to latest v0.24.1. Now workflowJob appears in HRA CRD description and works in HRA with GHE.

mumoshu force-pushed the hra-webhook branch 3 times, most recently from 826ecbf to ba53f9a Compare January 31, 2021 01:54

mumoshu force-pushed the hra-webhook branch from ba53f9a to 73acc86 Compare January 31, 2021 02:03

mumoshu and others added 16 commits January 31, 2021 12:42

Add tests and support for ping events

1c4381c

Fix the autoscaler webhook to use HRA.Spec.CapacityReservations for s…

4975fb8

…caling

fixup! Fix the autoscaler webhook to use HRA.Spec.CapacityReservation…

b29a7f5

…s for scaling

Fix integration test

e08fae6

Update manifests

7fb52eb

Ensure it works as intended

8273ba0

fixup! Ensure it works as intended

7fc943a

fixup! fixup! Ensure it works as intended

ec3bc8b

fixup! fixup! fixup! Ensure it works as intended

a0d8a39

fixup! fixup! fixup! fixup! Ensure it works as intended

33e2710

chart: Add support for github webhook server

38bba79

dockerfile: Include github-webhook-server binary

c3a6e3a

Do not import unversioned go-github

b62305c

Update README

e75d770

Merge branch 'master' into hra-webhook

61218a1

Small fixes

bccd40c

mumoshu merged commit ab1c39d into master Feb 7, 2021

mumoshu deleted the hra-webhook branch February 7, 2021 08:37

mumoshu added a commit that referenced this pull request Feb 22, 2021

Fix panic on webhook for user-owned repository

2b6844b

Follow-up for #282 and #334

mumoshu mentioned this pull request Feb 22, 2021

Fix panic on webhook for user-owned repository #344

Merged

mumoshu added a commit that referenced this pull request Feb 22, 2021

Fix panic on webhook for user-owned repository (#344)

991535e

* Fix panic on webhook for user-owned repository Follow-up for #282 and #334

jonico reviewed Feb 23, 2021

View reviewed changes

mumoshu mentioned this pull request Feb 24, 2021

Failing to save horizontal runner autoscaler status if it gets too large #346

Closed

mumoshu mentioned this pull request Mar 19, 2021

Do not delay min/maxReplicas propagation from HRA to RD due to caching #406

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: HorizontalRunnerAutoscaler Webhook server #282

feat: HorizontalRunnerAutoscaler Webhook server #282

mumoshu commented Jan 31, 2021 •

edited

jonico left a comment

jonico Feb 23, 2021

jonico Feb 23, 2021

mumoshu Feb 24, 2021

Nuru commented Apr 23, 2022

racker7 commented Jun 9, 2022 •

edited

feat: HorizontalRunnerAutoscaler Webhook server #282

feat: HorizontalRunnerAutoscaler Webhook server #282

Conversation

mumoshu commented Jan 31, 2021 • edited

Example 1: Scale up on each check_run event

Example 2: Scale on each pull_request event against develop or main branches

Implementation

jonico left a comment

Choose a reason for hiding this comment

jonico Feb 23, 2021

Choose a reason for hiding this comment

jonico Feb 23, 2021

Choose a reason for hiding this comment

mumoshu Feb 24, 2021

Choose a reason for hiding this comment

Nuru commented Apr 23, 2022

racker7 commented Jun 9, 2022 • edited

mumoshu commented Jan 31, 2021 •

edited

Example 1: Scale up on each `check_run` event

Example 2: Scale on each `pull_request` event against `develop` or `main` branches

racker7 commented Jun 9, 2022 •

edited