Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: HorizontalRunnerAutoscaler Webhook server #282

Merged
merged 17 commits into from Feb 7, 2021
Merged

Conversation

mumoshu
Copy link
Collaborator

@mumoshu mumoshu commented Jan 31, 2021

This introduces a Webhook server that responds GitHub check_run, pull_request, and push events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.

HorizontalRunnerAutoscaler.Spec has been extended to make this behavior configurable.

Example 1: Scale up on each check_run event

Note: This should work almost like https://github.com/philips-labs/terraform-aws-github-runner

To scale up replicas of the runners for example/myrepo by 1 for 5 minutes on each check_run, you write manifests like the below:

kind: RunnerDeployment:
metadata:
   name: myrunners
spec:
  repository: example/myrepo
---
kind: HorizontalRunnerAutoscaler
spec:
  scaleTargetRef:
    name: myrunners
  scaleUpTrigggers:
  - githubEvent:
      checkRun:
        types: ["created"]
        status: "queued"
    amount: 1
    duration: "5m"

Example 2: Scale on each pull_request event against develop or main branches

kind: RunnerDeployment:
metadata:
   name: myrunners
spec:
  repository: example/myrepo
---
kind: HorizontalRunnerAutoscaler
spec:
  scaleTargetRef:
    name: myrunners
  scaleUpTrigggers:
  - githubEvent:
      pullRequest:
        types: ["synchronize"]
        branches: ["main", "develop"]
    amount: 1
    duration: "5m"

See "activity types" for the list of valid values for scaleUpTriggers[].githubEvent.pullRequest.types.

Implementation

First of all, this feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. I appreciate authors and maintainers of terraform-aws-github-runner for sharing their awesome ideas and work as opensource.

The biggest difference would be that terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.

On each GitHub event received, the webhook server searches for repository-wide and organizational runners from the cluster and searches for the single target to scale up.

Note that searches are executed using controller-runtime's internal search index, so it doesn't end up DoS attacking your K8s API server.

Then the webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it considers that as the scale target.

If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.

The webhook server scales up the HorizontalRunnerAutoscaler by adding an array entry to HRA.Spec.CapacityReservations[] which is also added in this PR. The HRA controller watches the change and reconciles the HRA resource by updating the scale target(RunnerDeployment)'s desired replicas.

The CapacityReservations item (we call it "capacity reservation") is composed of Name, ExpirationTime, and Replicas fields. Each item is uniquely identified by Name. Each capacity reservation denotes that you want to "temporarily" add the number of Replicas "in addition" to the desired number of replicas computed by querying GitHub API. "Temporarily" because it is valid only until ExpirationTime is met.

@mumoshu mumoshu force-pushed the hra-webhook branch 3 times, most recently from 826ecbf to ba53f9a Compare January 31, 2021 01:54
This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.

This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.

On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.
@mumoshu mumoshu merged commit ab1c39d into master Feb 7, 2021
@mumoshu mumoshu deleted the hra-webhook branch February 7, 2021 08:37
mumoshu added a commit that referenced this pull request Feb 22, 2021
mumoshu added a commit that referenced this pull request Feb 22, 2021
* Fix panic on webhook for user-owned repository

Follow-up for #282 and #334
Copy link
Contributor

@jonico jonico left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumoshu: After all my auto scalers started to show the accumulation / etcd storage extended problem (#346 ), I looked into this code and think you may have forgotten to delete updated entries from the cache so that it just adds up

Comment on lines +153 to +157
for _, ent := range updated.Status.CacheEntries {
if ent.ExpirationTime.Before(&metav1.Time{Time: now}) {
cacheEntries = append(cacheEntries, ent)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mumoshu: I am having trouble understanding the intent of this loop. It seems to go through all expired keys and adds them to an array but this array is not used afterwards any more.

Comment on lines +167 to +171
updated.Status.CacheEntries = append(updated.Status.CacheEntries, v1alpha1.CacheEntry{
Key: v1alpha1.CacheEntryKeyDesiredReplicas,
Value: *replicas,
ExpirationTime: metav1.Time{Time: time.Now().Add(cacheDuration)},
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I understand this code correctly, it keeps adding to the status of the horizontal runner autoscaler until it finally exceeds the limits of the etcd storage backend, see #346 . Was your intent to remove the expired cache entries ☝️ ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonico Ah! Good catch. Definitely, my intent was to remove the expired entries

mumoshu added a commit that referenced this pull request Mar 19, 2021
As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event.

Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted.

This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas.

Follow-up for #282
mumoshu added a commit that referenced this pull request Mar 19, 2021
#406)

As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event.

Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted.

This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas.

Additionally, I've also updated logging so that you observe which number was fetched from cache, and what number was suggested by either TotalNumberOfQueuedAndInProgressWorkflowRuns or PercentageRunnersBusy, and what was the final number used as the desired-replicas(after applying {Min,Max}Replicas).

Follow-up for #282
@Nuru
Copy link
Contributor

Nuru commented Apr 23, 2022

@mumoshu Please add some documentation on how to set up the webhook on the GitHub side:

  • org vs repo webhook
  • which events to configure
  • setting "Content type" to application/json
  • Configuring a Secret (if that is supported)

@racker7
Copy link

racker7 commented Jun 9, 2022

Hi, your doc https://github.com/actions-runner-controller/actions-runner-controller#webhook-driven-scaling
mentions "workflow_job" and points here.
However, its not recognized and its not mentioned on this page.
Has "workflow_job" and/or "workflowJob" webhook event type been implemented, and if not can it be added please?

$ kubectl -n myns apply -f test/deploy-hra-webhook.yaml                                                                      
error: error validating "test/deploy-hra-webhook.yaml": error validating data: ValidationError(HorizontalRunnerAutoscaler.spec.scaleUpTriggers[0].githubEvent): unknown field "workflowJob" in dev.summerwind.actions.v1alpha1.HorizontalRunnerAutoscaler.spec.scaleUpTriggers.githubEvent;

This also fails with "workflow_job", or "pullRequest", but it succeeds with "checkRun" and "push".
And it works with empty array for githubEvent: {}

$ cat test/deploy-hra-webhook.yaml                                                                                                   
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: myrun
spec:
  maxReplicas: 15
  minReplicas: 0
  scaleTargetRef:
    name: myrd
  scaleUpTriggers:
  - githubEvent:
      checkRun:
        status: "queued"

EDIT:
I see it was added here https://github.com/actions-runner-controller/actions-runner-controller/releases/tag/v0.20.0
FWIW we are in AWS EKS and on Release v0.22.3 of ARC and the CRDs appear to be updated correctly.
It seems to work fine with workflow_jobs from GHE using generic githubEvent: {}, it just wont let me specify workflowJob in the HRA definition. This just means I can have only one kind of github event, which has not yet presented any problems.
UPDATE:
Convinced the team to delete the CRDs and redeploy via Argo to latest v0.24.1. Now workflowJob appears in HRA CRD description and works in HRA with GHE.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants