-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: HorizontalRunnerAutoscaler Webhook server #282
Conversation
826ecbf
to
ba53f9a
Compare
This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners. This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs. On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for _, ent := range updated.Status.CacheEntries { | ||
if ent.ExpirationTime.Before(&metav1.Time{Time: now}) { | ||
cacheEntries = append(cacheEntries, ent) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumoshu: I am having trouble understanding the intent of this loop. It seems to go through all expired keys and adds them to an array but this array is not used afterwards any more.
updated.Status.CacheEntries = append(updated.Status.CacheEntries, v1alpha1.CacheEntry{ | ||
Key: v1alpha1.CacheEntryKeyDesiredReplicas, | ||
Value: *replicas, | ||
ExpirationTime: metav1.Time{Time: time.Now().Add(cacheDuration)}, | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I understand this code correctly, it keeps adding to the status of the horizontal runner autoscaler until it finally exceeds the limits of the etcd storage backend, see #346 . Was your intent to remove the expired cache entries ☝️ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonico Ah! Good catch. Definitely, my intent was to remove the expired entries
As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event. Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted. This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas. Follow-up for #282
#406) As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event. Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted. This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas. Additionally, I've also updated logging so that you observe which number was fetched from cache, and what number was suggested by either TotalNumberOfQueuedAndInProgressWorkflowRuns or PercentageRunnersBusy, and what was the final number used as the desired-replicas(after applying {Min,Max}Replicas). Follow-up for #282
@mumoshu Please add some documentation on how to set up the webhook on the GitHub side:
|
Hi, your doc https://github.com/actions-runner-controller/actions-runner-controller#webhook-driven-scaling
This also fails with "workflow_job", or "pullRequest", but it succeeds with "checkRun" and "push".
EDIT: |
This introduces a Webhook server that responds GitHub
check_run
,pull_request
, andpush
events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.HorizontalRunnerAutoscaler.Spec has been extended to make this behavior configurable.
Example 1: Scale up on each
check_run
eventTo scale up replicas of the runners for
example/myrepo
by 1 for 5 minutes on eachcheck_run
, you write manifests like the below:Example 2: Scale on each
pull_request
event againstdevelop
ormain
branchesSee "activity types" for the list of valid values for
scaleUpTriggers[].githubEvent.pullRequest.types
.Implementation
First of all, this feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. I appreciate authors and maintainers of terraform-aws-github-runner for sharing their awesome ideas and work as opensource.
The biggest difference would be that terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.
On each GitHub event received, the webhook server searches for repository-wide and organizational runners from the cluster and searches for the single target to scale up.
Then the webhook server tries to match
HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest]
against the event and if it finds only one HRA, it considers that as the scale target.If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.
The webhook server scales up the HorizontalRunnerAutoscaler by adding an array entry to
HRA.Spec.CapacityReservations[]
which is also added in this PR. The HRA controller watches the change and reconciles the HRA resource by updating the scale target(RunnerDeployment)'s desired replicas.The
CapacityReservations
item (we call it "capacity reservation") is composed ofName
,ExpirationTime
, andReplicas
fields. Each item is uniquely identified byName
. Each capacity reservation denotes that you want to "temporarily" add the number ofReplicas
"in addition" to the desired number of replicas computed by querying GitHub API. "Temporarily" because it is valid only untilExpirationTime
is met.