Support runner groups with selected visibility in webhooks autoscaler #1012

fgalind1 · 2021-12-16T23:22:15Z

The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected).

This also improves resolving the scaleTargetKey that are used to match an HRA based on the inputs of the RunnerSet/RunnerDeployment spec to better support for runner groups.

This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before

controllers/horizontal_runner_autoscaler_webhook.go

github/github.go

mumoshu · 2021-12-18T04:02:00Z

controllers/horizontal_runner_autoscaler_webhook.go

@@ -586,6 +658,9 @@ HRA:

 			// Ensure that the RunnerSet-managed runners have all the labels requested by the workflow_job.
 			for _, l := range labels {
+				if l == "self-hosted" {


I recently learned that this isn't quite true. self-hosted is included in a workflow_job event because you had included it in runs-on of your workflow definition yaml. So if we were to make it clear, you'd better do

Suggested change

if l == "self-hosted" {

if len(rs.Spec.Labels) == 0 && l == "self-hosted" {

So that it's clearer that actions-runner-controller just assumes the default RunnerSet.Spec.Labels and RunnerDeployment.Spec.Labels as ["self-hosted"] if empty.

Thanks for your review @mumoshu

yes - that's right self-hosted is included in workflow_job. However the proposed approach doesn't cover other examples, e.g if I have a workflow with runs-on: ["self-hosted", "group-a"] and I have RunnerDeployment.Spec.Labels = ["group-a"], then job won't be scheduled. The only way will be to define RunnerDeployment.Spec.Labels = ["self-hosted", "group-a"] although that seems weird for two reasons:

all runners in this controller are self-hosted - if you register a runner without self-hosted in Labels it will be useless as it will never match a self-hosted runner.

self-hosted is a default label added by actions/runner which is not specified during registration - I believe in Spec.Labels we should set the custom labels that are going to be passed via config.sh which is not the case for self-hosted.

If it's easier I can remove this part from this PR and make a separate for this just to keep them separate 😃

Any thoughts?

@fgalind1 Thanks for helping me shape this!

I believe this discussion relates to #953 which stems from the same thought as yours.

self-hosted is a default label added by actions/runner which is not specified during registration

So the biggest thing I missed was this one!

I was blindly assuming that self-hosted is just a standard label that everyone uses just by convention. If this is the hard-coded default used by the actions/runner agents as well, actions-runner-controller should treat it as such too.

I'm going to merge #953 first as it's submitted earlier and after that, I'm fine with this change being included in this PR!

mumoshu · 2021-12-18T04:05:53Z

cmd/githubwebhookserver/main.go

+	if len(c.Token) > 0 || (c.AppID > 0 && c.AppInstallationID > 0 && c.AppPrivateKey != "") {
+		ghClient, err = c.NewClient()
+		if err != nil {
+			fmt.Fprintln(os.Stderr, "Error: Client creation failed.", err)


Probably we can make this even more nicer by adding more info on what were there and not, like Error: Client creation failed. AppIP is provided but Token, ApplicationID, AppPrivateKey were missing. You need to either set Token, or all of AppID, AppInstallationID and AppPrivateKey. But let's do that in another pr :)

Nevermind! I just misunderstood when this error occurs.

mumoshu

LGTM. Thank you for your contribution! I'm merging this.

We still lack a comprehensive E2E test suite so an awesome but relatively large change like this one may break things in an unexpected way.
So I'd appreciate it a lot if you could keep testing this for a while by perhaps building and deploying actions-runner-controller built from the latest main branch after merging!
Thanks in advance for our contribution and support ☺️

Duplicate condition caused after merge of #953 and #1012

toast-gear · 2022-01-17T03:30:30Z

@fgalind1 could I ask you what your driver was for introducing this change?

mumoshu · 2022-01-17T03:56:02Z

@fgalind1 I've discussed with @toast-gear and we now consider this to be not the right way to do it.

AFAIK, a runner group works as an additional visibility toggle in addition to labels. So GitHub filters target runners by both labels and groups. What we've done in this PR doesn't align with that GitHub actions behavior.

You scaled "all" the runner groups associated with a repo/org/enterprise regardless of labels. But GitHub still uses labels to select a runner to schedule the job. This results in ARC scaling up unnecessary runners, that are associated with visible a runner group BUT don't match workflow job labels.

Better support for runner group visibilities would be to make ARC skip a RunnerDeployment/RunnerSet that is found via scaleTarget func IF it is associated with a runner group and the group is invisible to the target repo/org/enterprise.

mumoshu · 2022-01-17T03:57:27Z

Linking #1055

mumoshu · 2022-01-17T04:12:36Z

controllers/horizontal_runner_autoscaler_webhook.go

-		log.Info("finding organizational runner", "organization", owner)
+	// Search for organization runner HRAs in default runner group
+	if target, err := scaleTarget(owner); err != nil {
+		log.Error(err, "finding organizational runner", "organization", owner)
 		return nil, err
 	} else if target != nil {


I believe what we had to do instead was to skip this target(=a runnerdepoyment that is associated to an organization) if it was associated to a runner group and that runner group was indivisible to this repository.

mumoshu · 2022-01-17T04:14:28Z

controllers/horizontal_runner_autoscaler_webhook.go

+			log.Error(err, "finding enterprise runner", "enterprise", enterprise)
+			return nil, err
+		} else if target != nil {
+			log.Info("scale up target is default enterprise runners", "enterprise", enterprise)


As similar to https://github.com/actions-runner-controller/actions-runner-controller/pull/1012/files#r785624849, I believe what we had to head instead would be to skip this target(=a runnerdeployment that is associated to an enterprise) if it was associated to a runner group defined at enterprise-level and the group was configured to be invisible to this organization.

mumoshu · 2022-01-18T00:12:46Z

controllers/horizontal_runner_autoscaler_webhook.go

 			if enterprise := rd.Spec.Template.Spec.Enterprise; enterprise != "" {
-				keys = append(keys, enterpriseKey(enterprise))
+				if group := rd.Spec.Template.Spec.Group; group != "" {
+					keys = append(keys, enterpriseRunnerGroupKey(enterprise, rd.Spec.Template.Spec.Group)) // Enterprise runner groups


It turns out we don't need to find HRA by runner groups as GitHub finds the runner to scheduled a job by labels anyway.
What we need instead would be to list all the HRAs whose RunnerDeployment matches the workflow_job labels and filter out those that are associated to runner groups (both organization and enterprise groups) that are unavailable to the org / the repo.

@mumoshu but you can use groups without labels, you could set runner groups with custom visibility and no custom labels right? (That's a common pattern in our company) if we dont take group into consideration, we could scale up the wrong runner group

That's not really the design philosophy behind runner groups though. If Github had intended on them as a means for defining exclusive access they would have:

included the repository runner group membership information in the workflow_jobs event whilst queued

enabled you to remove repositories from being members of the Default runner group.

Runner groups are a means for extending what runners are avaliable to a repository via labels. They aren't designed to be a way of defining placement of runners beyond opening up access different classes of runners via labels unique to those runner groups which in turn implicitly defines placement.

included the repository runner group membership information in the workflow_jobs event whilst queued

This is one of the points that we're dicussing with GitHub (still pending) to potentially include that information in the webhook payload and have the full support as they do support custom visibility. That could simplify some of the logic here to avoid querying GitHub API and find the matching runner group

enabled you to remove repositories from being members of the Default runner group.

You can not add runners to the default runner group (this is what we're doing) that way we can preserve the intent of having only repository-specific runner groups

Runner groups are a means for extending what runners are avaliable to a repository via labels. They aren't designed to be a way of defining placement of runners beyond opening up access different classes of runners via labels which in turn implicitly defines placement.

Uhm, that's why GitHub added the custom visibility isn't it? So that you could define some sort of protected runners to a specific set of repositories. Labels won't do anything w.r.t protection or scoping some runners to some repositories only.

I'm happy to discuss more details on this use case if needed 😃

@mumoshu but you can use groups without labels, you could set runner groups with custom visibility and no custom labels right? (That's a common pattern in our company) if we dont take group into consideration, we could scale up the wrong runner group

@fgalind1 Correct. But this seems like considering it in a wrong way.

What's implemented in this PR only works when you only use a lot of runner groups with the same set of labels.

If you configure a different set of labels per runner groups, what's done in this PR scales wrong runner runnerdeployments.

Instead you'd need to do what I've explained in #1012 (comment)

I believe what we had to do instead was to skip this target(=a runnerdepoyment that is associated to an organization) if it was associated to a runner group and that runner group was indivisible to this repository.

@mumoshu just to clarify with your previous comment, right now we find the potential HRA with the following precedence:

matching runner in repo from webhook

matching default runner in an org from webhook

matching default runner in enterprise from webhook

matching runner group in an org that matches repo from webhook and custom visibility of runner group

matching runner group in an org that matches repo from webhook and custom visibility of runner group

with your comment - are you suggesting to remove/skip 2 and 3?

it was associated to a runner group

This is the part that wasn't clear to me what object where you referring to be associated to the runner group

I'd say it should be like the below one instead:

matching repository-wide runner from the repository, the default org runner group, and the non-default org runner group that are visible to the repository, from webhook

matching organizational runner from the organization, the default enterprise runner group, and the non-default enterprise runner groups that are visible to the organization, from webhook

matching default runner in enterprise from webhook

You may already know but there're two variants of runner groups, organizational runner groups and enterprise runner groups. The visibility for the former can be controlled per repository, not organization. Similarly, the visibility for the latter can be controlled per organization under the enterprise. That's why (I suppose) these logics for runner groups can be merged into the step 1, 2, and 3, rather than being independent steps after 3.

fgalind1 · 2022-01-18T17:18:48Z

@fgalind1 could I ask you what your driver was for introducing this change?

In our company we use enterprise and organization runner groups with selected visibility. You can think of a runner group is mapped to a certain team who owns a few repos in my company and only that team/repos should use that runner group. Prior this change if a webhook was coming from repo X, it could potentially scale up a runner group with selected visibility that doesn't include that repo

GitHub by design doesn't necessarily need labels to be defined if you configure your runner groups/mapped repos correctly. If you have runner group X mapped to repo A and B and runner group Y mapped to repo C, whenever you receive a webhook from repo C you want to scale up only RG Y (additional filtering ciuld be done with labels but that should't be a requirement)

mumoshu · 2022-01-19T00:18:23Z

@fgalind1 Hey. I believe I understand your use case but I also see two problems:

This change seems to work as intended only when you don't differentiate labels per runner group. We'd need to make it work as intended regardless of you use custom labels with runner groups or not, as... GitHub allows it/I can imagine users who use runner groups with custom labels.

#1012 (comment)

This turns out to amplify the number of API calls per webhook

2 can be addressed later with a big caution/enough documentation, but 1 needs to be addressed before the next release of ARC.

fgalind1 · 2022-01-19T00:31:12Z

This change seems to work as intended only when you don't differentiate labels per runner group. We'd need to make it work as intended regardless of you use custom labels with runner groups or not, as... GitHub allows it/I can imagine users who use runner groups with custom labels.

@mumoshu I'll take a look to see if this could be a simple fix/update that. So just to double-check, the example you're describing is that I could have 2 runner groups with different labels, and with the current implementation it may be possible that we could scale up (target) the runner group that doesn't match the webhook labels?

mumoshu · 2022-01-19T00:35:04Z

with the current implementation it may be possible that we could scale up (target) the runner group that doesn't match the webhook labels?

@fgalind1 Yes. That's exactly what I had in my mind!

Ref #1012 (comment)

mumoshu · 2022-01-19T00:56:46Z

@fgalind1 Here's my note that may help fixing it: #1058

It isn't a working implementation of the fix but contains several TODO comments in places where I thought we'd add some code.

fgalind1 · 2022-01-19T02:44:12Z

@mumoshu I did some real-live testing with the current controller as-is and I'm not able to reproduce the incorrect behavior yet. Could you share the steps you have on your mind to reproduce the issue?

e.g I have:

HRA and RunnerDeployment "foo" with enterprise=XXX, group: k8s-dev, labels: [fgalind1]
Enterprise Runner group in GitHub with visibility to "fgalind1-test" repo

if I trigger a workflow_dispatch for fgalind1-test repo with a job containing only self-hosted label I get:
Scale target not found. If this is unexpected, ensure that there is exactly one repository-wide or organizational

if I trigger a workflow_dispatch for fgalind1-test repo with a job containing only self-hosted, fgalind1 labels I properly scale the runner-group:
scaled foo by 1

Right now if we have N groups with custom visibility and custom labels, first we find all the potential candidates autoscaler.getPotentialGroupsFromHRAs and based on that we filter out the HRAs where its runner group doesn't match the labels

mumoshu · 2022-01-19T04:12:02Z

Let's say you have:

HRA and RunnerDeployment "foo" with enterprise=XXX, group: k8s-dev, labels: [fgalind1]
HRA and RunnerDeployment "bar" with enterprise=XXX, labels: [fgalind1]
Setup two repos, fgalind-foo-test and fgalind-bar-test
Enterprise Runner group k8s-dev in GitHub with visibility to "fgalind-foo-test" repo only

What I had in my mind was that with the above setup, triggering a workflow run on fgalind-bar-test with on: [fgalind1] may result in scaling deployment foo, which doesn't seem correct.

fgalind1 · 2022-01-19T16:51:44Z

@mumoshu I tested the above example using the same input (just used org group instead of enterprise runner group) and got the following:

initial setup

foo runners (1 org runner + 1 runner group runner)

bar runners (1 org runner)

the HRAs keys for the webhooks controller

github-webhook-server 1.642610484236478e+09    DEBUG    controllers.Runner    HRA keys indexed for HRA bar: [XXX]
github-webhook-server 1.6426104842365727e+09    DEBUG    controllers.Runner    HRA keys indexed for HRA foo: [XXX/group/foo-group]

when I trigger a job in foo

github-webhook-server 1.6426105445647357e+09    DEBUG    controllers.Runner    Found 0 HRAs by key    {"key": "XXX/personal.fgalind1.foo-test"}
github-webhook-server 1.6426105445647907e+09    DEBUG    controllers.Runner    Found 1 HRAs by key    {"key": "XXX"}
github-webhook-server 1.6426105445648148e+09    INFO    controllers.Runner    job scale up target is organizational runners    {"event": "workflow_job", "hookID": "", "delivery": "c71f9490-79
github-webhook-server 1.642610544564828e+09    INFO    controllers.Runner    Patching hra for capacityReservations update    {"before": [{"expirationTime":"2022-01-19T17:08:58Z","replicas":1}
github-webhook-server 1.6426105445775921e+09    INFO    controllers.Runner    scaled bar by 1

foo should scale up either of the deployments - it scaled up "bar"

when I trigger a job in bar

github-webhook-server 1.6426106025543697e+09    DEBUG    controllers.Runner    Found 0 HRAs by key    {"key": "XXX/personal.fgalind1.bar-test"}
github-webhook-server 1.6426106025544243e+09    DEBUG    controllers.Runner    Found 1 HRAs by key    {"key": "XXX"}
github-webhook-server 1.6426106025544486e+09    INFO    controllers.Runner    job scale up target is organizational runners    {"event": "workflow_job", "hookID": "", "delivery": "e9aeee70-79
github-webhook-server 1.6426106025554268e+09    INFO    controllers.Runner    Patching hra for capacityReservations update    {"before": [{"expirationTime":"2022-01-19T17:12:24Z","replicas":1
github-webhook-server 1.6426106025682225e+09    INFO    controllers.Runner    scaled bar by 1

bar should scale up RunnerDeployment "bar" with org=XXX, labels: [fgalind1] and it scaled up "bar"

AFAIK logic is working fine as expected. as we're honoring labels just fine

mumoshu · 2022-01-19T23:30:46Z

bar should scale up RunnerDeployment "bar" with org=XXX, labels: [fgalind1] and it scaled up "bar"

Isn't that just a coincidence? ARC isn't really looking into the runner group when it's matching based on labels so it MAY still scale up foo(which is incorrect) from time to time.

mumoshu · 2022-01-19T23:36:45Z

See this line:

https://github.com/actions-runner-controller/actions-runner-controller/blob/5e86881c30614aef7e5edbfdeb7854ada28101db/controllers/horizontal_runner_autoscaler_webhook.go#L440

ARC returns the first target when there were two or more scale targets found. For deployment "bar" ARC would find both foo and bar as scale targets, and if K8s API returned the RunnerDeployment foo earlier it would return it. Otherwise bar is returned. It's not deterministic.

What we'd need to do there would be to return the scale target with matching labels AND from a visible runner group, instead of blindly returning the first scale target with matching labels(which may or may not be coming from a visible runner group).

fgalind1 · 2022-01-19T23:48:57Z

okay I think I see the case of the discussion - this issue will happen if the "default runner group" uses also "selected repositories only". In the example I was testing bar was using "all repositories" but if I switch it to "selected repositories = bar) then yes the issue will occur. Working on the fix now

mumoshu · 2022-01-20T00:04:27Z

I appreciate your efforts!

this issue will happen if the "default runner group" uses also "selected repositories only"

Yeah if it switched to the group bar to "selected repositories = bar" the issue will reproduce consistently (like 100%).
My point was though, it MAY still reproduce time to time even with "all repositories"`.

ARC should guard against both cases, and I guess, implementation-wise, somehow enhancing the getScaleTarget function could address both cases altogether.

fgalind1 · 2022-01-20T01:05:51Z

I appreciate your efforts!

this issue will happen if the "default runner group" uses also "selected repositories only"

Yeah if it switched to the group bar to "selected repositories = bar" the issue will reproduce consistently (like 100%). My point was though, it MAY still reproduce time to time even with "all repositories"`.

ARC should guard against both cases, and I guess, implementation-wise, somehow enhancing the getScaleTarget function could address both cases altogether.

@mumoshu let me know what do you think 😃 #1062. tested with the above flow using bar (selected repos and all repos) and everything looks good

fgalind1 · 2022-02-02T01:35:02Z

I appreciate your efforts!

this issue will happen if the "default runner group" uses also "selected repositories only"

Yeah if it switched to the group bar to "selected repositories = bar" the issue will reproduce consistently (like 100%). My point was though, it MAY still reproduce time to time even with "all repositories". ARC should guard against both cases, and I guess, implementation-wise, somehow enhancing the getScaleTarget` function could address both cases altogether.

@mumoshu let me know what do you think 😃 #1062. tested with the above flow using bar (selected repos and all repos) and everything looks good

@mumoshu just wonder if you had a chance to look at the improvements from #1062

mumoshu · 2022-02-02T01:51:55Z

@fgalind1 I'll try my best to fully review your change this weekend. Thank you for your patience!

…1062) This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required. More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud. Fixes additional cases from #1012 If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories `getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow: 1. Search for **repository** HRAs - if so it ends here 2. Get available HRAs in k8s 3. Compute visible runner groups a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls. b. If GitHub auth is not provided - assume all runner groups are visible to all repositories 4. Search for **default organization** runners (a.k.a runners from organization's visible default runner group) with matching labels 5. Search for **default enterprise** runners (a.k.a runners from enterprise's visible default runner group) with matching labels 6. Search for **custom organization runner groups** with matching labels 7. Search for **custom enterprise runner groups** with matching labels Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>

Support runner groups with selected visibility in webhooks autoscaler

c49256e

mumoshu reviewed Dec 16, 2021

View reviewed changes

controllers/horizontal_runner_autoscaler_webhook.go Outdated Show resolved Hide resolved

Use different key for group and unify scaleUpTarget functions

200518c