-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support runner groups with selected visibility in webhooks autoscaler #1012
Conversation
@@ -586,6 +658,9 @@ HRA: | |||
|
|||
// Ensure that the RunnerSet-managed runners have all the labels requested by the workflow_job. | |||
for _, l := range labels { | |||
if l == "self-hosted" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recently learned that this isn't quite true. self-hosted
is included in a workflow_job
event because you had included it in runs-on
of your workflow definition yaml. So if we were to make it clear, you'd better do
if l == "self-hosted" { | |
if len(rs.Spec.Labels) == 0 && l == "self-hosted" { |
So that it's clearer that actions-runner-controller just assumes the default RunnerSet.Spec.Labels and RunnerDeployment.Spec.Labels as ["self-hosted"]
if empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your review @mumoshu
yes - that's right self-hosted
is included in workflow_job
. However the proposed approach doesn't cover other examples, e.g if I have a workflow with runs-on: ["self-hosted", "group-a"]
and I have RunnerDeployment.Spec.Labels = ["group-a"]
, then job won't be scheduled. The only way will be to define RunnerDeployment.Spec.Labels = ["self-hosted", "group-a"]
although that seems weird for two reasons:
- all runners in this controller are self-hosted - if you register a runner without self-hosted in Labels it will be useless as it will never match a self-hosted runner.
self-hosted
is a default label added by actions/runner which is not specified during registration - I believe inSpec.Labels
we should set the custom labels that are going to be passed via config.sh which is not the case for self-hosted.
If it's easier I can remove this part from this PR and make a separate for this just to keep them separate 😃
Any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fgalind1 Thanks for helping me shape this!
I believe this discussion relates to #953 which stems from the same thought as yours.
self-hosted is a default label added by actions/runner which is not specified during registration
So the biggest thing I missed was this one!
I was blindly assuming that self-hosted
is just a standard label that everyone uses just by convention. If this is the hard-coded default used by the actions/runner agents as well, actions-runner-controller should treat it as such too.
I'm going to merge #953 first as it's submitted earlier and after that, I'm fine with this change being included in this PR!
if len(c.Token) > 0 || (c.AppID > 0 && c.AppInstallationID > 0 && c.AppPrivateKey != "") { | ||
ghClient, err = c.NewClient() | ||
if err != nil { | ||
fmt.Fprintln(os.Stderr, "Error: Client creation failed.", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we can make this even more nicer by adding more info on what were there and not, like Error: Client creation failed. AppIP is provided but Token, ApplicationID, AppPrivateKey were missing. You need to either set Token, or all of AppID, AppInstallationID and AppPrivateKey
. But let's do that in another pr :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nevermind! I just misunderstood when this error occurs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you for your contribution! I'm merging this.
We still lack a comprehensive E2E test suite so an awesome but relatively large change like this one may break things in an unexpected way.
So I'd appreciate it a lot if you could keep testing this for a while by perhaps building and deploying actions-runner-controller built from the latest main branch after merging!
Thanks in advance for our contribution and support
@fgalind1 could I ask you what your driver was for introducing this change? |
@fgalind1 I've discussed with @toast-gear and we now consider this to be not the right way to do it. AFAIK, a runner group works as an additional visibility toggle in addition to labels. So GitHub filters target runners by both labels and groups. What we've done in this PR doesn't align with that GitHub actions behavior. You scaled "all" the runner groups associated with a repo/org/enterprise regardless of labels. But GitHub still uses labels to select a runner to schedule the job. This results in ARC scaling up unnecessary runners, that are associated with visible a runner group BUT don't match workflow job labels. Better support for runner group visibilities would be to make ARC skip a RunnerDeployment/RunnerSet that is found via |
Linking #1055 |
log.Info("finding organizational runner", "organization", owner) | ||
// Search for organization runner HRAs in default runner group | ||
if target, err := scaleTarget(owner); err != nil { | ||
log.Error(err, "finding organizational runner", "organization", owner) | ||
return nil, err | ||
} else if target != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe what we had to do instead was to skip this target
(=a runnerdepoyment that is associated to an organization) if it was associated to a runner group and that runner group was indivisible to this repository.
log.Error(err, "finding enterprise runner", "enterprise", enterprise) | ||
return nil, err | ||
} else if target != nil { | ||
log.Info("scale up target is default enterprise runners", "enterprise", enterprise) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As similar to https://github.com/actions-runner-controller/actions-runner-controller/pull/1012/files#r785624849, I believe what we had to head instead would be to skip this target
(=a runnerdeployment that is associated to an enterprise) if it was associated to a runner group defined at enterprise-level and the group was configured to be invisible to this organization.
if enterprise := rd.Spec.Template.Spec.Enterprise; enterprise != "" { | ||
keys = append(keys, enterpriseKey(enterprise)) | ||
if group := rd.Spec.Template.Spec.Group; group != "" { | ||
keys = append(keys, enterpriseRunnerGroupKey(enterprise, rd.Spec.Template.Spec.Group)) // Enterprise runner groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It turns out we don't need to find HRA by runner groups as GitHub finds the runner to scheduled a job by labels anyway.
What we need instead would be to list all the HRAs whose RunnerDeployment matches the workflow_job labels and filter out those that are associated to runner groups (both organization and enterprise groups) that are unavailable to the org / the repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumoshu but you can use groups without labels, you could set runner groups with custom visibility and no custom labels right? (That's a common pattern in our company) if we dont take group into consideration, we could scale up the wrong runner group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not really the design philosophy behind runner groups though. If Github had intended on them as a means for defining exclusive access they would have:
- included the repository runner group membership information in the workflow_jobs event whilst queued
- enabled you to remove repositories from being members of the
Default
runner group.
Runner groups are a means for extending what runners are avaliable to a repository via labels. They aren't designed to be a way of defining placement of runners beyond opening up access different classes of runners via labels unique to those runner groups which in turn implicitly defines placement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- included the repository runner group membership information in the workflow_jobs event whilst queued
This is one of the points that we're dicussing with GitHub (still pending) to potentially include that information in the webhook payload and have the full support as they do support custom visibility. That could simplify some of the logic here to avoid querying GitHub API and find the matching runner group
- enabled you to remove repositories from being members of the
Default
runner group.
You can not add runners to the default runner group (this is what we're doing) that way we can preserve the intent of having only repository-specific runner groups
Runner groups are a means for extending what runners are avaliable to a repository via labels. They aren't designed to be a way of defining placement of runners beyond opening up access different classes of runners via labels which in turn implicitly defines placement.
Uhm, that's why GitHub added the custom visibility isn't it? So that you could define some sort of protected runners to a specific set of repositories. Labels won't do anything w.r.t protection or scoping some runners to some repositories only.
I'm happy to discuss more details on this use case if needed 😃
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mumoshu but you can use groups without labels, you could set runner groups with custom visibility and no custom labels right? (That's a common pattern in our company) if we dont take group into consideration, we could scale up the wrong runner group
@fgalind1 Correct. But this seems like considering it in a wrong way.
What's implemented in this PR only works when you only use a lot of runner groups with the same set of labels.
If you configure a different set of labels per runner groups, what's done in this PR scales wrong runner runnerdeployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead you'd need to do what I've explained in #1012 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe what we had to do instead was to skip this target(=a runnerdepoyment that is associated to an organization) if it was associated to a runner group and that runner group was indivisible to this repository.
@mumoshu just to clarify with your previous comment, right now we find the potential HRA with the following precedence:
- matching runner in repo from webhook
- matching default runner in an org from webhook
- matching default runner in enterprise from webhook
- matching runner group in an org that matches repo from webhook and custom visibility of runner group
- matching runner group in an org that matches repo from webhook and custom visibility of runner group
with your comment - are you suggesting to remove/skip 2 and 3?
it was associated to a runner group
This is the part that wasn't clear to me what object where you referring to be associated to the runner group
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it should be like the below one instead:
- matching repository-wide runner from the repository, the default org runner group, and the non-default org runner group that are visible to the repository, from webhook
- matching organizational runner from the organization, the default enterprise runner group, and the non-default enterprise runner groups that are visible to the organization, from webhook
- matching default runner in enterprise from webhook
You may already know but there're two variants of runner groups, organizational runner groups and enterprise runner groups. The visibility for the former can be controlled per repository, not organization. Similarly, the visibility for the latter can be controlled per organization under the enterprise. That's why (I suppose) these logics for runner groups can be merged into the step 1, 2, and 3, rather than being independent steps after 3.
In our company we use enterprise and organization runner groups with selected visibility. You can think of a runner group is mapped to a certain team who owns a few repos in my company and only that team/repos should use that runner group. Prior this change if a webhook was coming from repo X, it could potentially scale up a runner group with selected visibility that doesn't include that repo GitHub by design doesn't necessarily need labels to be defined if you configure your runner groups/mapped repos correctly. If you have runner group X mapped to repo A and B and runner group Y mapped to repo C, whenever you receive a webhook from repo C you want to scale up only RG Y (additional filtering ciuld be done with labels but that should't be a requirement) |
@fgalind1 Hey. I believe I understand your use case but I also see two problems:
2 can be addressed later with a big caution/enough documentation, but 1 needs to be addressed before the next release of ARC. |
@mumoshu I'll take a look to see if this could be a simple fix/update that. So just to double-check, the example you're describing is that I could have 2 runner groups with different labels, and with the current implementation it may be possible that we could scale up (target) the runner group that doesn't match the webhook labels? |
@fgalind1 Yes. That's exactly what I had in my mind! |
@mumoshu I did some real-live testing with the current controller as-is and I'm not able to reproduce the incorrect behavior yet. Could you share the steps you have on your mind to reproduce the issue? e.g I have:
if I trigger a if I trigger a Right now if we have N groups with custom visibility and custom labels, first we find all the potential candidates |
Let's say you have:
What I had in my mind was that with the above setup, triggering a workflow run on |
@mumoshu I tested the above example using the same input (just used org group instead of enterprise runner group) and got the following: initial setup foo runners (1 org runner + 1 runner group runner) the HRAs keys for the webhooks controller
when I trigger a job in foo
foo should scale up either of the deployments - it scaled up "bar" when I trigger a job in bar
bar should scale up RunnerDeployment "bar" with org=XXX, labels: [fgalind1] and it scaled up "bar" AFAIK logic is working fine as expected. as we're honoring labels just fine |
Isn't that just a coincidence? ARC isn't really looking into the runner group when it's matching based on labels so it MAY still scale up |
See this line: ARC returns the first target when there were two or more scale targets found. For deployment "bar" ARC would find both What we'd need to do there would be to return the scale target with matching labels AND from a visible runner group, instead of blindly returning the first scale target with matching labels(which may or may not be coming from a visible runner group). |
okay I think I see the case of the discussion - this issue will happen if the "default runner group" uses also "selected repositories only". In the example I was testing bar was using "all repositories" but if I switch it to "selected repositories = bar) then yes the issue will occur. Working on the fix now |
I appreciate your efforts!
Yeah if it switched to the group ARC should guard against both cases, and I guess, implementation-wise, somehow enhancing the |
@mumoshu let me know what do you think 😃 #1062. tested with the above flow using bar (selected repos and all repos) and everything looks good |
@mumoshu just wonder if you had a chance to look at the improvements from #1062 |
@fgalind1 I'll try my best to fully review your change this weekend. Thank you for your patience! |
…1062) This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required. More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud. Fixes additional cases from #1012 If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories `getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow: 1. Search for **repository** HRAs - if so it ends here 2. Get available HRAs in k8s 3. Compute visible runner groups a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls. b. If GitHub auth is not provided - assume all runner groups are visible to all repositories 4. Search for **default organization** runners (a.k.a runners from organization's visible default runner group) with matching labels 5. Search for **default enterprise** runners (a.k.a runners from enterprise's visible default runner group) with matching labels 6. Search for **custom organization runner groups** with matching labels 7. Search for **custom enterprise runner groups** with matching labels Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected).
This also improves resolving the
scaleTargetKey
that are used to match an HRA based on the inputs of theRunnerSet
/RunnerDeployment
spec to better support for runner groups.This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before