Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runner created but no pods linked on GHES on Canary from 13/03/2022 #1223

Closed
Eriwyr opened this issue Mar 15, 2022 · 13 comments · Fixed by #1237
Closed

Runner created but no pods linked on GHES on Canary from 13/03/2022 #1223

Eriwyr opened this issue Mar 15, 2022 · 13 comments · Fixed by #1237

Comments

@Eriwyr
Copy link

Eriwyr commented Mar 15, 2022

Describe the bug
We try to use the latest version of the canary image of the action runner controller in our GHES cluster to take advantage of the latest fixes. (We are really eager to see the 0.22 release :) )

We notice the following problem:

When launching jobs we can see objects of kind runner appearing in the cluster but no pods are launched.

There are errors logged on the runner-controller.

2022-03-15T13:05:40Z	INFO	runner-resource	validate resource to be created	{"name": "linter-zvk5w-sxvhw"}
2022-03-15T13:05:40Z	INFO	runner-resource	validate resource to be updated	{"name": "linter-zvk5w-sxvhw"}
2022-03-15T13:05:40Z	DEBUG	actions-runner-controller.runnerreplicaset	Skipped reconcilation because owner is not synced yet	{"runnerreplicaset": "devops/linter-zvk5w", "owner": "devops/linter-zvk5w-sxvhw", "pods": null}
2022-03-15T13:05:40Z	INFO	actions-runner-controller.runner	Updated registration token	{"runner": "linter-zvk5w-sxvhw", "repository": ""}
2022-03-15T13:05:40Z	DEBUG	events	Normal	{"object": {"kind":"Runner","namespace":"devops","name":"linter-zvk5w-sxvhw","uid":"d04e1355-bd5a-48b1-9062-9a475c5bdace","apiVersion":"actions.summerwind.dev/v1alpha1","resourceVersion":"108664723"}, "reason": "RegistrationTokenUpdated", "message": "Successfully update registration token"}
2022-03-15T13:05:40Z	ERROR	actions-runner-controller.runner	Failed to create pod resource	{"runner": "devops/linter-zvk5w-sxvhw", "error": "admission webhook \"mutate-runner-pod.webhook.actions.summerwind.dev\" denied the request: failed to create registration token: POST https://github.m6web.fr/api/v3/orgs/devops/actions/runners/registration-token: 403 Resource not accessible by integration []"}
2022-03-15T13:05:40Z	ERROR	controller.runner-controller	Reconciler error	{"reconciler group": "actions.summerwind.dev", "reconciler kind": "Runner", "name": "linter-zvk5w-sxvhw", "namespace": "devops", "error": "admission webhook \"mutate-runner-pod.webhook.actions.summerwind.dev\" denied the request: failed to create registration token: POST https://github.m6web.fr/api/v3/orgs/devops/actions/runners/registration-token: 403 Resource not accessible by integration []"}
2022-03-15T13:05:41Z	ERROR	actions-runner-controller.runner	Failed to create pod resource	{"runner": "devops/linter-zvk5w-sxvhw", "error": "admission webhook \"mutate-runner-pod.webhook.actions.summerwind.dev\" denied the request: failed to create registration token: POST https://github.m6web.fr/api/v3/orgs/devops/actions/runners/registration-token: 403 Resource not accessible by integration []"}

github.com/actions-runner-controller/actions-runner-controller/controllers.(*RunnerReconciler).Reconcile
	/workspace/controllers/runner_controller.go:127
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:114
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:311
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.11.1/pkg/internal/controller/controller.go:227

To Reproduce
Use image action-runner-controller:canary on GHES

Environment:

  • Controller Version : Latest canary
  • Deployment Method : Helm
  • Helm Chart Version : 0.16.1
@mumoshu
Copy link
Collaborator

mumoshu commented Mar 16, 2022

@Eriwyr Wow! Thanks a lot for testing canary ☺️ That really helps accelerate development and make us confident hence encourage us to cut the next release earlier ☺️

Re your issue, I see failed to create registration token: POST https://github.m6web.fr/api/v3/orgs/devops/actions/runners/registration-token: 403 Resource not accessible by integration []"} in your error log.

It's usually emitted when your PAT or GitHub App does not have admin permission on the github organization devops. if you're using GitHub App for auth, you'd need to recheck the App settings and see if it does request the admin permission on org-scope, and confirm the updated permissions on your installation settings page.
If you're using PAT, you'd just need the github user an admin(or maybe owner) of the organization and give the PAT access to admin resources in the organization.

We've made no GitHub API related changes other than adding a transparent cache layer and I did verify that the layer does not affect authentication. So, I hope this is just a simple permission misconfiguration on either your PAT or App.
If it isn't, I'd appreciate more information on your environment.

Thanks in advance for your cooperation!

@Eriwyr
Copy link
Author

Eriwyr commented Mar 16, 2022

Thank you very much for your answer @mumoshu

We use Github App for the auth and we have checked the permissions of it. Everything works correctly if we switch back to a canary from 10 days ago for example or to the last stable release.

The permissions of the github app are those specified in the documentation:
Capture d’écran 2022-03-16 à 09 41 29

You can see from the logs that in the same second the log says ok and then 403.
image

To confirm that it was not a problem of rights we also realized a curl manually with the credentials of the github app and that works correctly in this case.

Thank you in advance for your help!

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 16, 2022

@Eriwyr Thank you! All I can say is there may be an edge-case that our new cache layer breaks some authenticated HTTP requests 🤔

Would you mind setting the log level to "-4" and collect all the HTTP requests and responses for the failing API endpoint, so that you can hopefully isolate in which condition the 403 errors occur?

Here's how you can set the log level to "-4" when you use helm chart:

https://github.com/actions-runner-controller/actions-runner-controller/blob/a40793bb60dfa1fea45a80e85c12e4d4e377e67a/acceptance/values.yaml#L2

And here's the implementation of API req/res logging, You mind find it helpful to see which message is from the log level:

https://github.com/actions-runner-controller/actions-runner-controller/blob/a40793bb60dfa1fea45a80e85c12e4d4e377e67a/logging/transport.go#L52-L63

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 16, 2022

Note that I was able to successfully use the canary version of ARC with my GitHub App and organizational runners. So at least it isn't totally broken.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 16, 2022

Beware that it might print sensitive information like token contained in the request header for GitHub API access!

@Meroje
Copy link
Contributor

Meroje commented Mar 16, 2022

Hi, colleague here. Using log level -4 does get us stacktraces but not any additional message. looks like maybe the PodRunnerTokenInjector has a different code paths when logging ?

@Meroje
Copy link
Contributor

Meroje commented Mar 16, 2022

Some more details on our environment

  • there are multiple organizations on our GHES instance
  • each organization has an ARC assigned along with with a K8S namespace

Seeing as there were multiple admission webhook \"mutate-runner-pod.webhook.actions.summerwind.dev\" denied the request logs I suspected something maybe changed with their config or handling.
Indeed changing every webhook config to have a namespaceSelector allowed for runners to be created.

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 16, 2022

@Meroje Which stacktraces did you get? 🤔 How did you enabled the log level of -4? If you're using helm chart it needs to be "-4" not -4. I can't say for sure without the concrete and the exact example!

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 16, 2022

each organization has an ARC assigned along with with a K8S namespace

How did it ever work?!

Our helm chart doesn't have a namespace selector in the webhook config which should be required to deploy one ARC per namespace.

Without that, it might be indeterministic which mutating webhook (including PodTokenInjector) comes into play for a runner pod. Suppose have ARC instance 1 and instance 2, instance 2's webhook might comes into play for a runner pod that is managed by instance 1.

admission webhook \"mutate-runner-pod.webhook.actions.summerwind.dev\" denied the request

To me, this seems to indicate that you are actually using the same mutatingwebhookconfig name for all the ARC instances across namespaces, right? 🤔

@Meroje
Copy link
Contributor

Meroje commented Mar 16, 2022

@Meroje Which stacktraces did you get? 🤔 How did you enabled the log level of -4? If you're using helm chart it needs to be "-4" not -4. I can't say for sure without the concrete and the exact example!

I copied your exact snippet, which gets --log-level=-4 on the actual deployment

To me, this seems to indicate that you are actually using the same mutatingwebhookconfig name for all the ARC instances across namespaces, right? 🤔

The event would rather get sent to all matching webhooks, I guess ARC v0.21 did not care about some failures as long as one webhook was successful.

Shall I send you a PR to add the namespace selector ? Would look like this

--- a/charts/actions-runner-controller/templates/webhook_configs.yaml
+++ b/charts/actions-runner-controller/templates/webhook_configs.yaml
@@ -12,6 +12,11 @@ metadata:
 webhooks:
 - admissionReviewVersions:
   - v1beta1
+  {{- if .Values.scope.singleNamespace }}
+  namespaceSelector:
+    matchLabels:
+      name: {{ .Values.scope.watchNamespace }}
+  {{- end }}
   clientConfig:
     {{- if .Values.admissionWebHooks.caBundle }}
     caBundle: {{ quote .Values.admissionWebHooks.caBundle }}

@Meroje
Copy link
Contributor

Meroje commented Mar 16, 2022

Also, I guess the lack of namespace selector explains why we didn't see logs from the mutating webhook

@mumoshu
Copy link
Collaborator

mumoshu commented Mar 17, 2022

I copied your exact snippet, which gets --log-level=-4 on the actual deployment

Thank you for confirming!
The configuration looks good.
Would you mind rechecking once you fixed the namespaceSelector in the webhook configs?

I guess ARC v0.21 did not care about some failures as long as one webhook was successful.

I had a lightbulb in my brain when I was about to sleep- So it might be due to that your duplicating mutatingwebhooks across namespaces have never worked.

Which custom resource are you using to deploy runners, RunnerSet or RunnerDeployment?
As I see "runnerreplicaset": "devops/linter-zvk5w" in the log you share in the very beginning of this issue, I suppose you're using RunnerDeployment only. If so, that explains your issue.

Pre ARC 0.22.0, only RunnerSet managed runner pods were subject to the mutating webhook. So no matter if the mutating webhook is actually working, RunnerDeployment just worked.

Since 0.22.0, it doesn't. Every runner pod, regardless of its managed by RunnerSet or Runnerdeployment, is subject to the mutating webhook so if your webhook config/server is broken/misconfigured somehow, it may result in nasty errors.

In your case, your mutating webhook wasn't correctly namespaced in terms of namespaceSelector as you pointed out. That must be the problem.

That said:

Shall I send you a PR to add the namespace selector ? Would look like this

Please! Your change looks definitely good and aligns well with my analysis above.

Also, I guess the lack of namespace selector explains why we didn't see logs from the mutating webhook

Yes, that makes sense to me too.

@Meroje
Copy link
Contributor

Meroje commented Mar 17, 2022

I copied your exact snippet, which gets --log-level=-4 on the actual deployment

Thank you for confirming! The configuration looks good. Would you mind rechecking once you fixed the namespaceSelector in the webhook configs?

Still not seeing roundtrips from the mutatingwebhook to api/v3/orgs/{org}/actions/runners/registration-token

Which custom resource are you using to deploy runners, RunnerSet or RunnerDeployment? As I see "runnerreplicaset": "devops/linter-zvk5w" in the log you share in the very beginning of this issue, I suppose you're using RunnerDeployment only. If so, that explains your issue.

Yes, we only ever used RunnerDeployments

Pre ARC 0.22.0, only RunnerSet managed runner pods were subject to the mutating webhook. So no matter if the mutating webhook is actually working, RunnerDeployment just worked.

Since 0.22.0, it doesn't. Every runner pod, regardless of its managed by RunnerSet or Runnerdeployment, is subject to the mutating webhook so if your webhook config/server is broken/misconfigured somehow, it may result in nasty errors.

In your case, your mutating webhook wasn't correctly namespaced in terms of namespaceSelector as you pointed out. That must be the problem.

Thanks for confirming the change in behaviour.

That said:

Shall I send you a PR to add the namespace selector ? Would look like this

Please! Your change looks definitely good and aligns well with my analysis above.

Will do :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants