Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize Containers - HttpError: HTTP request failed - EKS - containerMode kubernetes #128

Open
carl-reverb opened this issue Jan 9, 2024 · 14 comments
Labels
question Further information is requested

Comments

@carl-reverb
Copy link

carl-reverb commented Jan 9, 2024

When I attempt to run a workflow against a self-hosted runner deployed using the gha-runner-scale-set-controller and gha-runner-scale-set charts, my job fails on the 'Initialize Containers' step.

Runner Scale Set values.yaml:

minRunners: 1
maxRunners: 16

containerMode:
  type: kubernetes
  kubernetesModeWorkVolumeClaim:
    accessModes: ["ReadWriteOnce"]
    storageClassName: "ebs-gp3-ephemeral"
    resources:
      requests:
        storage: 10Gi

template:
  spec:
    securityContext:
      fsGroup: 123
    containers:
    - name: runner
      image: ghcr.io/actions/actions-runner:latest
      command: ["/home/runner/run.sh"]

In the Github UI after the job is picked up the following error message appears in the log:

Error: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

Full error context:

##[debug]Evaluating condition for step: 'Initialize containers'
##[debug]Evaluating: success()
##[debug]Evaluating success:
##[debug]=> true
##[debug]Result: true
##[debug]Starting: Initialize containers
##[debug]Register post job cleanup for stopping/deleting containers.
Run '/home/runner/k8s/index.js'
##[debug]/home/runner/externals/node1[6](https://github.com/reverbdotcom/reverb-terraform/actions/runs/7426318988/job/20311530393#step:2:6)/bin/node /home/runner/k[8](https://github.com/reverbdotcom/reverb-terraform/actions/runs/7426318988/job/20311530393#step:2:8)s/index.js
Error: HttpError: HTTP request failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug]System.Exception: Executing the custom container implementation failed. Please contact your self hosted runner administrator.
##[debug] ---> System.Exception: The hook script at '/home/runner/k8s/index.js' running command 'PrepareJob' did not execute successfully
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   --- End of inner exception stack trace ---
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.ExecuteHookScript[T](IExecutionContext context, HookInput input, ActionRunStage stage, String prependPath)
##[debug]   at GitHub.Runner.Worker.Container.ContainerHooks.ContainerHookManager.PrepareJobAsync(IExecutionContext context, List`1 containers)
##[debug]   at GitHub.Runner.Worker.ContainerOperationProvider.StartContainersAsync(IExecutionContext executionContext, Object data)
##[debug]   at GitHub.Runner.Worker.JobExtensionRunner.RunAsync()
##[debug]   at GitHub.Runner.Worker.StepsRunner.RunStepAsync(IStep step, CancellationToken jobCancellationToken)
##[debug]Finishing: Initialize containers

My workflow:

name: git hooks
on: push

jobs:
  pre-commit:
    name: pre-commit
    runs-on: reverbdotcom-general-purpose
    container: summerwind/actions-runner:latest
    steps:
      - run: echo "hello actions"

I have tried a lot of different things to try to understand what is not working here but the chain of dependencies and effects is not easy to comprehend. There is a lot of red herrings and other noise in the logs which led me on several chases around the web, and I spent a while trying security contexts, various container images, etc. At this point I think I have run out of time to figure this out and will have to fall back to the previous actions runner controller and advise my team that the next generation of actions runners is a risk and we should evaluate alternative CI pipelines.

@carl-reverb
Copy link
Author

Well, finding some more time to dig, I went into the source code here and started tracing out the execution path since the stack trace doesn't give many clues as to where this HTTP request failed. Noting that the first thing that probably does a request is:
https://github.com/actions/runner-container-hooks/blob/main/packages/k8s/src/k8s/index.ts#L455

I shell into my pod and install node and then attempt this direct basic implementation:

const k8s = require('@kubernetes/client-node');

const kc = new k8s.KubeConfig();
kc.loadFromDefault();

const k8sApi = kc.makeApiClient(k8s.CoreV1Api);

let main = async () => {
    try {
        const podsRes = await k8sApi.listNamespacedPod('actions-runners');
        console.log(podsRes.body);
    } catch (err) {
        console.error(err);
    }
};

main();

It fails like so:

{
   // ...
 body: {
    kind: 'Status',
    apiVersion: 'v1',
    metadata: {},
    status: 'Failure',
    message: 'Unauthorized',
    reason: 'Unauthorized',
    code: 401
  },
  statusCode: 401
}

So I can presume that the problem is not the fault of the hooks library, but something is wrong with either the service account or the cluster configuration in EKS. There's not a lot of easily-findable documentation on how to perform in-cluster authentication via service account because most users want to authenticate to their cluster from outside, using eksctl or similar.

The Role is as configured by the helm chart:

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: reverbdotcom-general-purpose-gha-rs-kube-mode
  namespace: actions-runners
- apiGroups:
  - ""
  resources:
  - pods
  verbs:
  - get
  - list
  - create
  - delete
- apiGroups:
  - ""
  resources:
  - pods/exec
  verbs:
  - get
  - create
- apiGroups:
  - ""
  resources:
  - pods/log
  verbs:
  - get
  - list
  - watch
- apiGroups:
  - batch
  resources:
  - jobs
  verbs:
  - get
  - list
  - create
  - delete
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - get
  - list
  - create
  - delete

Apparently if there is some RBAC issue I should receive a 403. A 401 indicates that the token was rejected completely. I also checked to see if the token in the client configuration matched the one mounted in the pod, and it does.

I'm out of ideas for now... until I can learn more about debugging 401 with an in-cluster service account token.

@carl-reverb
Copy link
Author

carl-reverb commented Jan 11, 2024

I created a test pod on the cluster in the actions-runners namespace using the latest node image and attached to it, then ran the short js script to test. The result is a 403:

 body: {
    kind: 'Status',
    apiVersion: 'v1',
    metadata: {},
    status: 'Failure',
    message: 'pods is forbidden: User "system:serviceaccount:actions-runners:default" cannot list resource "pods" in API group "" in the namespace "actions-runners"',
    reason: 'Forbidden',
    details: { kind: 'pods' },
    code: 403
  },
  statusCode: 403
}

This is expected because I didn't specify a service account, so I got the default service account which has no role bound to it. Next I attempted the same, but specified the reverbdotcom-general-purpose-gha-rs-kube-mode service account.

~ $ kubectl run -it -n actions-runners carl-test --image=node --overrides='{ "spec": { "serviceAccount": "reverbdotcom-general-purpose-gha-rs-kube-mode" } }' -- bash

With this service account, I again get a 401. Since this is on an unrelated pod, which works with default service account, there is something wrong with the service account.

The default service account has a "mountable secret" but the "reverbdotcom-general-purpose-gha-rs-kube-mode" does not.

Name:                reverbdotcom-general-purpose-gha-rs-kube-mode
Namespace:           actions-runners
Labels:              <redacted>
Annotations:         <redacted>
Image pull secrets:  <none>
Mountable secrets:   <none>
Tokens:              <none>
Events:              <none>

We're on kubernetes 1.24, which is now undocumented, so I can't be sure but other documentation indicates that it shouldn't be necessary to manually create tokens, and that the admission controller should use the Refresh api to obtain a token for the projected volume when a pod is scheduling... It definitely obtains a token, but the token is unauthorized.

Off to spend some time digging around in EKS docs and attempting to figure out if there's some configuration setting I need to flip.

@carl-reverb
Copy link
Author

After more experimentation I accidentally deleted the service account, and then had to recreate it by forcing a new helm install-upgrade.

Following that, new pods which used the kube-mode service account were able to communicate with the apiserver, but old pods were not. I destroyed the old runner pod and waited for the controller to create a new one, whereupon it was able to make apiserver requests again.

Unknown why replacing the serviceaccount made it start working, monitoring to see if it breaks again after some interval of time. If so, then theory goes that the projected token is not being refreshed.

@nikola-jokic
Copy link
Member

Hey @carl-reverb,

Sorry for the late response. Is there any news regarding this issue? Does it work now?

@carl-reverb
Copy link
Author

Yes, it's been working now, thank you.

@carl-reverb
Copy link
Author

carl-reverb commented Feb 23, 2024

Ok, I reproduced this as I'm rolling out a new set of runner. My first runner, an arm64 kubernetes-mode runner again presented this issue. The Chart version is v0.8.2. The workaround was the same:

  1. Remove the finalizer blocking deletion of the serviceaccount
  2. Delete the serviceaccount
  3. Force-update the helm release

I also happened to deploy an amd64 kubernetes-mode runner at the same time. My job contained a javascript action and ran in an alpine container so I had to switch it to the amd64. Once again, HTTP denied. Repeat my workaround ... and it works again.

@carl-reverb carl-reverb reopened this Feb 23, 2024
@nikola-jokic
Copy link
Member

Could you please write the exact steps that you are doing to land on this spot? I can't seem to reproduce the issue. It seems like there is some kind of permission issue where service account is not mounted to the runner container.

Could you please write an example values.yaml file with the stuff you would like to hide redacted? Write exact commands that you are using to deploy this scale set. I just can't reproduce this issue

@nikola-jokic nikola-jokic added the question Further information is requested label Mar 21, 2024
@carl-reverb
Copy link
Author

This is AWS EKS version 1.25. I'm sorry I don't have the bandwidth to work on reproduction. I simply install the runner scale set helm chart oci://ghcr.io/actions/actions-runner-controller-charts/gha-runner-scale-set with a values file such as: https://gist.github.com/carl-reverb/05bb00856a7e5da70e1020fba65bc1ee

My hook extension is:

{
   "apiVersion": "v1",
   "data": {
      "extension.yaml": "\"spec\":\n  \"serviceAccount\": \"gha-job-container\""
   },
   "kind": "ConfigMap",
   "metadata": {
      "name": "gha-runner-scale-set-hook-extension",
      "namespace": "actions-runners"
   }
}

Now, all this is being installed with Flux, so the order of application is up to those controllers. Because of the nature of the resources, I presume that the kustomize controller runs first and creates my service account, config maps, and the helm release CRD, upon which the helm controller runs to evaluate the Helm Release and execute helm install.

Sorry I can't remember much more detail than this.

@nikola-jokic
Copy link
Member

Oh, please do not apologize, I'm the one being late on this issue. I think the problem is that we are creating service account on demand and mounting it on the runner pod. The hook does extension does not need a service account. Extension is only scoped to the workflow pod. The service account needs to be mounted on the runner. It is likely that something in the tooling is not mounting the service account properly.

@carl-reverb
Copy link
Author

Oh, to explain the hook extension on the service account, you're right that's a red herring, but I do need that because I'm using docker buildx with the native kubernetes driver and that driver requires a service account with some role bindings in order to create buildx pods.

Yes you're absolutely right, the service account which is a problem is the one for the runner pod which consumes the workflow and then fails to spawn job pods due to not having a good service account token.

@nikola-jokic
Copy link
Member

Right, but the role you pasted is not the one the runner needs.This is the actual role that is created for the runner: https://github.com/actions/actions-runner-controller/blob/master/charts/gha-runner-scale-set/templates/kube_mode_role.yaml

Is it possible that the incorrect role binding was made, so runner did not have enough permissions?

@timmjd
Copy link

timmjd commented Apr 28, 2024

Had the same issue. Seems to happen due to namespace rename OR a helm chart upgrade - our IT did both simultaneously.

For me the following had caused the issue:

  • Install of helm chart 0.6.0 in a namespace foo
  • Deploy some ARC runners
  • Upgrade the runner controller to a newer version, in my case 0.9.0
  • Also rename the namespace of the runner controller from foo to something else like bar
  • The ARC runners are untouched. The existing ARC runner is not working anymore - it's still there but will crash without any further details except an HTTP error

Somehow you run into the 401 error, for me it was the runner-hook trying to do a get on the K8S api about "am I allowed to access a secret". Fix was to delete the helm of the ARC runner and re-deploy it. I guess the serviceaccount did not had the permission required either due to rename or version upgrade.

I was debugging this for 2 days and only found the origin after implementing #158 / #159. With the trace beeing available, finding the root cause was an easy job. Maybe @nikola-jokic could have a look on this PR?

@carl-reverb
Copy link
Author

Reproduced the issue again, this time on 0.9.1

  • Upgrade runners and controller charts to 0.9.1
  • Much issues, end up following upgrade guide to uninstall everything, delete CRDs, then reinstall with helm.
  • Jobs going to runners fail with HTTP error.
  • Delete all the *-gha-rs-kube-mode service accounts, patching the annoying finalizer.
  • Force-reinstall helm chart to get the service accounts recreated.
  • Workaround success, jobs start working again.

@sofiegonzalez
Copy link

hi @carl-reverb, before i was unable to run a job in a container in containerMode: Kubernetes because of the HttpError and the runner pod being unable to inialize, but you solution in this comment where you added the serviceAccount label to the runner spec solved my issue. BUT I don't understand why.
When I look at the runner pod, it contains two serviceAccount definitions now. If i removed the serviceAccount one, it is unable to spin up the workflow pod. Do you know why this is or why this fix allows the runner pod to start and create the -workflow pod to run the container?

  serviceAccount: gha-runner-scale-set-kube-mode
  serviceAccountName: gha-runner-scale-set-kube-mode

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants