Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Controller keeps reconciling with non-existent runners - leads to infinite runners #512

Closed
gstravinskaite opened this issue May 3, 2021 · 11 comments

Comments

@gstravinskaite
Copy link

gstravinskaite commented May 3, 2021

Hi,

We are using Helm charts to provision the controller and up until now were using version 16.1 with the chart version 2 and now tried to upgrade to 18.2 with the chart 10.4. We immediately ran into the infinite pod issue described in #427. I then reverted the version but the infinite pod scheduling still persisted. I must say that I upgraded the controller and then left it till the next morning and only then I created the runners to see the infinite pod scheduling issue. I then tried to wipe the state clean - destroyed helm chart, manually deleted the CDRs and upgraded to 18.2 again and the issue is still there. We started suspecting that perhaps the controller issued so many requests and GitHub is spewing them back to us up until now. Even though API request limit was exceeded (over 5000), I only saw around 200 runners scheduled (people in other issues mentioned as many as 2000 runners). We deleted the runners we saw but then we see that the controllers keeps reconciling with non existing runners. What is more interesting, is that the runner pod IDs seem to be not-changing since the time a lot of them span up. Leading to errors as:

021-05-03T10:01:47.300Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runner-controller", "request": "runner-infra/runner-infra-xm9vc-f8h5z"}

runner-infra-xm9vc is the ID of the old pod which we deleted and cleared the existing runners from GitHub.

Today, we tried again to set up the runner and now no pods/runners are being scheduled but the controller is very busy and keeps spewing logs as:

2021-05-03T10:01:47.299Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runnerreplicaset-controller", "request": "runner-infra/runner-infra-xm9vc"}
2021-05-03T10:01:47.300Z	INFO	actions-runner-controller.runner	Removed runner from GitHub	{"runner": "runner-infra/runner-infra-xm9vc-f8h5z", "repository": "elsevierPTG/rap-terraformcontrol-aws-rt-fca-funderconsole", "organization": ""}
2021-05-03T10:01:47.300Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "runner-controller", "request": "runner-infra/runner-infra-xm9vc-f8h5z"}

2021-05-03T10:43:31.297Z	DEBUG	actions-runner-controller.runner	Runner was never registered on GitHub	{"runner": "runner-infra/runner-infra-xm9vc-ww9f2"}
2021-05-03T10:43:31.303Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "1c8fce3a-452e-4c6a-98db-28507a8c173a", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2021-05-03T10:43:31.303Z	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "1c8fce3a-452e-4c6a-98db-28507a8c173a", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2021-05-03T10:43:31.309Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "UID": "dc295911-3fd4-4547-8ccd-e69612ff895f", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}
2021-05-03T10:43:31.309Z	INFO	runner-resource	validate resource to be updated	{"name": "runner-infra-xm9vc-ww9f2"}
2021-05-03T10:43:31.309Z	DEBUG	controller-runtime.webhook.webhooks	wrote response	{"webhook": "/validate-actions-summerwind-dev-v1alpha1-runner", "UID": "dc295911-3fd4-4547-8ccd-e69612ff895f", "allowed": true, "result": {}, "resultError": "got runtime.Object without object metadata: &Status{ListMeta:ListMeta{SelfLink:,ResourceVersion:,Continue:,RemainingItemCount:nil,},Status:,Message:,Reason:,Details:nil,Code:200,}"}
2021-05-03T10:43:31.317Z	DEBUG	controller-runtime.webhook.webhooks	received request	{"webhook": "/mutate-actions-summerwind-dev-v1alpha1-runner", "UID": "312c3fee-52df-4ac3-9fec-92a089d213cb", "kind": "actions.summerwind.dev/v1alpha1, Kind=Runner", "resource": {"group":"actions.summerwind.dev","version":"v1alpha1","resource":"runners"}}

Any help would be appreciated. This is the controller setup:

replicaCount: 1

syncPeriod: 10m

authSecret:
  enabled: true
  github_token: "ps::${ssm_github_pat}"

image:
  repository: summerwind/actions-runner-controller
  tag: "v0.18.2"
  dindSidecarRepositoryAndTag: "docker:dind"
  pullPolicy: IfNotPresent

kube_rbac_proxy:
  image:
    repository: gcr.io/kubebuilder/kube-rbac-proxy
    tag: "v0.4.1"

serviceAccount:
  create: true
  name: actions-runner-controller-sa
  annotations:
    eks.amazonaws.com/role-arn: ${iam_role_arn}

podSecurityContext:
  fsGroup: 65534

podAnnotations:
  kube-secret-inject: "true"

nodeSelector:
  dedicated: runner

tolerations:
  - key: "dedicated"
    operator: "Equal"
    value: "runner"
    effect: "NoSchedule"

and this is for the runner:

apiVersion: v1
kind: Namespace
metadata:
  labels:
    kube-secret-inject: "enabled"
  name: runner-${name}
---
apiVersion: v1
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/role-arn: ${iam_role_arn}
  name: runner-${name}-sa
  namespace: runner-${name}
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: runner-${name}-pvc
  namespace: runner-${name}
spec:
  accessModes:
    - ReadWriteMany
  volumeMode: Filesystem
  resources:
    requests:
      storage: 100Gi
  storageClassName: efs-sc
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: runner-${name}
  namespace: runner-${name}
spec:
  template:
    metadata:
      annotations:
        kube-secret-inject: "true"
    spec:
      repository: ${repository}
      image: ${image_url}:${image_tag}
      imagePullPolicy: IfNotPresent
      dockerdWithinRunnerContainer: true
      serviceAccountName: runner-${name}-sa
      tolerations:
        - key: "dedicated"
          operator: "Equal"
          value: "runner"
          effect: "NoSchedule"
      labels:
        - runner-${name}
        - "${aws_account_id}"
        - ${cluster_name}
        - ${environment_name}
      nodeSelector:
        dedicated: runner
      initContainers:
        - name: ssh-config
          image: ${image_url}:${image_tag}
          imagePullPolicy: IfNotPresent
          command: ["/bin/sh", "-c"]
          args:
            - echo "$GIT_SSH_KEY" > /home/runner/.ssh/id_rsa;
              chmod 400 /home/runner/.ssh/id_rsa;
              ssh-keyscan -t rsa github.com >> /home/runner/.ssh/known_hosts;
              chmod 700 /home/runner/.ssh/known_hosts;
              chown -R 1000 /home/runner/.ssh/*
          env:
            - name: AWS_REGION
              value: ${aws_region}
            - name: GIT_SSH_KEY
              value: ps::${ssm_github_private_key}
          securityContext:
            fsGroup: 27
          volumeMounts:
            - mountPath: /home/runner/.ssh
              name: ssh
      env:
        - name: AWS_REGION
          value: ${aws_region}
      securityContext:
        fsGroup: 27
      volumeMounts:
        - mountPath: /home/runner/.ssh
          name: ssh
        - mountPath: /home/runner/cache/terragrunt
          name: runner-${name}-pvc
      volumes:
        - name: ssh
          emptyDir: {}
        - name: runner-${name}-pvc
          persistentVolumeClaim:
            claimName: runner-${name}-pvc
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
  name: runner-${name}
  namespace: runner-${name}
spec:
  scaleTargetRef:
    name: runner-${name}
  minReplicas: 1
  maxReplicas: 40
  scaleDownDelaySecondsAfterScaleOut: 60
  metrics:
    - type: PercentageRunnersBusy
      scaleUpThreshold: "0.75"
      scaleDownThreshold: "0.3"
      scaleUpFactor: "1.4"
      scaleDownFactor: "0.7"

@mumoshu
Copy link
Collaborator

mumoshu commented May 3, 2021

@gstravinskaite Hey! You've probably missed updating CRDs. Please see #427, #467, and #468

@mumoshu
Copy link
Collaborator

mumoshu commented May 3, 2021

TL;DR; There's no easy way to rollback if you broke your deployment by only upgrading the controller, and then broke it even more by downgrading the controller leaving runners created by the newer controller.

If that's the case, all you could do would be to stop the controller at all, manually force delete the runners on K8s, and then go to the Actions page of your repository or organization and delete registered runners manually.

@mumoshu
Copy link
Collaborator

mumoshu commented May 3, 2021

We deleted the runners we saw but then we see that the controllers keeps reconciling with non existing runners

I don't get this. Technically this is very unlikely to happen, as the controller tries to see the latest cache of controller-runtime to see what to reconcile. If you really remove all the runners(how did you do that? I though you had to kubectl delete runner and then kubectl patch to delete the finalized to do that after you broke the setup by not upgrading CRDs. Have you confirmed you've deleted all the runners by kubectl get runner?

@gstravinskaite
Copy link
Author

@gstravinskaite Hey! You've probably missed updating CRDs. Please see #427, #467, and #468

I deleted CRDs and recreated them again with Helm. Is this what you mean by "upgrade"

TL;DR; There's no easy way to rollback if you broke your deployment by only upgrading the controller, and then broke it even more by downgrading the controller leaving runners created by the newer controller.

If that's the case, all you could do would be to stop the controller at all, manually force delete the runners on K8s, and then go to the Actions page of your repository or organization and delete registered runners manually.

The issue is that I did delete all the runners manually in the cluster AND under the Actions tab but the issue persists.

We deleted the runners we saw but then we see that the controllers keeps reconciling with non existing runners

I don't get this. Technically this is very unlikely to happen, as the controller tries to see the latest cache of controller-runtime to see what to reconcile. If you really remove all the runners(how did you do that? I though you had to kubectl delete runner and then kubectl patch to delete the finalized to do that after you broke the setup by not upgrading CRDs. Have you confirmed you've deleted all the runners by kubectl get runner?

How did I remove the runners? Well, I removed the namespace and then yes, deleted finalizer and patched it. But I now see that the namespace deletion did not delete the runners. My bad. I will try to clear them. Thanks!

@mumoshu
Copy link
Collaborator

mumoshu commented May 3, 2021

I deleted CRDs and recreated them again with Helm. Is this what you mean by "upgrade"

Sounds good 👍

Well, I removed the namespace and then yes, deleted finalizer and patched it. But I now see that the namespace deletion did not delete the runners. My bad. I will try to clear them

Alright! To be extra clear, what you needed to run kubectl delete --force and kubectl patch against were runners, not namespaces containing runners.

@grggls
Copy link

grggls commented May 3, 2021

we're having the same issue. the finalizer keeps us from deleting the runners. patching the runners generates the following error:

error: runners.actions.summerwind.dev "mobimeo-actions-runner-nmwk6-7p695" could not be patched: Internal error occurred: failed calling webhook "mutate.runner.actions.summerwind.dev": Post https://webhook-service.actions-runner-system.svc:443/mutate-actions-summerwind-dev-v1alpha1-runner?timeout=30s: service "webhook-service" not found

@grggls
Copy link

grggls commented May 3, 2021

rolling back to 0.10.5 of the helm chart also generates this error

@gstravinskaite
Copy link
Author

I think I saw the same error, we patched the runners when having the controller still working. For us, the controller was actually deleting the runners by itself we noticed. I suppose we had a couple hundred thousands of them before. So I think if you do a complete wipe, delete CRDs and then deploy the old/new version of the controller again the controller should keep deleting runners.

From our side, we just let the controller to run for a couple of hours and it deleted all the runners - things are more or less back to normal.

Thinking that the upgrade documentation could be a useful thing here.

@callum-tait-pbx
Copy link
Contributor

#519 I'm going to work on moving away from the does it all action so we can better provide upgrade docs etc as part of the release

@mumoshu
Copy link
Collaborator

mumoshu commented May 5, 2021

@grggls You seem to have broken your admission webhook service somehow. Reading your logs, perhaps you've completely removed the K8s service named webhook-service at all?

@mumoshu
Copy link
Collaborator

mumoshu commented May 9, 2021

Closing as the original issue seems to have been resolved 👍 Thanks for reporting and all your support everyone!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants