runner controller failing in 0.18.2 #468

zetaab · 2021-04-21T11:43:32Z

We updated runner controller from 0.17.0 to 0.18.2 and the whole thing does not work anymore. Runners are not coming online and in logs we can see

2021-04-21T11:17:59.542Z	ERROR	actions-runner-controller.runner	Failed to update runner status for Registration	{"runner": "ghe-runner-deployment-znpqv-gpthz", "error": "Runner.actions.summerwind.dev \"ghe-runner-deployment-znpqv-gpthz\" is invalid: [status.message: Required value, status.phase: Required value, status.reason: Required value]"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReconciler).updateRegistrationToken
	/workspace/controllers/runner_controller.go:493
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReconciler).Reconcile
	/workspace/controllers/runner_controller.go:153
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-04-21T11:17:59.542Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "runner-controller", "request": "gha/ghe-runner-deployment-znpqv-gpthz", "error": "Runner.actions.summerwind.dev \"ghe-runner-deployment-znpqv-gpthz\" is invalid: [status.message: Required value, status.phase: Required value, status.reason: Required value]"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88

I assume this behaviour is maybe made in #398

cc @mumoshu

anyways we downgraded back to 0.17.0 and everything seems fine now.

The text was updated successfully, but these errors were encountered:

callum-tait-pbx · 2021-04-21T11:54:23Z

Did you update your CRDs as part of your upgrade to 18.2? How are you deploying the solution?

zetaab · 2021-04-21T13:15:24Z

I deleted everything, helm delete and then deleted all crds also. Then installed everything again with 0.18.2 but the problem still existed. After that I downgraded helm to 0.17.0

helm upgrade -n actions-runner-system ghe-runner actions-runner-controller/actions-runner-controller -f manifests/ghe/values.yaml --install

zetaab · 2021-04-21T13:23:06Z

oh god I did not execute helm repo update before running helm

% helm search repo actions-runner-controller
NAME                                              	CHART VERSION	APP VERSION	DESCRIPTION
actions-runner-controller/actions-runner-contro...	0.4.0        	           	A Kubernetes controller that operates self-host...

and after

% helm search repo actions-runner-controller
NAME                                              	CHART VERSION	APP VERSION	DESCRIPTION
actions-runner-controller/actions-runner-contro...	0.11.0       	0.18.2     	A Kubernetes controller that operates self-host...

lets try again now

zetaab · 2021-04-21T13:30:52Z

hmm looks like this helm command does not update crds. https://helm.sh/docs/chart_best_practices/custom_resource_definitions/

There is no support at this time for upgrading or deleting CRDs using Helm.

great

zetaab · 2021-04-21T13:46:36Z

I can confirm that this problem is happening also with new crds.

zetaab · 2021-04-21T13:50:47Z

now it looks like this controller went crazy (tried 0.18.2, 0.18.1 and then back to 0.17.0). I have 2 replicas in both runnerdeployments:

% kubectl get pods -n gha
NAME                                    READY   STATUS              RESTARTS   AGE
sre-group-runnerdeploy-2mnds-2sdpf      0/1     Pending             0          3s
sre-group-runnerdeploy-2mnds-5rm5g      0/1     ContainerCreating   0          7s
sre-group-runnerdeploy-2mnds-755tf      1/1     Running             0          8s
sre-group-runnerdeploy-2mnds-7p9ld      1/1     Running             0          107s
sre-group-runnerdeploy-2mnds-85sh8      0/1     ContainerCreating   0          6s
sre-group-runnerdeploy-2mnds-8b2lb      0/1     Pending             0          2s
sre-group-runnerdeploy-2mnds-8vkw6      1/1     Running             0          10s
sre-group-runnerdeploy-2mnds-9fkg8      1/1     Running             0          13s
sre-group-runnerdeploy-2mnds-9nnm2      1/1     Running             0          8s
sre-group-runnerdeploy-2mnds-gxbpd      1/1     Running             0          12s
sre-group-runnerdeploy-2mnds-k6sw5      0/1     Pending             0          3s
sre-group-runnerdeploy-2mnds-mqnvm      0/1     Pending             0          1s
sre-group-runnerdeploy-2mnds-qjr6f      0/1     Pending             0          1s
sre-group-runnerdeploy-2mnds-rhn26      0/1     Pending             0          4s
sre-group-runnerdeploy-2mnds-rjkrc      1/1     Running             0          101s
sre-group-runnerdeploy-2mnds-rjr9v      0/1     Pending             0          6s
sre-group-runnerdeploy-2mnds-v7m6w      0/1     Pending             0          3s
sre-group-runnerdeploy-2mnds-v8lhh      0/1     ContainerCreating   0          8s
sre-group-runnerdeploy-2mnds-v94rv      0/1     ContainerCreating   0          8s
sre-group-runnerdeploy-2mnds-wbwbb      0/1     Pending             0          1s
foobar-group-runnerdeploy-6kctz-22mwj   0/1     ContainerCreating   0          7s
foobar-group-runnerdeploy-6kctz-5bdxk   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-672ds   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-6btvh   1/1     Running             0          12s
foobar-group-runnerdeploy-6kctz-6hp59   0/1     ContainerCreating   0          7s
foobar-group-runnerdeploy-6kctz-86xtr   0/1     ContainerCreating   0          8s
foobar-group-runnerdeploy-6kctz-8bfms   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-9df7w   0/1     Pending             0          5s
foobar-group-runnerdeploy-6kctz-9s6mb   1/1     Running             0          8s
foobar-group-runnerdeploy-6kctz-bdvmx   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-bvzv8   0/1     Pending             0          1s
foobar-group-runnerdeploy-6kctz-f2ptd   1/1     Running             0          103s
foobar-group-runnerdeploy-6kctz-fdjbc   1/1     Running             0          101s
foobar-group-runnerdeploy-6kctz-gbmvh   0/1     ContainerCreating   0          8s
foobar-group-runnerdeploy-6kctz-kvgfx   1/1     Running             0          11s
foobar-group-runnerdeploy-6kctz-pldh6   1/1     Running             0          100s
foobar-group-runnerdeploy-6kctz-rb656   0/1     Pending             0          5s
foobar-group-runnerdeploy-6kctz-s8nzj   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-v8hk4   0/1     Pending             0          4s
foobar-group-runnerdeploy-6kctz-ww8k2   1/1     Running             0          104s
foobar-group-runnerdeploy-6kctz-xqn29   1/1     Running             0          12s
foobar-group-runnerdeploy-6kctz-z6nln   0/1     Pending             0          5s
foobar-group-runnerdeploy-6kctz-zkp6x   1/1     Running             0          12s
foobar-group-runnerdeploy-sw8qb-gzs75   1/1     Running             0          103s

one minute later:

% kubectl get pods -n gha|wc -l
192

after 5minutes:
over 4000 runners

its somekind of loop adding new runners :) quite critical bug I would say

callum-tait-pbx · 2021-04-21T14:18:41Z

hmm looks like this helm command does not update crds. https://helm.sh/docs/chart_best_practices/custom_resource_definitions/

yeh, Helm will only installs CRDs, it will not delete or update them. When doing an upgrade that involves CRD changes you need to uninstall everything and delete the CRDs.

its somekind of loop adding new runners :) quite critical bug I would say

Any interesting controller logs? can you post your RunnerDeployments?

@mumoshu this has happened before during a similar scenario, #427 here it is

Is it worth adding some sort of totalMaxRunnerCount config to the controller that any scaling acitivity is always checked against to ensure we don't scale to infinity in any scenario? Obviously it doesn't fix the bug but it would at least prevent the system killing itself in some sort of bug edge case? We can set it to something fairly high in the values.yaml like 100 so you don't in general need to touch it?

zetaab · 2021-04-21T17:38:01Z

our runnerdeployments were

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: sre-group-runnerdeploy
  namespace: gha
spec:
  replicas: 2
  template:
    spec:
      organization: sensured
      group: SRE
      image: sensured/sre/actions-runner-dind
      dockerdWithinRunnerContainer: true
      resources:
        limits:
          cpu: "4000m"
          memory: "2Gi"
        requests:
          cpu: "200m"
          memory: "200Mi"
      volumeMounts:
      - mountPath: /runner
        name: runner
      volumes:
      - name: runner
        emptyDir: {}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: sensured-group-runnerdeploy
  namespace: gha
spec:
  replicas: 2
  template:
    spec:
      organization: sensured
      group: sensured
      image: sensured/sre/actions-runner-dind
      dockerdWithinRunnerContainer: true
      resources:
        limits:
          cpu: "4000m"
          memory: "2Gi"
        requests:
          cpu: "200m"
          memory: "200Mi"
      volumeMounts:
      - mountPath: /runner
        name: runner
      volumes:
      - name: runner
        emptyDir: {}

well there were like 100k rows of log.. impossible to investigate

callum-tait-pbx · 2021-04-21T17:41:43Z

Are you able to grep for things like desired replicas? Most of the log entries on the controller will probably be Successfully Reconciled lines, desired replicas should be a much smaller subset and might provide some useful debug information.

Do you have any other runners deployed? Do you have any HorizontalRunnerAutoscaler kinds deployed as part of this controller deployment?

callum-tait-pbx · 2021-04-21T17:52:40Z

#467 another relevant issue. There is a bit of a process updating CRDs with Helm which, when done wrong (albeit it's not very obvious if the problem is derived from Helms lack of CRD management support or if it is an actions-runner-controller issue), causes a massive spike in runners. It's well worth producing a runbook internally for doing Helm upgrades of this project as CRDs require a bit of extra care annoyingly. I personally follow:

Uninstall all my runners (ours are wrapped in a Helm chart)
Ensure they are all deleted and they aren't any left orphaned i.e.

kubectl get pods -n %ACTIONS_NAMESPACE%
kubectl get runners -n %ACTIONS_NAMESPACE%

Uninstall my controller chart
Delete the CRDs with kubectl
Install the new version of the controller chart
Install my runners

If possible some log lines with desired replicas might help towards seeing if there is something that can be done in the controller though! :D

mumoshu · 2021-04-22T03:30:53Z

Sorry to see you getting caught by this.

Well, the short answer is you seem to need to delete runnerreplicaset that were created by v0.18.

RunnerDeploment was created by you so just leave it there if it keeps working after you manually removed the runnerreplicaset created by v0.18. v0.17 controller won't handle v0.18 runnerreplicaset nicely.

actions-runner-controller is not tested for forward compatibility so expecting v0.17 controller to gracefully handle runners and runner deployments created by v0.18 controller won't be justified much.

Also, I'm not sure if there's any general way to prevent a K8s controller from gracefully handling "broken" CRDs. If we were to cover that in our implementation and tests, that would be a large effort that takes my time.

That said, a sustainable way to further prevent this kind of issues can be to add some upgrade instruction(that just note about helm-upgrade doesn't upgrade CRDs and you need separate kubectl runs to upgrade CRDs first) in the README, or the GitHub release description.

zetaab · 2021-04-22T03:49:10Z

Its enough if there is just somekind of release notes "read this before updating"

mumoshu · 2021-04-25T03:42:27Z

@zetaab I've just done my best writing it :)

Closing as answered and resolved, but please feel free to submit another issues for whatever. I can think of potential new issues about adding docs, enhancing release note, ideas to improve the controller to not break in terrible ways with invalid CRDs, etc.

mumoshu · 2021-04-25T04:00:16Z

If you have suggestions about how release notes should be written, check #481 and contribute changes to the new RELEASE_NOTE_TEMPLATE file.

mumoshu mentioned this issue Apr 23, 2021

Runner stuck in " Not configured" state #479

Closed

mumoshu closed this as completed Apr 25, 2021

mumoshu mentioned this issue May 3, 2021

Controller keeps reconciling with non-existent runners - leads to infinite runners #512

Closed

This was referenced May 21, 2021

Improve acceptance test to cover more controller and CRD compatibility matrix #560

Open

Enhance and automate acceptance tests to accelerate releases #564

Open

mumoshu mentioned this issue Mar 23, 2022

what does it mean "user repositories not supported"? #1253

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runner controller failing in 0.18.2 #468

runner controller failing in 0.18.2 #468

zetaab commented Apr 21, 2021

callum-tait-pbx commented Apr 21, 2021

zetaab commented Apr 21, 2021 •

edited

zetaab commented Apr 21, 2021

zetaab commented Apr 21, 2021 •

edited

zetaab commented Apr 21, 2021 •

edited

zetaab commented Apr 21, 2021 •

edited

callum-tait-pbx commented Apr 21, 2021 •

edited

zetaab commented Apr 21, 2021 •

edited

callum-tait-pbx commented Apr 21, 2021 •

edited

callum-tait-pbx commented Apr 21, 2021 •

edited

mumoshu commented Apr 22, 2021

zetaab commented Apr 22, 2021

mumoshu commented Apr 25, 2021

mumoshu commented Apr 25, 2021

runner controller failing in 0.18.2 #468

runner controller failing in 0.18.2 #468

Comments

zetaab commented Apr 21, 2021

callum-tait-pbx commented Apr 21, 2021

zetaab commented Apr 21, 2021 • edited

zetaab commented Apr 21, 2021

zetaab commented Apr 21, 2021 • edited

zetaab commented Apr 21, 2021 • edited

zetaab commented Apr 21, 2021 • edited

callum-tait-pbx commented Apr 21, 2021 • edited

zetaab commented Apr 21, 2021 • edited

callum-tait-pbx commented Apr 21, 2021 • edited

callum-tait-pbx commented Apr 21, 2021 • edited

mumoshu commented Apr 22, 2021

zetaab commented Apr 22, 2021

mumoshu commented Apr 25, 2021

mumoshu commented Apr 25, 2021

zetaab commented Apr 21, 2021 •

edited

zetaab commented Apr 21, 2021 •

edited

zetaab commented Apr 21, 2021 •

edited

zetaab commented Apr 21, 2021 •

edited

callum-tait-pbx commented Apr 21, 2021 •

edited

zetaab commented Apr 21, 2021 •

edited

callum-tait-pbx commented Apr 21, 2021 •

edited

callum-tait-pbx commented Apr 21, 2021 •

edited