Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runner controller failing in 0.18.2 #468

Closed
zetaab opened this issue Apr 21, 2021 · 14 comments
Closed

runner controller failing in 0.18.2 #468

zetaab opened this issue Apr 21, 2021 · 14 comments

Comments

@zetaab
Copy link
Contributor

zetaab commented Apr 21, 2021

We updated runner controller from 0.17.0 to 0.18.2 and the whole thing does not work anymore. Runners are not coming online and in logs we can see

2021-04-21T11:17:59.542Z	ERROR	actions-runner-controller.runner	Failed to update runner status for Registration	{"runner": "ghe-runner-deployment-znpqv-gpthz", "error": "Runner.actions.summerwind.dev \"ghe-runner-deployment-znpqv-gpthz\" is invalid: [status.message: Required value, status.phase: Required value, status.reason: Required value]"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReconciler).updateRegistrationToken
	/workspace/controllers/runner_controller.go:493
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReconciler).Reconcile
	/workspace/controllers/runner_controller.go:153
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-04-21T11:17:59.542Z	ERROR	controller-runtime.controller	Reconciler error	{"controller": "runner-controller", "request": "gha/ghe-runner-deployment-znpqv-gpthz", "error": "Runner.actions.summerwind.dev \"ghe-runner-deployment-znpqv-gpthz\" is invalid: [status.message: Required value, status.phase: Required value, status.reason: Required value]"}
github.com/go-logr/zapr.(*zapLogger).Error
	/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
	/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
	/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88

I assume this behaviour is maybe made in #398

cc @mumoshu

anyways we downgraded back to 0.17.0 and everything seems fine now.

@callum-tait-pbx
Copy link
Contributor

Did you update your CRDs as part of your upgrade to 18.2? How are you deploying the solution?

@zetaab
Copy link
Contributor Author

zetaab commented Apr 21, 2021

I deleted everything, helm delete and then deleted all crds also. Then installed everything again with 0.18.2 but the problem still existed. After that I downgraded helm to 0.17.0

helm upgrade -n actions-runner-system ghe-runner actions-runner-controller/actions-runner-controller -f manifests/ghe/values.yaml --install

@zetaab
Copy link
Contributor Author

zetaab commented Apr 21, 2021

oh god I did not execute helm repo update before running helm

% helm search repo actions-runner-controller
NAME                                              	CHART VERSION	APP VERSION	DESCRIPTION
actions-runner-controller/actions-runner-contro...	0.4.0        	           	A Kubernetes controller that operates self-host...

and after

% helm search repo actions-runner-controller
NAME                                              	CHART VERSION	APP VERSION	DESCRIPTION
actions-runner-controller/actions-runner-contro...	0.11.0       	0.18.2     	A Kubernetes controller that operates self-host...

lets try again now

@zetaab
Copy link
Contributor Author

zetaab commented Apr 21, 2021

hmm looks like this helm command does not update crds. https://helm.sh/docs/chart_best_practices/custom_resource_definitions/

There is no support at this time for upgrading or deleting CRDs using Helm.

great

@zetaab
Copy link
Contributor Author

zetaab commented Apr 21, 2021

I can confirm that this problem is happening also with new crds.

@zetaab
Copy link
Contributor Author

zetaab commented Apr 21, 2021

now it looks like this controller went crazy (tried 0.18.2, 0.18.1 and then back to 0.17.0). I have 2 replicas in both runnerdeployments:

% kubectl get pods -n gha
NAME                                    READY   STATUS              RESTARTS   AGE
sre-group-runnerdeploy-2mnds-2sdpf      0/1     Pending             0          3s
sre-group-runnerdeploy-2mnds-5rm5g      0/1     ContainerCreating   0          7s
sre-group-runnerdeploy-2mnds-755tf      1/1     Running             0          8s
sre-group-runnerdeploy-2mnds-7p9ld      1/1     Running             0          107s
sre-group-runnerdeploy-2mnds-85sh8      0/1     ContainerCreating   0          6s
sre-group-runnerdeploy-2mnds-8b2lb      0/1     Pending             0          2s
sre-group-runnerdeploy-2mnds-8vkw6      1/1     Running             0          10s
sre-group-runnerdeploy-2mnds-9fkg8      1/1     Running             0          13s
sre-group-runnerdeploy-2mnds-9nnm2      1/1     Running             0          8s
sre-group-runnerdeploy-2mnds-gxbpd      1/1     Running             0          12s
sre-group-runnerdeploy-2mnds-k6sw5      0/1     Pending             0          3s
sre-group-runnerdeploy-2mnds-mqnvm      0/1     Pending             0          1s
sre-group-runnerdeploy-2mnds-qjr6f      0/1     Pending             0          1s
sre-group-runnerdeploy-2mnds-rhn26      0/1     Pending             0          4s
sre-group-runnerdeploy-2mnds-rjkrc      1/1     Running             0          101s
sre-group-runnerdeploy-2mnds-rjr9v      0/1     Pending             0          6s
sre-group-runnerdeploy-2mnds-v7m6w      0/1     Pending             0          3s
sre-group-runnerdeploy-2mnds-v8lhh      0/1     ContainerCreating   0          8s
sre-group-runnerdeploy-2mnds-v94rv      0/1     ContainerCreating   0          8s
sre-group-runnerdeploy-2mnds-wbwbb      0/1     Pending             0          1s
foobar-group-runnerdeploy-6kctz-22mwj   0/1     ContainerCreating   0          7s
foobar-group-runnerdeploy-6kctz-5bdxk   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-672ds   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-6btvh   1/1     Running             0          12s
foobar-group-runnerdeploy-6kctz-6hp59   0/1     ContainerCreating   0          7s
foobar-group-runnerdeploy-6kctz-86xtr   0/1     ContainerCreating   0          8s
foobar-group-runnerdeploy-6kctz-8bfms   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-9df7w   0/1     Pending             0          5s
foobar-group-runnerdeploy-6kctz-9s6mb   1/1     Running             0          8s
foobar-group-runnerdeploy-6kctz-bdvmx   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-bvzv8   0/1     Pending             0          1s
foobar-group-runnerdeploy-6kctz-f2ptd   1/1     Running             0          103s
foobar-group-runnerdeploy-6kctz-fdjbc   1/1     Running             0          101s
foobar-group-runnerdeploy-6kctz-gbmvh   0/1     ContainerCreating   0          8s
foobar-group-runnerdeploy-6kctz-kvgfx   1/1     Running             0          11s
foobar-group-runnerdeploy-6kctz-pldh6   1/1     Running             0          100s
foobar-group-runnerdeploy-6kctz-rb656   0/1     Pending             0          5s
foobar-group-runnerdeploy-6kctz-s8nzj   1/1     Running             0          10s
foobar-group-runnerdeploy-6kctz-v8hk4   0/1     Pending             0          4s
foobar-group-runnerdeploy-6kctz-ww8k2   1/1     Running             0          104s
foobar-group-runnerdeploy-6kctz-xqn29   1/1     Running             0          12s
foobar-group-runnerdeploy-6kctz-z6nln   0/1     Pending             0          5s
foobar-group-runnerdeploy-6kctz-zkp6x   1/1     Running             0          12s
foobar-group-runnerdeploy-sw8qb-gzs75   1/1     Running             0          103s

one minute later:

% kubectl get pods -n gha|wc -l
192

after 5minutes:
over 4000 runners

its somekind of loop adding new runners :) quite critical bug I would say

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Apr 21, 2021

hmm looks like this helm command does not update crds. https://helm.sh/docs/chart_best_practices/custom_resource_definitions/

yeh, Helm will only installs CRDs, it will not delete or update them. When doing an upgrade that involves CRD changes you need to uninstall everything and delete the CRDs.

its somekind of loop adding new runners :) quite critical bug I would say

Any interesting controller logs? can you post your RunnerDeployments?

@mumoshu this has happened before during a similar scenario, #427 here it is

Is it worth adding some sort of totalMaxRunnerCount config to the controller that any scaling acitivity is always checked against to ensure we don't scale to infinity in any scenario? Obviously it doesn't fix the bug but it would at least prevent the system killing itself in some sort of bug edge case? We can set it to something fairly high in the values.yaml like 100 so you don't in general need to touch it?

@zetaab
Copy link
Contributor Author

zetaab commented Apr 21, 2021

our runnerdeployments were

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: sre-group-runnerdeploy
  namespace: gha
spec:
  replicas: 2
  template:
    spec:
      organization: sensured
      group: SRE
      image: sensured/sre/actions-runner-dind
      dockerdWithinRunnerContainer: true
      resources:
        limits:
          cpu: "4000m"
          memory: "2Gi"
        requests:
          cpu: "200m"
          memory: "200Mi"
      volumeMounts:
      - mountPath: /runner
        name: runner
      volumes:
      - name: runner
        emptyDir: {}
---
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: sensured-group-runnerdeploy
  namespace: gha
spec:
  replicas: 2
  template:
    spec:
      organization: sensured
      group: sensured
      image: sensured/sre/actions-runner-dind
      dockerdWithinRunnerContainer: true
      resources:
        limits:
          cpu: "4000m"
          memory: "2Gi"
        requests:
          cpu: "200m"
          memory: "200Mi"
      volumeMounts:
      - mountPath: /runner
        name: runner
      volumes:
      - name: runner
        emptyDir: {}

well there were like 100k rows of log.. impossible to investigate

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Apr 21, 2021

Are you able to grep for things like desired replicas? Most of the log entries on the controller will probably be Successfully Reconciled lines, desired replicas should be a much smaller subset and might provide some useful debug information.

Do you have any other runners deployed? Do you have any HorizontalRunnerAutoscaler kinds deployed as part of this controller deployment?

@callum-tait-pbx
Copy link
Contributor

callum-tait-pbx commented Apr 21, 2021

#467 another relevant issue. There is a bit of a process updating CRDs with Helm which, when done wrong (albeit it's not very obvious if the problem is derived from Helms lack of CRD management support or if it is an actions-runner-controller issue), causes a massive spike in runners. It's well worth producing a runbook internally for doing Helm upgrades of this project as CRDs require a bit of extra care annoyingly. I personally follow:

  1. Uninstall all my runners (ours are wrapped in a Helm chart)
  2. Ensure they are all deleted and they aren't any left orphaned i.e.
kubectl get pods -n %ACTIONS_NAMESPACE%
kubectl get runners -n %ACTIONS_NAMESPACE%
  1. Uninstall my controller chart
  2. Delete the CRDs with kubectl
  3. Install the new version of the controller chart
  4. Install my runners

If possible some log lines with desired replicas might help towards seeing if there is something that can be done in the controller though! :D

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 22, 2021

Sorry to see you getting caught by this.

Well, the short answer is you seem to need to delete runnerreplicaset that were created by v0.18.

RunnerDeploment was created by you so just leave it there if it keeps working after you manually removed the runnerreplicaset created by v0.18. v0.17 controller won't handle v0.18 runnerreplicaset nicely.

actions-runner-controller is not tested for forward compatibility so expecting v0.17 controller to gracefully handle runners and runner deployments created by v0.18 controller won't be justified much.

Also, I'm not sure if there's any general way to prevent a K8s controller from gracefully handling "broken" CRDs. If we were to cover that in our implementation and tests, that would be a large effort that takes my time.

That said, a sustainable way to further prevent this kind of issues can be to add some upgrade instruction(that just note about helm-upgrade doesn't upgrade CRDs and you need separate kubectl runs to upgrade CRDs first) in the README, or the GitHub release description.

@zetaab
Copy link
Contributor Author

zetaab commented Apr 22, 2021

Its enough if there is just somekind of release notes "read this before updating"

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 25, 2021

@zetaab I've just done my best writing it :)

image

Closing as answered and resolved, but please feel free to submit another issues for whatever. I can think of potential new issues about adding docs, enhancing release note, ideas to improve the controller to not break in terrible ways with invalid CRDs, etc.

@mumoshu mumoshu closed this as completed Apr 25, 2021
@mumoshu
Copy link
Collaborator

mumoshu commented Apr 25, 2021

If you have suggestions about how release notes should be written, check #481 and contribute changes to the new RELEASE_NOTE_TEMPLATE file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants