Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2.12: leadship election panics and crashes controller #4761

Closed
alexec opened this issue Dec 16, 2020 · 14 comments
Closed

v2.12: leadship election panics and crashes controller #4761

alexec opened this issue Dec 16, 2020 · 14 comments
Assignees
Labels

Comments

@alexec
Copy link
Contributor

alexec commented Dec 16, 2020

 controller | time="2020-12-16T10:35:34.758Z" level=info msg="Deleting TTL expired workflow argo/calendar-workflow-x28p4"
 controller | E1216 11:15:18.671907   78349 leaderelection.go:307] Failed to release lock: Lease.coordination.k8s.io "workflow-controller" is invalid: spec.leaseDurationSeconds: Invalid value: 0: must be greater than 0

I'm pretty use we panic here and then we see other things shutting down.

 controller | time="2020-12-16T11:15:18.671Z" level=info msg="stopped leading" id=local
 controller | time="2020-12-16T11:15:18.672Z" level=info msg="Shutting workflow TTL worker"
 controller | panic: http: Server closed
 controller | goroutine 279 [running]:
 controller | github.com/argoproj/argo/workflow/metrics.runServer.func1(0x1, 0xc00074df68, 0x8, 0x2382, 0x0, 0x0, 0xc000836000)
 controller | 	/Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:53 +0x117
 controller | created by github.com/argoproj/argo/workflow/metrics.runServer
 controller | 	/Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:50 +0x246
 controller | Terminating controller

kubernetes/client-go#754

@sarabala1979
Copy link
Member

@alexec is this happening on your test? are you configuring LeaseDuration: 0 * time.Second on code?

@alexec
Copy link
Contributor Author

alexec commented Dec 16, 2020

@sarabala1979 no, it just seems to happen after about running for (maybe) 20m+

@sarabala1979
Copy link
Member

The issue is fixed in v1.20 Kubernetes.
kubernetes/kubernetes#80954

@sarabala1979
Copy link
Member

It is very hard to reproduce this issue. it is not consistent. I was able to reproduce 2 times in my local env with k3d. I got one more different error also after that I couldn't reproduce it.

controller | I1217 09:33:17.702074   80579 leaderelection.go:288] failed to renew lease argo/workflow-controller: failed to tryAcquireOrRenew context deadline exceeded
controller | E1217 09:33:20.362137   80579 leaderelection.go:307] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "workflow-controller": the object has been modified; please apply your changes to the latest version and try again

@max-sixty
Copy link
Contributor

FYI It looks like the fix is cherry-picked back to 1.18: kubernetes/kubernetes#80954 (comment)

@max-sixty
Copy link
Contributor

We've hit a similar issue around LEADER_ELECTION_IDENTITY must be set so that the workflow controllers can elect a leader, which caused our workflow-controller to go into CrashLoopBackOff and jobs not to be scheduled.

Here are logs, if that's helpful. If there's a way that we could have prevented this, would be great to know. And if anyone has a monitoring approach that would have alerted to something like this, I'd be v interested to learn. Thanks in advance:

time="2020-12-25T20:44:53.729Z" level=info msg="config map" name=workflow-controller-configmap
time="2020-12-25T20:44:53.745Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  gcs:\n    bucket: {}\n    serviceAccountKeySecret:\n      key: \"\"\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\nparallelism: 20\npodSpecLogStrategy: {}\ntelemetryConfig: {}\nworkflowDefaults:\n  metadata:\n    creationTimestamp: null\n  spec:\n    arguments: {}\n    retryStrategy:\n      backoff:\n        duration: 1m\n        factor: 2\n      limit: 10\n      retryPolicy: Always\n    serviceAccountName: {}\n    ttlStrategy:\n      secondsAfterFailure: 604800\n      secondsAfterSuccess: 604800\n  status:\n    finishedAt: null\n    startedAt: null\n"
time="2020-12-25T20:44:53.745Z" level=info msg="Persistence configuration disabled"
time="2020-12-25T20:44:53.746Z" level=info msg="Starting Workflow Controller" version=v2.11.8+b7412aa.dirty
time="2020-12-25T20:44:53.746Z" level=info msg="Workers: workflow: 32, pod: 32, pod cleanup: 4"
time="2020-12-25T20:44:53.867Z" level=info msg="Manager initialized successfully"
time="2020-12-25T20:44:53.867Z" level=fatal msg="LEADER_ELECTION_IDENTITY must be set so that the workflow controllers can elect a leader"

This sequence repeats every five minutes when the workflow-container-controller starts.

@alexec
Copy link
Contributor Author

alexec commented Dec 29, 2020

You need to set the LEADER_ELECTION_IDENTITY environment variable in your manifest. This is typically (always?) set to metadata.name.

@sarabala1979
Copy link
Member

Are you using install.sh from release or just updating the release version?
if you are just updating the release version, you need to add env on your existing deployment spec.

        env:
        - name: LEADER_ELECTION_IDENTITY
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name

@max-sixty
Copy link
Contributor

I was using this 2.11.8, tagged with kustomize: https://github.com/argoproj/argo/manifests/cluster-install?ref=v2.11.8. It had worked for a few weeks before that issue. I'm fairly confident the manifests didn't change.

I don't see any LEADER_ELECTION_IDENTITY in the manifests until after 2.12.2. I checked the manifests that were deployed and the images were all correctly tagged to 2.11.8 — it doesn't seem to be an issue with an untagged image upgrading accidentally.

So either I made a mistake (very possible) or the error came from a 2.11.8 image.

It resolved after upgrading to 2.12.2 — not sure whether that was the version upgrading or just a change in version. Wiping the whole argo namespace and reapplying the 2.11.8 manifests didn't help.

Thanks you as ever for engaging and for the phenomenal library.

@alexec
Copy link
Contributor Author

alexec commented Dec 29, 2020

LEADER_ELECTION_IDENTITY is a v2.12 feature, not a v2.11 feature, so you should not see anything is the logs if you're running v2.11.8.

docker run argoproj/workflow-controller:v2.11.8 version
Unable to find image 'argoproj/workflow-controller:v2.11.8' locally
v2.11.8: Pulling from argoproj/workflow-controller
e54ef591d839: Already exists 
d69ba838510e: Already exists 
Digest: sha256:44d87c9f555fc14ef2433eeda4f29d70eab37b6bda7e019192659c95e5ed0161
Status: Downloaded newer image for argoproj/workflow-controller:v2.11.8
workflow-controller: v2.11.8+b7412aa.dirty
  BuildDate: 2020-12-24T10:16:20Z
  GitCommit: b7412aa1bcff2df20bbe5d515abddb8f33cf4c9e
  GitTreeState: dirty
  GitTag: v2.10.0-rc1
  GoVersion: go1.13.15
  Compiler: gc
  Platform: linux/amd64

It looks like someone (me?) has overwritten the v2.11.8 controller with a test version.

@alexec
Copy link
Contributor Author

alexec commented Dec 29, 2020

Why oh why does Docker Hub allow you to overwrite images like this? It is impossible to prevent this from ever happening, even updating build scripts to check that a version does not exist, could not prevent this.

@alexec
Copy link
Contributor Author

alexec commented Dec 29, 2020

@max-sixty Fixed.

docker run argoproj/workflow-controller:v2.11.8 version
workflow-controller: v2.11.8
  BuildDate: 2020-12-29T20:43:36Z
  GitCommit: 310e099f82520030246a7c9d66f3efaadac9ade2
  GitTreeState: clean
  GitTag: v2.11.8
  GoVersion: go1.13.4
  Compiler: gc
  Platform: linux/amd64

@max-sixty
Copy link
Contributor

Great — thanks a lot for tracking it down!

@sarabala1979
Copy link
Member

K8s 1.19 client has a fix for this issue

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants