v2.12: leadship election panics and crashes controller #4761

alexec · 2020-12-16T19:49:14Z

 controller | time="2020-12-16T10:35:34.758Z" level=info msg="Deleting TTL expired workflow argo/calendar-workflow-x28p4"
 controller | E1216 11:15:18.671907   78349 leaderelection.go:307] Failed to release lock: Lease.coordination.k8s.io "workflow-controller" is invalid: spec.leaseDurationSeconds: Invalid value: 0: must be greater than 0

I'm pretty use we panic here and then we see other things shutting down.

 controller | time="2020-12-16T11:15:18.671Z" level=info msg="stopped leading" id=local
 controller | time="2020-12-16T11:15:18.672Z" level=info msg="Shutting workflow TTL worker"
 controller | panic: http: Server closed
 controller | goroutine 279 [running]:
 controller | github.com/argoproj/argo/workflow/metrics.runServer.func1(0x1, 0xc00074df68, 0x8, 0x2382, 0x0, 0x0, 0xc000836000)
 controller | 	/Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:53 +0x117
 controller | created by github.com/argoproj/argo/workflow/metrics.runServer
 controller | 	/Users/acollins8/go/src/github.com/argoproj/argo/workflow/metrics/server.go:50 +0x246
 controller | Terminating controller

kubernetes/client-go#754

The text was updated successfully, but these errors were encountered:

sarabala1979 · 2020-12-16T23:04:59Z

@alexec is this happening on your test? are you configuring LeaseDuration: 0 * time.Second on code?

alexec · 2020-12-16T23:13:28Z

@sarabala1979 no, it just seems to happen after about running for (maybe) 20m+

sarabala1979 · 2020-12-16T23:53:31Z

The issue is fixed in v1.20 Kubernetes.
kubernetes/kubernetes#80954

sarabala1979 · 2020-12-18T00:30:55Z

It is very hard to reproduce this issue. it is not consistent. I was able to reproduce 2 times in my local env with k3d. I got one more different error also after that I couldn't reproduce it.

controller | I1217 09:33:17.702074   80579 leaderelection.go:288] failed to renew lease argo/workflow-controller: failed to tryAcquireOrRenew context deadline exceeded
controller | E1217 09:33:20.362137   80579 leaderelection.go:307] Failed to release lock: Operation cannot be fulfilled on leases.coordination.k8s.io "workflow-controller": the object has been modified; please apply your changes to the latest version and try again

max-sixty · 2020-12-25T20:54:27Z

FYI It looks like the fix is cherry-picked back to 1.18: kubernetes/kubernetes#80954 (comment)

max-sixty · 2020-12-25T21:04:44Z

We've hit a similar issue around LEADER_ELECTION_IDENTITY must be set so that the workflow controllers can elect a leader, which caused our workflow-controller to go into CrashLoopBackOff and jobs not to be scheduled.

Here are logs, if that's helpful. If there's a way that we could have prevented this, would be great to know. And if anyone has a monitoring approach that would have alerted to something like this, I'd be v interested to learn. Thanks in advance:

time="2020-12-25T20:44:53.729Z" level=info msg="config map" name=workflow-controller-configmap
time="2020-12-25T20:44:53.745Z" level=info msg="Configuration:\nartifactRepository:\n  archiveLogs: true\n  gcs:\n    bucket: {}\n    serviceAccountKeySecret:\n      key: \"\"\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\nparallelism: 20\npodSpecLogStrategy: {}\ntelemetryConfig: {}\nworkflowDefaults:\n  metadata:\n    creationTimestamp: null\n  spec:\n    arguments: {}\n    retryStrategy:\n      backoff:\n        duration: 1m\n        factor: 2\n      limit: 10\n      retryPolicy: Always\n    serviceAccountName: {}\n    ttlStrategy:\n      secondsAfterFailure: 604800\n      secondsAfterSuccess: 604800\n  status:\n    finishedAt: null\n    startedAt: null\n"
time="2020-12-25T20:44:53.745Z" level=info msg="Persistence configuration disabled"
time="2020-12-25T20:44:53.746Z" level=info msg="Starting Workflow Controller" version=v2.11.8+b7412aa.dirty
time="2020-12-25T20:44:53.746Z" level=info msg="Workers: workflow: 32, pod: 32, pod cleanup: 4"
time="2020-12-25T20:44:53.867Z" level=info msg="Manager initialized successfully"
time="2020-12-25T20:44:53.867Z" level=fatal msg="LEADER_ELECTION_IDENTITY must be set so that the workflow controllers can elect a leader"

This sequence repeats every five minutes when the workflow-container-controller starts.

alexec · 2020-12-29T18:09:07Z

You need to set the LEADER_ELECTION_IDENTITY environment variable in your manifest. This is typically (always?) set to metadata.name.

sarabala1979 · 2020-12-29T18:29:48Z

Are you using install.sh from release or just updating the release version?
if you are just updating the release version, you need to add env on your existing deployment spec.

        env:
        - name: LEADER_ELECTION_IDENTITY
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name

max-sixty · 2020-12-29T20:03:55Z

I was using this 2.11.8, tagged with kustomize: https://github.com/argoproj/argo/manifests/cluster-install?ref=v2.11.8. It had worked for a few weeks before that issue. I'm fairly confident the manifests didn't change.

I don't see any LEADER_ELECTION_IDENTITY in the manifests until after 2.12.2. I checked the manifests that were deployed and the images were all correctly tagged to 2.11.8 — it doesn't seem to be an issue with an untagged image upgrading accidentally.

So either I made a mistake (very possible) or the error came from a 2.11.8 image.

It resolved after upgrading to 2.12.2 — not sure whether that was the version upgrading or just a change in version. Wiping the whole argo namespace and reapplying the 2.11.8 manifests didn't help.

Thanks you as ever for engaging and for the phenomenal library.

alexec · 2020-12-29T20:31:11Z

LEADER_ELECTION_IDENTITY is a v2.12 feature, not a v2.11 feature, so you should not see anything is the logs if you're running v2.11.8.

docker run argoproj/workflow-controller:v2.11.8 version
Unable to find image 'argoproj/workflow-controller:v2.11.8' locally
v2.11.8: Pulling from argoproj/workflow-controller
e54ef591d839: Already exists 
d69ba838510e: Already exists 
Digest: sha256:44d87c9f555fc14ef2433eeda4f29d70eab37b6bda7e019192659c95e5ed0161
Status: Downloaded newer image for argoproj/workflow-controller:v2.11.8
workflow-controller: v2.11.8+b7412aa.dirty
  BuildDate: 2020-12-24T10:16:20Z
  GitCommit: b7412aa1bcff2df20bbe5d515abddb8f33cf4c9e
  GitTreeState: dirty
  GitTag: v2.10.0-rc1
  GoVersion: go1.13.15
  Compiler: gc
  Platform: linux/amd64

It looks like someone (me?) has overwritten the v2.11.8 controller with a test version.

alexec · 2020-12-29T20:34:36Z

Why oh why does Docker Hub allow you to overwrite images like this? It is impossible to prevent this from ever happening, even updating build scripts to check that a version does not exist, could not prevent this.

alexec · 2020-12-29T20:59:01Z

@max-sixty Fixed.

docker run argoproj/workflow-controller:v2.11.8 version
workflow-controller: v2.11.8
  BuildDate: 2020-12-29T20:43:36Z
  GitCommit: 310e099f82520030246a7c9d66f3efaadac9ade2
  GitTreeState: clean
  GitTag: v2.11.8
  GoVersion: go1.13.4
  Compiler: gc
  Platform: linux/amd64

max-sixty · 2020-12-29T21:41:22Z

Great — thanks a lot for tracking it down!

sarabala1979 · 2021-01-13T17:20:04Z

K8s 1.19 client has a fix for this issue

alexec added the type/bug label Dec 16, 2020

alexec assigned sarabala1979 Dec 16, 2020

alexec mentioned this issue Dec 16, 2020

Difficulty scaling when running many workflows and/or steps #4634

Closed

sarabala1979 closed this as completed Jan 13, 2021

TomerShor mentioned this issue May 10, 2022

[Pipelines] Upgrade to 1.8.1 v3io/helm-charts#735

Merged

2 tasks

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v2.12: leadship election panics and crashes controller #4761

v2.12: leadship election panics and crashes controller #4761

alexec commented Dec 16, 2020

sarabala1979 commented Dec 16, 2020

alexec commented Dec 16, 2020

sarabala1979 commented Dec 16, 2020

sarabala1979 commented Dec 18, 2020

max-sixty commented Dec 25, 2020

max-sixty commented Dec 25, 2020

alexec commented Dec 29, 2020

sarabala1979 commented Dec 29, 2020

max-sixty commented Dec 29, 2020

alexec commented Dec 29, 2020 •

edited

Loading

alexec commented Dec 29, 2020 •

edited

Loading

alexec commented Dec 29, 2020

max-sixty commented Dec 29, 2020

sarabala1979 commented Jan 13, 2021

v2.12: leadship election panics and crashes controller #4761

v2.12: leadship election panics and crashes controller #4761

Comments

alexec commented Dec 16, 2020

sarabala1979 commented Dec 16, 2020

alexec commented Dec 16, 2020

sarabala1979 commented Dec 16, 2020

sarabala1979 commented Dec 18, 2020

max-sixty commented Dec 25, 2020

max-sixty commented Dec 25, 2020

alexec commented Dec 29, 2020

sarabala1979 commented Dec 29, 2020

max-sixty commented Dec 29, 2020

alexec commented Dec 29, 2020 • edited Loading

alexec commented Dec 29, 2020 • edited Loading

alexec commented Dec 29, 2020

max-sixty commented Dec 29, 2020

sarabala1979 commented Jan 13, 2021

alexec commented Dec 29, 2020 •

edited

Loading

alexec commented Dec 29, 2020 •

edited

Loading