v2.10/v2.11/latest(21st Sep): Too many warn & error messages in Argo Workflow Controller (msg="error in entry template execution" error="Deadline exceeded") #4048

BlackRider97 · 2020-09-16T14:22:30Z

Summary

Too many warning and error messages inside Argo workflow controllers

Argo workflow controller logs

$ kubectl logs --tail=20 workflow-controller-cb99d68cf-znssr

time="2020-09-16T13:46:45Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionczpht
time="2020-09-16T13:46:45Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestioncbmt6
time="2020-09-16T13:46:45Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestioncbmt6
time="2020-09-16T13:46:46Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionvz4km
time="2020-09-16T13:46:46Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionvz4km
time="2020-09-16T13:46:46Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionhvnhs
time="2020-09-16T13:46:46Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionhvnhs
time="2020-09-16T13:46:46Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionnnsbb
time="2020-09-16T13:46:46Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionnnsbb
time="2020-09-16T13:46:46Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionkc5sb
time="2020-09-16T13:46:46Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionkc5sb
time="2020-09-16T13:46:46Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionc9fcz
time="2020-09-16T13:46:46Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionc9fcz
time="2020-09-16T13:46:46Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionpjczx
time="2020-09-16T13:46:46Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionpjczx
time="2020-09-16T13:46:46Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionftmdh
time="2020-09-16T13:46:46Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionftmdh
time="2020-09-16T13:46:46Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionbfrc5
time="2020-09-16T13:46:46Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionbfrc5

Workflows are getting stuck after some time and not completing in 12+ hours while normal execution is around 1 minute.

I am creating almost 1000 workflows with each workflow containing 4 Pods in short span of time. There are enough worker nodes to do processing so no issues as such from K8s cluster side.

internal-data-ingestiontj79x error in entry template execution: Deadline exceeded github.com/argoproj/argo/errors.New  /go/src/github.com/argoproj/argo/errors/errors.go:49 github.com/argoproj/argo/workflow/controller.init  /go/src/github.com/argoproj/argo/workflow/controller/operator.go:102 runtime.doInit  /usr/local/go/src/runtime/proc.go:5222 runtime.doInit  /usr/local/go/src/runtime/proc.go:5217 runtime.main  /usr/local/go/src/runtime/proc.go:190 runtime.goexit  /usr/local/go/src/runtime/asm_amd64.s:1357
--

Diagnostics

What version of Argo Workflows are you running?

Argo v2.10.1

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

sarabala1979 · 2020-09-16T15:51:21Z

Deadline exceeded message as expected for each workflow which takes more than 10 seconds for one-time processing.
The message was printing for each workflow.
How big is your workflow?
Do you have withparam or FromValue in your workflow?

alexec · 2020-09-16T16:12:26Z

Maybe fixed by #3905.

Does the workflow have a DAG with 1000s of items?

BlackRider97 · 2020-09-17T15:42:39Z

@sarabala1979 Yes we have withparam in our workflows.
I didn't understand what is meaning of 10 secs of one-time processing ?
Does this over all workflow execution time or any individual step/pod execution or something else ?

Expected means should I ignore these warns & error messages ?

@alexec No, the workflow do not have a DAG with 1000s of items. However it has DAG with about 8 items.
And there are 500+ workflows are running concurrently.
Still you suspect, it is related to fix #3905 ?

alexec · 2020-09-17T16:06:08Z

When deadline is exceeded it maybe possible for workflow to get corrupted, so the reconciliation aborts. Reconciliation is allowed 10s and can take longer if, for example:

Kubernetes API request are taking a long time - e.g. due to a lot of work happening.
There are many Kubernetes API being made- e.g. because the DAG is large.

As you are running many workflows, it is probably flooding the Kubernetes API. Have you tried limiting work in progress this, e.g. using parallelism or semaphores.

We are working on various improvements in v2.11 and v2.12 to support large scale.

BlackRider97 · 2020-09-18T06:04:28Z

@alexec Yes I can see that Kubernetes API request are taking a long time during my large workload execution.
After using parallelism it solves the problem but we loose scaling & concurrent execution of workflows while sufficient worker nodes are available to do the processing.
Should I ignore these warn & error log messages ?

I can see that my workflow are getting stuck in Running state while there is no Pod which is in Running state.

$ argo list -n argo --running | wc -l
307
$ kubectl get pods -n argo
NAME                                                 READY   STATUS      RESTARTS   AGE
argo-server-57b8549fdb-hptg2                         1/1     Running     1          2d21h
workflow-controller-cb99d68cf-mc4j8                  1/1     Running     0          16h

No workflow PODS

$ kubectl describe workflow internal-data-ingestioncnmdb  

  Phase:               Running
  Started At:          2020-09-18T05:39:14Z
Events:
  Type     Reason            Age   From                 Message
  ----     ------            ----  ----                 -------
  Normal   WorkflowRunning   37m   workflow-controller  Workflow Running
  Warning  WorkflowTimedOut  36m   workflow-controller  Workflow timed out

What kind of improvements you are trying to do to support large scale ?

alexec · 2020-09-18T14:21:37Z

I would upgrade to v2.11.0 and try with that version. I'm not sure we will have fully fixed this problem until v2.12 mind you.

BlackRider97 · 2020-09-21T07:29:55Z

@alexec After upgrading to v2.11.0 also I can see these warnings and errors

time="2020-09-21T07:09:52Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestion5z894
time="2020-09-21T07:10:03Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestion9p7qr
time="2020-09-21T07:10:03Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionbrpb8
time="2020-09-21T07:10:03Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionbrpb8
time="2020-09-21T07:10:03Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionckccs
time="2020-09-21T07:10:03Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionckccs
time="2020-09-21T07:10:04Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestiongcv4c
time="2020-09-21T07:10:04Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestiongcv4c
time="2020-09-21T07:10:04Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestionwcx7n
time="2020-09-21T07:10:04Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestionwcx7n
time="2020-09-21T07:10:14Z" level=warning msg="Deadline exceeded" namespace=argo workflow=internal-data-ingestiongcv4c
time="2020-09-21T07:10:14Z" level=error msg="error in entry template execution" error="Deadline exceeded" namespace=argo workflow=internal-data-ingestiongcv4c

This is when I started 100 concurrent workflows with each workflow having about 6-7 Pods.

BlackRider97 · 2020-09-21T10:16:03Z

This workflow is interesting. I got such 3 failed workflows in setup of 200 concurrent running workflows.

get-data-and-further is a DAG with 3 children of it and one pod for one DAG branch.
All 3 Pod got succeeded but overall workflow is Failed

Attached all logs related to this workflow for further debugging.
failed-ingestionrkc5f.log

Highlights of logs

time="2020-09-21T09:26:18Z" level=info msg="Step group node internal-data-ingestionrkc5f-2608704937 deemed failed: child 'internal-data-ingestionrkc5f-3985201543' failed" namespace=argo workflow=internal-data-ingestionrkc5f
time="2020-09-21T09:26:18Z" level=info msg="node internal-data-ingestionrkc5f-2608704937 phase Running -> Failed" namespace=argo workflow=internal-data-ingestionrkc5f
time="2020-09-21T09:26:18Z" level=info msg="node internal-data-ingestionrkc5f-2608704937 message: child 'internal-data-ingestionrkc5f-3985201543' failed" namespace=argo workflow=internal-data-ingestionrkc5f
time="2020-09-21T09:26:18Z" level=info msg="node internal-data-ingestionrkc5f-2608704937 finished: 2020-09-21 09:26:18.805091638 +0000 UTC" namespace=argo workflow=internal-data-ingestionrkc5f

@alexec Can you tell me meaning for Step group node internal-data-ingestionrkc5f-2608704937 deemed failed: child 'internal-data-ingestionrkc5f-3985201543' failed ? Pod is succeeded but workflow got Failed due to some other error in Argo workflow execution.

Looking at code https://github.com/argoproj/argo/blob/5be254425e3bb98850b31a2ae59f66953468d890/workflow/controller/steps.go#L279
It seems like Argo was checking for all children of a DAG for completion and got error for one of the children.

BlackRider97 · 2020-09-21T11:45:53Z

What I understood from my observations is that-
Due to lot of concurrent workflows there is sudden increase in no of Pods. That is resulting into K8s API server timeouts (or API failures). And due to that Argo workflow controller getting into inconsistent state.

Some workflows are in forever Pending state while controller is not picking up those at all. Some are failing randomly as Argo workflow controller is not able to commit states for workflows even though DAG task or sequentially step succeeded. Workflow controller is taking more time to process workflows due to slowness in K8s APIs.

Slowness is expected and it's fine but inconsistent state of workflows worries me a lot.
Argo should have a better handling while keeping healthy state machine for each workflow considering such network failures.

@sarabala1979 Any help here ?

alexec · 2020-09-21T15:32:06Z

Can anyone please provide some executable reproduction steps? I.e. can you provide a workflow that reproduces this behaviour. It is extremely time consuming (and often impractical) to fix bugs where this has not been provided.

alexec · 2020-09-21T15:34:13Z

This is the line of code deeming failure:

https://github.com/argoproj/argo/blob/5be254425e3bb98850b31a2ae59f66953468d890/workflow/controller/steps.go#L285

AbhishekMallick · 2020-09-21T17:44:29Z

@alexec: Workflow
coin_flip.yml

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: dag-diamon-coinflip-
spec:
  podGC:
    strategy: OnWorkflowSuccess
  entrypoint: diamond
  templates:
  - name: diamond
    dag:
      tasks:
        - name: A
          template: coinflip
        - name: B
          dependencies: [A]
          template: coinflip
        - name: C
          dependencies: [A]
          template: coinflip
        - name: D
          dependencies: [B, C]
          template: coinflip

  - name: coinflip
    steps:
      - - name: flip-coin
          template: flip-coin
      - - name: heads
          template: found-heads
          when: "{{steps.flip-coin.outputs.result}} == heads"
        - name: tails
          template: coinflip
          when: "{{steps.flip-coin.outputs.result}} == tails"

  - name: flip-coin
    script:
      image: python:alpine3.6
      command: [python]
      source: |
        import random
        import time
        time.sleep(60)
        result = "heads" if random.randint(0,1) == 0 else "tails"
        print(result)

  - name: found-heads
    container:
      image: alpine:3.6
      command: [sh, -c]
      args: ["echo \"it was heads\""]

submit 1000 of them at once - load_argo.sh

./load_argo.sh coin_flip.yml 1000

load_argo.sh

#!/bin/bash
for ((i=1;i<=$2;i++));
do
        argo submit -n argo $1
done

Using an Azure kubernetes cluster.

alexec · 2020-09-21T17:48:04Z

Do you have time for a Zoom?

AbhishekMallick · 2020-09-21T18:29:00Z

Hey @alexec - sure. Let me know of a time that works for you..

AbhishekMallick · 2020-09-22T17:19:07Z

Hi @alexec - I am in IST time zone. I was also wondering how will we connect.. You most likely don't see my email address. I can't see yours at least. Is there someway I can message you?

alexec · 2020-09-22T18:45:37Z

I'm in PST, so we do not have overlapping working hour with IST AFAIK - so no Zoom.

BlackRider97 · 2020-09-23T05:01:38Z

We'll adjust the timing at our end. Please let us know your availability or email address to have Zoom call. I'll send you an invite there. It will be very helpful if we are able to nail down to the exact issue.

alexec · 2020-09-23T15:19:04Z

@BlackRider97 which TZ are you in please? Could you do between 4:30pm and 6pm PST today? 9am and 10am tomorrow (Thursday)?

BlackRider97 · 2020-09-24T10:18:14Z

@alexec @AbhishekMallick I am in IST (GMT+5:30) time zone.
Lets do it on Thursday, 24 September 2020 9AM PST ( Thursday, 24 September 2020 9:30 PM IST).

@alexec Please let me know how I can send you a Zoom call invite to you ?

alexec · 2020-09-24T14:38:48Z

I'm busy at 9am, but free at 9:30am. Find me here: https://intuit.zoom.us/j/9471219298?pwd=L3dwWklkclZNUk1VYkJBOTE0SGREdz09

jayz28 · 2020-10-29T18:08:20Z

I think you'll find if you restart the controller, your zombie workflows will come alive again.

Looks like the :no-sig image has resolved the problem, we are no longer seeing the error and all the workflows are now progressing. Curious what was specific issue? Hopefully the fix will be merged soon.

no-response · 2020-10-29T18:28:33Z

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.

alexec · 2020-10-29T19:35:02Z

Looks like the :no-sig image has resolved the problem, we are no longer seeing the error and all the workflows are now progressing. Curious what was specific issue? Hopefully the fix will be merged soon.

Can you confirm if this was an issue resolves straight-away, or after 30m?

jayz28 · 2020-10-29T21:03:40Z

Looks like the :no-sig image has resolved the problem, we are no longer seeing the error and all the workflows are now progressing. Curious what was specific issue? Hopefully the fix will be merged soon.

Can you confirm if this was an issue resolves straight-away, or after 30m?

It appears to have resolved right away.

alexec · 2020-10-29T21:05:52Z

Can you please try :latest?

jayz28 · 2020-10-30T21:08:56Z

So far so good using the :latest image.

alexec · 2020-10-31T16:31:17Z

Is it resolved straight away?

jayz28 · 2020-11-01T17:11:48Z

Is it resolved straight away?

Yes, it started to resolve very quickly after applying the changes. Curious what was the cause for the issue?

alexec · 2020-11-01T17:20:25Z

Can I confirm that they were resolved using :latest or using :no-sig?

jayz28 · 2020-11-01T17:24:38Z

Can I confirm that they were resolved using :latest or using :no-sig?

Initially trying :no-sig resolved our stuck workflows. We have since been load testing using :latest and have not encountered the issue, running similar loads.

alexec · 2020-11-01T17:25:49Z

Ok. So we can shelve :no-sig.

ebr · 2020-11-18T07:43:09Z

The issue description and comments sound exactly like what we're experiencing on 2.11.6. "stuck" workflow steps without running pods. @alexec we briefly chatted in Slack and you suggested upgrading to 2.11.7 - would :latest be a better idea instead?

alexec · 2020-11-18T17:31:47Z

@ebr try them both.

ebr · 2020-11-18T20:18:04Z

Still an issue with 2.11.7 (i can share some screenshots privately or Zoom). I also raised --qps, --burst and --pod-workers to 10x-20x above than the defaults.

:latest is 2.12.x - will trying it prevent me from reverting to 2.11.7 (we're offloading to Postgres)

alexec · 2020-11-19T00:01:17Z

You should be able to revert, but I'd recommend you try in a test env.

ebr · 2020-11-19T00:11:36Z

We are not lucky enough to have a test env :) In a middle of a massive production run right now. so we're doing it live.
I went for 2.12.0-rc2, and as soon as I upgraded the controller, a lot of the step nodes got "unstuck" and shot for the moon, creating almost all of the pods. but some kind of upper bound still exists. E.g., deploying a workflow with a withItems step which is supposed to create ~8000 iterations of the item, it only gets to item ~(3600:3600) and stalls, like before. In 2.11.7, this upper bound was around 2100 items, but it also seems to depend on the total number of pods the workflow controller is currently handling. I can't find a concrete pattern to it, but the symptoms are reproducible. So 2.12 definitely helps, but not entirely. We are currently working around this by breaking our workflows into chunks that will not create more than 2000 iterations of any step.

alexec · 2020-11-19T00:36:47Z

Does it come unstuck after 20m (at which time there is a resync)?

ebr · 2020-11-19T00:50:45Z

Unfortunately, no. E.g. we have a couple of wfs right now that have been "stuck" this way with 0 pods (because they have been GC'd), but no new pods are being created for those wfs. Other wfs (like the chunked-up ones i described above) are happily creating pods though.

alexec · 2020-11-19T00:57:02Z

Can we Zoom? I'd like to investigate online.

ebr · 2020-11-19T01:09:35Z

That would be fantastic. Could we organize it for tomorrow, as it's getting late here (EST)? I'll make sure to keep the stuck workflows around (though it's reproducible, so that's not an issue). I can message you on Slack to organize, if that's ok.

alexec · 2020-11-19T04:05:52Z

I'm in PST.

jkgeyti · 2020-11-19T16:59:02Z

We're seeing the exact issues in 2.11 as well, when submitting hundreds of workflows. If possible, I'd be delighted to join the call.

alexec · 2020-11-19T17:00:37Z

Ping me on the Slack channel.

alexec · 2020-11-19T17:35:43Z

Closing as we have #4560 now.

BlackRider97 added the type/bug label Sep 16, 2020

alexec added the more-information-needed label Sep 16, 2020

no-response bot removed the more-information-needed label Sep 17, 2020

alexec added this to the v2.10 milestone Sep 21, 2020

alexec added P1 labels Sep 21, 2020

alexec self-assigned this Sep 21, 2020

alexec added P2 and removed P1 labels Sep 22, 2020

no-response bot closed this as completed Oct 29, 2020

alexec removed the more-information-needed label Oct 29, 2020

alexec reopened this Oct 29, 2020

alexec linked a pull request Nov 1, 2020 that will close this issue

feat(controller): Deprecate pod-workers. Fixes #4398 #4403

Closed

6 tasks

alexec mentioned this issue Nov 19, 2020

Zombie workflows (aka "stuck workflows") #4560

Closed

alexec closed this as completed Nov 23, 2020

This issue was closed.

v2.10/v2.11/latest(21st Sep): Too many warn & error messages in Argo Workflow Controller (msg="error in entry template execution" error="Deadline exceeded") #4048

v2.10/v2.11/latest(21st Sep): Too many warn & error messages in Argo Workflow Controller (msg="error in entry template execution" error="Deadline exceeded") #4048

Comments

BlackRider97 commented Sep 16, 2020 • edited Loading

Summary

Diagnostics

sarabala1979 commented Sep 16, 2020

alexec commented Sep 16, 2020

BlackRider97 commented Sep 17, 2020 • edited Loading

alexec commented Sep 17, 2020

BlackRider97 commented Sep 18, 2020 • edited Loading

alexec commented Sep 18, 2020

BlackRider97 commented Sep 21, 2020

BlackRider97 commented Sep 21, 2020 • edited Loading

BlackRider97 commented Sep 21, 2020 • edited Loading

alexec commented Sep 21, 2020

alexec commented Sep 21, 2020

AbhishekMallick commented Sep 21, 2020 • edited Loading

alexec commented Sep 21, 2020

AbhishekMallick commented Sep 21, 2020

AbhishekMallick commented Sep 22, 2020

alexec commented Sep 22, 2020

BlackRider97 commented Sep 23, 2020

alexec commented Sep 23, 2020

BlackRider97 commented Sep 24, 2020

alexec commented Sep 24, 2020

jayz28 commented Oct 29, 2020

no-response bot commented Oct 29, 2020

alexec commented Oct 29, 2020

jayz28 commented Oct 29, 2020

alexec commented Oct 29, 2020

jayz28 commented Oct 30, 2020

alexec commented Oct 31, 2020

jayz28 commented Nov 1, 2020

alexec commented Nov 1, 2020

jayz28 commented Nov 1, 2020 • edited Loading

alexec commented Nov 1, 2020

ebr commented Nov 18, 2020

alexec commented Nov 18, 2020

ebr commented Nov 18, 2020

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020 • edited Loading

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020

alexec commented Nov 19, 2020

ebr commented Nov 19, 2020

alexec commented Nov 19, 2020

jkgeyti commented Nov 19, 2020

alexec commented Nov 19, 2020

alexec commented Nov 19, 2020

BlackRider97 commented Sep 16, 2020 •

edited

Loading

BlackRider97 commented Sep 17, 2020 •

edited

Loading

BlackRider97 commented Sep 18, 2020 •

edited

Loading

BlackRider97 commented Sep 21, 2020 •

edited

Loading

BlackRider97 commented Sep 21, 2020 •

edited

Loading

AbhishekMallick commented Sep 21, 2020 •

edited

Loading

jayz28 commented Nov 1, 2020 •

edited

Loading

ebr commented Nov 19, 2020 •

edited

Loading