Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some branches in DAG stop making progress temporarily even though previous steps succeeded #3197

Closed
4 tasks done
saitoy1 opened this issue Jun 8, 2020 · 14 comments
Closed
4 tasks done
Assignees
Labels

Comments

@saitoy1
Copy link

saitoy1 commented Jun 8, 2020

Checklist:

  • I've included the version.
  • I've included reproduction steps.
  • I've included the workflow YAML.
  • I've included the logs.

What happened:
I have a DAG workflow whose topology is a balanced tree with height 6, and the tree has a branching factor of 300 from the root node. Once I have kicked off a workflow, I see in the Argo Server UI that some steps stop making progress to the next even though they were marked successful. When that happens, k8s does not seem to schedule pods for the next steps onto cluster nodes.

Click to view an example screenshot

sample_screenshot

Once other branches have finished running till their terminal steps, the blocked steps in question become unblocked and the workflow will eventually run successfully. So I would categorize this as a performance issue as opposed to a functional one; blocked intermediate steps lead to the added total execution time for a workflow and I would like to minimize that as much as possible.

This issue is more pronounced in Amazon EKS, where I need to run the workflow ultimately. The EKS cluster consists of five EC2 instances whose instance type is r5.8xlarge.

What you expected to happen:
I expect intermediate completed steps (checked in green) to be always followed by next steps, e.g. PodInitializing (yellow circles in the Argo Server UI) or Running (blue circles). Intermediate green circles should not be left hanging without making any progress for some time.

I further expect reasonable parallelism for all branches in DAG in a way that DAG is expected to progress in a breath-first manner but not in a depth-first manner only for some branches

How to reproduce it (as minimally and precisely as possible):

A workflow template YAML

You can run this workflow in an on-premise k8s cluster. I borrowed this from #3176 and modified it to be a branching workflow. It happens to be a simplified reproduction steps of what I need to accomplish. You may need to increase number-of-branches depending on the spec of each worker node in the cluster to consistently reproduce the issue.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: branching-test-
spec:
  entrypoint: start
  templates:
    - name: start
      inputs:
        parameters:
          - name: number-of-branches
            value: 300
      dag:
        tasks:
          - name: generate
            template: gen-number-list
            arguments:
              parameters:
                - name: number-of-branches
                  value: "{{inputs.parameters.number-of-branches}}"
          - name: start-step
            template: step
            dependencies: ["generate"]
            arguments:
              parameters:
                - name: counter
                  value: "{{item}}"
            withParam: "{{tasks.generate.outputs.result}}"

    - name: gen-number-list
      inputs:
        parameters:
          - name: number-of-branches
      script:
        image: python:alpine3.6
        command: [python]
        source: |
          import json
          import sys
          json.dump([i for i in range(0, {{inputs.parameters.number-of-branches}})], sys.stdout)

    - name: step
      inputs:
        parameters:
          - name: counter
      steps:
        - - name: iterate-counter
            template: incrementer
            arguments:
              parameters:
                - name: counter
                  value: "{{inputs.parameters.counter}}"
        - - name: execute-alpha
            template: exec
            arguments:
              parameters:
                - name: current
                  value: "{{steps.iterate-counter.outputs.result}}"
        - - name: execute-beta
            template: exec
            arguments:
              parameters:
                - name: current
                  value: "{{steps.iterate-counter.outputs.result}}"
        - - name: execute-gamma
            template: exec
            arguments:
              parameters:
                - name: current
                  value: "{{steps.iterate-counter.outputs.result}}"
    - name: incrementer
      inputs:
        parameters:
          - name: counter
      script:
        image: python:alpine3.6
        command: [python]
        source: |
          print({{inputs.parameters.counter}} + 1)
    - name: exec
      inputs:
        parameters:
          - name: current
      retryStrategy:
        limit: 10
        retryPolicy: "Always"
        backoff:
          duration: "1m"
          factor: 2
          maxDuration: "1h"
      script:
        image: python:alpine3.6
        command: [python]
        source: |
          print(int("{{inputs.parameters.current}}"))

Anything else we need to know?:
My on-premise k8s cluster consists of five worker nodes and the spec of each worker looks like

Capacity:
  cpu:                36
  ephemeral-storage:  921210636Ki
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             262558980Ki
  pods:               110
Allocatable:
  cpu:                36
  ephemeral-storage:  848987720732
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             262456580Ki
  pods:               110

Environment:

  • Argo version:
$ argo version
argo: v2.8.0
  BuildDate: 2020-05-11T22:55:16Z
  GitCommit: 8f696174746ed01b9bf1941ad03da62d312df641
  GitTreeState: clean
  GitTag: v2.8.0
  GoVersion: go1.13.4
  Compiler: gc
  Platform: linux/amd64
  • Kubernetes version :
$ kubectl version -o yaml
clientVersion:
  buildDate: "2020-02-11T18:14:22Z"
  compiler: gc
  gitCommit: 06ad960bfd03b39c8310aaf92d1e7c12ce618213
  gitTreeState: clean
  gitVersion: v1.17.3
  goVersion: go1.13.6
  major: "1"
  minor: "17"
  platform: linux/amd64
serverVersion:
  buildDate: "2020-01-18T23:22:30Z"
  compiler: gc
  gitCommit: 59603c6e503c87169aea6106f57b9f242f64df89
  gitTreeState: clean
  gitVersion: v1.17.2
  goVersion: go1.13.5
  major: "1"
  minor: "17"
  platform: linux/amd64

Other debugging information (if applicable):

  • workflow result:
$ argo --loglevel DEBUG get branching-test-qtpw7
DEBU[0000] CLI version                                   version="{v2.8.0 2020-05-11T22:55:16Z 8f696174746ed01b9bf1941ad03da62d312df641 v2.8.0 clean go1.13.4 gc linux/amd64}"
DEBU[0000] Client options                                opts="{{ false false} 0x15746d0 0xc0004a5540}"
Name:                branching-test-qtpw7
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Mon Jun 08 16:51:57 +0000 (16 minutes ago)
Started:             Mon Jun 08 16:51:57 +0000 (16 minutes ago)
Finished:            Mon Jun 08 17:01:00 +0000 (7 minutes ago)
Duration:            9 minutes 3 seconds
  • executor logs:
    No executer logs as pods eventually ran successfully.

  • workflow-controller logs:
    I noticed that the volume of the following warning messages was high when the issue occurred.

$ kubectl logs -n argo $(kubectl get pods -l app=workflow-controller -n argo -o name)
time="2020-04-01T18:29:06Z" level=warning msg="Deadline exceeded" namespace=default workflow=hf-regression-tests-qtpw7
time="2020-04-01T18:29:06Z" level=warning msg="Deadline exceeded" namespace=default workflow=hf-regression-tests-qtpw7
time="2020-04-01T18:29:06Z" level=warning msg="Deadline exceeded" namespace=default workflow=hf-regression-tests-qtpw7
time="2020-04-01T18:29:06Z" level=warning msg="Deadline exceeded" namespace=default workflow=hf-regression-tests-qtpw7
time="2020-04-01T18:29:06Z" level=warning msg="Deadline exceeded" namespace=default workflow=hf-regression-tests-qtpw7
time="2020-04-01T18:29:06Z" level=warning msg="Deadline exceeded" namespace=default workflow=hf-regression-tests-qtpw7
time="2020-04-01T18:29:06Z" level=warning msg="Deadline exceeded" namespace=default workflow=hf-regression-tests-qtpw7
time="2020-04-01T18:29:06Z" level=warning msg="Deadline exceeded" namespace=default workflow=hf-regression-tests-qtpw7
time="2020-04-01T18:29:06Z" level=warning msg="Deadline exceeded" namespace=default workflow=hf-regression-tests-qtpw7
...

Message from the maintainers:

If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.

@simster7 simster7 self-assigned this Jun 8, 2020
@saitoy1
Copy link
Author

saitoy1 commented Jun 21, 2020

@simster7 Thanks for looking into this. Have you been able to reproduce the issue on your end?

@simster7
Copy link
Member

Sorry, haven't had a chance to look into this yet. However, v2.9.0-rc2 refactored this logic to avoid issues similar to this one. Could you please test this with that version and let me know if the issue persists? Or if you run into new issues?

Please note that v2.9.0-rc2 is a pre-release and is not certified for use in production yet. Test this on a pre-prod environment.

@saitoy1
Copy link
Author

saitoy1 commented Jun 22, 2020

Thanks for the suggestion! I've tried v2.9.0-rc2 and confirmed that the issue persists. It'd be great to identify the root cause of this behavior as the execution time could be cut down by half if I did not run into this issue.

@simster7
Copy link
Member

Okay, thanks. I'll look into this soon

@simster7
Copy link
Member

@saitoy1 I see that you linked #3176 in your issue, and it seems like this is the same problem. Were you able to read my response here: #3176 (comment)? In particular my advice about creating new Workflows for nesting and the example I linked? If so, why does that approach not work?

@saitoy1
Copy link
Author

saitoy1 commented Jun 24, 2020

Let me see if I can clarify. I linked this issue to #3176 because it was convenient to tailor & come up with simplified steps for my production workflow. I read your response back then but have not tried your advice yet as I am unsure that the underlying issue is the same.

For one, I actually have an on-prem k8s cluster that runs my workflow (not the simplified one above but the one used in production) without the issue of this PR. That workflow also has a branching factor of 300 from the root step and the successful intermediate steps are always followed by pending or running steps; the workflow only gets "stuck" sometimes when the maximum number of pods in the entire cluster has been reached but the execution resumes once a node is freed up to schedule a next pending pod. In my mind, getting stuck by the max number of pods reached is expected.
For business reasons, I needed to put my workflow onto AWS so set up an EKS cluster with the same number of nodes as the on-prem k8s cluster. I ensured that EC2's instance types were almost equal to on-prem machine specs and that there were enough IP addresses to go around for pods assignment. I then ran into the issue; successful intermediate steps got stuck and the total execution time was doubled.

For two, the workflow for #3176 gets completely stuck, if I read correctly. My workflow (as well as the above simplified one) does not get stuck for good and eventually finishes running. There seems to be a subtle difference.

For three, I like to understand better why the issue occurs in some k8s cluster (in my case, that's an EKS in AWS) but not all the clusters. There are a lot of factors involved, but I feel it is critical to understand what makes Argo behave the way it does when it runs into the issue in certain k8s clusters.

FYI, I already have a resource construct in my production workflow creating a k8s Service within each DAG branch like so

  - name: service
    resource:
      action: create
      manifest: |
        apiVersion: v2
        kind: Service
        ...

If I were to replace each DAG branch with a a nested workflow using a resource construct then I'd have the nested resource constructs

  - name: create-wf
    resource:
      action: create
      manifest: |
        apiVersion: argoproj.io/v1alpha1
        kind: Workflow
        metadata:
          generateName: service
        spec:
          entrypoint: create-service
          templates:
          - name: create-service
             resource:
               action: create
               manifest: |
                 apiVersion: v2
                 kind: Service
                 ...

which seems to make the workflow yaml more complicated than it should be.

@alexec
Copy link
Contributor

alexec commented Jul 10, 2020

@simster7 to follow up with whether or not v2.9.2 may fix this.

@simster7
Copy link
Member

Hi @saitoy1, sorry for the delay. I won't be able to speak much about running the Workflow on AWS/EKS/EC2 as I don't have much experience running them in those environments.

For two, the workflow for #3176 gets completely stuck, if I read correctly. My workflow (as well as the above simplified one) does not get stuck for good and eventually finishes running. There seems to be a subtle difference.

Correct, this is an important distinction.

We recently optimized some of the DAG logic in 2.9.2: #3418. Could you give it a try to see if it alleviates your issue?

@saitoy1
Copy link
Author

saitoy1 commented Jul 15, 2020

Thank you @simster7 @alexec for further looking into this! I have tried v2.9.2 but still ran into the same issue.

One thing I discovered though was when I specified the parallelism parameter, then the workflow ran faster. Specifically, when I specify 200 for parallelism, the workflow ran in 16 minutes 19 seconds, whereas if I don't explicitly specify any value for parallelism it normally takes around 20 minutes. I am wondering if it works better when I tell Argo up front the maximum number of parallelism it should be aware of, rather than Argo just hitting the limit of parallelism by itself.

I won't be able to speak much about running the Workflow on AWS/EKS/EC2 as I don't have much experience running them in those environments.

No worries, which is why I wanted to come up with simplified repro steps that seemingly exhibit the same issue without the cloud. Appreciate your time & effort for looking into this rather mysterious issue.

@simster7
Copy link
Member

One thing I discovered though was when I specified the parallelism parameter, then the workflow ran faster. Specifically, when I specify 200 for parallelism, the workflow ran in 16 minutes 19 seconds, whereas if I don't explicitly specify any value for parallelism it normally takes around 20 minutes. I am wondering if it works better when I tell Argo up front the maximum number of parallelism it should be aware of, rather than Argo just hitting the limit of parallelism by itself.

Interesting, this could be helpful information. How does the runtime change with other values of parallelism? I would imagine that it's a "U" shape: longer run times with very low values or very high values, while there is a "sweet spot" with a middle value and a low runtime. Can you verify this? I'll investigate and see if anything stands out to me regarding parallelism

@simster7
Copy link
Member

After more investigation I'm hard pressed to find why setting a parallelism limit would actually decrease execution time from the sole perspective of Argo. It seems that the same operations that would happen without a limit, also happen with a limit, except that it's more drawn out, so -- all other things being equal -- setting a limit should increase the execution time.

Since this is not what's happening, and since this seems to be environment dependent, I'm thinking that this might be related to an issue with K8s scheduling pods in your environment. It seems that the environment doesn't handle spikes in pod creation well

@saitoy1
Copy link
Author

saitoy1 commented Jul 18, 2020

Thank you for sharing your insights. The fact that setting a limit should increase the execution time makes sense to me. It's curious to observe setting a limit decreases the execution time in my case. I've tried a couple of values for parallelism and compared total execution time of my production workflow (not the simplified repo steps in this issue). I was not sure whether there was any correlation between parallelism and the total execution time, but decided to pick 200 as the current best based on some heuristics.

Parallelism Execution time (mm:ss)
100 19:35
150 21:05
200 16:19
250 17:32

Since this is not what's happening, and since this seems to be environment dependent, I'm thinking that this might be related to an issue with K8s scheduling pods in your environment

This could be useful. One thing I could try on my end is to explicitly specify pods' resources e.g. Limits and Requests. Other than that, can you think of anything that helps K8s better perform pods scheduling, especially in the context of Argo?

@simster7
Copy link
Member

Other than that, can you think of anything that helps K8s better perform pods scheduling, especially in the context of Argo?

It would be the parallelism flag. That's its main purpose :)

@simster7
Copy link
Member

Will close as this does not seem to be an issue with Argo, but feel free to continue the discussion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants