New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some branches in DAG stop making progress temporarily even though previous steps succeeded #3197
Comments
@simster7 Thanks for looking into this. Have you been able to reproduce the issue on your end? |
Sorry, haven't had a chance to look into this yet. However, Please note that |
Thanks for the suggestion! I've tried |
Okay, thanks. I'll look into this soon |
@saitoy1 I see that you linked #3176 in your issue, and it seems like this is the same problem. Were you able to read my response here: #3176 (comment)? In particular my advice about creating new Workflows for nesting and the example I linked? If so, why does that approach not work? |
Let me see if I can clarify. I linked this issue to #3176 because it was convenient to tailor & come up with simplified steps for my production workflow. I read your response back then but have not tried your advice yet as I am unsure that the underlying issue is the same. For one, I actually have an on-prem k8s cluster that runs my workflow (not the simplified one above but the one used in production) without the issue of this PR. That workflow also has a branching factor of 300 from the root step and the successful intermediate steps are always followed by pending or running steps; the workflow only gets "stuck" sometimes when the maximum number of pods in the entire cluster has been reached but the execution resumes once a node is freed up to schedule a next pending pod. In my mind, getting stuck by the max number of pods reached is expected. For two, the workflow for #3176 gets completely stuck, if I read correctly. My workflow (as well as the above simplified one) does not get stuck for good and eventually finishes running. There seems to be a subtle difference. For three, I like to understand better why the issue occurs in some k8s cluster (in my case, that's an EKS in AWS) but not all the clusters. There are a lot of factors involved, but I feel it is critical to understand what makes Argo behave the way it does when it runs into the issue in certain k8s clusters. FYI, I already have a resource construct in my production workflow creating a k8s Service within each DAG branch like so
If I were to replace each DAG branch with a a nested workflow using a resource construct then I'd have the nested resource constructs
which seems to make the workflow yaml more complicated than it should be. |
@simster7 to follow up with whether or not v2.9.2 may fix this. |
Hi @saitoy1, sorry for the delay. I won't be able to speak much about running the Workflow on AWS/EKS/EC2 as I don't have much experience running them in those environments.
Correct, this is an important distinction. We recently optimized some of the DAG logic in 2.9.2: #3418. Could you give it a try to see if it alleviates your issue? |
Thank you @simster7 @alexec for further looking into this! I have tried One thing I discovered though was when I specified the
No worries, which is why I wanted to come up with simplified repro steps that seemingly exhibit the same issue without the cloud. Appreciate your time & effort for looking into this rather mysterious issue. |
Interesting, this could be helpful information. How does the runtime change with other values of |
After more investigation I'm hard pressed to find why setting a Since this is not what's happening, and since this seems to be environment dependent, I'm thinking that this might be related to an issue with K8s scheduling pods in your environment. It seems that the environment doesn't handle spikes in pod creation well |
Thank you for sharing your insights. The fact that setting a limit should increase the execution time makes sense to me. It's curious to observe setting a limit decreases the execution time in my case. I've tried a couple of values for
This could be useful. One thing I could try on my end is to explicitly specify pods' resources e.g. |
It would be the |
Will close as this does not seem to be an issue with Argo, but feel free to continue the discussion |
Checklist:
What happened:
I have a DAG workflow whose topology is a balanced tree with height 6, and the tree has a branching factor of 300 from the root node. Once I have kicked off a workflow, I see in the Argo Server UI that some steps stop making progress to the next even though they were marked successful. When that happens, k8s does not seem to schedule pods for the next steps onto cluster nodes.
Click to view an example screenshot
Once other branches have finished running till their terminal steps, the blocked steps in question become unblocked and the workflow will eventually run successfully. So I would categorize this as a performance issue as opposed to a functional one; blocked intermediate steps lead to the added total execution time for a workflow and I would like to minimize that as much as possible.
This issue is more pronounced in Amazon EKS, where I need to run the workflow ultimately. The EKS cluster consists of five EC2 instances whose instance type is
r5.8xlarge
.What you expected to happen:
I expect intermediate completed steps (checked in green) to be always followed by next steps, e.g. PodInitializing (yellow circles in the Argo Server UI) or Running (blue circles). Intermediate green circles should not be left hanging without making any progress for some time.
I further expect reasonable parallelism for all branches in DAG in a way that DAG is expected to progress in a breath-first manner but not in a depth-first manner only for some branches
How to reproduce it (as minimally and precisely as possible):
A workflow template YAML
You can run this workflow in an on-premise k8s cluster. I borrowed this from #3176 and modified it to be a branching workflow. It happens to be a simplified reproduction steps of what I need to accomplish. You may need to increase
number-of-branches
depending on the spec of each worker node in the cluster to consistently reproduce the issue.Anything else we need to know?:
My on-premise k8s cluster consists of five worker nodes and the spec of each worker looks like
Environment:
Other debugging information (if applicable):
executor logs:
No executer logs as pods eventually ran successfully.
workflow-controller logs:
I noticed that the volume of the following warning messages was high when the issue occurred.
Message from the maintainers:
If you are impacted by this bug please add a 👍 reaction to this issue! We often sort issues this way to know what to prioritize.
The text was updated successfully, but these errors were encountered: