Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

3.5 Server OOM when too many executions of CronWorkflows #12645

Closed
3 of 4 tasks
panicboat opened this issue Feb 9, 2024 · 5 comments · Fixed by #12681
Closed
3 of 4 tasks

3.5 Server OOM when too many executions of CronWorkflows #12645

panicboat opened this issue Feb 9, 2024 · 5 comments · Fixed by #12681
Labels
area/cron-workflows area/ui good first issue Good for newcomers P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug type/regression Regression from previous behavior (a specific type of bug)

Comments

@panicboat
Copy link
Contributor

panicboat commented Feb 9, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

This occurs when trying to view the details of each cron-workflows or workflow-template UI, which has a lot of history.
There was nothing like that in the argo-server logs.
image

Version

v3.5.4

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Any workflow is acceptable.

Logs from the workflow controller

workflow-controller does not seem to be involved.

Found 2 pods, using pod/workflow-controller-7bbd96c558-qss82
time="2024-02-09T09:40:00.290Z" level=info msg="Processing workflow" Phase= ResourceVersion=646930323 namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:00.691Z" level=info msg="Task-result reconciliation" namespace=develop-workflow numObjs=0 workflow=**********-1707471600
time="2024-02-09T09:40:00.790Z" level=info msg="Updated phase  -> Running" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:01.690Z" level=info msg="Created PDB resource for workflow." namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:01.690Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:01.691Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:02.290Z" level=info msg="Pod node **********-1707471600 initialized Pending" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:02.392Z" level=info msg="Created pod: **********-1707471600 (**********-1707471600)" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:02.392Z" level=info msg="TaskSet Reconciliation" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:02.392Z" level=info msg=reconcileAgentPod namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:02.691Z" level=info msg="Workflow update successful" namespace=develop-workflow phase=Running resourceVersion=646930400 workflow=**********-1707471600
time="2024-02-09T09:40:12.691Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=646930400 namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.693Z" level=info msg="Task-result reconciliation" namespace=develop-workflow numObjs=0 workflow=**********-1707471600
time="2024-02-09T09:40:12.693Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.693Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.694Z" level=info msg="node changed" namespace=develop-workflow new.message= new.phase=Succeeded new.progress=0/1 nodeID=**********-1707471600 old.message= old.phase=Pending old.progress=0/1 workflow=**********-1707471600
time="2024-02-09T09:40:12.694Z" level=info msg="TaskSet Reconciliation" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.695Z" level=info msg=reconcileAgentPod namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.695Z" level=info msg="Running OnExit handler: exit-handler" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.695Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.700Z" level=info msg="was unable to obtain node for , letting display name to be nodeName" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.700Z" level=info msg="Steps node **********-1707471600-3586666638 initialized Running" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.700Z" level=info msg="StepGroup node **********-1707471600-4104114448 initialized Running" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.700Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.700Z" level=info msg="Steps node **********-1707471600-626938768 initialized Running" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.700Z" level=info msg="StepGroup node **********-1707471600-1958917994 initialized Running" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.701Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.701Z" level=info msg="Pod node **********-1707471600-3566325063 initialized Pending" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.809Z" level=info msg="Created pod: **********-1707471600.onExit[0].datadog-metrics[0].exit-status-count (**********-1707471600-datadog-3566325063)" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.809Z" level=warning msg="Node was nil, will be initialized as type Skipped" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.890Z" level=info msg="Pod node **********-1707471600-1538370245 initialized Pending" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.991Z" level=info msg="Created pod: **********-1707471600.onExit[0].datadog-metrics[0].exit-duration-gauge (**********-1707471600-datadog-1538370245)" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.991Z" level=info msg="Workflow step group node **********-1707471600-1958917994 not yet completed" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:12.991Z" level=info msg="Skipping **********-1707471600.onExit[0].slack-notification: when 'false == true' evaluated false" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:13.189Z" level=info msg="Skipped node **********-1707471600-103546606 initialized Skipped (message: when 'false == true' evaluated false)" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:13.190Z" level=info msg="Workflow step group node **********-1707471600-4104114448 not yet completed" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:13.290Z" level=info msg="Workflow update successful" namespace=develop-workflow phase=Running resourceVersion=646930801 workflow=**********-1707471600
time="2024-02-09T09:40:18.490Z" level=info msg="cleaning up pod" action=deletePod key=develop-workflow/**********-1707471600/deletePod
time="2024-02-09T09:40:22.893Z" level=info msg="Processing workflow" Phase=Running ResourceVersion=646930801 namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=info msg="Task-result reconciliation" namespace=develop-workflow numObjs=0 workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=info msg="node changed" namespace=develop-workflow new.message= new.phase=Succeeded new.progress=0/1 nodeID=**********-1707471600-1538370245 old.message= old.phase=Pending old.progress=0/1 workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=warning msg="workflow uses legacy/insecure pod patch, see https://argo-workflows.readthedocs.io/en/release-3.5/workflow-rbac/" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=info msg="node changed" namespace=develop-workflow new.message= new.phase=Succeeded new.progress=0/1 nodeID=**********-1707471600-3566325063 old.message= old.phase=Pending old.progress=0/1 workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=info msg="TaskSet Reconciliation" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=info msg=reconcileAgentPod namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.895Z" level=info msg="Running OnExit handler: exit-handler" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="Step group node **********-1707471600-1958917994 successful" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="node **********-1707471600-1958917994 phase Running -> Succeeded" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="node **********-1707471600-1958917994 finished: 2024-02-09 09:40:22.990379721 +0000 UTC" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="Outbound nodes of **********-1707471600-3566325063 is [**********-1707471600-3566325063]" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="Outbound nodes of **********-1707471600-1538370245 is [**********-1707471600-1538370245]" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="Outbound nodes of **********-1707471600-626938768 is [**********-1707471600-3566325063 **********-1707471600-1538370245]" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="node **********-1707471600-626938768 phase Running -> Succeeded" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="node **********-1707471600-626938768 finished: 2024-02-09 09:40:22.990488334 +0000 UTC" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="Step group node **********-1707471600-4104114448 successful" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="node **********-1707471600-4104114448 phase Running -> Succeeded" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="node **********-1707471600-4104114448 finished: 2024-02-09 09:40:22.990680607 +0000 UTC" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="Outbound nodes of **********-1707471600-626938768 is [**********-1707471600-3566325063 **********-1707471600-1538370245]" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="Outbound nodes of **********-1707471600-103546606 is [**********-1707471600-103546606]" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="Outbound nodes of **********-1707471600-3586666638 is [**********-1707471600-3566325063 **********-1707471600-1538370245 **********-1707471600-103546606]" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="node **********-1707471600-3586666638 phase Running -> Succeeded" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="node **********-1707471600-3586666638 finished: 2024-02-09 09:40:22.99083864 +0000 UTC" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.990Z" level=info msg="Updated phase Running -> Succeeded" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:22.991Z" level=info msg="Marking workflow completed" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:23.002Z" level=info msg="Deleted PDB resource for workflow." namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:23.002Z" level=info msg="Marking workflow as pending archiving" namespace=develop-workflow workflow=**********-1707471600
time="2024-02-09T09:40:23.008Z" level=info msg="cleaning up pod" action=deletePod key=develop-workflow/**********-1707471600-1340600742-agent/deletePod
time="2024-02-09T09:40:23.092Z" level=info msg="Workflow update successful" namespace=develop-workflow phase=Succeeded resourceVersion=646931206 workflow=**********-1707471600
time="2024-02-09T09:40:23.190Z" level=info msg="archiving workflow" namespace=develop-workflow uid=809ee2c5-797d-42d9-a287-5a65ac412d3f workflow=**********-1707471600
time="2024-02-09T09:40:23.325Z" level=info msg="Queueing Succeeded workflow develop-workflow/**********-1707471600 for delete in 2m59s due to TTL"
time="2024-02-09T09:40:28.190Z" level=info msg="cleaning up pod" action=deletePod key=develop-workflow/**********-1707471600-datadog-3566325063/deletePod
time="2024-02-09T09:40:28.190Z" level=info msg="cleaning up pod" action=deletePod key=develop-workflow/**********-1707471600-datadog-1538370245/deletePod
time="2024-02-09T09:43:23.000Z" level=info msg="Deleting garbage collected workflow 'develop-workflow/**********-1707471600'"
time="2024-02-09T09:43:23.009Z" level=info msg="Successfully request 'develop-workflow/**********-1707471600' to be deleted"

Logs from in your workflow's wait container

workflow-controller does not seem to be involved.

No resources found in argo namespace.
@agilgur5 agilgur5 changed the title Getting a lot of history makes argo-server OOM. Server OOM when too many executions of CronWorkflows or WorkflowTemplates Feb 9, 2024
@agilgur5 agilgur5 added area/api Argo Server API area/ui P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important area/cron-workflows area/workflow-templates labels Feb 9, 2024
@agilgur5
Copy link
Member

view the details of each cron-workflows

You mean the history view in the CronWorkflow Details page that was added in #11811, right?
I think for that we can just set a hard limit of say, the most recent 20 executions of the CronWorkflow.

For pagination for the CronWorkflow List page, there is #10846

or workflow-template UI

The WorkflowTemplate Details page does not currently have a history of executions -- can you elaborate on what you were referring to here? Pagination for the WorkflowTemplate List page?

@agilgur5 agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Feb 16, 2024
@terrytangyuan terrytangyuan added the problem/more information needed Not enough information has been provide to diagnose this issue. label Feb 18, 2024
@panicboat
Copy link
Contributor Author

panicboat commented Feb 18, 2024

Thanks for confirming.

I think for that we can just set a hard limit of say, the most recent 20 executions of the CronWorkflow.

How can this be accomplished?
Can archived workflows be included?

The WorkflowTemplate Details page does not currently have a history of executions -- can you elaborate on what you were referring to here? Pagination for the WorkflowTemplate List page?

The perception was wrong.
WorkflowTemplate didn't have a list, so forget it.

@agilgur5 agilgur5 removed the problem/more information needed Not enough information has been provide to diagnose this issue. label Feb 18, 2024
@agilgur5
Copy link
Member

How can this be accomplished?

We'd just need to add a limit in the pagination args of the request. As in, this line would need modification:

const workflowList = await services.workflows.list(namespace, null, [`${models.labels.cronWorkflow}=${name}`], null);

Would you like to work on that?

Can archived workflows be included?

In 3.5.x I believe they should be included by default since it has the unified API/UI.
3.5.0 had a regression that retrieved all archived workflows during deduplication (see #12025), but that was reverted in 3.5.1. I think the current issue is that the deduplication can be off (see #11715) and it can overfetch, but only to a max of limit * 2 (the same limit applies to both live and archived workflows).

@agilgur5 agilgur5 added the good first issue Good for newcomers label Feb 18, 2024
@panicboat
Copy link
Contributor Author

Thank you, I would love to work on it, can you assign me to it?

@agilgur5
Copy link
Member

You don't need to be assigned to work on something, can just submit a PR.

@agilgur5 agilgur5 added the type/regression Regression from previous behavior (a specific type of bug) label Apr 3, 2024
@agilgur5 agilgur5 added this to the v3.5.x patches milestone Apr 3, 2024
@agilgur5 agilgur5 changed the title Server OOM when too many executions of CronWorkflows or WorkflowTemplates 3.5 Server OOM when too many executions of CronWorkflows or WorkflowTemplates Jun 9, 2024
@argoproj argoproj locked as resolved and limited conversation to collaborators Jun 30, 2024
@agilgur5 agilgur5 changed the title 3.5 Server OOM when too many executions of CronWorkflows or WorkflowTemplates 3.5 Server OOM when too many executions of CronWorkflows Jun 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/cron-workflows area/ui good first issue Good for newcomers P2 Important. All bugs with >=3 thumbs up that aren’t P0 or P1, plus: Any other bugs deemed important solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug type/regression Regression from previous behavior (a specific type of bug)
Projects
Development

Successfully merging a pull request may close this issue.

3 participants