Unable to accelerate cleanup of succeeded workflows with TTL workers #12206

n1klasD · 2023-11-15T14:07:19Z

Pre-requisites

I have double-checked my configuration
I can confirm the issues exists when I tested with :latest
I'd like to contribute the fix myself (see contributing guide)

What happened/what you expected to happen?

We are experiencing difficulties in increasing the number of TTL (time-to-live) workers to faster clean up succeeded workflow.. We have set the --workflow-ttl-workers flag of the controller to higher values, such as 64 or 96, but it appears that the workflows controller does not recognize or utilize these values effectively. Despite configuring a higher number of TTL workers, the Prometheus metric "argo_workflows_workers_busy_count" consistently shows only one busy TTL worker, and not the expected 64 or 96.

This issue is causing a build-up of succeeded workflows over time, currently peaking at around 10,000, because we are creating workflows faster than we can clean up. Our intention was to leverage multiple TTL workers to accelerate the clean-up process and reduce the backlog of succeeded workflows. It's worth noting that our machine's CPU resources for the controller are only utilized at a fraction (approximately 5% of a 16-core machine).

Additionally, we are observing frequent logs indicating that workflows are being cleaned up late, as seen below.

Version

3.4.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

time="2023-11-15T12:07:59.850Z" level=info msg="Queueing Succeeded workflow XXXXXXXXXX for delete in -3h42m20s due to TTL"

Logs from in your workflow's wait container

The text was updated successfully, but these errors were encountered:

sgutwein · 2023-11-15T14:29:50Z

Here are some further screenshots from our monitoring system:

You can see, that the number of busy ttl workers never surpasses one, and often even no worker is busy. We expect, that this metric (argo_workflows_workers_busy_count) approaches the number set by --workflow-ttl-workers when there are many workflows to be cleaned up:

Here you can see the number of succeeded workflows, which are not cleaned up fast enough:

Strangely, the metric for the depth of the workflows queue is always at zero:

But we see a high frequency of the logs time="2023-11-15T12:07:59.850Z" level=info msg="Queueing Succeeded workflow XXXXXXXXXX for delete in -3h42m20s due to TTL":

n1klasD · 2023-11-17T12:40:41Z

I´d like to add a point to that. It seems like the workflow garbage collection sometimes just stops for one hour or so, until I delete the workflow-controller pod. Then it will work for a while, cleaning up a lot of succeeded workflows, before stopping again. We currently use this as a workaround, when too many succeeded workflows have accumulated.

Apart from setting --workflow-ttl-workers to 64 or 128, we also tried setting the environment variable WORKFLOW_GC_PERIOD in the controller from the default value of 5m to 2m, to increase the periodicity for GC of workflows. We did not see any improvements so far though.

I read through some other issues, and it seems to be related to this (or at least the comment under this issue): #4634 (comment)

agilgur5 · 2023-11-20T09:04:45Z

We have set the --workflow-ttl-workers flag of the controller to higher values, such as 64 or 96, but it appears that the workflows controller does not recognize or utilize these values effectively

So this corresponds directly to the number of goroutines launched for the GC controller -- quite literally this for loop.

Whether it utilizes it effectively, can't quite say on that one. Sounds like it may be a no based on your report.
There is a queue, which could limit concurrency to an extent.

Strangely, the metric for the depth of the workflows queue is always at zero:

I was gonna say to check the workflow_ttl_queue metric, huh 🤔 😕

Does the queue depth of the other queues make sense? i.e. cron_wf_queue, pod_cleanup_queue, workflow_queue -- is workflow_ttl_queue the only odd one out that's not correct?

You can see, that the number of busy ttl workers never surpasses one, and often even no worker is busy. We expect, that this metric (argo_workflows_workers_busy_count) approaches the number set by --workflow-ttl-workers when there are many workflows to be cleaned up:

Yea I would expect this too... The logic looks correct to me too and is near identical to other parts of the codebase 🤔 😕

Same as above, does the busy_count look correct for the other queues? Or is it just workflow_ttl_queue that seems off?

It's worth noting that our machine's CPU resources for the controller are only utilized at a fraction (approximately 5% of a 16-core machine).

That's a good note since fully utilized CPU would certainly limit concurrency.
I'm assuming you have your k8s resource requests and limits set appropriately for the Controller Deployment too?

WORKFLOW_GC_PERIOD in the controller from the default value of 5m to 2m, to increase the periodicity for GC of workflows. We did not see any improvements so far though.

This appears to be a slight misnomer; in the code this refers to the periodicity specifically of the GC for node status offloads.

The period for GC for Workflow TTL / retention is located in the same for loop as above: this line sets it to every second and it is not configurable.

I read through some other issues, and it seems to be related to this (or at least the comment under this issue): #4634 (comment)

It looks like the user there did not necessarily confirm the fix from #4736, which added a small 1s delay to GC as well as the configurable number of TTL workers

agilgur5 · 2023-11-20T09:29:02Z

But we see a high frequency of the logs time="2023-11-15T12:07:59.850Z" level=info msg="Queueing Succeeded workflow XXXXXXXXXX for delete in -3h42m20s due to TTL":

So this graph does look a little suspicious... this seems to log / run every 15-20 minutes? Any chance you can confirm the exact time period?

20 minutes corresponds to the Workflow Informer's resync period, which is when Informers rebuild their cache.
So that would suggest there may be a bug with the Informer, and not the TTL queue or TTL GC necessarily... 🤔

This may be the same root cause as #11948 and therefore fixed by #12133.

Can you try the tag published there, dev-fix-informer-3.4.13?

n1klasD · 2023-11-20T12:59:53Z

Thanks for your extensive reply !

Whether it utilizes it effectively, can't quite say on that one. Sounds like it may be a no based on your report.

Either it does not utilize it effectively, or the metric for that reports incorrect values.

Does the queue depth of the other queues make sense? i.e. cron_wf_queue, pod_cleanup_queue, workflow_queue -- is workflow_ttl_queue the only odd one out that's not correct?

When I look at the argo_workflows_queue_depth_count metric, there are reasonable spikes in the cron_wf_queue and workflow_queue metrics, this looks good. Apart from workflow_ttl_queue being always at zero, pod_cleanup_queue is also always at zero. There are queue adds greater than zero (argo_workflows_queue_adds_count) for all four queues, indicating an error in the queue depth for workflow_ttl_queue and pod_cleanup_queue.

Same as above, does the busy_count look correct for the other queues? Or is it just workflow_ttl_queue that seems off?

When looking at the argo_workflows_workers_busy_count metric, I only see this problem with the number of busy workers for the workflow_ttl_queue, where the value does not spike higher than one worker (https://user-images.githubusercontent.com/46707060/283146674-e9d5e274-620d-4ead-907a-36f66bb4c682.png). For workflow_queue, this correctly spikes to our configured number of 64 workers under high load. I also don´t see this problem for pod_cleanup_queue. We do not set --cron-workflow-workers, but the number of busy cron workflow workers constantly spikes to 8. I am assuming, that this is the default.

I'm assuming you have your k8s resource requests and limits set appropriately for the Controller Deployment too?

Yes. we have set appropriate requests and limits for the controller deployment. These are lot higher than required actually, both in terms of memory and number of cpu cores.

n1klasD · 2023-11-20T13:04:18Z

This appears to be a slight misnomer; in the code this refers to the periodicity specifically of the GC for node status offloads.

Thanks for the clarification. In the docs for the environment variables, this is described as "The periodicity for GC of workflows.", hence the confusion.

n1klasD · 2023-11-20T13:23:53Z

But we see a high frequency of the logs time="2023-11-15T12:07:59.850Z" level=info msg="Queueing Succeeded workflow XXXXXXXXXX for delete in -3h42m20s due to TTL":

So this graph does look a little suspicious... this seems to log / run every 15-20 minutes? Any chance you can confirm the exact time period?

Yes, this is very consistently logged in batches every 18-20 minutes. That seems to align with the Workflow Informer's resync period of 20 minutes.

sgutwein · 2023-11-20T13:41:10Z

The periodic frequency of the "Queueing Succeeded workflow ..." logs can be seen here:

n1klasD · 2023-11-20T13:42:53Z

Can you try the tag published there, dev-fix-informer-3.4.13?

We currently only have this problem on our production environment because of the increased load, so we are naturally very hesitant to test this there. I´ll get back to you on that though. Do you have any idea, when this fix will be released ?

agilgur5 · 2023-11-29T20:49:06Z

Do you have any idea, when this fix will be released ?

The informer fix has been released as part of 3.4.14 now (and 3.5.2).

Are you able to try the new version and see if it is fixed?

We currently only have this problem on our production environment because of the increased load

So you could check in a staging environment if the periodic frequency of the "Queueing Succeeded workflow [...] for delete [...] due to TTL" logs changes from ~20min to more frequently.
If that's the root cause (which sounds pretty likely right now), then a change in the frequency in staging would indicate it would fix the production issue as well.

n1klasD · 2023-12-04T14:02:02Z

I tested 3.4.14 in a staging environment and the frequency of the "Queueing Succeeded workflow [...] for delete [...] due to TTL" logs definitely increased. I don´t see this periodically happening any more, that's great!

However the workflow queue depth metric argo_workflows_queue_depth_count still seems to be incorrect and shows the same behaviour as described above. I know, that this was not part of this fix though !

tooptoop4 · 2024-02-13T07:56:44Z

#12659 means that twice as many pods could be going into cleanup queue than they should

n1klasD added the type/bug label Nov 15, 2023

n1klasD changed the title ~~Unable to Increase the Number of TTL Workers for Cleanup in Argo Workflows~~ Unable to increase the number of TTL Workers to accelerate cleanup of succeeded workflow. Nov 15, 2023

n1klasD changed the title ~~Unable to increase the number of TTL Workers to accelerate cleanup of succeeded workflow.~~ Unable to increase the number of TTL Workers to accelerate cleanup of succeeded workflows Nov 15, 2023

agilgur5 added the area/docs Incorrect, missing, or mistakes in docs label Nov 20, 2023

agilgur5 mentioned this issue Feb 4, 2024

proposal: metrics overhaul including opentelemetry #12589

Open

agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Feb 4, 2024

agilgur5 changed the title ~~Unable to increase the number of TTL Workers to accelerate cleanup of succeeded workflows~~ Unable to accelerate cleanup of succeeded workflows with TTL workers Feb 5, 2024

agilgur5 added the area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more label Feb 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to accelerate cleanup of succeeded workflows with TTL workers #12206

Unable to accelerate cleanup of succeeded workflows with TTL workers #12206

n1klasD commented Nov 15, 2023

sgutwein commented Nov 15, 2023

n1klasD commented Nov 17, 2023 •

edited

agilgur5 commented Nov 20, 2023 •

edited

agilgur5 commented Nov 20, 2023

n1klasD commented Nov 20, 2023 •

edited

n1klasD commented Nov 20, 2023 •

edited

n1klasD commented Nov 20, 2023 •

edited

sgutwein commented Nov 20, 2023

n1klasD commented Nov 20, 2023 •

edited

agilgur5 commented Nov 29, 2023 •

edited

n1klasD commented Dec 4, 2023

tooptoop4 commented Feb 13, 2024

Unable to accelerate cleanup of succeeded workflows with TTL workers #12206

Unable to accelerate cleanup of succeeded workflows with TTL workers #12206

Comments

n1klasD commented Nov 15, 2023

Pre-requisites

What happened/what you expected to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

sgutwein commented Nov 15, 2023

n1klasD commented Nov 17, 2023 • edited

agilgur5 commented Nov 20, 2023 • edited

agilgur5 commented Nov 20, 2023

n1klasD commented Nov 20, 2023 • edited

n1klasD commented Nov 20, 2023 • edited

n1klasD commented Nov 20, 2023 • edited

sgutwein commented Nov 20, 2023

n1klasD commented Nov 20, 2023 • edited

agilgur5 commented Nov 29, 2023 • edited

n1klasD commented Dec 4, 2023

tooptoop4 commented Feb 13, 2024

n1klasD commented Nov 17, 2023 •

edited

agilgur5 commented Nov 20, 2023 •

edited

n1klasD commented Nov 20, 2023 •

edited

n1klasD commented Nov 20, 2023 •

edited

n1klasD commented Nov 20, 2023 •

edited

n1klasD commented Nov 20, 2023 •

edited

agilgur5 commented Nov 29, 2023 •

edited