Etcd get full with ~500 workflows. #12802

leryn1122 · 2024-03-14T08:35:32Z

Pre-requisites

I have double-checked my configuration
I can confirm the issue exists when I tested with :latest
I have searched existing issues and could not find a match for this bug
I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

We run ~500 workflows and ~500 pods concurrently as offline tasks in prod env. Etcd got full rapidly at the size of 8G.
It resulted in that etcd and apiserver turned into unavailable and the argo workflow controller auto restarted frequently.
Our team concluded that etcd and apiserver may be unavailable if running and pending workflows flood into etcd according to monitoring and metrics.

For now, the team’s solutions are:

Limiting workflows quotas
Optimizing the size of workflow template rendored by biz
Writing scripts to check and compress etcd if full as a schedule task
Migrating biz argo into another cluster alone

It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.

Version

v3.4.10

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Limited by NDA.

Logs from the workflow controller

time="2024-03-06T01:59:48.872Z" level=info msg="Mark node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(20:raw/587/2023/12/6/1734100703048318977/ros2/20231206104000_20231206104046_5m/raw_587_20231206104000_20231206104046.db3)[2].xxxxxxx[0].xxxxxxx[0].xxxxxxx(0) as Pending, due to: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.875Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2759181252 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3992814895]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="Transient error: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600"
time="2024-03-06T01:59:48.187Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(21:raw/587/2024/1/26/1751125289026650113/ros2/20240126140500_20240126141000_5m/raw_587_20240126140500_20240126141000.db3)[2].xxxxxxx[0].xxxxxxx[1].vision2d-lidar-fusion-match(0)" nodePhase=Pending recentlyStarted=false workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.185Z" level=info msg="Processing workflow" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.195Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-814885821 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2761991884]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.875Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-765220508 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3024955863]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.206Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-239402416 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-193874387]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.873Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3590128111 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.187Z" level=info msg="Workflow pod is missing" namespace=argo nodeName="slice-587-20231201141424-20240128125944-20240201143506-fen7hx62[0].xxxxxxx[1].subtask(20:raw/587/2023/12/6/1734100703048318977/ros2/20231206104000_20231206104046_5m/raw_587_20231206104000_20231206104046.db3)[2].xxxxxxx[0].xxxxxxx[0].xxxxxxx(0)" nodePhase=Pending recentlyStarted=false workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.186Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.206Z" level=info msg="SG Outbound nodes of slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1845491224 are [slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1265119851]" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944 message: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3557973944\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=822, limited: count/pods=600" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.872Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-285032592 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.873Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-3624891064 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:48.187Z" level=info msg="node unchanged" nodeID=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2773247126
time="2024-03-06T01:59:49.647Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.624Z" level=info msg="Workflow step group node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1031877465 not yet completed" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.623Z" level=info msg="Transient error: admission webhook \"resourcesquotas.quota.kubesphere.io\" denied the request: pods \"slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-1351688902\" is forbidden: exceeded quota: argo, requested: count/pods=1, used: count/pods=821, limited: count/pods=600"
time="2024-03-06T01:59:49.648Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.632Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62
time="2024-03-06T01:59:49.633Z" level=info msg="template (node slice-587-20231201141424-20240128125944-20240201143506-fen7hx62-2485359944) active children parallelism exceeded 3" namespace=argo workflow=slice-587-20231201141424-20240128125944-20240201143506-fen7hx62

Logs from in your workflow's wait container

N/A

The text was updated successfully, but these errors were encountered:

agilgur5 · 2024-03-15T01:49:34Z

We run ~500 workflows and ~500 pods concurrently

So ~2500 concurrent Pods total?

It is expected that argo workflows do not flood into etcd or impact the stability of whole cluster.

For many Workflows and large Workflows, it may indeed stress the k8s API and etcd. That's not really an Argo limitation, that's how k8s works with shared control plane resources.

There are a few features you may want to use that are well documented:

Tune rate limits and/or the DEFAULT_REQUEUE_TIME per https://argo-workflows.readthedocs.io/en/latest/running-at-massive-scale/#overwhelmed-kubernetes-api
- Tune --qps and --burst per https://argo-workflows.readthedocs.io/en/latest/scaling/#k8s-api-client-side-rate-limiting
Tune your TTLs per https://argo-workflows.readthedocs.io/en/latest/cost-optimisation/#limit-the-total-number-of-workflows-and-pods
- Tune the rate of clean up with --workflow-ttl-workers and --pod-cleanup-workers per https://argo-workflows.readthedocs.io/en/latest/scaling/#adding-goroutines-to-increase-concurrency
- Optionally archive Workflows deleted from k8s/etcd in a separate DB per https://argo-workflows.readthedocs.io/en/latest/workflow-archive/
Use synchronization features such as parallelism and mutexes and semaphores to limit the number of concurrent Workflows and tasks per https://argo-workflows.readthedocs.io/en/latest/synchronization/
Enable nodeStatusOffload to move the status subresource data out of etcd and to a separate DB per https://argo-workflows.readthedocs.io/en/latest/offloading-large-workflows/

leryn1122 · 2024-03-18T03:12:29Z

Status:

Currently it is ~4000 pending workflows , ~1000 runing workflows , ~1000 pods for one biggest argo. And 4 argos of smaller scale, ignored.
The former cluster has >10 nodes for argo and ~180 nodes totally. Now we build a standalone cluster for argo.
Workflows behave different from business. Some of them finished in several seconds, while some run for a few hours.

Recent Efforts:

Standalone MySQL archiving has already been enabled in months.
Another probelm we solved is that if db got stuck with growing data when archiving, the workflow controller doesn't handle any request anymore. Arching should be asynchronized I think. Archiving goes slowly, or even stuck after argo_archived_workflows reached ~250G. We wrote a cronjob to daily delete argo_archived_workflows with workflows finished weeks ago to prevent archiving stuck.
--workflow-ttl-workers and --pod-cleanup-workers: It was attempted to be modified. It works but does not save etcd from stress.
Tunning parallelism and pod limitation is the primary way in past weeks. A lower limitation does not satisfy the business requirements. A rapid/jump change leads to etcd be unstable in my experiences.
We've urged developpers to reduces the size of workflow template from 200K to smaller ones.

agilgur5 · 2024-03-19T17:29:57Z

Workflows behave different from business. Some of them finished in several seconds, while some run for a few hours.

Yea Workflows in general have a lot of diverse use-cases, so capacity planning can be challenging. Configurations that are ideal for short Workflows are not necessarily ideal for long Workflows, etc.

Arching should be asynchronized I think [sic]

Archiving is asynchronous. The entire Controller is async, it's all goroutines.

Archiving goes slowly, or even stuck after argo_archived_workflows reached ~250G.

This sounds like it might be getting CPU starved? Without detailed metrics etc it's pretty hard to dive into details.

It also sounds a bit like #11948, which was fixed in 3.4.14 and later. Not entirely the same though from the description (you have an etcd OOM vs a Controller OOM and your archive is growing vs your live Workflows).

I can confirm the issue exists when I tested with :latest

v3.4.10

You also checked this box, but are not on latest. Please fill out the issue template accurately, those questions are asked for very good reasons.

We wrote a cronjob to daily delete argo_archived_workflows with workflows finished weeks ago to prevent archiving stuck.

You can use archiveTTL for this as a built-in option.

It works but does not save etcd from stress.

If you're creating as many Workflows as you're deleting, that sounds possible. Again, you didn't provide metrics, but those would be ideal to track when doing any sort of performance tuning.

A rapid/jump change leads to etcd be unstable in my experiences.

Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent and so will eventually hit an upper bound).

We've urged developpers to reduces the size of workflow template from 200K to smaller ones. [sic]

I listed this in my previous comment -- nodeStatusOffload can help with this.

leryn1122 · 2024-03-22T02:38:28Z

Sorry, I was limited by NDA, and I am going to expose more details.
Configuration and thresholds vary in past months.

You can use archiveTTL for this as a built-in option.

Current archiveTTL is 7d.

Standalone MySQL instance quota: 60-80G mem and local nvme disk.

When the size of archived workflows within 30-45 days reached ~250G, queries and writing on table argo_archived_workflows go slowly. A single SQL deleting workflows cost 2-3 minutes. We attempted to by the table index and MySQL hint, but it does not effect evidently. So I rebuilt MySQL and added a new hacking cronjob mentioned before and now it runs stable.

Are you on a managed k8s control plane provider? E.g. EKS, AKS, GKE, etc? Those try to auto-scale and have pre-set limits, so that can certainly happen. If you're using a self-managed control plane (e.g. kOps), you can set vertically scale etcd and the rest of the k8s control plane (as well as horizontally scale to an extent, as etcd is fully consistent an so will eventually hit an upper bound).

Self-manager cluster:

Kubernetes v1.21.9
Kubesphere v3.2.1

We've enabled etcd compact and compression regularly if triggers by the DB size metrics now. That is a hack.

I listed this in my previous comment --nodeStatusOffload can help with this.

It is enabled. Related config I could expose:

Persistence:

connectionPool:
  maxIdleConns: 100
  maxOpenConns: 0
  connMaxLifetime: 0s
nodeStatusOffLoad: true
archive: true
archiveTTL: 7d

Workflow defaults:

spec:
  ttlStrategy:
    secondsAfterCompletion: 0
    secondsAfterSuccess: 0
    secondsAfterFailure: 0
  podGC:
    strategy: OnPodCompletion
  parallelism: 3

Workflow controller args

args:
  - '--configmap'
  - workflow-controller-configmap
  - '--executor-image'
  - 'xxxxx/argoexec:v3.4.10'
  - '--namespaced'
  - '--workflow-ttl-workers=8'      # 4->8
  - '--pod-cleanup-workers=32'  # 4->32
  - '--workflow-workers=64'        # 32->64
  - '--qps=50'
  - '--kube-api-burst=90'  # 60->90
  - '--kube-api-qps=60'    # 40->60

Executor config

imagePullPolicy: IfNotPresent
resources:
  requests:
    cpu: 10m
    memory: 64Mi
  limits:
    cpu: 1000m
    memory: 512Mi

There are some desensitized etcd and argo metrics screenshots, where the first one shows etcd db size varies rapidly, and the following one shows the count of workflows and pods in argo namespace at the same time.

github-actions · 2024-04-08T02:08:47Z

This issue has been automatically marked as stale because it has not had recent activity and needs more information. It will be closed if no further activity occurs.

leryn1122 added the type/bug label Mar 14, 2024

agilgur5 added type/support User support issue - likely not a bug and removed type/bug labels Mar 15, 2024

agilgur5 added area/gc Garbage collection, such as TTLs, retentionPolicy, delays, and more problem/more information needed Not enough information has been provide to diagnose this issue. labels Mar 15, 2024

github-actions bot added problem/stale This has not had a response in some time and removed problem/stale This has not had a response in some time problem/more information needed Not enough information has been provide to diagnose this issue. labels Apr 8, 2024

agilgur5 mentioned this issue Apr 18, 2024

chore: clarify :latest in bug templates #12951

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Etcd get full with ~500 workflows. #12802

Etcd get full with ~500 workflows. #12802

leryn1122 commented Mar 14, 2024 •

edited

agilgur5 commented Mar 15, 2024 •

edited

leryn1122 commented Mar 18, 2024

agilgur5 commented Mar 19, 2024 •

edited

leryn1122 commented Mar 22, 2024

github-actions bot commented Apr 8, 2024

Etcd get full with ~500 workflows. #12802

Etcd get full with ~500 workflows. #12802

Comments

leryn1122 commented Mar 14, 2024 • edited

Pre-requisites

What happened/what did you expect to happen?

Version

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Logs from the workflow controller

Logs from in your workflow's wait container

agilgur5 commented Mar 15, 2024 • edited

leryn1122 commented Mar 18, 2024

agilgur5 commented Mar 19, 2024 • edited

leryn1122 commented Mar 22, 2024

github-actions bot commented Apr 8, 2024

leryn1122 commented Mar 14, 2024 •

edited

agilgur5 commented Mar 15, 2024 •

edited

agilgur5 commented Mar 19, 2024 •

edited