Tasks are in queued state for a longer time and executor slots are exhausted often #38968

paramjeet01 · 2024-04-12T15:18:37Z

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.3

What happened?

The tasks are in queued state for longer time than expected. This was working fine in 2.3.3 perfectly.

What you think should happen instead?

The tasks should be in running state instead being queued.

How to reproduce

Spin up more than 150 dag run in parallel and the tasks gets queued while it can execute in airflow 2.8.3

Operating System

Amazon Linux 2

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

jscheffl · 2024-04-12T18:47:41Z

Without any logs, errors, metrics or details it is impossible to (1) understand your problem and (2) fix anything.

Can you please describe more details?

paramjeet01 · 2024-04-12T18:58:07Z

Apologies, I'm relatively new to Airflow. We've checked the scheduler logs thoroughly, and everything seems to be functioning correctly without any errors. Additionally, the scheduler pods are operating within normal CPU and memory limits. Our database, RDS, doesn't indicate any breaches either. Currently, we're running a DAG with 150 parallel DAG runs. However, a significant portion of tasks are remaining in a queued state for an extended period. Specifically, about 140 tasks are queued, while only 39 are actively running. I've already reviewed the configurations for max_active_tasks_per_dag and max_active_runs_per_dag, and they appear to be properly set. We did not face this issue in 2.3.3

ephraimbuddy · 2024-04-12T19:14:03Z

Apologies, I'm relatively new to Airflow. We've checked the scheduler logs thoroughly, and everything seems to be functioning correctly without any errors. Additionally, the scheduler pods are operating within normal CPU and memory limits. Our database, RDS, doesn't indicate any breaches either. Currently, we're running a DAG with 150 parallel DAG runs. However, a significant portion of tasks are remaining in a queued state for an extended period. Specifically, about 140 tasks are queued, while only 39 are actively running. I've already reviewed the configurations for max_active_tasks_per_dag and max_active_runs_per_dag, and they appear to be properly set. We did not face this issue in 2.3.3

Can you try increasing the [scheduler]max_tis_per_query to 512? In one performance debugging, we discovered this to work better when increased but it might depend on the environment

paramjeet01 · 2024-04-12T19:23:47Z

I have updated the config map with max_tis_per_query = 512 and redeployed the scheduler. Will monitor for some time and let you know , thanks for quick response.

paramjeet01 · 2024-04-13T06:55:12Z

@ephraimbuddy , The above config has improved the performance in scheduling the tasks and the gantt view shows the tasks queue time is lesser than before. Also , please share performance tuning documentation that will be really nice of you.

tirkarthi · 2024-04-13T09:36:11Z

@paramjeet01 This might be helpful

https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/scheduler.html#fine-tuning-your-scheduler-performance

paramjeet01 · 2024-04-13T09:40:37Z

@ephraimbuddy , Also saw that the dags were in scheduled state , after restarting the scheduler everything works fine now. Found that the executor was showing no open slots available, attaching the image of the metrics

paramjeet01 · 2024-04-14T05:30:54Z

This is similar issue to #36998 , #36478

changqian9 · 2024-04-14T06:23:20Z

We got the same issues for twice. Same observation, this happened when executor open slots < 0.

cc @paramjeet01

paramjeet01 · 2024-04-15T12:38:14Z

@jscheffl , Can you remove the pending response label.

paramjeet01 · 2024-04-16T16:46:29Z

After reviewing various GitHub and Stack Overflow discussions, I've made updates to the following configuration and migrated to Airflow version 2.7.2 with apache-airflow-providers-cncf-kubernetes version 8.0.0 :

[scheduler]
task_queued_timeout : 90
max_dagruns_per_loop_to_schedule : 128
max_dagruns_to_create_per_loop : 128
max_tis_per_query : 1024

Disabled gitsync.
Additionally, scaling the scheduler to 8 replicas has notably improved performance. The executor slots being exhausted was solved using max_tis_query to max number. Sorry I couldn't find the root cause of the issue but I hope this helps

paramjeet01 · 2024-04-17T15:45:30Z

After observing for some time, we encountered instances where the executor open slots were approaching negative values, leading to tasks becoming stuck in the scheduled state. Restarting all the scheduler pods solved this issue in airflow v2.8.3 , apache-airflow-providers-cncf-kubernetes v8.0.0

paramjeet01 · 2024-04-18T09:40:46Z

We have also observed that the pods are not cleaned up after completion of the task and all the pods are stuck in SUCCEEDED state.

paramjeet01 · 2024-04-24T19:27:49Z

Sorry , the above comment is false positive. We are customizing our KPO and we missed to add on_finish_action so the pods stuck in SUCCEEDED state. After adding it , all the pods are removed properly.
We also able to mitigate the executor slots leak by adding a cronjob to restart our schedulers once in a while.

dirrao · 2024-05-11T04:16:59Z

@paramjeet01
You can mention airflow num_runs configuration parameter to restart the scheduler container based on your needs.
https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html

dirrao · 2024-05-11T04:18:29Z

This issue is related to watcher is not able to scale and process the events on time. This leads to so many completed pods over the time.
related: #22612

paramjeet01 · 2024-05-11T05:58:34Z

@dirrao , airflow num_runs configuration parameter purpose has been changed a while ago AFAIK and cannot be used for restarting the scheduler. We have also removed run_duration which was later used for restarting the scheduler.
https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html#num-runs
https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html#remove-run-duration

sunank200 · 2024-05-31T07:02:12Z

If I understood this correctly, the performance issues with tasks in the queued state were mitigated by adjusting max_tis_per_query, scaling scheduler replicas, and implementing periodic scheduler restarts. @paramjeet01 tried periodic restarts of all scheduler pods to temporarily resolve the issue.

Related Issues: #36998, #22612

ephraimbuddy · 2024-06-13T09:55:01Z

Can anyone try this patch #40183 for the scheduler restarting issue?

paramjeet01 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 12, 2024

jscheffl added the pending-response label Apr 12, 2024

paramjeet01 changed the title ~~Tasks are in queued state for a longer time~~ Tasks are in queued state for a longer time and executor slots are exhausted often Apr 14, 2024

changqian9 mentioned this issue Apr 14, 2024

Fix race condition in KubernetesExecutor with concurrently running schedulers #35800

Open

jscheffl removed pending-response needs-triage label for new issues that we didn't triage yet labels Apr 15, 2024

paramjeet01 mentioned this issue Apr 16, 2024

Kubernetes Executor Task Leak #36998

Closed

2 tasks

dirrao added the provider:cncf-kubernetes Kubernetes provider related issues label May 10, 2024

dirrao added the area:performance label May 11, 2024

sunank200 mentioned this issue May 31, 2024

Schedular going down for 1-2 minute on every 10 minute as increase completed pods in EKS #22612

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks are in queued state for a longer time and executor slots are exhausted often #38968

Tasks are in queued state for a longer time and executor slots are exhausted often #38968

paramjeet01 commented Apr 12, 2024 •

edited

Loading

jscheffl commented Apr 12, 2024

paramjeet01 commented Apr 12, 2024

ephraimbuddy commented Apr 12, 2024

paramjeet01 commented Apr 12, 2024

paramjeet01 commented Apr 13, 2024

tirkarthi commented Apr 13, 2024

paramjeet01 commented Apr 13, 2024 •

edited

Loading

paramjeet01 commented Apr 14, 2024

changqian9 commented Apr 14, 2024

paramjeet01 commented Apr 15, 2024

paramjeet01 commented Apr 16, 2024 •

edited

Loading

paramjeet01 commented Apr 17, 2024 •

edited

Loading

paramjeet01 commented Apr 18, 2024

paramjeet01 commented Apr 24, 2024

dirrao commented May 11, 2024

dirrao commented May 11, 2024

paramjeet01 commented May 11, 2024

sunank200 commented May 31, 2024

ephraimbuddy commented Jun 13, 2024

Tasks are in queued state for a longer time and executor slots are exhausted often #38968

Tasks are in queued state for a longer time and executor slots are exhausted often #38968

Comments

paramjeet01 commented Apr 12, 2024 • edited Loading

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

jscheffl commented Apr 12, 2024

paramjeet01 commented Apr 12, 2024

ephraimbuddy commented Apr 12, 2024

paramjeet01 commented Apr 12, 2024

paramjeet01 commented Apr 13, 2024

tirkarthi commented Apr 13, 2024

paramjeet01 commented Apr 13, 2024 • edited Loading

paramjeet01 commented Apr 14, 2024

changqian9 commented Apr 14, 2024

paramjeet01 commented Apr 15, 2024

paramjeet01 commented Apr 16, 2024 • edited Loading

paramjeet01 commented Apr 17, 2024 • edited Loading

paramjeet01 commented Apr 18, 2024

paramjeet01 commented Apr 24, 2024

dirrao commented May 11, 2024

dirrao commented May 11, 2024

paramjeet01 commented May 11, 2024

sunank200 commented May 31, 2024

ephraimbuddy commented Jun 13, 2024

paramjeet01 commented Apr 12, 2024 •

edited

Loading

paramjeet01 commented Apr 13, 2024 •

edited

Loading

paramjeet01 commented Apr 16, 2024 •

edited

Loading

paramjeet01 commented Apr 17, 2024 •

edited

Loading