After upgrading to 2.5.3, Dag Processing time increased dramatically. #30593

wookiist · 2023-04-12T01:28:37Z

Apache Airflow version

2.5.3

What happened

I upgraded my airflow cluster from 2.5.2 to 2.5.3 , after which strange things started happening.
I'm currently using a standalone dagProcessor, and the parsing time that used to take about 2 seconds has suddenly increased to about 10 seconds.
I'm thinking it's weird because I haven't made any changes other than a version up, but is there something I can look into? Thanks in advance! 🙇🏼

What you think should happen instead

I believe that the time it takes to parse a Dag should be constant, or at least have some variability, but shouldn't take as long as it does now.

How to reproduce

If you cherrypick this commit into 2.5.2 stable code, the issue will recur.

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Kubernetes 1.21 Cluster
1.7.0 helm chart
standalone dag processor
using kubernetes executor
using mysql database

Anything else

This issue still persists, and restarting the Dag Processor has not resolved the issue.

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

boring-cyborg · 2023-04-12T01:28:40Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

potiuk · 2023-04-12T09:54:16Z

Thanks. looking at it, seems like a regression., but also could be the "metrics" improvement. Is it possible that you share some excerpts of logs from dag file processor showing parsing the fiels?

potiuk · 2023-04-12T09:56:14Z

If you cherrypick #30079 into 2.5.2 stable code, the issue will recur.

Ah. Fantastic. That will help us a LOT to see what's going on. Thanks for narrowing it down @wookiist !

potiuk · 2023-04-12T10:42:23Z

We definitely need the log. I looked at the change and I can think of only one reason why you could have the metrics increased - namely that there is a recurring problem with some database operation related to callbacks, and retries simply increased the time it takes to run the opertions several times until they fail. This might indicate some consistent problem with either bug in Airflow or some deadlock/operational issue with your database, but we need more data to take a closer look and diagnose it fully.

Can you tell us how many dag file processors you have @wookiist ?

So I have two kind requests @wookiist:

could you please increase log level for dag file processor to debug (it can be only for the processor to not impact other components): https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#logging-level (it can be done via adding the AIRFLOW__LOGGING__LOGGING_LEVEL env variable and restarting the processor) and sharing a gist with the logs f dag file processor. Also if you could see if there are any errror messages produced in your database logs, signalling that there are some queries failing from correlated time of when dag file processor runs, that could help a lot as well.
Can you please take a look at the setting of scheduler 'max_callbacks_for_loop configuration (default is 20) . And if you could lower it down to say 5 (if you have 20), and see if that helps. I have a theory that this parameter might make your queries either run long or run into deadlocks and that might cause the queries to retry few times.

Longer explanation (and theory I have):

The processing time you see is not only for parsing, it is also for running callbacks related to the DAG being processed. When you have callbacks that cannot be run in the worker related to DAG file, they are executed by the DAG file processor after the file is parsed. How it works is that the main parsing loop will start processor for each file and feed it with the callbacks to execute. The change you mentioned does not directly contribute to the time, but it produces callbacks that the file processor executes. And those callbacks actually contribute to the time you see - so the increase of the time you see might not come from "parsing" but it might come from callback processing, and it could be tha the db_retry mechanism introduced in the PR you mentioned, has a problem that it (accidentally) sends multiple repeated callbacks to the processor - and that increases the processing time.

Having the logs, we will be able to see if that is happening and likely we should be able to see the reason (possibly database logs might give more explanation what's going on).

cc: @mhenc I think that one is closely related to standalone file processor case and might need your insights once we get more information

wookiist · 2023-04-13T05:29:55Z

Thanks for looking, @potiuk !

0. Can you tell us how many dag file processors you have @wookiist ?

Only one dag processor is being used

1. could you please increase log level for dag file processor to debug

Yes, I tested it on dev airflow cluster, and here is the log file.

log file

[2023-04-13T04:52:47.287+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:52:47.287+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:52:47.294+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T04:52:47.306+0000] {manager.py:1055} DEBUG - Started a process (PID: 37) to generate tasks for /opt/airflow/dags/data/
[2023-04-13T04:52:47.307+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/ finished
[2023-04-13T04:52:47.309+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor7-Process' pid=36 parent=9 stopped exitcode=0>
[2023-04-13T04:52:47.310+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T04:52:47.310+0000] {manager.py:1021} DEBUG - 35 file paths queued for processing
[2023-04-13T04:52:47.324+0000] {manager.py:864} INFO - 
================================================================================
DAG File Processing Stats

File Path                                                                                PID  Runtime      # DAGs    # Errors  Last Runtime    Last Run
-------------------------------------------------------------------------------------  -----  ---------  --------  ----------  --------------  -------------------
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      37  0.01s             0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      1           0          5.01s           2023-04-13T04:52:37
/opt/airflow/dags/data/                                                                                      1           0          5.02s           2023-04-13T04:52:42
/opt/airflow/dags/data/                                                                                      1           0          5.02s           2023-04-13T04:52:22
/opt/airflow/dags/data/                                                                                      1           0          5.01s           2023-04-13T04:52:27
/opt/airflow/dags/data/                                                                                      1           0          5.12s           2023-04-13T04:52:17
/opt/airflow/dags/data/                                                                                      1           0          5.03s           2023-04-13T04:52:32
/opt/airflow/dags/data/                                                                                      1           0          5.04s           2023-04-13T04:52:47
/opt/airflow/dags/data/                                                                                      1           0          5.18s           2023-04-13T04:52:17
================================================================================
[2023-04-13T04:52:47.810+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 37)
[2023-04-13T04:52:52.300+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T04:52:52.301+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:52:52.301+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:52:52.308+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T04:52:52.343+0000] {manager.py:1055} DEBUG - Started a process (PID: 38) to generate tasks for /opt/airflow/dags/data/B
[2023-04-13T04:52:52.344+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/C finished
[2023-04-13T04:52:52.345+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor8-Process' pid=37 parent=9 stopped exitcode=0>
[2023-04-13T04:52:52.346+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T04:52:52.346+0000] {manager.py:1021} DEBUG - 34 file paths queued for processing
[2023-04-13T04:52:52.513+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 38)
[2023-04-13T04:52:57.299+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T04:52:57.300+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:52:57.300+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:52:57.305+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T04:52:57.316+0000] {manager.py:1055} DEBUG - Started a process (PID: 39) to generate tasks for /opt/airflow/dags/data/A.py
[2023-04-13T04:52:57.319+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/B.py finished
[2023-04-13T04:52:57.320+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor9-Process' pid=38 parent=9 stopped exitcode=0>
[2023-04-13T04:52:57.321+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T04:52:57.321+0000] {manager.py:1021} DEBUG - 33 file paths queued for processing
[2023-04-13T04:52:57.516+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 39)
[2023-04-13T04:53:02.314+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T04:53:02.314+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:53:02.315+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:53:02.319+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T04:53:02.328+0000] {manager.py:1055} DEBUG - Started a process (PID: 40) to generate tasks for /opt/airflow/dags/data/D.py
[2023-04-13T04:53:02.329+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/A.py finished
[2023-04-13T04:53:02.329+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor10-Process' pid=39 parent=9 stopped exitcode=0>
[2023-04-13T04:53:02.330+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T04:53:02.330+0000] {manager.py:1021} DEBUG - 32 file paths queued for processing
[2023-04-13T04:53:02.471+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 40)
[2023-04-13T04:53:07.321+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T04:53:07.322+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:53:07.322+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:53:07.326+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3

2. Can you please take a look at the setting of scheduler 'max_callbacks_for_loop configuration (default is 20) .

I think you're probably referring to the max_callbacks_per_loop value, and I changed that to 5.
This is the log file after I change the value. And this is the graph change.

log file

[2023-04-13T05:15:43.600+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T05:15:43.601+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T05:15:43.601+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T05:15:43.611+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T05:15:43.630+0000] {manager.py:1055} DEBUG - Started a process (PID: 185) to generate tasks for /opt/airflow/dags/data/A.py
[2023-04-13T05:15:43.632+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/0.py finished
[2023-04-13T05:15:43.637+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor91-Process' pid=184 parent=6 stopped exitcode=0>
[2023-04-13T05:15:43.637+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T05:15:43.637+0000] {manager.py:1021} DEBUG - 29 file paths queued for processing
[2023-04-13T05:15:43.684+0000] {manager.py:864} INFO - 
================================================================================
DAG File Processing Stats

File Path                                                                                PID  Runtime      # DAGs    # Errors  Last Runtime    Last Run
-------------------------------------------------------------------------------------  -----  ---------  --------  ----------  --------------  -------------------
/opt/airflow/dags/data/                                                                                         0           1  5.04s           2023-04-13T05:14:43
/opt/airflow/dags/data/                                                                                         0           0  5.21s           2023-04-13T05:10:53
/opt/airflow/dags/data/                                                                                         0           0  4.89s           2023-04-13T05:10:58
/opt/airflow/dags/data/                                                                                         0           0  5.15s           2023-04-13T05:14:53
/opt/airflow/dags/data/                                                                                         0           0  5.01s           2023-04-13T05:11:08
/opt/airflow/dags/data/                                                                                         0           0  5.03s           2023-04-13T05:11:03
/opt/airflow/dags/data/                                                                                         0           0  5.08s           2023-04-13T05:11:13
/opt/airflow/dags/data/                                                                                         0           0  5.02s           2023-04-13T05:11:18
/opt/airflow/dags/data/                                                                                         0           1  5.01s           2023-04-13T05:13:13
/opt/airflow/dags/data/                                                                                         0           0  5.61s           2023-04-13T05:14:38
/opt/airflow/dags/data/                                                                                         0           1  5.07s           2023-04-13T05:13:18
/opt/airflow/dags/data/                                                                                         0           1  5.05s           2023-04-13T05:13:38
/opt/airflow/dags/data/                                                                                         0           1  4.99s           2023-04-13T05:13:43
/opt/airflow/dags/data/                                                                                         0           1  8.10s           2023-04-13T05:13:51
/opt/airflow/dags/data/                                                                                         0           1  4.99s           2023-04-13T05:13:23
/opt/airflow/dags/data/                                                                                         1           0  18.92s          2023-04-13T05:14:43
/opt/airflow/dags/data/                                                                                         1           0  5.09s           2023-04-13T05:14:17
/opt/airflow/dags/data/                                                                                         1           0  5.08s           2023-04-13T05:14:17
/opt/airflow/dags/data/                                                                                         1           0  5.06s           2023-04-13T05:13:33
/opt/airflow/dags/data/                                                                                         1           0  17.68s          2023-04-13T05:14:06
/opt/airflow/dags/data/                                                                                         1           0  5.14s           2023-04-13T05:12:43
/opt/airflow/dags/data/                                                                                         1           0  4.92s           2023-04-13T05:12:48
/opt/airflow/dags/data/                                                                                         1           0  5.02s           2023-04-13T05:13:08
/opt/airflow/dags/data/                                                                                         1           0  5.09s           2023-04-13T05:12:53
/opt/airflow/dags/data/                                                                                         1           0  5.02s           2023-04-13T05:12:38
/opt/airflow/dags/data/                                                                                         1           0  5.01s           2023-04-13T05:12:33
/opt/airflow/dags/data/                                                                                         1           0  5.02s           2023-04-13T05:13:03
/opt/airflow/dags/data/                                                                                         1           0  5.01s           2023-04-13T05:12:58
/opt/airflow/dags/data/                                                                                         1           0  5.05s           2023-04-13T05:15:23
/opt/airflow/dags/data/                                                                                         1           0  4.99s           2023-04-13T05:15:28
/opt/airflow/dags/data/                                                                  185  0.02s             1           0  5.03s           2023-04-13T05:12:13
/opt/airflow/dags/data/                                                                                         1           0  5.09s           2023-04-13T05:15:08
/opt/airflow/dags/data/                                                                                         1           0  5.04s           2023-04-13T05:12:18
/opt/airflow/dags/data/                                                                                         1           0  4.99s           2023-04-13T05:15:38
/opt/airflow/dags/data/                                                                                         1           0  5.06s           2023-04-13T05:15:13
/opt/airflow/dags/data/                                                                                         1           0  5.01s           2023-04-13T05:12:23
/opt/airflow/dags/data/                                                                                         1           0  5.03s           2023-04-13T05:12:28
/opt/airflow/dags/data/                                                                                         1           0  5.12s           2023-04-13T05:15:03
/opt/airflow/dags/data/                                                                                         1           0  5.05s           2023-04-13T05:15:43
/opt/airflow/dags/data/                                                                                         1           0  4.99s           2023-04-13T05:15:18
/opt/airflow/dags/data/                                                                                         1           0  5.05s           2023-04-13T05:15:33
/opt/airflow/dags/data/                                                                                         1           0  4.91s           2023-04-13T05:14:58
/opt/airflow/dags/data/                                                                                         1           0  9.75s           2023-04-13T05:14:06
/opt/airflow/dags/data/                                                                                         1           0  5.02s           2023-04-13T05:13:28
================================================================================
[2023-04-13T05:15:43.853+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 185)
[2023-04-13T05:15:48.616+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T05:15:48.617+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T05:15:48.618+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T05:15:48.645+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T05:15:48.657+0000] {manager.py:1055} DEBUG - Started a process (PID: 186) to generate tasks for /opt/airflow/dags/data/B.py
[2023-04-13T05:15:48.658+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/A.py finished
[2023-04-13T05:15:48.660+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor92-Process' pid=185 parent=6 stopped exitcode=0>
[2023-04-13T05:15:48.661+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T05:15:48.661+0000] {manager.py:1021} DEBUG - 28 file paths queued for processing
[2023-04-13T05:15:48.974+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 186)
[2023-04-13T05:15:53.671+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T05:15:53.684+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T05:15:53.686+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T05:15:53.708+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T05:15:53.742+0000] {manager.py:1055} DEBUG - Started a process (PID: 187) to generate tasks for /opt/airflow/dags/data/C.py
[2023-04-13T05:15:53.757+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/B.py finished
[2023-04-13T05:15:53.758+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor93-Process' pid=186 parent=6 stopped exitcode=0>
[2023-04-13T05:15:53.762+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T05:15:53.762+0000] {manager.py:1021} DEBUG - 27 file paths queued for processing
[2023-04-13T05:15:54.682+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 187)
[2023-04-13T05:15:58.656+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T05:15:58.657+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T05:15:58.658+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T05:15:58.664+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3

graph

Unfortunately, I don't see any major changes that I can think of. Immediately after deployment, the value changed dramatically, but after an hour or so, it was the same as before.

I'm of the same opinion. It's unfortunate that I can't drill down and see the metrics, but I was suspecting that something in the retry logic was adding multiple behaviors.

Unfortunately I couldn't get the database logs, and based on the dag processor debug logs alone, I'm wondering if there's something wrong with the retry logic as well. Please correct me if I'm misinterpreting the log files.

By any chance, could you explain what callbacks are involved in the DAG?

And here I see that the @provide_session and @retry_db_transaction decorators are used in separate functions, but it's still possible to call a function with the @retry_db_transaction decorator within a function with the @provide_session decorator.
Is there anything going on here that could be causing the problem?

argibbs · 2023-04-17T14:37:02Z

~~@wookiist Do you use SLAs?~~

Edit:
nvm - I was looking at your logs and thinking it reminded me of problems that I've seen in the past with callback overload (in my case SLAs, but not nec yours). But re-reading through the bug some more I realise that's a side-effect of the problem.

As you were.

wookiist · 2023-04-20T07:49:24Z

Thanks. @argibbs!
If a side effect of the problem is an increase in callback time, what would you define as a problem?
I didn't see any SLAs in the helm template, could you share what they are?. 🙇🏼
Also, if you have solved it, it would be great if you could share how you solved it.

potiuk · 2023-04-23T16:54:13Z

By any chance, could you explain what callbacks are involved in the DAG?

Sorry for letting this slip a bit -> the callbacks are on_failure/on_success and sla_miss callbacks mainly.

I still cannot see though how the change you mentioned could impact it. Retry logic does not seem to be triggered in your case. Are you absolutely sure that just cherry-picking that one change is causing those changes?

wookiist · 2023-04-26T00:40:38Z

@potiuk Yes, I can guarantee that we only cherrypicked that commit.
I'll test with 2.5.3 without that commit, and then we can see what goes on!
I'll share the result

potiuk · 2023-04-26T01:09:32Z

Thanks. You can also test 2.6.0rc1 which has been just released and see if you see the same problem there.

george-zubrienko · 2023-04-26T12:57:58Z

We are seeing a lot of parsing processes exiting with -1:

/opt/airflow/dags/ecco_airflow/dags/image_processing/product_image_load.py                                                0          -1  56.68s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/known_consumers/known_consumers.py                                                    0          -1  56.64s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/monitoring/row_counts.py                                                              0          -1  56.67s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/omnichannel/base.py                                                                   0          -1  56.66s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_data.py                                                                0          -1  56.67s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_stream.py                                                              0          -1  56.52s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/reporting/reporting_data_foundation.py                                                0          -1  56.63s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/retail_analysis/retail_analysis_dbt.py                                                0          -1  56.66s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/rfm_segments/rfm_segments.py                                                          0          -1  56.02s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/utils/airflow.py                                                                           0          -1  56.65s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/aad_users_listing.py                                                           1           0  55.51s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/funnel_io.py                                                                   1           0  56.13s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/iar_param.py                                                                   1           0  56.50s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/sfmc_copy.py                                                                   1           0  56.59s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/us_legacy_datawarehouse.py                                                     1           0  55.15s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_auditing.py                                                              1           0  56.54s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_budget_daily_phasing.py                                                  1           0  56.63s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_gold_rm_tests.py                                                         1           0  55.00s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/consumer_entity_matching/graph_entity_matching.py                                     1           0  56.67s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/data_backup/data_backup.py                                                            1           0  56.69s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/hive/adhoc_entity_publish.py                                                          1           0  55.33s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/image_regression/train.py                                                             1           0  56.63s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/maintenance/db_maintenance.py                                                         1           0  56.58s          2023-04-26T12:56:15

And messages like:

[2023-04-26T12:56:15.322+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/ecco_airflow/dags/bronze/us_legacy_datawarehouse.py finished
[2023-04-26T12:56:15.323+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor68-Process' pid=116 parent=7 stopped exitcode=0>
[2023-04-26T12:56:15.323+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_gold_rm_tests.py finished
[2023-04-26T12:56:15.323+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor69-Process' pid=122 parent=7 stopped exitcode=0>
[2023-04-26T12:56:15.324+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/ecco_airflow/dags/bronze/streaming/sap_inventory_feed.py finished
[2023-04-26T12:56:15.324+0000] {processor.py:314} DEBUG - Waiting for <ForkProcess name='DagFileProcessor70-Process' pid=128 parent=7 stopped exitcode=-SIGKILL>
[2023-04-26T12:56:15.324+0000] {manager.py:986} ERROR - Processor for /opt/airflow/dags/ecco_airflow/dags/bronze/streaming/sap_inventory_feed.py exited with return code -9.

This version is borderline unusable due to this regression...

potiuk · 2023-04-26T13:35:40Z

@george-zubrienko -> I think your issue is rather unrelated.

And it lacks a lot of details - for example I have no idea if ("This version") means 2.5.3 or 2.6.0rc (and which rc -> we are about to release rc3 in a few hours as some errors were found). Also your message is not actionable as it lacks a lot of details (like OS/configruation. executor, deployment detais etc). And your SIGKILL you see is likely indication of a problem with resources (maybe you have too little memory for example).

Could you please open a NEW issue and provide all the details you are asked for - including versions, os and some more logs (and take a look at the resource usage and reasons for killing the processes in your deployment before creating it. Without doing all that, your message is simply not actionable.

george-zubrienko · 2023-04-26T14:19:24Z

@potiuk I commented here because we experienced the same problem with this version = 2.5.3, upgraded from 2.4.3.

Dag processing stats from 2.4.3:

/opt/airflow/dags/ecco_airflow/dags/image_regression/train.py                                                             1           0  1.34s           2023-04-26T14:19:08
/opt/airflow/dags/ecco_airflow/dags/known_consumers/known_consumers.py                                                    1           0  1.12s           2023-04-26T14:19:00
/opt/airflow/dags/ecco_airflow/dags/maintenance/db_maintenance.py                                                         1           0  0.63s           2023-04-26T14:18:27
/opt/airflow/dags/ecco_airflow/dags/monitoring/row_counts.py                                                              1           0  3.74s           2023-04-26T14:18:45
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_data.py                                                                1           0  1.21s           2023-04-26T14:18:47
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_stream.py                                                              1           0  1.22s           2023-04-26T14:18:30
/opt/airflow/dags/ecco_airflow/dags/reporting/reporting_data_foundation.py                                                1           0  1.39s           2023-04-26T14:19:08
/opt/airflow/dags/ecco_airflow/dags/retail_analysis/retail_analysis_dbt.py                                                1           0  1.32s           2023-04-26T14:18:51
/opt/airflow/dags/ecco_airflow/dags/rfm_segments/rfm_segments.py                                                          1           0  1.20s           2023-04-26T14:18:34

Re memory, dag processor was running with 4Gi and using around 1.5Gi out of those. And again, downgrading to 2.4.3 resolves the problem all together

potiuk · 2023-04-26T14:21:56Z

@potiuk I commented here because we experienced the same problem with this version = 2.5.3, upgraded from 2.4.3.

But your problem is likely different.

potiuk · 2023-04-26T14:22:16Z

Please open a new issue with more details.

george-zubrienko · 2023-04-26T14:24:29Z

@potiuk I commented here because we experienced the same problem with this version = 2.5.3, upgraded from 2.4.3.

But your problem is likely different.

Could you explain why do you think it is different? I have a feeling such issue will be marked as duplicate. We have exactly the same problem with dag processing time raising from 1s to 60s and exactly the same logs, minus the SIGKILL part.

potiuk · 2023-04-26T14:35:31Z

Could you explain why do you think it is different? I have a feeling such issue will be marked as duplicate. We have exactly the same problem with dag processing time raising from 1s to 60s and exactly the same logs, minus the SIGKILL part.

I do not know everything you know and have not provided any information about . What database you have ? what version? what deployment? What kind of other observations you have . I have far too little information to assess if it is the same issue. And as I mentioned above your logs above do not add any information there that could help to solve the problem. If you want to help to solve the problem - please provide more information explaining your circumstances, otherwise there is no way anyone can do anything with what you provided.

On the other hand, if you spend maybe 15 minutes describing all the details and gathering more evidences and describing your (completely unknown to anyone looking here) configuration and circumstances, there is a great chance that you will actually help in diagnosing the issue - and helping people who try to help people like you to diagnose it (mostly in their free time, to solve the problems in the software you got for free). EVen if such an issue will be later diagnosed as the same and closed as duplicated, it might be actually helpful if it has some information. And when it turns to be a different reason (which is likely because your problem is 60s vs.1 not 10s vs 2) then this can turn into two different streams of conversations that will be far easier to treat separately.

By yourself assuming that you have the same problem (even if you do not know it) and then questioning my kind request to create a new issue - you also add a lot of noise to the conversation - without adding value.

So let me repeat for the third time - can you please open a new issue and all many more details describing your conversatiosn @george-zubrienko . Pretty please.

george-zubrienko · 2023-04-26T14:49:54Z

Linking the issue here in case any relation is found #30884

potiuk · 2023-04-27T07:45:47Z

@wookiist -> relate to the conversation in #30884 - are you also using Variables at the top-level code of your DAGs (or other DB access) - would it be possible to run an experiment and see if removing it (hardcoding temporarily) would mitigate the problem ? (not suggesting that as a solution but more like investigation aid).

potiuk · 2023-04-27T09:47:46Z

@wookiist -> what's your https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#job-heartbeat-sec ?

potiuk · 2023-04-27T10:22:22Z

I believe #30899 should fix the problm. If you could apply it to 2.5.3 (it wil be slightly different code / change - adding "only_if_necessary=True" to the heartbeat method in here to test it:

https://github.com/apache/airflow/pull/30278/files#diff-827b7a469438ffd1b172ae295134a84c8a914c9ea4c9ea2a7d3de1f1d5aa6bb6R580

But I am pretty sure that's it

The standalone file processor as of apache#30278 introduced accidentally an artifficial delay between dag processing by adding heartbeat but missing to set "only_if_necessary" flag to True. If your dag file processing has been fast (faster than the scheduler job_heartbeat_sec) this introduced unnecessary pause between the next dag file processor loop (up until the time passed), it also introduced inflation of the dag_processing_last_duration metrics (it would always show minimum job_heartbeat_sec) Adding "only_if_necessary" flag fixes the problem. Fixes: apache#30593 Fixes: apache#30884

The standalone file processor as of #30278 introduced accidentally an artifficial delay between dag processing by adding heartbeat but missing to set "only_if_necessary" flag to True. If your dag file processing has been fast (faster than the scheduler job_heartbeat_sec) this introduced unnecessary pause between the next dag file processor loop (up until the time passed), it also introduced inflation of the dag_processing_last_duration metrics (it would always show minimum job_heartbeat_sec) Adding "only_if_necessary" flag fixes the problem. Fixes: #30593 Fixes: #30884

The standalone file processor as of #30278 introduced accidentally an artifficial delay between dag processing by adding heartbeat but missing to set "only_if_necessary" flag to True. If your dag file processing has been fast (faster than the scheduler job_heartbeat_sec) this introduced unnecessary pause between the next dag file processor loop (up until the time passed), it also introduced inflation of the dag_processing_last_duration metrics (it would always show minimum job_heartbeat_sec) Adding "only_if_necessary" flag fixes the problem. Fixes: #30593 Fixes: #30884 (cherry picked from commit 00ab45f)

wookiist · 2023-04-27T13:08:46Z

@wookiist -> what's your https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#job-heartbeat-sec ?

Amazing, it's 5sec! maybe your commit would resolve the problem.
I'm now building new image, I'll share the result after testing 🙏🏼

wookiist · 2023-05-02T08:46:28Z

@potiuk Thanks Potiuk!
That was the root cause! I modified the code and everything went ok!

potiuk · 2023-05-02T08:53:27Z

Fantastic. 2.6.0 is now out with the fix included BTW.

wookiist added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 12, 2023

potiuk removed the needs-triage label for new issues that we didn't triage yet label Apr 12, 2023

potiuk added the pending-response label Apr 12, 2023

potiuk mentioned this issue Apr 27, 2023

Fix dag file processor hertbeat to run only if necessary #30899

Merged

potiuk mentioned this issue Apr 27, 2023

DagProcessor Performance Regression #30884

Closed

2 tasks

eladkal added affected_version:2.5 Issues Reported for 2.5 and removed pending-response labels Apr 27, 2023

potiuk closed this as completed in #30899 Apr 27, 2023

dlstadther-pl mentioned this issue Jul 7, 2023

Dag File Processing Slowness when using Dag Params #32434

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After upgrading to 2.5.3, Dag Processing time increased dramatically. #30593

After upgrading to 2.5.3, Dag Processing time increased dramatically. #30593

wookiist commented Apr 12, 2023

boring-cyborg bot commented Apr 12, 2023

potiuk commented Apr 12, 2023

potiuk commented Apr 12, 2023

potiuk commented Apr 12, 2023

wookiist commented Apr 13, 2023 •

edited

argibbs commented Apr 17, 2023 •

edited

wookiist commented Apr 20, 2023

potiuk commented Apr 23, 2023

wookiist commented Apr 26, 2023

potiuk commented Apr 26, 2023

george-zubrienko commented Apr 26, 2023 •

edited

potiuk commented Apr 26, 2023

george-zubrienko commented Apr 26, 2023 •

edited

potiuk commented Apr 26, 2023

potiuk commented Apr 26, 2023

george-zubrienko commented Apr 26, 2023

potiuk commented Apr 26, 2023 •

edited

george-zubrienko commented Apr 26, 2023

potiuk commented Apr 27, 2023 •

edited

potiuk commented Apr 27, 2023

potiuk commented Apr 27, 2023

wookiist commented Apr 27, 2023

wookiist commented May 2, 2023

potiuk commented May 2, 2023

After upgrading to 2.5.3, Dag Processing time increased dramatically. #30593

After upgrading to 2.5.3, Dag Processing time increased dramatically. #30593

Comments

wookiist commented Apr 12, 2023

Apache Airflow version

What happened

What you think should happen instead

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Apr 12, 2023

potiuk commented Apr 12, 2023

potiuk commented Apr 12, 2023

potiuk commented Apr 12, 2023

wookiist commented Apr 13, 2023 • edited

0. Can you tell us how many dag file processors you have @wookiist ?

1. could you please increase log level for dag file processor to debug

log file

2. Can you please take a look at the setting of scheduler 'max_callbacks_for_loop configuration (default is 20) .

log file

graph

argibbs commented Apr 17, 2023 • edited

wookiist commented Apr 20, 2023

potiuk commented Apr 23, 2023

wookiist commented Apr 26, 2023

potiuk commented Apr 26, 2023

george-zubrienko commented Apr 26, 2023 • edited

potiuk commented Apr 26, 2023

george-zubrienko commented Apr 26, 2023 • edited

potiuk commented Apr 26, 2023

potiuk commented Apr 26, 2023

george-zubrienko commented Apr 26, 2023

potiuk commented Apr 26, 2023 • edited

george-zubrienko commented Apr 26, 2023

potiuk commented Apr 27, 2023 • edited

potiuk commented Apr 27, 2023

potiuk commented Apr 27, 2023

wookiist commented Apr 27, 2023

wookiist commented May 2, 2023

potiuk commented May 2, 2023

wookiist commented Apr 13, 2023 •

edited

argibbs commented Apr 17, 2023 •

edited

george-zubrienko commented Apr 26, 2023 •

edited

george-zubrienko commented Apr 26, 2023 •

edited

potiuk commented Apr 26, 2023 •

edited

potiuk commented Apr 27, 2023 •

edited