Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrading to 2.5.3, Dag Processing time increased dramatically. #30593

Closed
1 of 2 tasks
wookiist opened this issue Apr 12, 2023 · 24 comments · Fixed by #30899
Closed
1 of 2 tasks

After upgrading to 2.5.3, Dag Processing time increased dramatically. #30593

wookiist opened this issue Apr 12, 2023 · 24 comments · Fixed by #30899
Labels
affected_version:2.5 Issues Reported for 2.5 area:core kind:bug This is a clearly a bug

Comments

@wookiist
Copy link
Contributor

Apache Airflow version

2.5.3

What happened

I upgraded my airflow cluster from 2.5.2 to 2.5.3 , after which strange things started happening.
I'm currently using a standalone dagProcessor, and the parsing time that used to take about 2 seconds has suddenly increased to about 10 seconds.
I'm thinking it's weird because I haven't made any changes other than a version up, but is there something I can look into? Thanks in advance! 🙇🏼
image

What you think should happen instead

I believe that the time it takes to parse a Dag should be constant, or at least have some variability, but shouldn't take as long as it does now.

How to reproduce

If you cherrypick this commit into 2.5.2 stable code, the issue will recur.

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

  • Kubernetes 1.21 Cluster
  • 1.7.0 helm chart
  • standalone dag processor
  • using kubernetes executor
  • using mysql database

Anything else

This issue still persists, and restarting the Dag Processor has not resolved the issue.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@wookiist wookiist added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 12, 2023
@boring-cyborg
Copy link

boring-cyborg bot commented Apr 12, 2023

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@potiuk
Copy link
Member

potiuk commented Apr 12, 2023

Thanks. looking at it, seems like a regression., but also could be the "metrics" improvement. Is it possible that you share some excerpts of logs from dag file processor showing parsing the fiels?

@potiuk potiuk removed the needs-triage label for new issues that we didn't triage yet label Apr 12, 2023
@potiuk
Copy link
Member

potiuk commented Apr 12, 2023

If you cherrypick #30079 into 2.5.2 stable code, the issue will recur.

Ah. Fantastic. That will help us a LOT to see what's going on. Thanks for narrowing it down @wookiist !

@potiuk
Copy link
Member

potiuk commented Apr 12, 2023

We definitely need the log. I looked at the change and I can think of only one reason why you could have the metrics increased - namely that there is a recurring problem with some database operation related to callbacks, and retries simply increased the time it takes to run the opertions several times until they fail. This might indicate some consistent problem with either bug in Airflow or some deadlock/operational issue with your database, but we need more data to take a closer look and diagnose it fully.

Can you tell us how many dag file processors you have @wookiist ?

So I have two kind requests @wookiist:

  1. could you please increase log level for dag file processor to debug (it can be only for the processor to not impact other components): https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#logging-level (it can be done via adding the AIRFLOW__LOGGING__LOGGING_LEVEL env variable and restarting the processor) and sharing a gist with the logs f dag file processor. Also if you could see if there are any errror messages produced in your database logs, signalling that there are some queries failing from correlated time of when dag file processor runs, that could help a lot as well.

  2. Can you please take a look at the setting of scheduler 'max_callbacks_for_loop configuration (default is 20) . And if you could lower it down to say 5 (if you have 20), and see if that helps. I have a theory that this parameter might make your queries either run long or run into deadlocks and that might cause the queries to retry few times.

Longer explanation (and theory I have):

The processing time you see is not only for parsing, it is also for running callbacks related to the DAG being processed. When you have callbacks that cannot be run in the worker related to DAG file, they are executed by the DAG file processor after the file is parsed. How it works is that the main parsing loop will start processor for each file and feed it with the callbacks to execute. The change you mentioned does not directly contribute to the time, but it produces callbacks that the file processor executes. And those callbacks actually contribute to the time you see - so the increase of the time you see might not come from "parsing" but it might come from callback processing, and it could be tha the db_retry mechanism introduced in the PR you mentioned, has a problem that it (accidentally) sends multiple repeated callbacks to the processor - and that increases the processing time.

Having the logs, we will be able to see if that is happening and likely we should be able to see the reason (possibly database logs might give more explanation what's going on).

cc: @mhenc I think that one is closely related to standalone file processor case and might need your insights once we get more information

@wookiist
Copy link
Contributor Author

wookiist commented Apr 13, 2023

Thanks for looking, @potiuk !

0. Can you tell us how many dag file processors you have @wookiist ?

Only one dag processor is being used

1. could you please increase log level for dag file processor to debug

Yes, I tested it on dev airflow cluster, and here is the log file.

log file

[2023-04-13T04:52:47.287+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:52:47.287+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:52:47.294+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T04:52:47.306+0000] {manager.py:1055} DEBUG - Started a process (PID: 37) to generate tasks for /opt/airflow/dags/data/
[2023-04-13T04:52:47.307+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/ finished
[2023-04-13T04:52:47.309+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor7-Process' pid=36 parent=9 stopped exitcode=0>
[2023-04-13T04:52:47.310+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T04:52:47.310+0000] {manager.py:1021} DEBUG - 35 file paths queued for processing
[2023-04-13T04:52:47.324+0000] {manager.py:864} INFO - 
================================================================================
DAG File Processing Stats

File Path                                                                                PID  Runtime      # DAGs    # Errors  Last Runtime    Last Run
-------------------------------------------------------------------------------------  -----  ---------  --------  ----------  --------------  -------------------
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      37  0.01s             0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      0           0
/opt/airflow/dags/data/                                                                                      1           0          5.01s           2023-04-13T04:52:37
/opt/airflow/dags/data/                                                                                      1           0          5.02s           2023-04-13T04:52:42
/opt/airflow/dags/data/                                                                                      1           0          5.02s           2023-04-13T04:52:22
/opt/airflow/dags/data/                                                                                      1           0          5.01s           2023-04-13T04:52:27
/opt/airflow/dags/data/                                                                                      1           0          5.12s           2023-04-13T04:52:17
/opt/airflow/dags/data/                                                                                      1           0          5.03s           2023-04-13T04:52:32
/opt/airflow/dags/data/                                                                                      1           0          5.04s           2023-04-13T04:52:47
/opt/airflow/dags/data/                                                                                      1           0          5.18s           2023-04-13T04:52:17
================================================================================
[2023-04-13T04:52:47.810+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 37)
[2023-04-13T04:52:52.300+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T04:52:52.301+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:52:52.301+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:52:52.308+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T04:52:52.343+0000] {manager.py:1055} DEBUG - Started a process (PID: 38) to generate tasks for /opt/airflow/dags/data/B
[2023-04-13T04:52:52.344+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/C finished
[2023-04-13T04:52:52.345+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor8-Process' pid=37 parent=9 stopped exitcode=0>
[2023-04-13T04:52:52.346+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T04:52:52.346+0000] {manager.py:1021} DEBUG - 34 file paths queued for processing
[2023-04-13T04:52:52.513+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 38)
[2023-04-13T04:52:57.299+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T04:52:57.300+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:52:57.300+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:52:57.305+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T04:52:57.316+0000] {manager.py:1055} DEBUG - Started a process (PID: 39) to generate tasks for /opt/airflow/dags/data/A.py
[2023-04-13T04:52:57.319+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/B.py finished
[2023-04-13T04:52:57.320+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor9-Process' pid=38 parent=9 stopped exitcode=0>
[2023-04-13T04:52:57.321+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T04:52:57.321+0000] {manager.py:1021} DEBUG - 33 file paths queued for processing
[2023-04-13T04:52:57.516+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 39)
[2023-04-13T04:53:02.314+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T04:53:02.314+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:53:02.315+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:53:02.319+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T04:53:02.328+0000] {manager.py:1055} DEBUG - Started a process (PID: 40) to generate tasks for /opt/airflow/dags/data/D.py
[2023-04-13T04:53:02.329+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/A.py finished
[2023-04-13T04:53:02.329+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor10-Process' pid=39 parent=9 stopped exitcode=0>
[2023-04-13T04:53:02.330+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T04:53:02.330+0000] {manager.py:1021} DEBUG - 32 file paths queued for processing
[2023-04-13T04:53:02.471+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 40)
[2023-04-13T04:53:07.321+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T04:53:07.322+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T04:53:07.322+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T04:53:07.326+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3

2. Can you please take a look at the setting of scheduler 'max_callbacks_for_loop configuration (default is 20) .

I think you're probably referring to the max_callbacks_per_loop value, and I changed that to 5.
This is the log file after I change the value. And this is the graph change.

log file

[2023-04-13T05:15:43.600+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T05:15:43.601+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T05:15:43.601+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T05:15:43.611+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T05:15:43.630+0000] {manager.py:1055} DEBUG - Started a process (PID: 185) to generate tasks for /opt/airflow/dags/data/A.py
[2023-04-13T05:15:43.632+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/0.py finished
[2023-04-13T05:15:43.637+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor91-Process' pid=184 parent=6 stopped exitcode=0>
[2023-04-13T05:15:43.637+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T05:15:43.637+0000] {manager.py:1021} DEBUG - 29 file paths queued for processing
[2023-04-13T05:15:43.684+0000] {manager.py:864} INFO - 
================================================================================
DAG File Processing Stats

File Path                                                                                PID  Runtime      # DAGs    # Errors  Last Runtime    Last Run
-------------------------------------------------------------------------------------  -----  ---------  --------  ----------  --------------  -------------------
/opt/airflow/dags/data/                                                                                         0           1  5.04s           2023-04-13T05:14:43
/opt/airflow/dags/data/                                                                                         0           0  5.21s           2023-04-13T05:10:53
/opt/airflow/dags/data/                                                                                         0           0  4.89s           2023-04-13T05:10:58
/opt/airflow/dags/data/                                                                                         0           0  5.15s           2023-04-13T05:14:53
/opt/airflow/dags/data/                                                                                         0           0  5.01s           2023-04-13T05:11:08
/opt/airflow/dags/data/                                                                                         0           0  5.03s           2023-04-13T05:11:03
/opt/airflow/dags/data/                                                                                         0           0  5.08s           2023-04-13T05:11:13
/opt/airflow/dags/data/                                                                                         0           0  5.02s           2023-04-13T05:11:18
/opt/airflow/dags/data/                                                                                         0           1  5.01s           2023-04-13T05:13:13
/opt/airflow/dags/data/                                                                                         0           0  5.61s           2023-04-13T05:14:38
/opt/airflow/dags/data/                                                                                         0           1  5.07s           2023-04-13T05:13:18
/opt/airflow/dags/data/                                                                                         0           1  5.05s           2023-04-13T05:13:38
/opt/airflow/dags/data/                                                                                         0           1  4.99s           2023-04-13T05:13:43
/opt/airflow/dags/data/                                                                                         0           1  8.10s           2023-04-13T05:13:51
/opt/airflow/dags/data/                                                                                         0           1  4.99s           2023-04-13T05:13:23
/opt/airflow/dags/data/                                                                                         1           0  18.92s          2023-04-13T05:14:43
/opt/airflow/dags/data/                                                                                         1           0  5.09s           2023-04-13T05:14:17
/opt/airflow/dags/data/                                                                                         1           0  5.08s           2023-04-13T05:14:17
/opt/airflow/dags/data/                                                                                         1           0  5.06s           2023-04-13T05:13:33
/opt/airflow/dags/data/                                                                                         1           0  17.68s          2023-04-13T05:14:06
/opt/airflow/dags/data/                                                                                         1           0  5.14s           2023-04-13T05:12:43
/opt/airflow/dags/data/                                                                                         1           0  4.92s           2023-04-13T05:12:48
/opt/airflow/dags/data/                                                                                         1           0  5.02s           2023-04-13T05:13:08
/opt/airflow/dags/data/                                                                                         1           0  5.09s           2023-04-13T05:12:53
/opt/airflow/dags/data/                                                                                         1           0  5.02s           2023-04-13T05:12:38
/opt/airflow/dags/data/                                                                                         1           0  5.01s           2023-04-13T05:12:33
/opt/airflow/dags/data/                                                                                         1           0  5.02s           2023-04-13T05:13:03
/opt/airflow/dags/data/                                                                                         1           0  5.01s           2023-04-13T05:12:58
/opt/airflow/dags/data/                                                                                         1           0  5.05s           2023-04-13T05:15:23
/opt/airflow/dags/data/                                                                                         1           0  4.99s           2023-04-13T05:15:28
/opt/airflow/dags/data/                                                                  185  0.02s             1           0  5.03s           2023-04-13T05:12:13
/opt/airflow/dags/data/                                                                                         1           0  5.09s           2023-04-13T05:15:08
/opt/airflow/dags/data/                                                                                         1           0  5.04s           2023-04-13T05:12:18
/opt/airflow/dags/data/                                                                                         1           0  4.99s           2023-04-13T05:15:38
/opt/airflow/dags/data/                                                                                         1           0  5.06s           2023-04-13T05:15:13
/opt/airflow/dags/data/                                                                                         1           0  5.01s           2023-04-13T05:12:23
/opt/airflow/dags/data/                                                                                         1           0  5.03s           2023-04-13T05:12:28
/opt/airflow/dags/data/                                                                                         1           0  5.12s           2023-04-13T05:15:03
/opt/airflow/dags/data/                                                                                         1           0  5.05s           2023-04-13T05:15:43
/opt/airflow/dags/data/                                                                                         1           0  4.99s           2023-04-13T05:15:18
/opt/airflow/dags/data/                                                                                         1           0  5.05s           2023-04-13T05:15:33
/opt/airflow/dags/data/                                                                                         1           0  4.91s           2023-04-13T05:14:58
/opt/airflow/dags/data/                                                                                         1           0  9.75s           2023-04-13T05:14:06
/opt/airflow/dags/data/                                                                                         1           0  5.02s           2023-04-13T05:13:28
================================================================================
[2023-04-13T05:15:43.853+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 185)
[2023-04-13T05:15:48.616+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T05:15:48.617+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T05:15:48.618+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T05:15:48.645+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T05:15:48.657+0000] {manager.py:1055} DEBUG - Started a process (PID: 186) to generate tasks for /opt/airflow/dags/data/B.py
[2023-04-13T05:15:48.658+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/A.py finished
[2023-04-13T05:15:48.660+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor92-Process' pid=185 parent=6 stopped exitcode=0>
[2023-04-13T05:15:48.661+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T05:15:48.661+0000] {manager.py:1021} DEBUG - 28 file paths queued for processing
[2023-04-13T05:15:48.974+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 186)
[2023-04-13T05:15:53.671+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T05:15:53.684+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T05:15:53.686+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T05:15:53.708+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3
[2023-04-13T05:15:53.742+0000] {manager.py:1055} DEBUG - Started a process (PID: 187) to generate tasks for /opt/airflow/dags/data/C.py
[2023-04-13T05:15:53.757+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/data/B.py finished
[2023-04-13T05:15:53.758+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor93-Process' pid=186 parent=6 stopped exitcode=0>
[2023-04-13T05:15:53.762+0000] {manager.py:1019} DEBUG - 1/2 DAG parsing processes running
[2023-04-13T05:15:53.762+0000] {manager.py:1021} DEBUG - 27 file paths queued for processing
[2023-04-13T05:15:54.682+0000] {settings.py:408} DEBUG - Disposing DB connection pool (PID 187)
[2023-04-13T05:15:58.656+0000] {base_job.py:240} DEBUG - [heartbeat]
[2023-04-13T05:15:58.657+0000] {retries.py:80} DEBUG - Running DagFileProcessorManager._fetch_callbacks_with_retries with retries. Try 1 of 3
[2023-04-13T05:15:58.658+0000] {manager.py:680} DEBUG - Fetching callbacks from the database.
[2023-04-13T05:15:58.664+0000] {retries.py:80} DEBUG - Running DagWarning._purge_inactive_dag_warnings_with_retry with retries. Try 1 of 3

graph

image

image

Unfortunately, I don't see any major changes that I can think of. Immediately after deployment, the value changed dramatically, but after an hour or so, it was the same as before.

I'm of the same opinion. It's unfortunate that I can't drill down and see the metrics, but I was suspecting that something in the retry logic was adding multiple behaviors.

Unfortunately I couldn't get the database logs, and based on the dag processor debug logs alone, I'm wondering if there's something wrong with the retry logic as well. Please correct me if I'm misinterpreting the log files.

By any chance, could you explain what callbacks are involved in the DAG?

And here I see that the @provide_session and @retry_db_transaction decorators are used in separate functions, but it's still possible to call a function with the @retry_db_transaction decorator within a function with the @provide_session decorator.
Is there anything going on here that could be causing the problem?

@argibbs
Copy link
Contributor

argibbs commented Apr 17, 2023

@wookiist Do you use SLAs?

Edit:
nvm - I was looking at your logs and thinking it reminded me of problems that I've seen in the past with callback overload (in my case SLAs, but not nec yours). But re-reading through the bug some more I realise that's a side-effect of the problem.

As you were.

@wookiist
Copy link
Contributor Author

Thanks. @argibbs!
If a side effect of the problem is an increase in callback time, what would you define as a problem?
I didn't see any SLAs in the helm template, could you share what they are?. 🙇🏼
Also, if you have solved it, it would be great if you could share how you solved it.

@potiuk
Copy link
Member

potiuk commented Apr 23, 2023

By any chance, could you explain what callbacks are involved in the DAG?

Sorry for letting this slip a bit -> the callbacks are on_failure/on_success and sla_miss callbacks mainly.

I still cannot see though how the change you mentioned could impact it. Retry logic does not seem to be triggered in your case. Are you absolutely sure that just cherry-picking that one change is causing those changes?

@wookiist
Copy link
Contributor Author

@potiuk Yes, I can guarantee that we only cherrypicked that commit.
I'll test with 2.5.3 without that commit, and then we can see what goes on!
I'll share the result

@potiuk
Copy link
Member

potiuk commented Apr 26, 2023

Thanks. You can also test 2.6.0rc1 which has been just released and see if you see the same problem there.

@george-zubrienko
Copy link

george-zubrienko commented Apr 26, 2023

We are seeing a lot of parsing processes exiting with -1:

/opt/airflow/dags/ecco_airflow/dags/image_processing/product_image_load.py                                                0          -1  56.68s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/known_consumers/known_consumers.py                                                    0          -1  56.64s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/monitoring/row_counts.py                                                              0          -1  56.67s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/omnichannel/base.py                                                                   0          -1  56.66s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_data.py                                                                0          -1  56.67s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_stream.py                                                              0          -1  56.52s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/reporting/reporting_data_foundation.py                                                0          -1  56.63s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/retail_analysis/retail_analysis_dbt.py                                                0          -1  56.66s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/rfm_segments/rfm_segments.py                                                          0          -1  56.02s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/utils/airflow.py                                                                           0          -1  56.65s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/aad_users_listing.py                                                           1           0  55.51s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/funnel_io.py                                                                   1           0  56.13s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/iar_param.py                                                                   1           0  56.50s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/sfmc_copy.py                                                                   1           0  56.59s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/bronze/us_legacy_datawarehouse.py                                                     1           0  55.15s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_auditing.py                                                              1           0  56.54s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_budget_daily_phasing.py                                                  1           0  56.63s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_gold_rm_tests.py                                                         1           0  55.00s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/consumer_entity_matching/graph_entity_matching.py                                     1           0  56.67s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/data_backup/data_backup.py                                                            1           0  56.69s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/hive/adhoc_entity_publish.py                                                          1           0  55.33s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/image_regression/train.py                                                             1           0  56.63s          2023-04-26T12:56:15
/opt/airflow/dags/ecco_airflow/dags/maintenance/db_maintenance.py                                                         1           0  56.58s          2023-04-26T12:56:15

And messages like:

[2023-04-26T12:56:15.322+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/ecco_airflow/dags/bronze/us_legacy_datawarehouse.py finished
[2023-04-26T12:56:15.323+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor68-Process' pid=116 parent=7 stopped exitcode=0>
[2023-04-26T12:56:15.323+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/ecco_airflow/dags/cdp/ecco_cdp_gold_rm_tests.py finished
[2023-04-26T12:56:15.323+0000] {processor.py:296} DEBUG - Waiting for <ForkProcess name='DagFileProcessor69-Process' pid=122 parent=7 stopped exitcode=0>
[2023-04-26T12:56:15.324+0000] {manager.py:979} DEBUG - Processor for /opt/airflow/dags/ecco_airflow/dags/bronze/streaming/sap_inventory_feed.py finished
[2023-04-26T12:56:15.324+0000] {processor.py:314} DEBUG - Waiting for <ForkProcess name='DagFileProcessor70-Process' pid=128 parent=7 stopped exitcode=-SIGKILL>
[2023-04-26T12:56:15.324+0000] {manager.py:986} ERROR - Processor for /opt/airflow/dags/ecco_airflow/dags/bronze/streaming/sap_inventory_feed.py exited with return code -9.

This version is borderline unusable due to this regression...

@potiuk
Copy link
Member

potiuk commented Apr 26, 2023

@george-zubrienko -> I think your issue is rather unrelated.

And it lacks a lot of details - for example I have no idea if ("This version") means 2.5.3 or 2.6.0rc (and which rc -> we are about to release rc3 in a few hours as some errors were found). Also your message is not actionable as it lacks a lot of details (like OS/configruation. executor, deployment detais etc). And your SIGKILL you see is likely indication of a problem with resources (maybe you have too little memory for example).

Could you please open a NEW issue and provide all the details you are asked for - including versions, os and some more logs (and take a look at the resource usage and reasons for killing the processes in your deployment before creating it. Without doing all that, your message is simply not actionable.

@george-zubrienko
Copy link

george-zubrienko commented Apr 26, 2023

@potiuk I commented here because we experienced the same problem with this version = 2.5.3, upgraded from 2.4.3.

Dag processing stats from 2.4.3:

/opt/airflow/dags/ecco_airflow/dags/image_regression/train.py                                                             1           0  1.34s           2023-04-26T14:19:08
/opt/airflow/dags/ecco_airflow/dags/known_consumers/known_consumers.py                                                    1           0  1.12s           2023-04-26T14:19:00
/opt/airflow/dags/ecco_airflow/dags/maintenance/db_maintenance.py                                                         1           0  0.63s           2023-04-26T14:18:27
/opt/airflow/dags/ecco_airflow/dags/monitoring/row_counts.py                                                              1           0  3.74s           2023-04-26T14:18:45
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_data.py                                                                1           0  1.21s           2023-04-26T14:18:47
/opt/airflow/dags/ecco_airflow/dags/omnichannel/oc_stream.py                                                              1           0  1.22s           2023-04-26T14:18:30
/opt/airflow/dags/ecco_airflow/dags/reporting/reporting_data_foundation.py                                                1           0  1.39s           2023-04-26T14:19:08
/opt/airflow/dags/ecco_airflow/dags/retail_analysis/retail_analysis_dbt.py                                                1           0  1.32s           2023-04-26T14:18:51
/opt/airflow/dags/ecco_airflow/dags/rfm_segments/rfm_segments.py                                                          1           0  1.20s           2023-04-26T14:18:34

Re memory, dag processor was running with 4Gi and using around 1.5Gi out of those. And again, downgrading to 2.4.3 resolves the problem all together

@potiuk
Copy link
Member

potiuk commented Apr 26, 2023

@potiuk I commented here because we experienced the same problem with this version = 2.5.3, upgraded from 2.4.3.

But your problem is likely different.

@potiuk
Copy link
Member

potiuk commented Apr 26, 2023

Please open a new issue with more details.

@george-zubrienko
Copy link

@potiuk I commented here because we experienced the same problem with this version = 2.5.3, upgraded from 2.4.3.

But your problem is likely different.

Could you explain why do you think it is different? I have a feeling such issue will be marked as duplicate. We have exactly the same problem with dag processing time raising from 1s to 60s and exactly the same logs, minus the SIGKILL part.

@potiuk
Copy link
Member

potiuk commented Apr 26, 2023

Could you explain why do you think it is different? I have a feeling such issue will be marked as duplicate. We have exactly the same problem with dag processing time raising from 1s to 60s and exactly the same logs, minus the SIGKILL part.

I do not know everything you know and have not provided any information about . What database you have ? what version? what deployment? What kind of other observations you have . I have far too little information to assess if it is the same issue. And as I mentioned above your logs above do not add any information there that could help to solve the problem. If you want to help to solve the problem - please provide more information explaining your circumstances, otherwise there is no way anyone can do anything with what you provided.

On the other hand, if you spend maybe 15 minutes describing all the details and gathering more evidences and describing your (completely unknown to anyone looking here) configuration and circumstances, there is a great chance that you will actually help in diagnosing the issue - and helping people who try to help people like you to diagnose it (mostly in their free time, to solve the problems in the software you got for free). EVen if such an issue will be later diagnosed as the same and closed as duplicated, it might be actually helpful if it has some information. And when it turns to be a different reason (which is likely because your problem is 60s vs.1 not 10s vs 2) then this can turn into two different streams of conversations that will be far easier to treat separately.

By yourself assuming that you have the same problem (even if you do not know it) and then questioning my kind request to create a new issue - you also add a lot of noise to the conversation - without adding value.

So let me repeat for the third time - can you please open a new issue and all many more details describing your conversatiosn @george-zubrienko . Pretty please.

@george-zubrienko
Copy link

Linking the issue here in case any relation is found #30884

@potiuk
Copy link
Member

potiuk commented Apr 27, 2023

@wookiist -> relate to the conversation in #30884 - are you also using Variables at the top-level code of your DAGs (or other DB access) - would it be possible to run an experiment and see if removing it (hardcoding temporarily) would mitigate the problem ? (not suggesting that as a solution but more like investigation aid).

@potiuk
Copy link
Member

potiuk commented Apr 27, 2023

@potiuk
Copy link
Member

potiuk commented Apr 27, 2023

I believe #30899 should fix the problm. If you could apply it to 2.5.3 (it wil be slightly different code / change - adding "only_if_necessary=True" to the heartbeat method in here to test it:

https://github.com/apache/airflow/pull/30278/files#diff-827b7a469438ffd1b172ae295134a84c8a914c9ea4c9ea2a7d3de1f1d5aa6bb6R580

But I am pretty sure that's it

potiuk added a commit to potiuk/airflow that referenced this issue Apr 27, 2023
The standalone file processor as of apache#30278 introduced accidentally
an artifficial delay between dag processing by adding heartbeat
but missing to set "only_if_necessary" flag to True.

If your dag file processing has been fast (faster than the
scheduler job_heartbeat_sec) this introduced unnecessary pause
between the next dag file processor loop (up until the time
passed), it also introduced inflation of the
dag_processing_last_duration metrics (it would always show minimum
job_heartbeat_sec)

Adding "only_if_necessary" flag fixes the problem.

Fixes: apache#30593
Fixes: apache#30884
@eladkal eladkal added affected_version:2.5 Issues Reported for 2.5 and removed pending-response labels Apr 27, 2023
potiuk added a commit that referenced this issue Apr 27, 2023
The standalone file processor as of #30278 introduced accidentally
an artifficial delay between dag processing by adding heartbeat
but missing to set "only_if_necessary" flag to True.

If your dag file processing has been fast (faster than the
scheduler job_heartbeat_sec) this introduced unnecessary pause
between the next dag file processor loop (up until the time
passed), it also introduced inflation of the
dag_processing_last_duration metrics (it would always show minimum
job_heartbeat_sec)

Adding "only_if_necessary" flag fixes the problem.

Fixes: #30593
Fixes: #30884
ephraimbuddy pushed a commit that referenced this issue Apr 27, 2023
The standalone file processor as of #30278 introduced accidentally
an artifficial delay between dag processing by adding heartbeat
but missing to set "only_if_necessary" flag to True.

If your dag file processing has been fast (faster than the
scheduler job_heartbeat_sec) this introduced unnecessary pause
between the next dag file processor loop (up until the time
passed), it also introduced inflation of the
dag_processing_last_duration metrics (it would always show minimum
job_heartbeat_sec)

Adding "only_if_necessary" flag fixes the problem.

Fixes: #30593
Fixes: #30884
(cherry picked from commit 00ab45f)
@wookiist
Copy link
Contributor Author

@wookiist -> what's your https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#job-heartbeat-sec ?

Amazing, it's 5sec! maybe your commit would resolve the problem.
I'm now building new image, I'll share the result after testing 🙏🏼

@wookiist
Copy link
Contributor Author

wookiist commented May 2, 2023

@potiuk Thanks Potiuk!
That was the root cause! I modified the code and everything went ok!

image

@potiuk
Copy link
Member

potiuk commented May 2, 2023

Fantastic. 2.6.0 is now out with the fix included BTW.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.5 Issues Reported for 2.5 area:core kind:bug This is a clearly a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants