-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
After upgrading to 2.5.3, Dag Processing time increased dramatically. #30593
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
Thanks. looking at it, seems like a regression., but also could be the "metrics" improvement. Is it possible that you share some excerpts of logs from dag file processor showing parsing the fiels? |
We definitely need the log. I looked at the change and I can think of only one reason why you could have the metrics increased - namely that there is a recurring problem with some database operation related to callbacks, and retries simply increased the time it takes to run the opertions several times until they fail. This might indicate some consistent problem with either bug in Airflow or some deadlock/operational issue with your database, but we need more data to take a closer look and diagnose it fully. Can you tell us how many dag file processors you have @wookiist ? So I have two kind requests @wookiist:
Longer explanation (and theory I have): The processing time you see is not only for parsing, it is also for running callbacks related to the DAG being processed. When you have callbacks that cannot be run in the worker related to DAG file, they are executed by the DAG file processor after the file is parsed. How it works is that the main parsing loop will start processor for each file and feed it with the callbacks to execute. The change you mentioned does not directly contribute to the time, but it produces callbacks that the file processor executes. And those callbacks actually contribute to the time you see - so the increase of the time you see might not come from "parsing" but it might come from callback processing, and it could be tha the db_retry mechanism introduced in the PR you mentioned, has a problem that it (accidentally) sends multiple repeated callbacks to the processor - and that increases the processing time. Having the logs, we will be able to see if that is happening and likely we should be able to see the reason (possibly database logs might give more explanation what's going on). cc: @mhenc I think that one is closely related to standalone file processor case and might need your insights once we get more information |
Thanks for looking, @potiuk ! 0. Can you tell us how many dag file processors you have @wookiist ?Only one dag processor is being used 1. could you please increase log level for dag file processor to debugYes, I tested it on dev airflow cluster, and here is the log file. log file
2. Can you please take a look at the setting of scheduler 'max_callbacks_for_loop configuration (default is 20) .I think you're probably referring to the log file
graphUnfortunately, I don't see any major changes that I can think of. Immediately after deployment, the value changed dramatically, but after an hour or so, it was the same as before. I'm of the same opinion. It's unfortunate that I can't drill down and see the metrics, but I was suspecting that something in the retry logic was adding multiple behaviors. Unfortunately I couldn't get the database logs, and based on the dag processor debug logs alone, I'm wondering if there's something wrong with the retry logic as well. Please correct me if I'm misinterpreting the log files. By any chance, could you explain what callbacks are involved in the DAG? And here I see that the @provide_session and @retry_db_transaction decorators are used in separate functions, but it's still possible to call a function with the @retry_db_transaction decorator within a function with the @provide_session decorator. |
Edit: As you were. |
Thanks. @argibbs! |
Sorry for letting this slip a bit -> the callbacks are on_failure/on_success and sla_miss callbacks mainly. I still cannot see though how the change you mentioned could impact it. Retry logic does not seem to be triggered in your case. Are you absolutely sure that just cherry-picking that one change is causing those changes? |
@potiuk Yes, I can guarantee that we only cherrypicked that commit. |
Thanks. You can also test 2.6.0rc1 which has been just released and see if you see the same problem there. |
We are seeing a lot of parsing processes exiting with -1:
And messages like:
This version is borderline unusable due to this regression... |
@george-zubrienko -> I think your issue is rather unrelated. And it lacks a lot of details - for example I have no idea if ("This version") means 2.5.3 or 2.6.0rc (and which rc -> we are about to release rc3 in a few hours as some errors were found). Also your message is not actionable as it lacks a lot of details (like OS/configruation. executor, deployment detais etc). And your SIGKILL you see is likely indication of a problem with resources (maybe you have too little memory for example). Could you please open a NEW issue and provide all the details you are asked for - including versions, os and some more logs (and take a look at the resource usage and reasons for killing the processes in your deployment before creating it. Without doing all that, your message is simply not actionable. |
@potiuk I commented here because we experienced the same problem with Dag processing stats from 2.4.3:
Re memory, dag processor was running with 4Gi and using around 1.5Gi out of those. And again, downgrading to 2.4.3 resolves the problem all together |
But your problem is likely different. |
Please open a new issue with more details. |
Could you explain why do you think it is different? I have a feeling such issue will be marked as duplicate. We have exactly the same problem with dag processing time raising from 1s to 60s and exactly the same logs, minus the SIGKILL part. |
I do not know everything you know and have not provided any information about . What database you have ? what version? what deployment? What kind of other observations you have . I have far too little information to assess if it is the same issue. And as I mentioned above your logs above do not add any information there that could help to solve the problem. If you want to help to solve the problem - please provide more information explaining your circumstances, otherwise there is no way anyone can do anything with what you provided. On the other hand, if you spend maybe 15 minutes describing all the details and gathering more evidences and describing your (completely unknown to anyone looking here) configuration and circumstances, there is a great chance that you will actually help in diagnosing the issue - and helping people who try to help people like you to diagnose it (mostly in their free time, to solve the problems in the software you got for free). EVen if such an issue will be later diagnosed as the same and closed as duplicated, it might be actually helpful if it has some information. And when it turns to be a different reason (which is likely because your problem is 60s vs.1 not 10s vs 2) then this can turn into two different streams of conversations that will be far easier to treat separately. By yourself assuming that you have the same problem (even if you do not know it) and then questioning my kind request to create a new issue - you also add a lot of noise to the conversation - without adding value. So let me repeat for the third time - can you please open a new issue and all many more details describing your conversatiosn @george-zubrienko . Pretty please. |
Linking the issue here in case any relation is found #30884 |
@wookiist -> relate to the conversation in #30884 - are you also using Variables at the top-level code of your DAGs (or other DB access) - would it be possible to run an experiment and see if removing it (hardcoding temporarily) would mitigate the problem ? (not suggesting that as a solution but more like investigation aid). |
I believe #30899 should fix the problm. If you could apply it to 2.5.3 (it wil be slightly different code / change - adding "only_if_necessary=True" to the heartbeat method in here to test it: But I am pretty sure that's it |
The standalone file processor as of apache#30278 introduced accidentally an artifficial delay between dag processing by adding heartbeat but missing to set "only_if_necessary" flag to True. If your dag file processing has been fast (faster than the scheduler job_heartbeat_sec) this introduced unnecessary pause between the next dag file processor loop (up until the time passed), it also introduced inflation of the dag_processing_last_duration metrics (it would always show minimum job_heartbeat_sec) Adding "only_if_necessary" flag fixes the problem. Fixes: apache#30593 Fixes: apache#30884
The standalone file processor as of #30278 introduced accidentally an artifficial delay between dag processing by adding heartbeat but missing to set "only_if_necessary" flag to True. If your dag file processing has been fast (faster than the scheduler job_heartbeat_sec) this introduced unnecessary pause between the next dag file processor loop (up until the time passed), it also introduced inflation of the dag_processing_last_duration metrics (it would always show minimum job_heartbeat_sec) Adding "only_if_necessary" flag fixes the problem. Fixes: #30593 Fixes: #30884
The standalone file processor as of #30278 introduced accidentally an artifficial delay between dag processing by adding heartbeat but missing to set "only_if_necessary" flag to True. If your dag file processing has been fast (faster than the scheduler job_heartbeat_sec) this introduced unnecessary pause between the next dag file processor loop (up until the time passed), it also introduced inflation of the dag_processing_last_duration metrics (it would always show minimum job_heartbeat_sec) Adding "only_if_necessary" flag fixes the problem. Fixes: #30593 Fixes: #30884 (cherry picked from commit 00ab45f)
Amazing, it's 5sec! maybe your commit would resolve the problem. |
@potiuk Thanks Potiuk! |
Fantastic. 2.6.0 is now out with the fix included BTW. |
Apache Airflow version
2.5.3
What happened
I upgraded my airflow cluster from 2.5.2 to 2.5.3 , after which strange things started happening.
I'm currently using a standalone dagProcessor, and the parsing time that used to take about 2 seconds has suddenly increased to about 10 seconds.
I'm thinking it's weird because I haven't made any changes other than a version up, but is there something I can look into? Thanks in advance! 🙇🏼
What you think should happen instead
I believe that the time it takes to parse a Dag should be constant, or at least have some variability, but shouldn't take as long as it does now.
How to reproduce
If you cherrypick this commit into 2.5.2 stable code, the issue will recur.
Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Anything else
This issue still persists, and restarting the Dag Processor has not resolved the issue.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: