-
Notifications
You must be signed in to change notification settings - Fork 13.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reloading of Logging and ORM in Dag Processing can be disruptive in spawn mode #19934
Comments
As described in detail in apache#19860, there was a race condition in starting and terminating DagProcessorAgent that caused us a lot of headeaches with flaky test_scheduler_job failures on our CI and after long investigation, it turned out to be a race condition. Not very likely, but possible to happen in production. The race condition involved starting DagProcessorAgent via multiprocessing, where the first action of the agent was changing the process GID to be the same as PID. If the DagProcessorAgent was terminated quickly (on a busy system) before the process could change the GID, the `reap_process_group` that was supposed to kill the whole group, was failing and the DagProcessorAgent remained running. This problem revealed a wrong behaviour of Airflow in some edge conditions when 'spawn' mode was used for starting the DAG processor Details are described in apache#19934, but this problem will have to be solved differently (avoiding ORM reinitialization during DAG processor starting). This change also moves the tests for `spawn` method out from test_scheduler_job.py (it was a remnant of old Airlfow and it did not really test what it was supposed to test). Instead tests were added for different spawn modes and killing the processor agent in both spawn and "default" mode.
As described in detail in #19860, there was a race condition in starting and terminating DagProcessorAgent that caused us a lot of headeaches with flaky test_scheduler_job failures on our CI and after long investigation, it turned out to be a race condition. Not very likely, but possible to happen in production. The race condition involved starting DagProcessorAgent via multiprocessing, where the first action of the agent was changing the process GID to be the same as PID. If the DagProcessorAgent was terminated quickly (on a busy system) before the process could change the GID, the `reap_process_group` that was supposed to kill the whole group, was failing and the DagProcessorAgent remained running. This problem revealed a wrong behaviour of Airflow in some edge conditions when 'spawn' mode was used for starting the DAG processor Details are described in #19934, but this problem will have to be solved differently (avoiding ORM reinitialization during DAG processor starting). This change also moves the tests for `spawn` method out from test_scheduler_job.py (it was a remnant of old Airlfow and it did not really test what it was supposed to test). Instead tests were added for different spawn modes and killing the processor agent in both spawn and "default" mode.
As described in detail in #19860, there was a race condition in starting and terminating DagProcessorAgent that caused us a lot of headeaches with flaky test_scheduler_job failures on our CI and after long investigation, it turned out to be a race condition. Not very likely, but possible to happen in production. The race condition involved starting DagProcessorAgent via multiprocessing, where the first action of the agent was changing the process GID to be the same as PID. If the DagProcessorAgent was terminated quickly (on a busy system) before the process could change the GID, the `reap_process_group` that was supposed to kill the whole group, was failing and the DagProcessorAgent remained running. This problem revealed a wrong behaviour of Airflow in some edge conditions when 'spawn' mode was used for starting the DAG processor Details are described in #19934, but this problem will have to be solved differently (avoiding ORM reinitialization during DAG processor starting). This change also moves the tests for `spawn` method out from test_scheduler_job.py (it was a remnant of old Airlfow and it did not really test what it was supposed to test). Instead tests were added for different spawn modes and killing the processor agent in both spawn and "default" mode. (cherry picked from commit 5254843)
This issue has been automatically marked as stale because it has been open for 365 days without any activity. There has been several Airflow releases since last activity on this issue. Kindly asking to recheck the report against latest Airflow version and let us know if the issue is reproducible. The issue will be closed in next 30 days if no further activity occurs from the issue author. |
This issue has been closed because it has not received response from the issue author. |
Apache Airflow version
main (development)
Operating System
Unix with 'spawn' method, MacOS, Windows (spawn is default there)
Versions of Apache Airflow Providers
Not relevant
Deployment
Other
Deployment details
Any deployment with 'spawn' mutliprocessing mode method used is susceptible to this problem. This is configured via (undocumented) feature on Linux CORE__MP_START_METHOD="spawn" but this is also default method for Windows and MacOS envieronments.
What happened
Whenever the DagFileProcessor performs relod of airflow settings (always when it starts), it also re-initializes SQLAlchemy/ORM engine. This reinitialization in
spawn
mode of multiprocessing has unintended side effect - it corrupts the SQLAlchemy session data in rather unpredictable ways. In our CI tests it resulted with making objects created after the reload unavailable to the session that queried for those objects.This is particularly disruptive as the "parent" process for DagProcessor manager currently is Scheduler, which heavlly relies on lots of ORM queries and any corruption of the session data might lead to extremely difficult to diagnose and debug problems, especially that such reload might happen asynchronously.
Note - since the only "production" deployment of Airflow is Linux, and this feature is undocumented, it is highly unlikely this problem will cause a problem in production. However in our CI environment and in environments where people use Airflow for development, it is quite likely to happen. This might already cause stability issues for MacOS users (and in the future Windows users) who run Airlfow locally to run tests rather than use Airflow in Production. While this is lower priority, it's still big enough and important group of users so that this problem should be fixed.
What you expected to happen
We expect the "reload" is not happening - and specifically that the ORM engine will not be reinitialized when spawned DagProcesorManager starts. This is only needed currently because the logging configuration needs to be recreated, specifically for DAG processor.
Note: In the future, when dag processor is separated to a separate process/command line (planned for the multi-tenant work), this will not be a problem. But the mode where DAG processor remains child process of Airflow is likely to stay, so we should fix it.
How to reproduce
sleep(2)
command to initialization of the_run_processor_manager
inairflow/dag_processing/manager.py
sleep(10) command in the
test_scheduler_keeps_scheduling_pool_fullof
tests/jobs/test_scheduler_job.py`Run pytest tests in the following way:
Result:
In the logs you will also see the below, which is the manifestation of session being broken. The object were added few lines before, but since they were added already after the reloaad hae
What happens here:
test_scheduler_multiprocessing_with_spawn_method
finished, but due to the race condition DAGProcessor has not been terminated when the test finishedAnything else
Very long investigation, but I am happy we found it. The fix is coming to the "race condition" and the "spawn" tests were moved to where they belong (dag_processor) with some more test cases added: #19935. However further changes are needed to solve the root cause - reload of the ORM should not happen in the spawned process.
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: