New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dispose connections when running tasks with os.fork & CeleryExecutor #13265
Conversation
Without this fix, when using CeleryExecutor and default config (i.e. `AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=False`), tasks are run in fork and the pooled connections are shared to a forked process. This causes Celery tasks to hang infinitely (tasks will stay in queued state) with the following error: ``` [2020-12-22 18:49:39,085: WARNING/ForkPoolWorker-2] Failed to log action with (psycopg2.DatabaseError) error with status PGRES_TUPLES_OK and no message from the libpq ``` >It’s critical that when using a connection pool, and by extension when using an Engine created via create_engine(), that the pooled connections are not shared to a forked process. Sqlalchmey docs: https://docs.sqlalchemy.org/en/14/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork
902c8c9
to
1bd1b38
Compare
The PR most likely needs to run full matrix of tests because it modifies parts of the core of Airflow. However, committers might decide to merge it quickly and take the risk. If they don't merge it quickly - please rebase it to the latest master at your convenience, or amend the last commit of the PR, and push it with --force-with-lease. |
…13265) Without this fix, when using CeleryExecutor and default config (i.e. `AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=False`), tasks are run in fork and the pooled connections are shared to a forked process. This causes Celery tasks to hang infinitely (tasks will stay in queued state) with the following error: ``` [2020-12-22 18:49:39,085: WARNING/ForkPoolWorker-2] Failed to log action with (psycopg2.DatabaseError) error with status PGRES_TUPLES_OK and no message from the libpq ``` >It’s critical that when using a connection pool, and by extension when using an Engine created via create_engine(), that the pooled connections are not shared to a forked process. Sqlalchmey docs: https://docs.sqlalchemy.org/en/14/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork (cherry picked from commit 7f8be97)
Hi @kaxil, I'm running an Airflow cluster with v2.5.0, CeleryExecutor and SQLAlchemy 1.4.4, and I actually ran into the same error noted on this PR.
It seems to have happened when making the call: airflow/jobs/scheduler_job.py", line 889, in _run_scheduler_loop On that link you noted in the PR description, it recommends that we call the dispose function with Was there a reason why we opted to leave the |
Actually, after some more debugging it looks like this issue isn't specific to happening when processing the executor events. This is a error traceback that happened when the scheduler was fetching active_runs_of_dags, that put the scheduler into a bad state.
This failure happened as soon as the application came up, and I'm now wondering if this is related to the connection pool unintentionally being shared across forked processes as well... To continue with my investigation, I'm testing out disabling the usage of connection pooling on my Airflow cluster If setting up that flag does resolve this issue, I wonder if that's means there are still some edge cases that make the usage of pooling unsafe with celery executor (unless we do the converse and launch all tasks with new Python Interpreter) |
Just following up on my own bug report on this PR for anyone who's facing a similar issue: I wasn't able to find the root cause for my issue, but I was able to resolve the issue by disabling connection pooling by setting sql_alchemy_pool_enabled to False. I am not an expert in database connections to put much weight on this, but there seem to be folks recommending the use of NullPool when the DB access is already managed through PgBouncer. And this seems to be anecdotally consistent with our experience in debugging our issue. Using NullPool resolved all of the connection issues we were having with our deployment, where the some of the schedulers would go into a bad state with the PostgreSQL DB connection, and the number of connections were still being managed by PgBouncer, with no visible performance degradations as a result of disabling connection pools. |
Without this fix, when using CeleryExecutor and default config (i.e.
AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=False
), tasks are run withos.fork
and the pooled connections are shared to a forked process. This causes Celery tasks to hang infinitely (tasks will stay in queued state) with the following error if enough there are not enough DB connections:Sqlalchmey docs: https://docs.sqlalchemy.org/en/14/core/pooling.html#using-connection-pools-with-multiprocessing-or-os-fork
This is also consistent with what we do in LocalExecutor:
airflow/airflow/executors/local_executor.py
Lines 65 to 69 in 93e4787
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.