Fix fab deserialize user session leak#68100
Open
pcorliss wants to merge 6 commits into
Open
Conversation
5a971c6 to
81cf36a
Compare
Contributor
Author
|
Test failures appear to be environmental, pulling golang from remote URL. I don't see a retry option so will wait until there are commits on main to rebase from. |
Address self-review findings: import airflow.settings at module top instead of inside the test method, and constrain the create_session context-manager mock to the protocol it actually needs. AI-Assisted-By: Claude Opus 4.7
The compat-3.0.6 test environment configures the engine with NullPool, which closes connections on release and exposes no checkedout() method. Skip the leak assertion there — by construction NullPool cannot leak the connection this test guards against. AI-Assisted-By: Claude Opus 4.7
0198985 to
49e90c6
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix periodic 500 due to idle_in_transaction timeout in FAB auth manager
When we upgraded to Airflow 3.1.8 (apache-airflow-providers-fab 3.4.0) we noticed periodic 500s being returned when connecting to certain API endpoints or browsing the GUI. We traced this back to a SQL error on the api-server.
(Subsequent requests reusing the same
scoped_sessionthen cascade intosqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled backuntil the session is removed.)On investigation we found that SQL queries were being left open in an
idle_in_transactionstate and would time out due to our PostgreSQL settings. In our environment we have our PostgreSQL idle in transaction timeout set to 5 minutes (idle_in_transaction_session_timeout = '5min'inpostgresql.conf). The PostgreSQL default is0("never time out"), so deployments using the default will not see this bug; you can confirm the live value withSHOW idle_in_transaction_session_timeout;.We traced this back to this change:
We think the removal of create_session() is what removed the commit/rollback handler.
We also noticed several follow-up fixes and tried upgrading to Airflow 3.2.2 (apache-airflow-providers-fab 3.6.4), but the errors continued because none of them fix the root cause (the open transaction left behind on a worker thread):
Repro Steps
Deployed Instance
ALTER SYSTEM SET idle_in_transaction_session_timeout = '10s'; SELECT pg_reload_conf();/auth/login./api/v2/version— expected 200 response. Anidle_in_transactionquery is opened in the database (visible viaSELECT pid, state, now()-state_change AS idle_for, LEFT(query, 80) FROM pg_stat_activity WHERE state='idle in transaction';).idle_in_transaction_session_timeout./api/v2/versionagain — expected 500 response withOperationalError/PendingRollbackErrorin the api-server logs.Breeze Script
Local breeze configuration in
files/airflow-breeze-config/environment_variables.env:AIRFLOW__FAB__CACHE_TTL=0disables the TTL cache so every request exercises the cache-miss path.AIRFLOW__API__WORKERS=1keeps a single gunicorn worker so the same anyio worker thread holds the leakedconnection across requests.
Start breeze with PostgreSQL and shorten the idle-in-transaction timeout so the bug is observable in seconds:
Then run the minimal repro shell script
Pre-fix the script exits non-zero with HTTP 500 on step 5. Post-fix it exits 0 with HTTP 200 and no
idle in transactionrow visible at step 3.Was generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.7) following the guidelines
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.