Skip to content

Fix fab deserialize user session leak#68100

Open
pcorliss wants to merge 6 commits into
apache:mainfrom
pcorliss:fix-fab-deserialize-user-session-leak
Open

Fix fab deserialize user session leak#68100
pcorliss wants to merge 6 commits into
apache:mainfrom
pcorliss:fix-fab-deserialize-user-session-leak

Conversation

@pcorliss
Copy link
Copy Markdown
Contributor

@pcorliss pcorliss commented Jun 5, 2026

Fix periodic 500 due to idle_in_transaction timeout in FAB auth manager

When we upgraded to Airflow 3.1.8 (apache-airflow-providers-fab 3.4.0) we noticed periodic 500s being returned when connecting to certain API endpoints or browsing the GUI. We traced this back to a SQL error on the api-server.

sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) server closed the connection unexpectedly
    This probably means the server terminated abnormally
    before or while processing the request.

(Subsequent requests reusing the same scoped_session then cascade into sqlalchemy.exc.PendingRollbackError: Can't reconnect until invalid transaction is rolled back until the session is removed.)

On investigation we found that SQL queries were being left open in an idle_in_transaction state and would time out due to our PostgreSQL settings. In our environment we have our PostgreSQL idle in transaction timeout set to 5 minutes (idle_in_transaction_session_timeout = '5min' in postgresql.conf). The PostgreSQL default is 0 ("never time out"), so deployments using the default will not see this bug; you can confirm the live value with SHOW idle_in_transaction_session_timeout;.

We traced this back to this change:

We think the removal of create_session() is what removed the commit/rollback handler.

We also noticed several follow-up fixes and tried upgrading to Airflow 3.2.2 (apache-airflow-providers-fab 3.6.4), but the errors continued because none of them fix the root cause (the open transaction left behind on a worker thread):

Repro Steps

Deployed Instance

  1. Set PostgreSQL idle_in_transaction_session_timeout - ALTER SYSTEM SET idle_in_transaction_session_timeout = '10s'; SELECT pg_reload_conf();
  2. Clear all existing cookies or use an incognito window.
  3. Login via /auth/login.
  4. Navigate to /api/v2/version — expected 200 response. An idle_in_transaction query is opened in the database (visible via SELECT pid, state, now()-state_change AS idle_for, LEFT(query, 80) FROM pg_stat_activity WHERE state='idle in transaction';).
  5. Take no action for at least the duration of PostgreSQL's idle_in_transaction_session_timeout.
  6. Navigate to /api/v2/version again — expected 500 response with OperationalError / PendingRollbackError in the api-server logs.

Breeze Script

Local breeze configuration in
files/airflow-breeze-config/environment_variables.env:

AIRFLOW__FAB__CACHE_TTL=0
AIRFLOW__API__WORKERS=1

AIRFLOW__FAB__CACHE_TTL=0 disables the TTL cache so every request exercises the cache-miss path. AIRFLOW__API__WORKERS=1 keeps a single gunicorn worker so the same anyio worker thread holds the leaked
connection across requests.

Start breeze with PostgreSQL and shorten the idle-in-transaction timeout so the bug is observable in seconds:

breeze start-airflow --backend postgres \
  --postgres-version 16 \
  --integration none \
  --load-default-connections \
  --load-example-dags

# In a separate shell, on the host:
docker exec -u postgres breeze-postgres-1 \
  psql -d airflow -c \
  "ALTER SYSTEM SET idle_in_transaction_session_timeout = '10s'; \
   SELECT pg_reload_conf();"

Then run the minimal repro shell script

#!/usr/bin/env bash
# Reproduce the FAB deserialize_user idle-in-transaction leak.
set -euo pipefail

API="${API:-http://localhost:28080}"
USERNAME="${USERNAME:-admin}"
PASSWORD="${PASSWORD:-admin}"
WAIT_SECONDS="${WAIT_SECONDS:-15}"

JAR=$(mktemp)
trap 'rm -f "$JAR"' EXIT

step() { printf "\n\033[1;34m== %s ==\033[0m\n" "$*"; }

step "1. Authenticate via /auth/login/"
LOGIN_HTML=$(curl -fsS -c "$JAR" -b "$JAR" "$API/auth/login/")
CSRF=$(printf '%s' "$LOGIN_HTML" \
  | grep -oE 'name="csrf_token"[^>]*value="[^"]*"' \
  | head -1 | sed -E 's/.*value="([^"]*)".*/\1/')
curl -fsS -o /dev/null -c "$JAR" -b "$JAR" -X POST "$API/auth/login/" \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  --data-urlencode "username=$USERNAME" \
  --data-urlencode "password=$PASSWORD" \
  --data-urlencode "csrf_token=$CSRF"

step "2. Warm GET /api/v2/version (leaks idle-in-tx scoped_session on worker thread)"
curl -fsS -o /dev/null -w 'HTTP %{http_code}\n' -b "$JAR" "$API/api/v2/version"

step "3. Confirm idle-in-transaction connection holding the FAB User JOIN"
docker exec breeze-postgres-1 psql -U postgres -d airflow -c \
  "SELECT pid, state, now()-state_change AS idle_for, LEFT(query, 80) \
   FROM pg_stat_activity \
   WHERE datname='airflow' AND state='idle in transaction';"

step "4. Sleep ${WAIT_SECONDS}s (> idle_in_transaction_session_timeout)"
sleep "$WAIT_SECONDS"

step "5. Probe GET /api/v2/version — pre-fix: 500; post-fix: 200"
HTTP=$(curl -s -o /tmp/repro-resp.txt -w '%{http_code}' -b "$JAR" "$API/api/v2/version")
echo "HTTP $HTTP"

if [[ "$HTTP" == "500" ]]; then
  echo -e "\n\033[1;31mLEAK REPRODUCED\033[0m"
  exit 1
else
  echo -e "\n\033[1;32mNO LEAK\033[0m"
  exit 0
fi

Pre-fix the script exits non-zero with HTTP 500 on step 5. Post-fix it exits 0 with HTTP 200 and no idle in transaction row visible at step 3.


Was generative AI tooling used to co-author this PR?
  • Yes — Claude Code (Opus 4.7)

Generated-by: Claude Code (Opus 4.7) following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@pcorliss pcorliss requested a review from vincbeck as a code owner June 5, 2026 20:16
@pcorliss pcorliss force-pushed the fix-fab-deserialize-user-session-leak branch from 5a971c6 to 81cf36a Compare June 5, 2026 20:16
@pcorliss
Copy link
Copy Markdown
Contributor Author

pcorliss commented Jun 5, 2026

Test failures appear to be environmental, pulling golang from remote URL. I don't see a retry option so will wait until there are commits on main to rebase from.

pcorliss added 5 commits June 5, 2026 21:33
Address self-review findings: import airflow.settings at module top
instead of inside the test method, and constrain the create_session
context-manager mock to the protocol it actually needs.

AI-Assisted-By: Claude Opus 4.7
The compat-3.0.6 test environment configures the engine with NullPool,
which closes connections on release and exposes no checkedout() method.
Skip the leak assertion there — by construction NullPool cannot leak
the connection this test guards against.

AI-Assisted-By: Claude Opus 4.7
@pcorliss pcorliss force-pushed the fix-fab-deserialize-user-session-leak branch from 0198985 to 49e90c6 Compare June 6, 2026 02:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PendingRollbackError: Can't reconnect until invalid transaction is rolled back. Please rollback() fully before proceeding (Background on this error)

1 participant