Skip to content

Use async DB session for Execution API task-instance heartbeat#67800

Open
Dev-iL wants to merge 1 commit into
apache:mainfrom
Dev-iL:2605/ti_heartbeat_async
Open

Use async DB session for Execution API task-instance heartbeat#67800
Dev-iL wants to merge 1 commit into
apache:mainfrom
Dev-iL:2605/ti_heartbeat_async

Conversation

@Dev-iL
Copy link
Copy Markdown
Collaborator

@Dev-iL Dev-iL commented May 31, 2026

Related:

What & why

Converts the Execution API ti_heartbeat route from the synchronous SessionDep to the async AsyncSessionDep, adopting the async metadata engine that already ships in Airflow 3.x (create_async_engine, async_sessionmaker, AsyncSessionDep, create_session_async).

Heartbeat is the highest-QPS write path in the system (one call per running task per heartbeat interval, per worker), so it is a meaningful first production route to run on the async engine. The goal of this PR is behavioral parity with zero regression — not a throughput change. It follows the same low-blast-radius conversion pattern already merged for GET /execution/variables/keys.

What changed

  • Route (execution_api/routes/task_instances.py): ti_heartbeat is now async def and takes AsyncSessionDep; the fast-path UPDATE, the slow-path SELECT … FOR UPDATE (.one()), the Task-Instance-History existence scalar, and the final UPDATE are awaited. No explicit commit() is added — the async dependency commits on success and rolls back (releasing the FOR UPDATE row lock) on exception, mirroring the sync dependency exactly. synchronize_session=False and with_for_update() are unchanged.
  • Tests (versions/head/test_task_instances.py):
    • The two tests that intercept the fast-path UPDATE now patch sqlalchemy.ext.asyncio.AsyncSession with an async interceptor (the route now awaits AsyncSession.execute); assertions are unchanged.
    • Added a reconfigure_async_db_engine autouse fixture to TestTIHealthEndpoint. The async engine binds its connection pool to the event loop that created it, while the test harness builds a fresh app and event loop per test; without this, a pooled connection from a prior test's closed loop is reused and fails (attached to a different loop). This is the same workaround already used by TestWaitDagRun.
  • Docs: documented the asyncpg + transaction-mode PgBouncer prepared-statement caveat on the sql_alchemy_connect_args_async config reference and in the PgBouncer section of the database setup guide.
  • Newsfragment (improvement): notes that the heartbeat endpoint now uses the async engine and that the API server may hold both the sync and async connection pools concurrently, so operators should budget DB max_connections for both.

Behavioral parity

No change to the endpoint contract: same 204 success, same 404 / 409 (running-elsewhere / not-running) / 410 (cleared-and-archived) responses with identical detail payloads, same request signature. No Execution API version bump. The pre-existing heartbeat tests pass unchanged — that byte-identical pass is the parity proof.

Testing

Heartbeat suite (-k heartbeat, 12 tests covering success, fast path, slow-path fallback, 404, 409, 410, and the rowcount == 0 / unknown-rowcount fall-through branches) passes on all three supported backends:

Backend Async driver Result
SQLite aiosqlite 12/12 ✅
Postgres asyncpg 12/12 ✅
MySQL aiomysql 12/12 ✅

ruff and mypy (airflow-core) are clean.

Operational notes

  • Connection budget: heartbeat moving to the async engine means a busy API server can hold both the sync pool (for the remaining sync routes) and the async pool concurrently. Size DB max_connections accordingly (called out in the newsfragment).
  • PgBouncer (transaction pooling): asyncpg uses server-side prepared statements, which break under transaction-mode PgBouncer. This is documented as a deployment-configuration requirement here; shipping pgbouncer-safe async-engine defaults is deferred and tracked in #<tracking-issue>.

Note to reviewers

A cosmetic Event loop is closed message is logged during session teardown when the async engine is disposed (dispose_orm closing the async engine synchronously). It appears on all three async drivers, is pre-existing engine disposal behavior unrelated to this change, and every run exits 0. Tidying async-engine disposal is out of scope for this PR.

Engine-default hardening for asyncpg/PgBouncer will be tracked in a dedicated issue to be opened soon.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

Generated-by: Claude Opus 4.8 following the guidelines


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@Dev-iL Dev-iL requested review from amoghrajesh, ashb and kaxil as code owners May 31, 2026 12:53
@Dev-iL Dev-iL requested a review from hussein-awala May 31, 2026 13:04
@Dev-iL Dev-iL added the full tests needed We need to run full set of tests for this PR to merge label May 31, 2026
Convert ti_heartbeat from the synchronous SessionDep to the async AsyncSessionDep, adopting the async metadata engine that already ships in Airflow 3.x. The route's behavior is unchanged: same 204/404/409/410 responses, same fast-path UPDATE / SELECT ... FOR UPDATE fallback, same last_heartbeat_at semantics, no version bump.

Heartbeat's async writes exposed a test-harness issue on Postgres: the async engine binds its connection pool to the event loop that created it, while the test harness builds a fresh app and loop per test, so a pooled connection from a prior test's closed loop was reused. Add a reconfigure_async_db_engine autouse fixture to the heartbeat tests that rebuilds the async session per test, mirroring the existing workaround in TestWaitDagRun.

Document the asyncpg + transaction-mode pgbouncer prepared-statement caveat in the sql_alchemy_connect_args_async config reference and the PGBouncer setup guide; engine-default hardening is deferred.
@Dev-iL Dev-iL force-pushed the 2605/ti_heartbeat_async branch from 455e42b to 8d811ea Compare June 3, 2026 09:39

See also :ref:`Helm Chart production guide <production-guide:pgbouncer>`

Some Airflow database routes use an async engine (the Execution API, for example). The asyncpg
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of just noting this (which will get ignored/missed by many users) we should change the default async to psycopg3. The only reason we had it set up as asyncpg was because we were stuck on SQLA1.4, but you've already fixed that :)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not want to keep asyncpg for the better performance? I did a benchmark recently and this is still the fastest driver.


# ti_heartbeat runs on the async engine. The async engine binds its pool to
# the event loop that created it (once per process), but the test harness
# builds a fresh FastAPI app and event loop per test, so a pooled connection
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what impact re-creating the connection many times will have on test durations? (Not much for this case, but as we move more things to async?)

Should we look at changing the default event-loop scoping maybe? That also has draw backs for sure.

Copy link
Copy Markdown
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one.

@ashb
Copy link
Copy Markdown
Member

ashb commented Jun 3, 2026

  • Connection budget: heartbeat moving to the async engine means a busy API server can hold both the sync pool (for the remaining sync routes) and the async pool concurrently. Size DB max_connections accordingly (called out in the newsfragment).
  • PgBouncer (transaction pooling): asyncpg uses server-side prepared statements, which break under transaction-mode PgBouncer. This is documented as a deployment-configuration requirement here; shipping pgbouncer-safe async-engine defaults is deferred and tracked in #.

I think the agent made these up :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:ConfigTemplates area:task-sdk full tests needed We need to run full set of tests for this PR to merge kind:documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants