Fix API server OOMKill: release row lock before asset event emission under high concurrency by Bishesh-Shahi · Pull Request #66932 · apache/airflow

Bishesh-Shahi · 2026-05-14T10:19:24Z

Problem

Under high concurrency (80+ simultaneous task completions emitting asset events), the API server dies with OOMKill. The root cause is a DB lock contention chain:

ti_update_state() acquires SELECT task_instance ... WITH FOR UPDATE, holding a PostgreSQL row lock.
While holding that lock, register_asset_changes_in_db() runs multiple slow queries including asset_alias_model.asset_events.append(asset_event). This ORM .append() lazy-loads the entire asset_events collection for the alias.
Each slow query leaves the connection idle in transaction while Python processes results. New workers needing SELECT task_instance FOR UPDATE on the same row queue up, each holding a FastAPI threadpool thread.
With 80+ concurrent completions, thread count grows unbounded until OOMKill.

Fix

Two changes:

1. AssetManager.register_asset_change() (assets/manager.py): Replace asset_alias_model.asset_events.append(asset_event) + session.add(asset_alias_model) with a direct INSERT INTO asset_alias_asset_event (alias_id, event_id). This eliminates the lazy-load of the existing events collection (which can be thousands of rows) while the task_instance row lock is held.

2. ti_update_state() (execution_api/routes/task_instances.py): Add session.commit() after the TI state UPDATE and Log writes to release the task_instance row lock before running asset registration. Asset registration then runs in a fresh implicit transaction. Registration failures are logged and swallowed -- the task state is already durable at that point.

Testing

New: test_register_asset_change_with_alias_no_lazy_load -- confirms no SELECT on asset_alias_asset_event collection during registration when pre-existing rows exist
New: test_ti_update_state_to_success_asset_registration_failure_returns_204 -- confirms 204 + TI SUCCESS when asset registration raises after commit

…under high concurrency

Fix API server OOMKill: release row lock before asset event emission …

d144c44

…under high concurrency

Bishesh-Shahi requested review from Lee-W, amoghrajesh, ashb, kaxil and uranusjr as code owners May 14, 2026 10:19

boring-cyborg Bot added area:API Airflow's REST/HTTP API area:task-sdk labels May 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix API server OOMKill: release row lock before asset event emission under high concurrency#66932

Fix API server OOMKill: release row lock before asset event emission under high concurrency#66932
Bishesh-Shahi wants to merge 1 commit into
apache:mainfrom
Bishesh-Shahi:fix-oom-api-server

Bishesh-Shahi commented May 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Bishesh-Shahi commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bishesh-Shahi commented May 14, 2026 •

edited

Loading