Skip to content

Fix API server OOMKill: release row lock before asset event emission under high concurrency#66932

Open
Bishesh-Shahi wants to merge 1 commit into
apache:mainfrom
Bishesh-Shahi:fix-oom-api-server
Open

Fix API server OOMKill: release row lock before asset event emission under high concurrency#66932
Bishesh-Shahi wants to merge 1 commit into
apache:mainfrom
Bishesh-Shahi:fix-oom-api-server

Conversation

@Bishesh-Shahi
Copy link
Copy Markdown

@Bishesh-Shahi Bishesh-Shahi commented May 14, 2026

Closes #66853.

Problem

Under high concurrency (80+ simultaneous task completions emitting asset events), the API server dies with OOMKill. The root cause is a DB lock contention chain:

  1. ti_update_state() acquires SELECT task_instance ... WITH FOR UPDATE, holding a PostgreSQL row lock.
  2. While holding that lock, register_asset_changes_in_db() runs multiple slow queries including asset_alias_model.asset_events.append(asset_event). This ORM .append() lazy-loads the entire asset_events collection for the alias.
  3. Each slow query leaves the connection idle in transaction while Python processes results. New workers needing SELECT task_instance FOR UPDATE on the same row queue up, each holding a FastAPI threadpool thread.
  4. With 80+ concurrent completions, thread count grows unbounded until OOMKill.

Fix

Two changes:

1. AssetManager.register_asset_change() (assets/manager.py): Replace asset_alias_model.asset_events.append(asset_event) + session.add(asset_alias_model) with a direct INSERT INTO asset_alias_asset_event (alias_id, event_id). This eliminates the lazy-load of the existing events collection (which can be thousands of rows) while the task_instance row lock is held.

2. ti_update_state() (execution_api/routes/task_instances.py): Add session.commit() after the TI state UPDATE and Log writes to release the task_instance row lock before running asset registration. Asset registration then runs in a fresh implicit transaction. Registration failures are logged and swallowed -- the task state is already durable at that point.

Testing

  • New: test_register_asset_change_with_alias_no_lazy_load -- confirms no SELECT on asset_alias_asset_event collection during registration when pre-existing rows exist
  • New: test_ti_update_state_to_success_asset_registration_failure_returns_204 -- confirms 204 + TI SUCCESS when asset registration raises after commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:task-sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API server OOMKill: task_instance row lock held during asset event emission under high concurrency

1 participant