Fix API server OOMKill: release row lock before asset event emission under high concurrency#66932
Open
Bishesh-Shahi wants to merge 1 commit into
Open
Fix API server OOMKill: release row lock before asset event emission under high concurrency#66932Bishesh-Shahi wants to merge 1 commit into
Bishesh-Shahi wants to merge 1 commit into
Conversation
…under high concurrency
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #66853.
Problem
Under high concurrency (80+ simultaneous task completions emitting asset events), the API server dies with OOMKill. The root cause is a DB lock contention chain:
ti_update_state()acquiresSELECT task_instance ... WITH FOR UPDATE, holding a PostgreSQL row lock.register_asset_changes_in_db()runs multiple slow queries includingasset_alias_model.asset_events.append(asset_event). This ORM.append()lazy-loads the entireasset_eventscollection for the alias.idle in transactionwhile Python processes results. New workers needingSELECT task_instance FOR UPDATEon the same row queue up, each holding a FastAPI threadpool thread.Fix
Two changes:
1.
AssetManager.register_asset_change()(assets/manager.py): Replaceasset_alias_model.asset_events.append(asset_event)+session.add(asset_alias_model)with a directINSERT INTO asset_alias_asset_event (alias_id, event_id). This eliminates the lazy-load of the existing events collection (which can be thousands of rows) while the task_instance row lock is held.2.
ti_update_state()(execution_api/routes/task_instances.py): Addsession.commit()after the TI state UPDATE and Log writes to release thetask_instancerow lock before running asset registration. Asset registration then runs in a fresh implicit transaction. Registration failures are logged and swallowed -- the task state is already durable at that point.Testing
test_register_asset_change_with_alias_no_lazy_load-- confirms no SELECT onasset_alias_asset_eventcollection during registration when pre-existing rows existtest_ti_update_state_to_success_asset_registration_failure_returns_204-- confirms 204 + TI SUCCESS when asset registration raises after commit