Fix deadlock in ti_update_state: FOR UPDATE OF task_instance only#67246
Merged
Conversation
session.get(TI, id, with_for_update=True) emits a SELECT that joins
dag_run (via the lazy="joined" relationship) and applies FOR UPDATE to
both tables. Under concurrent task completions this serialises all
workers on the same dag_run row, producing deadlock cycles with the
scheduler's trigger-rule dependency checks.
Three other callsites in this file already use with_for_update={"of": TI}
for exactly this reason. Apply the same fix to the two remaining callsites
in _create_ti_state_update_query_and_update_state and its error-recovery
path.
kaxil
approved these changes
May 21, 2026
Contributor
Backport successfully created: v3-2-testNote: As of Merging PRs targeted for Airflow 3.X In matter of doubt please ask in #release-management Slack channel.
|
vatsrahul1001
pushed a commit
that referenced
this pull request
May 21, 2026
…ing dag_run (#67246) (#67264) session.get(TI, id, with_for_update=True) emits a SELECT that joins dag_run (via the lazy="joined" relationship) and applies FOR UPDATE to both tables. Under concurrent task completions this serialises all workers on the same dag_run row, producing deadlock cycles with the scheduler's trigger-rule dependency checks. Three other callsites in this file already use with_for_update={"of": TI} for exactly this reason. Apply the same fix to the two remaining callsites in _create_ti_state_update_query_and_update_state and its error-recovery path. (cherry picked from commit 315d159) Co-authored-by: Arthur <arthur.volant@datadoghq.com>
vatsrahul1001
pushed a commit
that referenced
this pull request
May 21, 2026
…ing dag_run (#67246) (#67264) session.get(TI, id, with_for_update=True) emits a SELECT that joins dag_run (via the lazy="joined" relationship) and applies FOR UPDATE to both tables. Under concurrent task completions this serialises all workers on the same dag_run row, producing deadlock cycles with the scheduler's trigger-rule dependency checks. Three other callsites in this file already use with_for_update={"of": TI} for exactly this reason. Apply the same fix to the two remaining callsites in _create_ti_state_update_query_and_update_state and its error-recovery path. (cherry picked from commit 315d159) Co-authored-by: Arthur <arthur.volant@datadoghq.com>
2 tasks
kaxil
added a commit
that referenced
this pull request
May 22, 2026
…mit (#67353) PR #59686 dropped the _handle_fail_fast_for_dag call in the MySQL-TIMESTAMP-limit branch of the reschedule path based on an incorrect SQLA2 deadlock concern. As a result, DAGs with fail_fast=True silently fail to stop sibling tasks when a reschedule date exceeds 2038-01-19 on MySQL. The actual deadlock that motivated #59686 came from a different path (FOR UPDATE expanding to the lazy-joined dag_run row), fixed in #67246 by scoping the lock with with_for_update={"of": TI}. With that scope in place, the fail-fast call is safe and matches the file's two existing fail-fast sites. Also drops a second misleading comment in the same function claiming session.get was avoided to "avoid SQLA2 lock contention issues" -- the code itself is fine; the rationale was wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PATCH /execution/task-instances/{id}/statecalls:The
TImapper hasdag_run = relationship(..., lazy="joined"), so this emits:FOR UPDATEwith noOFclause locks every relation in the FROM list, bothtask_instanceanddag_run. Under concurrent task completions in the same DAG run, all workers serialise on a singledag_runrow, deadlocking with the scheduler'sTriggerRuleDepqueries (which run in transactions that also touchtask_instance).Observed in production: ~2,000
psycopg2.errors.DeadlockDetected+ widespreadstatement_timeout(5 s) errors per hour on thePATCH .../stateendpoint, with a mean execution time of 4,297 ms (pure lock-wait; the query does zero disk I/O).Three other call sites in this file already use
with_for_update(of=TI)for exactly this reason (lines 152, 339, 697). The two remaining barewith_for_update=Truecalls are the ones hit on every normal task completion.Fix
Scope the lock to
task_instanceonly:This produces
FOR UPDATE OF task_instance, leavingdag_rununlocked and breaking the deadlock cycle.Testing
Existing unit tests cover the state-update path. No behaviour change — the function still reads the
dag_runjoinedload; it just no longer holds a row lock on it.