Skip to content

Restore fail_fast handling when reschedule exceeds MySQL TIMESTAMP limit#67353

Merged
kaxil merged 1 commit into
apache:mainfrom
astronomer:restore-fail-fast-on-mysql-reschedule-limit
May 22, 2026
Merged

Restore fail_fast handling when reschedule exceeds MySQL TIMESTAMP limit#67353
kaxil merged 1 commit into
apache:mainfrom
astronomer:restore-fail-fast-on-mysql-reschedule-limit

Conversation

@kaxil
Copy link
Copy Markdown
Member

@kaxil kaxil commented May 22, 2026

Restore _handle_fail_fast_for_dag in the MySQL-TIMESTAMP-limit branch of the
reschedule path. Without this, DAGs with fail_fast=True silently fail to stop
sibling tasks when a reschedule date exceeds 2038-01-19 on MySQL.

What was wrong

PR #59686 removed the fail-fast call here with this rationale:

We skip fail_fast handling in this error case to avoid fetching the TI object
while the row is still locked from the earlier with_for_update() query, which
might cause deadlock issues in SQLA2. The task is marked as FAILED regardless.

That rationale was incorrect on both counts:

  • A transaction cannot deadlock with itself. A plain session.get(TI, id) on a
    row already locked by the same transaction acquires no new lock and reads
    freely (Postgres, MySQL 8.0+, SQLite all permit this).
  • "The task is marked as FAILED regardless" is true for the failing TI, but
    silently drops the contract for the rest of the DAG. With fail_fast=True,
    sibling non-teardown tasks should be stopped -- the skip turned that into a no-op.

The deadlock that motivated #59686 came from a different code path (FOR UPDATE
expanding to the lazy-joined dag_run row), fixed in #67246 by scoping the
lock with with_for_update={"of": TI}. With that scope in place, the fail-fast
call is safe and matches the file's two existing fail-fast sites.

Behavior change

  • Before: `fail_fast=True` DAG that reschedules past 2038-01-19 on MySQL ->
    failing TI is marked FAILED, siblings keep running (or stay queued).
  • After: failing TI is marked FAILED and sibling non-teardown tasks are stopped.

Silent functional bugfix; MySQL-only code path. The regression test mocks the
dialect gate so it runs on every backend in CI.

Also drops a second misleading comment in the same function claiming `session.get`
was avoided to "avoid SQLA2 lock contention issues" -- the code itself is fine;
the rationale was wrong.

PR apache#59686 dropped the _handle_fail_fast_for_dag call in the MySQL-TIMESTAMP-limit
branch of the reschedule path based on an incorrect SQLA2 deadlock concern. As a
result, DAGs with fail_fast=True silently fail to stop sibling tasks when a
reschedule date exceeds 2038-01-19 on MySQL.

The actual deadlock that motivated apache#59686 came from a different path (FOR UPDATE
expanding to the lazy-joined dag_run row), fixed in apache#67246 by scoping the lock
with with_for_update={"of": TI}. With that scope in place, the fail-fast call is
safe and matches the file's two existing fail-fast sites.

Also drops a second misleading comment in the same function claiming session.get
was avoided to "avoid SQLA2 lock contention issues" -- the code itself is fine;
the rationale was wrong.
@kaxil kaxil requested review from amoghrajesh and ashb as code owners May 22, 2026 17:59
@boring-cyborg boring-cyborg Bot added area:API Airflow's REST/HTTP API area:task-sdk labels May 22, 2026
@kaxil kaxil merged commit eca91dc into apache:main May 22, 2026
143 checks passed
@kaxil kaxil deleted the restore-fail-fast-on-mysql-reschedule-limit branch May 22, 2026 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:task-sdk

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants