Skip to content

Conversation

@msumit
Copy link
Contributor

@msumit msumit commented Jan 6, 2026

Dag-processor is getting killed repeatedly, erroring out with Deadlock errors. The fix is rather simple, which is to move away from a sub-query to a pre-query in assets.py. It'll also save unnecessary SQL queries if there are no stale DAGs in the system.

closes: #60178


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@msumit msumit requested review from XD-DENG and ashb as code owners January 6, 2026 15:14
@msumit msumit requested review from kaxil and uranusjr and removed request for XD-DENG and ashb January 6, 2026 15:14
@potiuk
Copy link
Member

potiuk commented Jan 6, 2026

I guess there is not a big number of people who use MySQL on Airflow 3 and have stale assets. I think it would be great to see the exact deadlock you experienced (including the server side information) - here or creating a related issue, that would allow to reason about it, and see if there are other related issue.

MySQL in general is much more prone to deadlocks - often depending on the plan it will use for a query, depending on the key collisions generated by the query and likely some other magic. In short - yes MySQLmight choose to open unnecessary locks sometimes that can lead to deadlocks. We saw that all the time happening in various versions of MySQL, pretty randomly and occasionally and the solution has always been "try to guess which way will not cause the MySQL design flaws to show up").

This is basically a flaw in the way how InnoDB engine is designed and there are multiple issues in MySQL issue tracking around that - most of them closed by MySQL team as "not a bug" - claiming that the application should do "better queries" - i.e copying the famous Apple's "you are holding your phone in a wrong way" - examples https://bugs.mysql.com/bug.php?id=68021 and more recently https://bugs.mysql.com/bug.php?id=116815 - so yes it's quite likely there is a combination of things that make it manifest in your case where others might not experience it, or experience it far less frequently.

And you likely found the right way of "holding the phone" for MySQL.

BTW. Postgres is much better in optimising those out - and we heartily recommend Postgres instead of MySQL.

@msumit
Copy link
Contributor Author

msumit commented Jan 6, 2026

@potiuk yeah, PostgreSQL works way better than MySQL with Airflow, but we don't have an option to use that in our org. Will create an issue & see if others are impacted by this as well.

@wjddn279
Copy link
Contributor

wjddn279 commented Jan 7, 2026

Fix is rather simple which is to move away from a sub-query to a pre-query in assets.py

Could you explain why this change prevents the deadlock from occurring? It would be more intuitive if you could include something like a query plan to explain it.

@wjddn279
Copy link
Contributor

wjddn279 commented Jan 7, 2026

I've also encountered and fixed a bug that occurred with MySQL under very specific conditions. (#55589) Without fixing that bug, we wouldn't have been able to use Airflow in our situation. Even if it occurs in limited cases, for certain users it can become a reason not to use Airflow (or not to upgrade versions), so I believe there's no reason not to apply the fix once the exact cause is identified and the justification for the fix version is proven.

@msumit
Copy link
Contributor Author

msumit commented Jan 7, 2026

@wjddn279 I think the main reason is that we are running multiple dag-processors, which is also proved by the following LLM analysis on the problematic code block.

The deadlock you're experiencing in MySQL is caused by the use of a subquery within a DELETE statement:
delete(model).where(model.dag_id.in_(select(DagModel.dag_id).where(DagModel.is_stale)))
In MySQL, DELETE ... WHERE ... IN (SELECT ...) can lead to deadlocks because the database may acquire shared locks on the subquery table (dag) while attempting to get exclusive locks on the target table (dag_schedule_asset_alias_reference, etc.). When multiple DAG processor instances or other concurrent transactions are performing similar operations, they can acquire these locks in different orders, resulting in a deadlock.

The deadlock is happening because this DELETE acquires a shared lock on the matching dag rows (because of the subquery) while also taking row locks on each dag_schedule_* table; meanwhile the DAG processor is updating the same DAG row (is_stale → 0) and inserting/updating the alias reference rows. The lock order ends up reversed across the two transactions: the cleanup code holds the dag row lock and waits on alias rows, while the processor holds the alias row lock and waits on the same dag row, producing the 1213 deadlock.

@potiuk
Copy link
Member

potiuk commented Jan 7, 2026

That's a very plausible explanation. Actually all deadlocks are caused by two resources that attempt to be locked in reverse order and it seems that's exactly what might happen here. The issue here is that you have two locks (shared for query and exclusive for delete) - and they are both attempted to be acquired in a single query - - so indeed if there is another query that attempts to acquire an exclusive lock while the shared lock is already acquired by the query - and then attempts to acquire the shared lock - those two queries will deadlock.

Separating it into two queries has the effect that there is only one lock acquired per query - and immediately released after the query completes, so there is basically no chance for deadlock.

I don't even think we need some tests for it - this is not an easy one to test and has very little value, once we add the comment.

@potiuk potiuk added this to the Airflow 3.1.6 milestone Jan 7, 2026
@potiuk potiuk added the backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch label Jan 7, 2026
@potiuk
Copy link
Member

potiuk commented Jan 7, 2026

Nice find @msumit !

@potiuk
Copy link
Member

potiuk commented Jan 7, 2026

I will speed up with merging it with the added comment so that it can make it for 3.1.6

@uranusjr
Copy link
Member

uranusjr commented Jan 7, 2026

I’d want this to have a specific code path for MySQL, instead of change for all databases. The additional fetch call can be a bit expensive when it’s not needed.

@msumit
Copy link
Contributor Author

msumit commented Jan 7, 2026

I’d want this to have a specific code path for MySQL, instead of change for all databases. The additional fetch call can be a bit expensive when it’s not needed.

@uranusjr, which additional fetch call? This change will reduce fetch calls for all types of DB instead.

@potiuk
Copy link
Member

potiuk commented Jan 7, 2026

I also do not thik it's worth to implement different logic here - it always complicates things, and in this case I think performance loss is negligible.

Logically speaking it's the same operation and while yes it is one more fetch, I am not sure it has a lot of impact on performance.

The only potential issue that might happen is that the list of stale dag_ids is really long and takes a lot of memory (and maybe even exceeds the size of sql query generated in the second step) - but I doubt you might have big number of those - as this query is run in a loop directly after step that deactivates dags.

Do you really tink @uranusjr it's worth to split the logic here? I am not even sure if in this particular case this is purely MySQL induced (it's likely - and happened in the past but we have not proven it yet) - so "protectively" it would be better to split it for all databases I think.

@msumit
Copy link
Contributor Author

msumit commented Jan 7, 2026

IMO one simple fetch call is way better than 5 sub fetch calls happening in those delete calls. Also, if there are no stale dags, then those 5 DB calls are being saved as well, isn't it?

@potiuk
Copy link
Member

potiuk commented Jan 7, 2026

IMO one simple fetch call is way better than 5 sub fetch calls happening in those delete calls. Also, if there are no stale dags, then those 5 DB calls are being saved as well, isn't it?

I think sqlalchemy makes into a single fetch with nested query (you can take a look by enabling sqlalchemy query logs) - so @uranusjr is I think right it saves fetches, It does not save queries done by the single fetch and it does not change optimisations when there are no stale dags - (I guess the DELETE query internally will be run anyway first and the delete will also not happen if delete is not returning anything) - you can likely see it by running explain plan on those queries.

But I think in this case it's just premature optimisation and I would opt for cleaner and unified code in this case.

@wjddn279
Copy link
Contributor

wjddn279 commented Jan 8, 2026

very plausible! thanks for explaining! @msumit @potiuk

@uranusjr
Copy link
Member

uranusjr commented Jan 8, 2026

Thinking about it, the deadlock isn’t really caused by the subquery, right? Would this be good enough to resolve the issue on MySQL?

if session.scalar(select(func.count()).where(DagModel.is_stale).limit(1)):
    for model in models_to_check:
        session.execute(
            delete(model)
            .where(model.dag_id.in_(select(DagModel.dag_id).where(DagModel.is_stale)))
            .execution_options(synchronize_session="fetch")
        )

This would avoid fetching in a list with unchecked size and should be safe everywhere.

@msumit
Copy link
Contributor Author

msumit commented Jan 8, 2026

@uranusjr no, the subquery is the one that is causing the deadlock errors when running multiple dag-processor. Also, I see in the code that at many places we are fetching DAGs in a list, so not sure what's so unsafe here, that too with only stale DAGs.

@potiuk
Copy link
Member

potiuk commented Jan 8, 2026

Thinking about it, the deadlock isn’t really caused by the subquery, right?

Yep as @msumit wrote. It's exactly the subquery. Simply MySQL in this case chooses to first run subquery (with shared lock) and after subquery is run - it runs DELETE (with exclusive lock - while still keeping the shared lock). So the sequence of operations is:

  • SHARED lock on SELECT SUBQUERY (a)
  • get list of things to delete # <- here in another process parallell UPDATE operation might happen which tries to acquire EXCLUSIVE lock (b)
  • EXCLUSIVE lock to DELETE without releasing lock a (b)

Classic deadlock. One process keeps lock (a), another process want to get (b) - but can't because of (a) - and the first process tries to get (b) and failes because the other process is waiting for (b) and was first.

This is the design flaw I've been talking about and the exact reason why we fight with deadlocks in MySQL, because the way how InnoDB engine works has this deep design flaw, that rather than acquiring all locks it needs for a single query, it attempts to acquire those locks separately in a non-atomic sequence of operations.

Postgres is able to optimize it away and acquires all locks it needs in a single, atomic operation for such query - and that's the reason we see less of those deadlocks.

@potiuk potiuk modified the milestones: Airflow 3.1.6, Airflow 3.1.7 Jan 8, 2026
@potiuk
Copy link
Member

potiuk commented Jan 8, 2026

@uranusjr ? OK to merge? It did not make to 3.1.6.rc1, but in case there is an rc2, we can still add it to it

@msumit msumit merged commit dfcd049 into apache:main Jan 12, 2026
70 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 12, 2026
…60166)

MySQL may throw deadlock errors when multiple DAG-Processor instances are
running. The issue is a fetch sub-query being used within a delete query, which
is sometimes causing a deadlock in MySQL.

---------
(cherry picked from commit dfcd049)

Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com>
Co-authored-by: Sumit Maheshwari <sumitm@uber.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
@github-actions
Copy link

Backport successfully created: v3-1-test

Status Branch Result
v3-1-test PR Link

github-actions bot pushed a commit to aws-mwaa/upstream-to-airflow that referenced this pull request Jan 12, 2026
…pache#60166)

MySQL may throw deadlock errors when multiple DAG-Processor instances are
running. The issue is a fetch sub-query being used within a delete query, which
is sometimes causing a deadlock in MySQL.

---------
(cherry picked from commit dfcd049)

Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com>
Co-authored-by: Sumit Maheshwari <sumitm@uber.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
@msumit msumit deleted the sumit/fix_dag_processor branch January 13, 2026 03:19
iharsh02 pushed a commit to iharsh02/airflow that referenced this pull request Jan 13, 2026
MySQL may throw deadlock errors when multiple DAG-Processor instances are 
running. The issue is a fetch sub-query being used within a delete query, which 
is sometimes causing a deadlock in MySQL.

---------

Co-authored-by: Sumit Maheshwari <sumitm@uber.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
ephraimbuddy pushed a commit that referenced this pull request Jan 14, 2026
…60166)

MySQL may throw deadlock errors when multiple DAG-Processor instances are
running. The issue is a fetch sub-query being used within a delete query, which
is sometimes causing a deadlock in MySQL.

---------
(cherry picked from commit dfcd049)

Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com>
Co-authored-by: Sumit Maheshwari <sumitm@uber.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
ephraimbuddy pushed a commit that referenced this pull request Jan 14, 2026
…60166) (#60418)

MySQL may throw deadlock errors when multiple DAG-Processor instances are
running. The issue is a fetch sub-query being used within a delete query, which
is sometimes causing a deadlock in MySQL.

---------
(cherry picked from commit dfcd049)

Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com>
Co-authored-by: Sumit Maheshwari <sumitm@uber.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
ephraimbuddy pushed a commit that referenced this pull request Jan 14, 2026
…60166) (#60418)

MySQL may throw deadlock errors when multiple DAG-Processor instances are
running. The issue is a fetch sub-query being used within a delete query, which
is sometimes causing a deadlock in MySQL.

---------
(cherry picked from commit dfcd049)

Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com>
Co-authored-by: Sumit Maheshwari <sumitm@uber.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
ephraimbuddy pushed a commit that referenced this pull request Jan 16, 2026
…60166) (#60418)

MySQL may throw deadlock errors when multiple DAG-Processor instances are
running. The issue is a fetch sub-query being used within a delete query, which
is sometimes causing a deadlock in MySQL.

---------
(cherry picked from commit dfcd049)

Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com>
Co-authored-by: Sumit Maheshwari <sumitm@uber.com>
Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v3-1-test Mark PR with this label to backport to v3-1-test branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

DAG-processor keeps on crashing due to MySql deadlock errors

4 participants