-
Notifications
You must be signed in to change notification settings - Fork 16.3k
Fix dag-processor crashing due to MySql deadlock errors #60166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I guess there is not a big number of people who use MySQL on Airflow 3 and have stale assets. I think it would be great to see the exact deadlock you experienced (including the server side information) - here or creating a related issue, that would allow to reason about it, and see if there are other related issue. MySQL in general is much more prone to deadlocks - often depending on the plan it will use for a query, depending on the key collisions generated by the query and likely some other magic. In short - yes MySQLmight choose to open unnecessary locks sometimes that can lead to deadlocks. We saw that all the time happening in various versions of MySQL, pretty randomly and occasionally and the solution has always been "try to guess which way will not cause the MySQL design flaws to show up"). This is basically a flaw in the way how InnoDB engine is designed and there are multiple issues in MySQL issue tracking around that - most of them closed by MySQL team as "not a bug" - claiming that the application should do "better queries" - i.e copying the famous Apple's "you are holding your phone in a wrong way" - examples https://bugs.mysql.com/bug.php?id=68021 and more recently https://bugs.mysql.com/bug.php?id=116815 - so yes it's quite likely there is a combination of things that make it manifest in your case where others might not experience it, or experience it far less frequently. And you likely found the right way of "holding the phone" for MySQL. BTW. Postgres is much better in optimising those out - and we heartily recommend Postgres instead of MySQL. |
|
@potiuk yeah, PostgreSQL works way better than MySQL with Airflow, but we don't have an option to use that in our org. Will create an issue & see if others are impacted by this as well. |
Could you explain why this change prevents the deadlock from occurring? It would be more intuitive if you could include something like a query plan to explain it. |
|
I've also encountered and fixed a bug that occurred with MySQL under very specific conditions. (#55589) Without fixing that bug, we wouldn't have been able to use Airflow in our situation. Even if it occurs in limited cases, for certain users it can become a reason not to use Airflow (or not to upgrade versions), so I believe there's no reason not to apply the fix once the exact cause is identified and the justification for the fix version is proven. |
|
@wjddn279 I think the main reason is that we are running multiple dag-processors, which is also proved by the following LLM analysis on the problematic code block.
|
|
That's a very plausible explanation. Actually all deadlocks are caused by two resources that attempt to be locked in reverse order and it seems that's exactly what might happen here. The issue here is that you have two locks (shared for query and exclusive for delete) - and they are both attempted to be acquired in a single query - - so indeed if there is another query that attempts to acquire an exclusive lock while the shared lock is already acquired by the query - and then attempts to acquire the shared lock - those two queries will deadlock. Separating it into two queries has the effect that there is only one lock acquired per query - and immediately released after the query completes, so there is basically no chance for deadlock. I don't even think we need some tests for it - this is not an easy one to test and has very little value, once we add the comment. |
|
Nice find @msumit ! |
|
I will speed up with merging it with the added comment so that it can make it for 3.1.6 |
|
I’d want this to have a specific code path for MySQL, instead of change for all databases. The additional fetch call can be a bit expensive when it’s not needed. |
@uranusjr, which additional fetch call? This change will reduce fetch calls for all types of DB instead. |
|
I also do not thik it's worth to implement different logic here - it always complicates things, and in this case I think performance loss is negligible. Logically speaking it's the same operation and while yes it is one more fetch, I am not sure it has a lot of impact on performance. The only potential issue that might happen is that the list of stale dag_ids is really long and takes a lot of memory (and maybe even exceeds the size of sql query generated in the second step) - but I doubt you might have big number of those - as this query is run in a loop directly after step that deactivates dags. Do you really tink @uranusjr it's worth to split the logic here? I am not even sure if in this particular case this is purely MySQL induced (it's likely - and happened in the past but we have not proven it yet) - so "protectively" it would be better to split it for all databases I think. |
|
IMO one simple fetch call is way better than 5 sub fetch calls happening in those delete calls. Also, if there are no stale dags, then those 5 DB calls are being saved as well, isn't it? |
I think sqlalchemy makes into a single fetch with nested query (you can take a look by enabling sqlalchemy query logs) - so @uranusjr is I think right it saves fetches, It does not save queries done by the single fetch and it does not change optimisations when there are no stale dags - (I guess the DELETE query internally will be run anyway first and the delete will also not happen if delete is not returning anything) - you can likely see it by running explain plan on those queries. But I think in this case it's just premature optimisation and I would opt for cleaner and unified code in this case. |
|
Thinking about it, the deadlock isn’t really caused by the subquery, right? Would this be good enough to resolve the issue on MySQL? if session.scalar(select(func.count()).where(DagModel.is_stale).limit(1)):
for model in models_to_check:
session.execute(
delete(model)
.where(model.dag_id.in_(select(DagModel.dag_id).where(DagModel.is_stale)))
.execution_options(synchronize_session="fetch")
)This would avoid fetching in a list with unchecked size and should be safe everywhere. |
|
@uranusjr no, the subquery is the one that is causing the deadlock errors when running multiple dag-processor. Also, I see in the code that at many places we are fetching DAGs in a list, so not sure what's so unsafe here, that too with only stale DAGs. |
Yep as @msumit wrote. It's exactly the subquery. Simply MySQL in this case chooses to first run subquery (with shared lock) and after subquery is run - it runs DELETE (with exclusive lock - while still keeping the shared lock). So the sequence of operations is:
Classic deadlock. One process keeps lock (a), another process want to get (b) - but can't because of (a) - and the first process tries to get (b) and failes because the other process is waiting for (b) and was first. This is the design flaw I've been talking about and the exact reason why we fight with deadlocks in MySQL, because the way how InnoDB engine works has this deep design flaw, that rather than acquiring all locks it needs for a single query, it attempts to acquire those locks separately in a non-atomic sequence of operations. Postgres is able to optimize it away and acquires all locks it needs in a single, atomic operation for such query - and that's the reason we see less of those deadlocks. |
|
@uranusjr ? OK to merge? It did not make to 3.1.6.rc1, but in case there is an rc2, we can still add it to it |
…60166) MySQL may throw deadlock errors when multiple DAG-Processor instances are running. The issue is a fetch sub-query being used within a delete query, which is sometimes causing a deadlock in MySQL. --------- (cherry picked from commit dfcd049) Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com> Co-authored-by: Sumit Maheshwari <sumitm@uber.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
…pache#60166) MySQL may throw deadlock errors when multiple DAG-Processor instances are running. The issue is a fetch sub-query being used within a delete query, which is sometimes causing a deadlock in MySQL. --------- (cherry picked from commit dfcd049) Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com> Co-authored-by: Sumit Maheshwari <sumitm@uber.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
MySQL may throw deadlock errors when multiple DAG-Processor instances are running. The issue is a fetch sub-query being used within a delete query, which is sometimes causing a deadlock in MySQL. --------- Co-authored-by: Sumit Maheshwari <sumitm@uber.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
…60166) MySQL may throw deadlock errors when multiple DAG-Processor instances are running. The issue is a fetch sub-query being used within a delete query, which is sometimes causing a deadlock in MySQL. --------- (cherry picked from commit dfcd049) Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com> Co-authored-by: Sumit Maheshwari <sumitm@uber.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
…60166) (#60418) MySQL may throw deadlock errors when multiple DAG-Processor instances are running. The issue is a fetch sub-query being used within a delete query, which is sometimes causing a deadlock in MySQL. --------- (cherry picked from commit dfcd049) Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com> Co-authored-by: Sumit Maheshwari <sumitm@uber.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
…60166) (#60418) MySQL may throw deadlock errors when multiple DAG-Processor instances are running. The issue is a fetch sub-query being used within a delete query, which is sometimes causing a deadlock in MySQL. --------- (cherry picked from commit dfcd049) Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com> Co-authored-by: Sumit Maheshwari <sumitm@uber.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
…60166) (#60418) MySQL may throw deadlock errors when multiple DAG-Processor instances are running. The issue is a fetch sub-query being used within a delete query, which is sometimes causing a deadlock in MySQL. --------- (cherry picked from commit dfcd049) Co-authored-by: Sumit Maheshwari <msumit@users.noreply.github.com> Co-authored-by: Sumit Maheshwari <sumitm@uber.com> Co-authored-by: Jarek Potiuk <jarek@potiuk.com>
Dag-processor is getting killed repeatedly, erroring out with Deadlock errors. The fix is rather simple, which is to move away from a sub-query to a pre-query in assets.py. It'll also save unnecessary SQL queries if there are no stale DAGs in the system.
closes: #60178
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.