New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler fails to schedule DagRuns due to persistent DAG record lock #36920
Comments
Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval. |
@doiken FYA |
I think what could help is to show more complete logs around the time when the deadlock occurs with logging level set to debug - that would help anyone who would analyse the problem. Can you please upload such log somewhere and link here @nookcreed ? |
scheduler.txt |
@potiuk We are also facing a similar issue. I can help provide logs or any other details required. |
Please. The more details, the more likely we will be able to find the cause |
@potiuk I have same issue. It starts appearing after I upgrade Airflow from 2.5.1 et 2.7.3. All blocked dags start running after I recreate schedulers pods |
Then send more info - someone might want to take a look |
I am trying to recreate or capture logs, but what I've been noticing thus far on 2.8.1:
Only info logs I have captured so far (from the task instance logs), which is probably no help. I will try to get some debug logs:
|
This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author. |
@potiuk can you please help remove the "pending-response" label? |
We are also seeing same issue. We started getting this after switching deployments to official helm chart on 2.7.3 |
Sure. Removed. There is still not nearly enough evidences.lokely for anyone to be able.yo replicate and diagnose and make hypothesis about the issue, but I can remove the label. But I am afraid for anyone trying to diagnose it , the more evidences (rather than people saying they have the same issue) the bigger chance is someone will actually find and fix the issue. |
I met the same issue. |
Just to add some information on our experience of this issue. We deploy Airflow with Astronomer. We started seeing something like this issue in version 9.4.0 which includes Ariflow 2.7.2. When I say 'this issue' I mean the following:
These are the provider versions we have installed and we're using Python 3.11:
We went over our DAGs in detail to see if we had introduced something terrible and we tried turning some DAGs off having the scheduler only running a couple of DAGs. We tried upgrading to Airflow 2.8.0. That initially seemed to fix the issue but we eventually learned that any time we restart the scheduler, the issue resolves temporarily but always comes back. We also tried upgrading to 2.8.1. We completely removed the Airflow deployment, recreating it from scratch, but the issue came back. Finally, we downgraded to Airflow 2.7.0 and this seems to have solved the issue. We still have no idea what the actual cause of the issue was, only that it does not happen in 2.7.0 but does in all versions from 2.7.2 and above. |
Can you raise it to Astronomer's support? I believe they provide paid service including support, and since they can have direct insight into what you see and be able to investigate thing much more thoroughly. Here people help when they have free time and maybe they will find problems, maybe not - but with paid support, it's quite a bit more likely you can expect investigation and diagnosis, especially that they can have a peek (with your pormission) at your installation (and since they are providing the service, it's pretty much on them to make sure it's configured properly) . Did you try it @ruarfff ? What do they say? |
@potiuk thank you. We did try Astronomer support but no luck there :) They couldn't figure it out. I just wanted to add some extra information to this issue in case it might help but I am not expecting help solving the issue here. For now, we know we can stick to version 2.7.0 and things will work. |
I'd suggest try again. If they could not figure it out with access to the system, then I am afraid it's not gonna be any easier here as people here cannot do any more diagnosis on your system, and here everyone is trying to help in their free time, when they feel like it, so there is little chance someone will easily reproduce it - following the description. There at least you have an easily reproducible environment, where Astronomer has full control over - Ideal situation to run diagnosis. I think you should insist there. They have a lot of expertise there, and if they get strong signal that people don't upgrade because of that issue, AND the fact it happens in the controlled environment of Astronomer AND is easily reproducible there - makes it far more feasible to diagnose the problem. I migh ping a few people in Astronomer to take a closer look if you will help with reproducibility case there @ruarfff |
@ruarfff Can you let me know your Astro support ticket number and we'll dig in to this a bit deeper. Edit: found the ticket now. Okay, saying "They couldn't figure it out." isn't really fair, given that two things were going on in parallel on that support ticket, and the original problem was solved and then the ticket closed by a colleague of yours :) Anyway, I've got our Airflow dev team looking at this now. |
@ashb sorry, my bad. From my perspective it wasn't figured out but you're right it was in fact someone in our internal Astronomer support team who closed the ticket. Sorry about that. |
Yeah I get that! S'alright. We might be in touch if we need help reproducing this. |
Hi, we are heavily affected by this. We are on 2.7.2. Switching off SQL Alchemy Pool fixed this problem |
Hmmm - that is an interesting note and might lead to some hypothesis why it happens @ephraimbuddy @kaxil and might help with reproduction |
Did you have any special pool configuration before when it happened @lihan ? Can you please share it here ? |
Hi, the only config I changed was |
This is a total misunderstanding of how SQL Alchemy Pool works. SQL Alchemy Pool has exactly SQL Alchemy Pool is only used in scheduler to do schedulling and setting huge value for the POOL like that makes completely no sense, because the connections are not reused between different processes (and cannot be). |
Thanks for the explanation, this make sense, the workers cannot reuse this, so making less sense even use the POOL when there is only 1 scheduler running. |
@gr8web can you also share what you did to resolve this? Was it this pool setting? What was the value before |
Hello people. Sorry, it looks like I was wrong. I just saw it again, connections getting stuck in This is the query I used to lookup the connections in the db:
We did not see the issue for some time but now its just back, so my assumption before that was that we had a misconfigured pgbouncer. we currently have
Almost everything are the defaults from the helm chart. The only thing what happened to the jobs is that the tasks were cleared for few days in the past. I dont really have experience to debug things like that but maybe I can still try to help somehow. |
Can you elaborate more on this? Like explain how it was cleared and your DAG args |
We have an
We currently have 60 Jobs, with 2-3 tasks each. |
Can you try having the start date be static date in the past? |
I faced the same issue on Airflow 2.8.1. Is there any solution for this issue ? |
As some people reported above (i recommend you to read the whole thread) - you can try Airflow 2.8.2 - it solved the problem for some users. Then you can see if you do not have similar problem (different) with hugely increased database pool size - which apparently caused issues for another user. Then, ideally @renanxx1 you should report your finding here - seeing if any of those things worked for you. That would give us more confidence that similar problems might be solved by others by applying similar solutions. And it might give you @renanxx1 a chance to contribute back in diagnosing and fixing the issue. And eventually if none of it work for you, providing more details, explaining the specific configuration you have and more details on circumstances - is a second best thing you can do to not only contribute back - but also possibly to speed up diagnosis and solutions. Note that this software is developed by volunteers who spend their own time so that you can use the software for absolutely free and helping to diagnose problems by providing your findings is the least you can do to give back and thank those people who actually decided to spend their time so that people like you and companies like yours can use the software for free. Note as well that the software you get is free and comes with no warranties and no support promise, so any diagnosis and analysis people do here trying to help problems of the users who have them is done because they voluntarliy decided to help and spend their personal, free time (even if they could spend it with their families or for pleasure). So providing your findings and trying out the different things above is absolutely least you can do to thank them for that. Can we count on your help @renanxx1 rather than just demanding a solution? That would be very useful and great thing for the community if you do. |
I think the "idle in transaction" might give a pointer. "Idle transactions are bad news" and I would like to know more about the query / transaction that is happening at that moment. Having the full output (including active transactions!) and not truncated output from @gr8web 's query can be helpful here. As a workaround (it might have side effects) you could try to set |
Using Airflow 2.8.2 and Postgres in Kubernetes. The issue was away for some time, but now emerged again. |
I'm having the same type of issue running Airflow 2.8.3 on Kubernetes. We're use the Kubernetes executer and use an external Postgres database. We get the same error saying DAG record was locked. The weird thing for us is the dags are running but only allows for one task at a time. So for example say there is 3 dags scheduled at the same time, dag 1 task 1 would execute then dag 2 task 1 then dag 3 task 1 and so on till all the task are completed. So it creates a bottleneck for dags to run. But the thing that locks everything up is if the dag calls the TriggerDagOperator with wait_for_completion set to true. The task would run but get stuck because the dag it triggers will never start because task from the dag that launched is still running. Which completely stops everything from running. Scaling up and down the scheduler fixes it for the most part. It only happens to us like once every two weeks or so its hard to replicate. |
Our setup is similar, kubernetes executor + postgres (I reported the issue originally). I attached debug logs around the time of failure to this ticket sometime ago, but looks like they weren't helpful. And like @Kenny1217 mentioned, it is hard to replicate. We are unable to upgrade to versions 2.7 or beyond. Happy to help out in any way possible |
Speaking of locks (un-locks), is there any chance this issue and discussion #38728 could be inversely related? |
I have also seen dag runs stuck, with the dag run in a 'running' state but without any tasks getting run. As a work-around I used |
I have upgraded to Airflow 2.8.4 for a couple of weeks, my environment is: Kubernetes executor and MySQL 8 and have not faced this issue yet. |
We are facing the same issue and nothing in this thread helped to solve the issue. What's the resolution here? |
We were having the issue with Airflow 2.8.3 but ever since we upgraded to Airflow to 2.9.0 we haven't seen the issue again. This could be just luck since it only happens randomly but it's been close to a month now since we did the upgrade and we haven't had any DAG record lock errors. |
We also faced the same issue with v2.7.3. Although, for us, clearing and rerunning the task worked. Rerunning the task wasn't costly to us as our tasks are idempotent but it could be for others. Is upgrading to 2.9.0 is the only long-term solution? |
After doing the following, all these all these errors went away for us.
Which one of the actions above that cured the problem is not known. Maybe only one of them is needed. |
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.7.3
What happened?
We are encountering an issue in our Apache Airflow setup where, after a few successful DagRuns, the scheduler stops scheduling new runs. The scheduler logs indicate:
{scheduler_job_runner.py:1426} INFO - DAG dag-test scheduling was skipped, probably because the DAG record was locked.
This problem persists despite running a single scheduler pod. Notably, reverting the changes from PR #31414 resolves this issue. A similar issue has been discussed on Stack Overflow: Airflow Kubernetes Executor Scheduling Skipped Because Dag Record Was Locked.
What you think should happen instead?
The scheduler should consistently schedule new DagRuns as per DAG configurations, without interruption due to DAG record locks.
How to reproduce
Run airflow v.2.7.3 on kubernetes. HA is not required.
Trigger multiple DagRuns (We have about 10 DAGs that run every minute).
Observe scheduler behavior and logs after a few successful runs. The error shows up after a few minutes
Operating System
centos7
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.10.0
apache-airflow-providers-apache-hive==6.2.0
apache-airflow-providers-apache-livy==3.6.0
apache-airflow-providers-cncf-kubernetes==7.8.0
apache-airflow-providers-common-sql==1.8.0
apache-airflow-providers-ftp==3.6.0
apache-airflow-providers-google==10.11.0
apache-airflow-providers-http==4.6.0
apache-airflow-providers-imap==3.4.0
apache-airflow-providers-papermill==3.4.0
apache-airflow-providers-postgres==5.7.1
apache-airflow-providers-presto==5.2.1
apache-airflow-providers-salesforce==5.5.0
apache-airflow-providers-snowflake==5.1.0
apache-airflow-providers-sqlite==3.5.0
apache-airflow-providers-trino==5.4.0
Deployment
Other
Deployment details
We have wrappers around the official airflow helm chart and docker images.
Environment:
Anything else?
Actual Behavior:
The scheduler stops scheduling new runs after a few DagRuns, with log messages about the DAG record being locked.
Workaround:
Restarting the scheduler pod releases the lock and allows normal scheduling to resume, but this is not viable in production. Reverting the changes in PR #31414 also resolves the issue.
Questions/Request for Information:
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: