Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scheduler fails to schedule DagRuns due to persistent DAG record lock #36920

Open
1 of 2 tasks
nookcreed opened this issue Jan 19, 2024 · 60 comments
Open
1 of 2 tasks

Scheduler fails to schedule DagRuns due to persistent DAG record lock #36920

nookcreed opened this issue Jan 19, 2024 · 60 comments
Assignees
Labels
affected_version:2.7 Issues Reported for 2.7 area:core area:Scheduler Scheduler or dag parsing Issues kind:bug This is a clearly a bug
Milestone

Comments

@nookcreed
Copy link

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.7.3

What happened?

We are encountering an issue in our Apache Airflow setup where, after a few successful DagRuns, the scheduler stops scheduling new runs. The scheduler logs indicate:

{scheduler_job_runner.py:1426} INFO - DAG dag-test scheduling was skipped, probably because the DAG record was locked.

This problem persists despite running a single scheduler pod. Notably, reverting the changes from PR #31414 resolves this issue. A similar issue has been discussed on Stack Overflow: Airflow Kubernetes Executor Scheduling Skipped Because Dag Record Was Locked.

What you think should happen instead?

The scheduler should consistently schedule new DagRuns as per DAG configurations, without interruption due to DAG record locks.

How to reproduce

Run airflow v.2.7.3 on kubernetes. HA is not required.
Trigger multiple DagRuns (We have about 10 DAGs that run every minute).
Observe scheduler behavior and logs after a few successful runs. The error shows up after a few minutes

Operating System

centos7

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.10.0
apache-airflow-providers-apache-hive==6.2.0
apache-airflow-providers-apache-livy==3.6.0
apache-airflow-providers-cncf-kubernetes==7.8.0
apache-airflow-providers-common-sql==1.8.0
apache-airflow-providers-ftp==3.6.0
apache-airflow-providers-google==10.11.0
apache-airflow-providers-http==4.6.0
apache-airflow-providers-imap==3.4.0
apache-airflow-providers-papermill==3.4.0
apache-airflow-providers-postgres==5.7.1
apache-airflow-providers-presto==5.2.1
apache-airflow-providers-salesforce==5.5.0
apache-airflow-providers-snowflake==5.1.0
apache-airflow-providers-sqlite==3.5.0
apache-airflow-providers-trino==5.4.0

Deployment

Other

Deployment details

We have wrappers around the official airflow helm chart and docker images.

Environment:

Airflow Version: 2.7.3
Kubernetes Version: 1.24
Executor: KubernetesExecutor
Database: PostgreSQL (metadata database)
Environment/Infrastructure: Kubernetes cluster running Airflow in Docker containers

Anything else?

Actual Behavior:
The scheduler stops scheduling new runs after a few DagRuns, with log messages about the DAG record being locked.

Workaround:
Restarting the scheduler pod releases the lock and allows normal scheduling to resume, but this is not viable in production. Reverting the changes in PR #31414 also resolves the issue.

Questions/Request for Information:

  1. Under what scenarios is the lock on a DAG record typically not released?
  2. Are there known issues in Airflow 2.7.3, or specific configurations, that might cause the DAG record to remain locked, thereby preventing new run scheduling?
  3. Could the changes made in PR #31414 be related to this issue?

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@nookcreed nookcreed added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Jan 19, 2024
Copy link

boring-cyborg bot commented Jan 19, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@nookcreed
Copy link
Author

@doiken FYA

@nookcreed nookcreed changed the title Scheduler Fails to Schedule Subsequent DagRuns Due to Persistent DAG Record Lock Scheduler fails to schedule DagRuns due to persistent DAG record lock Jan 19, 2024
@potiuk potiuk added this to the Airflow 2.8.2 milestone Jan 20, 2024
@potiuk
Copy link
Member

potiuk commented Jan 20, 2024

I think what could help is to show more complete logs around the time when the deadlock occurs with logging level set to debug - that would help anyone who would analyse the problem. Can you please upload such log somewhere and link here @nookcreed ?

@potiuk potiuk added pending-response and removed needs-triage label for new issues that we didn't triage yet labels Jan 20, 2024
@nookcreed
Copy link
Author

scheduler.txt
@potiuk I have attached debug logs. It is harder to recreate the scenario with debug logging enabled. With info, it almost always happens within a few minutes.

@arvind1705
Copy link

@potiuk We are also facing a similar issue. I can help provide logs or any other details required.

@potiuk
Copy link
Member

potiuk commented Jan 31, 2024

Please. The more details, the more likely we will be able to find the cause

@selimchergui
Copy link
Contributor

@potiuk I have same issue. It starts appearing after I upgrade Airflow from 2.5.1 et 2.7.3. All blocked dags start running after I recreate schedulers pods

@potiuk
Copy link
Member

potiuk commented Jan 31, 2024

Then send more info - someone might want to take a look

@plmnguyen
Copy link

plmnguyen commented Feb 2, 2024

I am trying to recreate or capture logs, but what I've been noticing thus far on 2.8.1:

  • For DAGs showing that "DAG record was locked", there is a task that is stuck in a running state, and never times out
    - Happens only on Kubernetes spark submit operators tasks

Only info logs I have captured so far (from the task instance logs), which is probably no help. I will try to get some debug logs:

[2024-02-02, 06:01:45 UTC] {taskinstance.py:1946} INFO - Dependencies not met for <TaskInstance: l3-main-abc123.rlav.rlav manual__2024-02-01T05:00:00+00:00 [running]>, dependency 'Task Instance Not Running' FAILED: Task is in the running state
[2024-02-02, 06:01:45 UTC] {taskinstance.py:1946} INFO - Dependencies not met for <TaskInstance: l3-main-abc123.rlav.rlav manual__2024-02-01T05:00:00+00:00 [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' state.
[2024-02-02, 06:01:45 UTC] {local_task_job_runner.py:160} INFO - Task is not able to be run

Copy link

This issue has been automatically marked as stale because it has been open for 14 days with no response from the author. It will be closed in next 7 days if no further activity occurs from the issue author.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Feb 20, 2024
@nookcreed
Copy link
Author

@potiuk can you please help remove the "pending-response" label?

@rahulnbhandari
Copy link

We are also seeing same issue. We started getting this after switching deployments to official helm chart on 2.7.3

@potiuk
Copy link
Member

potiuk commented Feb 20, 2024

Sure. Removed. There is still not nearly enough evidences.lokely for anyone to be able.yo replicate and diagnose and make hypothesis about the issue, but I can remove the label. But I am afraid for anyone trying to diagnose it , the more evidences (rather than people saying they have the same issue) the bigger chance is someone will actually find and fix the issue.

@mis98zb
Copy link
Contributor

mis98zb commented Feb 20, 2024

I met the same issue.
But after I updated the helm value file to use separated DAG Processor, I did not find this error in scheduler log anymore.

@ruarfff
Copy link

ruarfff commented Feb 20, 2024

Just to add some information on our experience of this issue.

We deploy Airflow with Astronomer. We started seeing something like this issue in version 9.4.0 which includes Ariflow 2.7.2.

When I say 'this issue' I mean the following:

  • DAGs start going into a skipped state and are not run.
  • Initially this is sporadic and for a while it's possible to trigger DAGs manually.
  • Eventually things degrade to a point where all DAGs skip and you can no longer get them to run by manually triggering them.
  • We see logs like the following: {scheduler_job_runner.py:1444} INFO - DAG our_dag scheduling was skipped, probably because the DAG record was locked.
  • We see some other suspicious logs but searching for those individually seems to indicate they are not part of the issue.
  • We are running everything in Kubernetes, version 1.26.6.
  • We have a single scheduler instance running in the namespace.

These are the provider versions we have installed and we're using Python 3.11:

apache-airflow-providers-cncf-kubernetes==7.14.0
apache-airflow-providers-databricks==4.6.0
apache-airflow-providers-microsoft-azure==8.0.0
apache-airflow-providers-microsoft-mssql==3.5.0
apache-airflow-providers-mongo==3.3.0
apache-airflow-providers-postgres==5.7.1
apache-airflow-providers-snowflake==5.3.0

We went over our DAGs in detail to see if we had introduced something terrible and we tried turning some DAGs off having the scheduler only running a couple of DAGs.

We tried upgrading to Airflow 2.8.0. That initially seemed to fix the issue but we eventually learned that any time we restart the scheduler, the issue resolves temporarily but always comes back. We also tried upgrading to 2.8.1.

We completely removed the Airflow deployment, recreating it from scratch, but the issue came back.

Finally, we downgraded to Airflow 2.7.0 and this seems to have solved the issue. We still have no idea what the actual cause of the issue was, only that it does not happen in 2.7.0 but does in all versions from 2.7.2 and above.

@potiuk
Copy link
Member

potiuk commented Feb 20, 2024

Can you raise it to Astronomer's support? I believe they provide paid service including support, and since they can have direct insight into what you see and be able to investigate thing much more thoroughly. Here people help when they have free time and maybe they will find problems, maybe not - but with paid support, it's quite a bit more likely you can expect investigation and diagnosis, especially that they can have a peek (with your pormission) at your installation (and since they are providing the service, it's pretty much on them to make sure it's configured properly) .

Did you try it @ruarfff ? What do they say?

@ruarfff
Copy link

ruarfff commented Feb 20, 2024

@potiuk thank you. We did try Astronomer support but no luck there :) They couldn't figure it out.

I just wanted to add some extra information to this issue in case it might help but I am not expecting help solving the issue here. For now, we know we can stick to version 2.7.0 and things will work.

@potiuk
Copy link
Member

potiuk commented Feb 20, 2024

I'd suggest try again. If they could not figure it out with access to the system, then I am afraid it's not gonna be any easier here as people here cannot do any more diagnosis on your system, and here everyone is trying to help in their free time, when they feel like it, so there is little chance someone will easily reproduce it - following the description. There at least you have an easily reproducible environment, where Astronomer has full control over - Ideal situation to run diagnosis.

I think you should insist there. They have a lot of expertise there, and if they get strong signal that people don't upgrade because of that issue, AND the fact it happens in the controlled environment of Astronomer AND is easily reproducible there - makes it far more feasible to diagnose the problem. I migh ping a few people in Astronomer to take a closer look if you will help with reproducibility case there @ruarfff

@ashb
Copy link
Member

ashb commented Feb 20, 2024

@ruarfff Can you let me know your Astro support ticket number and we'll dig in to this a bit deeper. Edit: found the ticket now.

Okay, saying "They couldn't figure it out." isn't really fair, given that two things were going on in parallel on that support ticket, and the original problem was solved and then the ticket closed by a colleague of yours :)

Anyway, I've got our Airflow dev team looking at this now.

@ruarfff
Copy link

ruarfff commented Feb 20, 2024

@ashb sorry, my bad. From my perspective it wasn't figured out but you're right it was in fact someone in our internal Astronomer support team who closed the ticket. Sorry about that.

@ashb
Copy link
Member

ashb commented Feb 20, 2024

Yeah I get that! S'alright. We might be in touch if we need help reproducing this.

@lihan
Copy link
Contributor

lihan commented Mar 5, 2024

Hi, we are heavily affected by this. We are on 2.7.2.

Switching off SQL Alchemy Pool fixed this problem

@potiuk
Copy link
Member

potiuk commented Mar 5, 2024

witching off SQL Alchemy Pool fixed this problem

Hmmm - that is an interesting note and might lead to some hypothesis why it happens @ephraimbuddy @kaxil and might help with reproduction

@potiuk
Copy link
Member

potiuk commented Mar 5, 2024

Did you have any special pool configuration before when it happened @lihan ? Can you please share it here ?

@lihan
Copy link
Contributor

lihan commented Mar 5, 2024

Hi, the only config I changed was AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE, which was set to 180. My Airflow has 1200 active running dags so this number was set at a rather high value.

@potiuk
Copy link
Member

potiuk commented Mar 5, 2024

Hi, the only config I changed was AIRFLOW__DATABASE__SQL_ALCHEMY_POOL_SIZE, which was set to 180. My Airflow has 1200 active running dags so this number was set at a rather high value.

This is a total misunderstanding of how SQL Alchemy Pool works. SQL Alchemy Pool has exactly 0 impact on running DAGs. Each task is run in their own process (every task is a forked, new process and establishes their own DB connnections) and sql alchemy pool can only be really useful if the connections are shared within the same process.

SQL Alchemy Pool is only used in scheduler to do schedulling and setting huge value for the POOL like that makes completely no sense, because the connections are not reused between different processes (and cannot be).

@lihan
Copy link
Contributor

lihan commented Mar 5, 2024

Thanks for the explanation, this make sense, the workers cannot reuse this, so making less sense even use the POOL when there is only 1 scheduler running.

@ephraimbuddy
Copy link
Contributor

ephraimbuddy commented Mar 6, 2024

sql alchemy pool is disabled.

@gr8web can you also share what you did to resolve this? Was it this pool setting? What was the value before

@gr8web
Copy link

gr8web commented Mar 6, 2024

Hello people.

Sorry, it looks like I was wrong. I just saw it again, connections getting stuck in idle in transaction state in the database
and jobs not progressing.

This is the query I used to lookup the connections in the db:


select pid, state, usename, query, query_start, client_addr, client_port, wait_event, wait_event_type
from pg_stat_activity
where pid in (
  select pid from pg_locks l
  join pg_class t on l.relation = t.oid
  and t.relkind = 'r'
  where t.relname = 'task_instance'
  --and state = 'idle in transaction'
);

idle_conn

We did not see the issue for some time but now its just back, so my assumption before that was that we had a misconfigured pgbouncer.
Since its back now I dont have an idea what could cause the problem, it doesnt seem to be related to the db/pgbouncer settings at all.

we currently have
3 scheduler instances
32 workers
1 webserver

worker_concurrency = 16
parallelism = 64
max_active_runs_per_dag = 16

pgbouncer:

[databases]
* = host=dbhost port=5432 auth_user=user

[pgbouncer]
pool_mode = transaction
listen_port = 5432
listen_addr = *
auth_type = md5
auth_file = /etc/pgbouncer/auth_file.txt
admin_users = airflow
stats_users_prefix = robot_
auth_query = SELECT * FROM pooler.user_lookup($1)
logfile = /var/log/pgbouncer/pgbouncer.log
pidfile = /var/run/pgbouncer/pgbouncer.pid

server_tls_sslmode = require
server_tls_ca_file = /etc/ssl/certs/pgbouncer.crt
server_tls_protocols = secure
client_tls_sslmode = require
client_tls_key_file = /etc/ssl/certs/pgbouncer.key
client_tls_cert_file = /etc/ssl/certs/pgbouncer.crt

log_connections = 0
log_disconnections = 0

# How many server connections to allow per user/database pair.
default_pool_size = 15

# How many additional connections to allow to a pool
reserve_pool_size = 7

# Maximum number of client connections allowed.
max_client_conn = 10000

# Do not allow more than this many connections per database (regardless of
# pool, i.e. user)
max_db_connections = 30

# If a client has been in "idle in transaction" state longer, it will be
# disconnected. [seconds]
idle_transaction_timeout = 600

# If login failed, because of failure from connect() or authentication that
# pooler waits this much before retrying to connect. Default is 15. [seconds]
server_login_retry = 5

# To ignore extra parameter in startup packet. By default only 'database' and
# 'user' are allowed, all others raise error.  This is needed to tolerate
# overenthusiastic JDBC wanting to unconditionally set 'extra_float_digits=2'
# in startup packet.
ignore_startup_parameters = extra_float_digits,options

Almost everything are the defaults from the helm chart.
Worker and scheduler connect to the pgbouncer.

The only thing what happened to the jobs is that the tasks were cleared for few days in the past.

I dont really have experience to debug things like that but maybe I can still try to help somehow.
Let me know if I maybe missed something or you need to know something else.

@ephraimbuddy
Copy link
Contributor

The only thing what happened to the jobs is that the tasks were cleared for few days in the past.

Can you elaborate more on this? Like explain how it was cleared and your DAG args

@gr8web
Copy link

gr8web commented Mar 6, 2024

The only thing what happened to the jobs is that the tasks were cleared for few days in the past.

Can you elaborate more on this? Like explain how it was cleared and your DAG args

We have an on_retry_callback which would clear the tasks.
It looked like that:

airflow tasks clear -s {cleanup_date} -t {task_regex} -d -y {dag_id}
with DAG(
        dag_id=DAG_ID,
        schedule_interval=timedelta(minutes=10),
        start_date=pendulum.datetime(2024, 3, 6, 10, 0, 0, tz="UTC"),
        catchup=True,
        default_args=get_config_value(KEY_DEFAULT_ARGS),
        max_active_runs=1,
)  

DEFAULT_ARGS = {
    "depends_on_past": True,
    "retries": 10,
}

cleanup_date is execution_date.
Some time later the start date for the jobs was changed to the present day.

We currently have 60 Jobs, with 2-3 tasks each.

@ephraimbuddy
Copy link
Contributor

ephraimbuddy commented Mar 7, 2024

cleanup_date is execution_date. Some time later the start date for the jobs was changed to the present day.

Can you try having the start date be static date in the past?

@renanxx1
Copy link

renanxx1 commented Mar 8, 2024

I faced the same issue on Airflow 2.8.1.
The scheduler was skipping dags execution with the message: "DAG scheduling was skipped, probably because the DAG record was locked"

Is there any solution for this issue ?

@potiuk
Copy link
Member

potiuk commented Mar 8, 2024

Is there any solution for this issue ?

As some people reported above (i recommend you to read the whole thread) - you can try Airflow 2.8.2 - it solved the problem for some users. Then you can see if you do not have similar problem (different) with hugely increased database pool size - which apparently caused issues for another user. Then, ideally @renanxx1 you should report your finding here - seeing if any of those things worked for you. That would give us more confidence that similar problems might be solved by others by applying similar solutions. And it might give you @renanxx1 a chance to contribute back in diagnosing and fixing the issue.

And eventually if none of it work for you, providing more details, explaining the specific configuration you have and more details on circumstances - is a second best thing you can do to not only contribute back - but also possibly to speed up diagnosis and solutions.

Note that this software is developed by volunteers who spend their own time so that you can use the software for absolutely free and helping to diagnose problems by providing your findings is the least you can do to give back and thank those people who actually decided to spend their time so that people like you and companies like yours can use the software for free.

Note as well that the software you get is free and comes with no warranties and no support promise, so any diagnosis and analysis people do here trying to help problems of the users who have them is done because they voluntarliy decided to help and spend their personal, free time (even if they could spend it with their families or for pleasure). So providing your findings and trying out the different things above is absolutely least you can do to thank them for that.

Can we count on your help @renanxx1 rather than just demanding a solution? That would be very useful and great thing for the community if you do.

@bolkedebruin
Copy link
Contributor

bolkedebruin commented Mar 27, 2024

I think the "idle in transaction" might give a pointer. "Idle transactions are bad news" and I would like to know more about the query / transaction that is happening at that moment. Having the full output (including active transactions!) and not truncated output from @gr8web 's query can be helpful here.

As a workaround (it might have side effects) you could try to set idle_in_transaction_session_timeout in Postgres to something positive, say around a minute or so ( = 6000) maybe even smaller.

@dominiquedemunck
Copy link

Using Airflow 2.8.2 and Postgres in Kubernetes. The issue was away for some time, but now emerged again.
It occurred when I tried clearing a handful of tasks quite fast after each other.
Restarting the scheduler pod helped.

@eladkal eladkal added area:Scheduler Scheduler or dag parsing Issues affected_version:2.7 Issues Reported for 2.7 labels Apr 5, 2024
@Kenny1217
Copy link

I'm having the same type of issue running Airflow 2.8.3 on Kubernetes. We're use the Kubernetes executer and use an external Postgres database. We get the same error saying DAG record was locked. The weird thing for us is the dags are running but only allows for one task at a time. So for example say there is 3 dags scheduled at the same time, dag 1 task 1 would execute then dag 2 task 1 then dag 3 task 1 and so on till all the task are completed. So it creates a bottleneck for dags to run. But the thing that locks everything up is if the dag calls the TriggerDagOperator with wait_for_completion set to true. The task would run but get stuck because the dag it triggers will never start because task from the dag that launched is still running. Which completely stops everything from running. Scaling up and down the scheduler fixes it for the most part. It only happens to us like once every two weeks or so its hard to replicate.

@nookcreed
Copy link
Author

Our setup is similar, kubernetes executor + postgres (I reported the issue originally). I attached debug logs around the time of failure to this ticket sometime ago, but looks like they weren't helpful. And like @Kenny1217 mentioned, it is hard to replicate. We are unable to upgrade to versions 2.7 or beyond. Happy to help out in any way possible

@garywhiteford
Copy link

Speaking of locks (un-locks), is there any chance this issue and discussion #38728 could be inversely related?

@paul-storey
Copy link

I have also seen dag runs stuck, with the dag run in a 'running' state but without any tasks getting run. As a work-around I used airflow db clean. A disadvantage of this method is the loss of historical dag run data.

@tanvn
Copy link
Contributor

tanvn commented Apr 19, 2024

I have upgraded to Airflow 2.8.4 for a couple of weeks, my environment is: Kubernetes executor and MySQL 8 and have not faced this issue yet.
Maybe this only happens with Postgre ?

@deepaksood619
Copy link

We are facing the same issue and nothing in this thread helped to solve the issue. What's the resolution here?

@Kenny1217
Copy link

We were having the issue with Airflow 2.8.3 but ever since we upgraded to Airflow to 2.9.0 we haven't seen the issue again. This could be just luck since it only happens randomly but it's been close to a month now since we did the upgrade and we haven't had any DAG record lock errors.

@vaishali-cm
Copy link

We also faced the same issue with v2.7.3. Although, for us, clearing and rerunning the task worked. Rerunning the task wasn't costly to us as our tasks are idempotent but it could be for others. Is upgrading to 2.9.0 is the only long-term solution?

@hterik
Copy link
Contributor

hterik commented May 16, 2024

After doing the following, all these all these errors went away for us.

  1. Delete all dags from the Database that no longer exist in the dagbag, using airflow dags delete command

  2. Delete task instances in odd states, according to this comment from @Schlyterr above.

    1. Going into Browse -> Task Instances and deleting all tasks that didn't have any of the following states upstream_failed, failed, success, removed (I think it was only task instances with state = scheduled and skipped that we deleted if I remember correctly, not sure how relevant it is to delete skipped tasks)

    2. Going into Browse -> Dag Runs and delete all Dag runs that had a state other than failed or success (Ithink we removed state = running and queued if I remember correctly).

    I didn't delete all that much, i kept running and queued, only deleted states that i didn't recognize at all. I can't remember exactly now but i think there was some oddball task in a state called "started" or "scheduled", that was stuck since many weeks back, which doesn't match the usual "queued"/"running" you normally see.

  3. Delete any task instance older than 6 months, using airflow db clean --clean-before-timestamp xxx

  4. Update Airflow from 2.7.1 to 2.9.1

Which one of the actions above that cured the problem is not known. Maybe only one of them is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affected_version:2.7 Issues Reported for 2.7 area:core area:Scheduler Scheduler or dag parsing Issues kind:bug This is a clearly a bug
Projects
None yet
Development

No branches or pull requests