-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Description
Apache Airflow version
2.2.1
What happened
We have an airflow instance for approximately 6k DAGs.
- If we delete a DAG from UI, the UI times out
- If we delete a DAG from CLI, it completes but sometimes takes up to a half-hour to finish.
Most of the execution time appears to be consumed in database queries. I know I can just throw more CPU and memory to the db instance and hope it works but I think we can do better during delete operation. Correct me if I am wrong but I think this is the code that gets executed when deleting a DAG from UI or CLI via delete_dag.py
for model in models.base.Base._decl_class_registry.values():
if hasattr(model, "dag_id"):
if keep_records_in_log and model.__name__ == 'Log':
continue
cond = or_(model.dag_id == dag_id, model.dag_id.like(dag_id + ".%"))
count += session.query(model).filter(cond).delete(synchronize_session='fetch')
if dag.is_subdag:
parent_dag_id, task_id = dag_id.rsplit(".", 1)
for model in TaskFail, models.TaskInstance:
count += (
session.query(model).filter(model.dag_id == parent_dag_id, model.task_id == task_id).delete()
)I see we are iterating over all the models and doing a dag_id match. Some of the tables don't have an index over dag_id column like job which is making this operation really slow. This could be one easy fix for this issue.
For example, the following query took 20 mins to finish in 16cpu 32gb Postgres instance:
SELECT job.id AS job_id FROM job WHERE job.dag_id = $1 OR job.dag_id LIKE $2and explain is as follows
EXPLAIN SELECT job.id AS job_id FROM job WHERE job.dag_id = '';
QUERY PLAN
---------------------------------------------------------------------------
Gather (cost=1000.00..1799110.10 rows=6351 width=8)
Workers Planned: 2
-> Parallel Seq Scan on job (cost=0.00..1797475.00 rows=2646 width=8)
Filter: ((dag_id)::text = ''::text)
(4 rows)This is just one of the many queries that are being executed during the delete operation.
What you expected to happen
Deletion of DAG should not take this much time.
How to reproduce
No response
Operating System
nix
Versions of Apache Airflow Providers
No response
Deployment
Other Docker-based deployment
Deployment details
No response
Anything else
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct