Skip to content

DAG deletion is slow due to lack of database indexes on dag_id #20249

@kushsharma

Description

@kushsharma

Apache Airflow version

2.2.1

What happened

We have an airflow instance for approximately 6k DAGs.

  • If we delete a DAG from UI, the UI times out
  • If we delete a DAG from CLI, it completes but sometimes takes up to a half-hour to finish.

Most of the execution time appears to be consumed in database queries. I know I can just throw more CPU and memory to the db instance and hope it works but I think we can do better during delete operation. Correct me if I am wrong but I think this is the code that gets executed when deleting a DAG from UI or CLI via delete_dag.py

    for model in models.base.Base._decl_class_registry.values():
        if hasattr(model, "dag_id"):
            if keep_records_in_log and model.__name__ == 'Log':
                continue
            cond = or_(model.dag_id == dag_id, model.dag_id.like(dag_id + ".%"))
            count += session.query(model).filter(cond).delete(synchronize_session='fetch')
    if dag.is_subdag:
        parent_dag_id, task_id = dag_id.rsplit(".", 1)
        for model in TaskFail, models.TaskInstance:
            count += (
                session.query(model).filter(model.dag_id == parent_dag_id, model.task_id == task_id).delete()
            )

I see we are iterating over all the models and doing a dag_id match. Some of the tables don't have an index over dag_id column like job which is making this operation really slow. This could be one easy fix for this issue.

For example, the following query took 20 mins to finish in 16cpu 32gb Postgres instance:

SELECT job.id AS job_id FROM job WHERE job.dag_id = $1 OR job.dag_id LIKE $2

and explain is as follows

EXPLAIN SELECT job.id AS job_id FROM job WHERE job.dag_id = '';
                                QUERY PLAN
---------------------------------------------------------------------------
 Gather  (cost=1000.00..1799110.10 rows=6351 width=8)
   Workers Planned: 2
   ->  Parallel Seq Scan on job  (cost=0.00..1797475.00 rows=2646 width=8)
         Filter: ((dag_id)::text = ''::text)
(4 rows)

This is just one of the many queries that are being executed during the delete operation.

What you expected to happen

Deletion of DAG should not take this much time.

How to reproduce

No response

Operating System

nix

Versions of Apache Airflow Providers

No response

Deployment

Other Docker-based deployment

Deployment details

No response

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions