Skip to content

Dag Processor performance issue querying task_instance table by dag_version_id #61894

@jvstein

Description

@jvstein

Apache Airflow version

3.1.7

If "Other Airflow 3 version" selected, which one?

No response

What happened?

I'm testing the upgrade of an Airflow 2.11.0 instance to 3.1.7 in Kubernetes and was observing repeated restarts of the dag-processor during initial start up due to failed liveness checks. After removing the liveness probe, the processor was able to show the actual errors. The dag-processor was successfully starting parses of the individual DAG files (up to the process limit), but then each DAG parse would stall. The dag-processor would sit there consuming 0% CPU resources and then eventually each DAG parse step would time out with a message like this and then a new batch would start and the process would repeat:

Processor for DagFileInfo(rel_path=PosixPath('my/dag/path.py'), bundle_name='dags-folder', bundle_path=PosixPath('/opt/airflow/dags'), bundle_version=None) with PID 19 started 807 ago killing it.

During this, I observed a very slow query in the upgraded Airflow database.

SELECT EXISTS (SELECT *
FROM task_instance
WHERE task_instance.dag_version_id = '019c453a-63af-77d4-8a7f-edd1832cae1e'::UUID) AS anon_1

Our task_instance table has over 27M rows and the dag_version_id column is not indexed, so this was taking tens of minutes to do a full table scan.

After patching our instance with this change, the dag-processor was able to proceed.
3.1.7...jvstein:airflow:fix_slow_task_instance_query

What you think should happen instead?

The query in question does a full table scan on a very large table. There appears to be an available dag_id value in the function, which should be used for filtering in addition to the dag_version_id value.

How to reproduce

Start with a very large number of records in the task_instance table. Run the dag-processor. The initial parsing of DAGs should be very slow to proceed.

Operating System

Debian bookworm

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==9.21.0
apache-airflow-providers-apache-iceberg==1.4.1
apache-airflow-providers-celery==3.15.2
apache-airflow-providers-cncf-kubernetes==10.12.3
apache-airflow-providers-common-compat==1.13.0
apache-airflow-providers-common-io==1.7.1
apache-airflow-providers-common-sql==1.30.4
apache-airflow-providers-fab==3.2.0
apache-airflow-providers-google==19.5.0
apache-airflow-providers-hashicorp==4.4.3
apache-airflow-providers-http==5.6.4
apache-airflow-providers-mysql==6.4.2
apache-airflow-providers-postgres==6.5.3
apache-airflow-providers-sendgrid==4.2.1
apache-airflow-providers-slack==9.6.2
apache-airflow-providers-smtp==2.4.2
apache-airflow-providers-standard==1.11.0
apache-airflow-providers-trino==6.4.2

Deployment

Official Apache Airflow Helm Chart

Deployment details

Deployed via helm to k8s. Not heavily customized.

Anything else?

In our upgraded database with lots of history, this happened every time.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions