-
Notifications
You must be signed in to change notification settings - Fork 16.5k
Description
Apache Airflow version
3.1.7
If "Other Airflow 3 version" selected, which one?
No response
What happened?
I'm testing the upgrade of an Airflow 2.11.0 instance to 3.1.7 in Kubernetes and was observing repeated restarts of the dag-processor during initial start up due to failed liveness checks. After removing the liveness probe, the processor was able to show the actual errors. The dag-processor was successfully starting parses of the individual DAG files (up to the process limit), but then each DAG parse would stall. The dag-processor would sit there consuming 0% CPU resources and then eventually each DAG parse step would time out with a message like this and then a new batch would start and the process would repeat:
Processor for DagFileInfo(rel_path=PosixPath('my/dag/path.py'), bundle_name='dags-folder', bundle_path=PosixPath('/opt/airflow/dags'), bundle_version=None) with PID 19 started 807 ago killing it.
During this, I observed a very slow query in the upgraded Airflow database.
SELECT EXISTS (SELECT *
FROM task_instance
WHERE task_instance.dag_version_id = '019c453a-63af-77d4-8a7f-edd1832cae1e'::UUID) AS anon_1
Our task_instance table has over 27M rows and the dag_version_id column is not indexed, so this was taking tens of minutes to do a full table scan.
After patching our instance with this change, the dag-processor was able to proceed.
3.1.7...jvstein:airflow:fix_slow_task_instance_query
What you think should happen instead?
The query in question does a full table scan on a very large table. There appears to be an available dag_id value in the function, which should be used for filtering in addition to the dag_version_id value.
How to reproduce
Start with a very large number of records in the task_instance table. Run the dag-processor. The initial parsing of DAGs should be very slow to proceed.
Operating System
Debian bookworm
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==9.21.0
apache-airflow-providers-apache-iceberg==1.4.1
apache-airflow-providers-celery==3.15.2
apache-airflow-providers-cncf-kubernetes==10.12.3
apache-airflow-providers-common-compat==1.13.0
apache-airflow-providers-common-io==1.7.1
apache-airflow-providers-common-sql==1.30.4
apache-airflow-providers-fab==3.2.0
apache-airflow-providers-google==19.5.0
apache-airflow-providers-hashicorp==4.4.3
apache-airflow-providers-http==5.6.4
apache-airflow-providers-mysql==6.4.2
apache-airflow-providers-postgres==6.5.3
apache-airflow-providers-sendgrid==4.2.1
apache-airflow-providers-slack==9.6.2
apache-airflow-providers-smtp==2.4.2
apache-airflow-providers-standard==1.11.0
apache-airflow-providers-trino==6.4.2
Deployment
Official Apache Airflow Helm Chart
Deployment details
Deployed via helm to k8s. Not heavily customized.
Anything else?
In our upgraded database with lots of history, this happened every time.
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct