Skip to content

Add new composite index to DagRun for dag_id, logical_date, run_type#62139

Open
ronaldorcampos wants to merge 1 commit intoapache:mainfrom
ronaldorcampos:fix/dag_run
Open

Add new composite index to DagRun for dag_id, logical_date, run_type#62139
ronaldorcampos wants to merge 1 commit intoapache:mainfrom
ronaldorcampos:fix/dag_run

Conversation

@ronaldorcampos
Copy link
Contributor

@ronaldorcampos ronaldorcampos commented Feb 18, 2026

While debugging queries in my mysql server, I noticed that many slow queries are originating from airflow

# Time: 2026-02-18T07:25:10.856389Z
# User@Host: airflow[airflow] @  [192.168.100.3]  Id: 177446
# Query_time: 17.806845  Lock_time: 0.000006 Rows_sent: 1  Rows_examined: 2
SET timestamp=1771399493;
SELECT dag_run.id, dag_run.dag_id, dag_run.logical_date, dag_run.data_interval_start, dag_run.data_interval_end 
FROM dag_run, (SELECT dag_run.dag_id AS dag_id, max(dag_run.logical_date) AS max_logical_date 
FROM dag_run 
WHERE dag_run.dag_id IN ('vmware_monitor_ingestion', 'vmware_ingest_monitor_data') AND dag_run.run_type IN ('backfill', 'scheduled') GROUP BY dag_run.dag_id) AS anon_1 
WHERE dag_run.dag_id = anon_1.dag_id AND dag_run.logical_date = anon_1.max_logical_date;

For instance, running explain on the query above, it shows:

{
	"data":
	[
		{
			"id": 1,
			"select_type": "PRIMARY",
			"table": "<derived2>",
			"partitions": null,
			"type": "ALL",
			"possible_keys": null,
			"key": null,
			"key_len": null,
			"ref": null,
			"rows": 68862,
			"filtered": 100,
			"Extra": "Using where"
		},
		{
			"id": 1,
			"select_type": "PRIMARY",
			"table": "dag_run",
			"partitions": null,
			"type": "eq_ref",
			"possible_keys": "dag_run_dag_id_run_id_key,dag_run_dag_id_logical_date_key,idx_dag_run_dag_id,dag_id_state",
			"key": "dag_run_dag_id_logical_date_key",
			"key_len": "760",
			"ref": "anon_1.dag_id,anon_1.max_logical_date",
			"rows": 1,
			"filtered": 100,
			"Extra": null
		},
		{
			"id": 2,
			"select_type": "DERIVED",
			"table": "dag_run",
			"partitions": null,
			"type": "index",
			"possible_keys": "dag_run_dag_id_run_id_key,dag_run_dag_id_logical_date_key,idx_dag_run_dag_id,idx_dag_run_queued_dags,dag_id_state,idx_dag_run_running_dags",
			"key": "dag_run_dag_id_run_id_key",
			"key_len": "1504",
			"ref": null,
			"rows": 685368,
			"filtered": 10.05,
			"Extra": "Using where"
		}
	]
}

The derived scans 685k+ rows, which is terrible performance wise, hence giving 15-20s queries in my db.

Manually adding the new composite index, dag_id_run_type_logical_date_key, to my db shows

"data":
	[
		{
			"id": 1,
			"select_type": "PRIMARY",
			"table": "<derived2>",
			"partitions": null,
			"type": "ALL",
			"possible_keys": null,
			"key": null,
			"key_len": null,
			"ref": null,
			"rows": 2,
			"filtered": 100,
			"Extra": "Using where"
		},
		{
			"id": 1,
			"select_type": "PRIMARY",
			"table": "dag_run",
			"partitions": null,
			"type": "eq_ref",
			"possible_keys": "dag_run_dag_id_run_id_key,dag_run_dag_id_logical_date_key,idx_dag_run_dag_id,dag_id_state,dag_id_run_type_logical_date_key",
			"key": "dag_run_dag_id_logical_date_key",
			"key_len": "760",
			"ref": "anon_1.dag_id,anon_1.max_logical_date",
			"rows": 1,
			"filtered": 100,
			"Extra": null
		},
		{
			"id": 2,
			"select_type": "DERIVED",
			"table": "dag_run",
			"partitions": null,
			"type": "range",
			"possible_keys": "dag_run_dag_id_run_id_key,dag_run_dag_id_logical_date_key,idx_dag_run_dag_id,idx_dag_run_queued_dags,dag_id_state,idx_dag_run_running_dags,dag_id_run_type_logical_date_key",
			"key": "dag_id_run_type_logical_date_key",
			"key_len": "904",
			"ref": null,
			"rows": 2,
			"filtered": 100,
			"Extra": "Using where; Using index for group-by"
		}
	]
}

The same query now runs in 150-200ms.


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments