Add database index on dag_run.start_date for query optimization #57828

SavingFrame · 2025-11-04T19:51:21Z

DagRuns API page was extremely slow with large datasets due to unindexed ORDER BY on start_date column. The nullslast CASE statement workaround prevented index usage, forcing full table scans.

This change adds idx_dag_run_start_date index and removes the nullsfirst logic. While NULLS LAST would be ideal, it cannot leverage the index. Since there's no way to achieve both indexed performance and consistent NULL ordering across PostgreSQL and MySQL, we accept the database defaults: PostgreSQL orders NULLs first in DESC, MySQL orders NULLs last in DESC.

Fixes #57344
In the issue you can read full debugging description and tests results.

Previous time to load 6872051 records was 25-30 seconds.
New time is 1.9s(which is quite big IMO but exactly rundag query takes 50ms)

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

uranusjr · 2025-11-05T03:43:35Z

airflow-core/src/airflow/api_fastapi/common/parameters.py

-            # MySQL does not support `nullslast`, and True/False ordering depends on the
-            # database implementation.
-            nullscheck = case((column.isnot(None), 0), else_=1)
-
-            columns.append(nullscheck)


Why is this removed?

This prevents the database from using indexes, which caused the dagruns page to load in about 30-40 seconds on large databases
Also after these changes dag page and main page works better

This feels like it will break MySQL, and instead of removing this it should be placed behind and "if ismysql:" style check.

And for non-mysql, we want to set nullsfirst/nullslast correctly.

Okay, i think then we need to discuss solution for this problem.

We want to show null always(both ASC and DESC) in the end, right?
In the postgres it won't work because index can't have null records in the start index and in the end at the same time.
In the postgres we are able to can specify where to store nulls in the index. If we want to store nulls in the end, then when we use ASC order it will be in the end, but if we use DESC order postgres should reverse index and nulls will be in the end. If we want to store nulls in the start, then vise versa.

So for example we can create index in the postgres

CREATE INDEX idx_dag_run_start_date ON dag_run (start_date NULLS FIRST);

so when we want to order by start_date ASC, then we get null, 2020-01-01, 2020-01-02, but on odering by start_date DESC we gonna get 2020-01-01, 2020-01-02, null

If we do not want to specify nulls order in the index, then in postgres we gonna get them in the end of index, but in the mysql in the start of index.

Anyway we can't use index and have nulls in the end always.

I assume we shouldn't use index anywhere, for example when we have dagruns filtered by dags we can afford scan table, but when you need to get all dagruns by all dags we can't scan table.
one of the solutions may be make this nullscheck optional and disable it in "big" queries such as get all dagruns

This feels like it will break MySQL, and instead of removing this it should be placed behind and "if ismysql:" style check.

Basically current solution will use default database behavior for each database.

Postgres:

ASC - nulls in the end

DESC - nulls in the start

Mysql:

ASC - nulls in the start

DESC - nulls in the end
Instead database will always prefer index(if it possible) instead of scan table.

I tend to think that we should remove this piece of code, and default to the database default sorting for nulls. This was a 'nice to have' improvement because users usually don't care about nulls, but even that can be argued with. So I think it's a lot of trouble for a very little added value.

At least this will make the endpoint 'usable' again from a performance point of view. Then we can later on add this back, more carefully, depending on the db used etc... there are multiple things to consider, and I don't think it's worth the effort at this point.

Maybe a completely dumb question, but if we really want this, would this work with two indexes?

CREATE INDEX idx_dag_run_start_date_nf ON dag_run (start_date NULLS FIRST); CREATE INDEX idx_dag_run_start_date_nl ON dag_run (start_date NULLS LAST);

So the planner can always find an index with 'nulls last' for both asc and desc

Also this makes me think that anyway this piece of code need to go. Because it will always prevent index usage of any column used for sorting. Basically the 'order_by' column is configurable from the query, so we would have the same problem with other available sorting columns.

Then we might need to add some other indexes to support performant other sorting.

This is pain to have such many indexes for ordering and will negatively affect performance during writing. Honestly I agree with you, this feature is good to have, but it has a huge impact on performance. I'd remove it at all and then slowly add where we really need it and where we have small amount of elements

uranusjr · 2025-11-05T03:44:30Z

airflow-core/src/airflow/migrations/versions/0092_3_2_0_add_start_date_index_to_dagrun.py

+
+def upgrade():
+    """Apply add start_date index to dagrun."""
+    # ### commands auto generated by Alembic - please adjust! ###


Please remove these comments generated by Alembic after you review the migration logic.

SavingFrame · 2025-11-05T08:20:14Z

My bad with tests, im working on it

DagRuns API page was extremely slow with large datasets due to unindexed ORDER BY on start_date column. The nullsfirst CASE statement workaround prevented index usage, forcing full table scans. This change adds idx_dag_run_start_date index and removes the nullsfirst logic. While NULLS FIRST DESC would be ideal, it cannot leverage the index. Since there's no way to achieve both indexed performance and consistent NULL ordering across PostgreSQL and MySQL, we accept the database defaults: PostgreSQL orders NULLs first in DESC, MySQL orders NULLs last in DESC. Fixes apache#57344

eladkal · 2025-11-06T16:25:07Z

airflow-core/src/airflow/migrations/versions/0092_3_2_0_add_start_date_index_to_dagrun.py

+down_revision = "b87d2135fa50"
+branch_labels = None
+depends_on = None
+airflow_version = "3.2.0"


Lets make sure if we want this bug fix to be in 3.2 or the next 3.1.x

ashb · 2025-11-06T16:34:58Z

airflow-core/src/airflow/migrations/versions/0092_3_2_0_add_start_date_index_to_dagrun.py

We discussed this idea (generally, not this PR specifically) and I think we don't want to add all indexes just to support filtering/ordering on the UI, but instead to give a general means to create index on fields that each user cares about

The reason we can't just add all indices for every column you can sort or order by is that the # of combinations is too large and it will kill DB write performance and the disk size needed.

I guess we have some columns set as the default ordering on the frontend. Should we create indexes for these columns, or at least change them to columns that already have indexes? My main point is that it's not ideal when we have slow performance on default sorting and users have to wait just to change it

pierrejeambrun · 2025-11-27T09:00:26Z

I am working on Ash suggestion which is to add configurable metadata database indexes. I'll create a new PR when it's ready and we might be able to close this one.

pierrejeambrun · 2025-11-28T14:12:55Z

Closing in favor of:
#57828

SavingFrame requested review from XD-DENG, ashb, bugraoz93, ephraimbuddy, jason810496, pierrejeambrun, rawwar and shubhamraj-git as code owners November 4, 2025 19:51

boring-cyborg bot added area:API Airflow's REST/HTTP API area:db-migrations PRs with DB migration kind:documentation labels Nov 4, 2025

uranusjr reviewed Nov 5, 2025

View reviewed changes

SavingFrame force-pushed the dagrun-improve-performance branch from 2732fcf to fc116eb Compare November 5, 2025 08:24

eladkal reviewed Nov 6, 2025

View reviewed changes

ashb reviewed Nov 6, 2025

View reviewed changes

pierrejeambrun closed this Nov 28, 2025

Add database index on dag_run.start_date for query optimization #57828

Add database index on dag_run.start_date for query optimization #57828

Uh oh!

Conversation

SavingFrame commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierrejeambrun Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierrejeambrun Nov 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

uranusjr Nov 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SavingFrame commented Nov 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pierrejeambrun commented Nov 27, 2025

Uh oh!

pierrejeambrun commented Nov 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

SavingFrame commented Nov 4, 2025 •

edited

Loading

pierrejeambrun Nov 5, 2025 •

edited

Loading

pierrejeambrun Nov 6, 2025 •

edited

Loading

uranusjr Nov 5, 2025 •

edited

Loading