Skip to content

Conversation

@SavingFrame
Copy link
Contributor

@SavingFrame SavingFrame commented Nov 4, 2025

DagRuns API page was extremely slow with large datasets due to unindexed ORDER BY on start_date column. The nullslast CASE statement workaround prevented index usage, forcing full table scans.

This change adds idx_dag_run_start_date index and removes the nullsfirst logic. While NULLS LAST would be ideal, it cannot leverage the index. Since there's no way to achieve both indexed performance and consistent NULL ordering across PostgreSQL and MySQL, we accept the database defaults: PostgreSQL orders NULLs first in DESC, MySQL orders NULLs last in DESC.

Fixes #57344
In the issue you can read full debugging description and tests results.

Previous time to load 6872051 records was 25-30 seconds.
New time is 1.9s(which is quite big IMO but exactly rundag query takes 50ms)


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

Comment on lines -254 to -258
# MySQL does not support `nullslast`, and True/False ordering depends on the
# database implementation.
nullscheck = case((column.isnot(None), 0), else_=1)

columns.append(nullscheck)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This prevents the database from using indexes, which caused the dagruns page to load in about 30-40 seconds on large databases
Also after these changes dag page and main page works better

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it will break MySQL, and instead of removing this it should be placed behind and "if ismysql:" style check.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And for non-mysql, we want to set nullsfirst/nullslast correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, i think then we need to discuss solution for this problem.

We want to show null always(both ASC and DESC) in the end, right?
In the postgres it won't work because index can't have null records in the start index and in the end at the same time.
In the postgres we are able to can specify where to store nulls in the index. If we want to store nulls in the end, then when we use ASC order it will be in the end, but if we use DESC order postgres should reverse index and nulls will be in the end. If we want to store nulls in the start, then vise versa.

So for example we can create index in the postgres

CREATE INDEX idx_dag_run_start_date ON dag_run (start_date NULLS FIRST);

so when we want to order by start_date ASC, then we get null, 2020-01-01, 2020-01-02, but on odering by start_date DESC we gonna get 2020-01-01, 2020-01-02, null

If we do not want to specify nulls order in the index, then in postgres we gonna get them in the end of index, but in the mysql in the start of index.

Anyway we can't use index and have nulls in the end always.

I assume we shouldn't use index anywhere, for example when we have dagruns filtered by dags we can afford scan table, but when you need to get all dagruns by all dags we can't scan table.
one of the solutions may be make this nullscheck optional and disable it in "big" queries such as get all dagruns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like it will break MySQL, and instead of removing this it should be placed behind and "if ismysql:" style check.

Basically current solution will use default database behavior for each database.

  • Postgres:
    • ASC - nulls in the end
    • DESC - nulls in the start
  • Mysql:
    • ASC - nulls in the start
    • DESC - nulls in the end
      Instead database will always prefer index(if it possible) instead of scan table.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think that we should remove this piece of code, and default to the database default sorting for nulls. This was a 'nice to have' improvement because users usually don't care about nulls, but even that can be argued with. So I think it's a lot of trouble for a very little added value.

At least this will make the endpoint 'usable' again from a performance point of view. Then we can later on add this back, more carefully, depending on the db used etc... there are multiple things to consider, and I don't think it's worth the effort at this point.

Copy link
Member

@pierrejeambrun pierrejeambrun Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a completely dumb question, but if we really want this, would this work with two indexes?

CREATE INDEX idx_dag_run_start_date_nf ON dag_run (start_date NULLS FIRST);
CREATE INDEX idx_dag_run_start_date_nl ON dag_run (start_date NULLS LAST);

So the planner can always find an index with 'nulls last' for both asc and desc

Copy link
Member

@pierrejeambrun pierrejeambrun Nov 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this makes me think that anyway this piece of code need to go. Because it will always prevent index usage of any column used for sorting. Basically the 'order_by' column is configurable from the query, so we would have the same problem with other available sorting columns.

Then we might need to add some other indexes to support performant other sorting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pain to have such many indexes for ordering and will negatively affect performance during writing. Honestly I agree with you, this feature is good to have, but it has a huge impact on performance. I'd remove it at all and then slowly add where we really need it and where we have small amount of elements


def upgrade():
"""Apply add start_date index to dagrun."""
# ### commands auto generated by Alembic - please adjust! ###
Copy link
Member

@uranusjr uranusjr Nov 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove these comments generated by Alembic after you review the migration logic.

@SavingFrame
Copy link
Contributor Author

My bad with tests, im working on it

DagRuns API page was extremely slow with large datasets due to
unindexed ORDER BY on start_date column. The nullsfirst CASE
statement workaround prevented index usage, forcing full table scans.

This change adds idx_dag_run_start_date index and removes the
nullsfirst logic. While NULLS FIRST DESC would be ideal, it cannot
leverage the index. Since there's no way to achieve both indexed
performance and consistent NULL ordering across PostgreSQL and MySQL,
we accept the database defaults: PostgreSQL orders NULLs first in DESC,
MySQL orders NULLs last in DESC.

Fixes apache#57344
@SavingFrame SavingFrame force-pushed the dagrun-improve-performance branch from 2732fcf to fc116eb Compare November 5, 2025 08:24
down_revision = "b87d2135fa50"
branch_labels = None
depends_on = None
airflow_version = "3.2.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets make sure if we want this bug fix to be in 3.2 or the next 3.1.x

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this idea (generally, not this PR specifically) and I think we don't want to add all indexes just to support filtering/ordering on the UI, but instead to give a general means to create index on fields that each user cares about

The reason we can't just add all indices for every column you can sort or order by is that the # of combinations is too large and it will kill DB write performance and the disk size needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we have some columns set as the default ordering on the frontend. Should we create indexes for these columns, or at least change them to columns that already have indexes? My main point is that it's not ideal when we have slow performance on default sorting and users have to wait just to change it

@pierrejeambrun
Copy link
Member

I am working on Ash suggestion which is to add configurable metadata database indexes. I'll create a new PR when it's ready and we might be able to close this one.

@pierrejeambrun
Copy link
Member

Closing in favor of:
#57828

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:db-migrations PRs with DB migration kind:documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

API: DagRuns page extremely slow with large number of records due to unindexed ORDER BY

5 participants