-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Add database index on dag_run.start_date for query optimization #57828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| # MySQL does not support `nullslast`, and True/False ordering depends on the | ||
| # database implementation. | ||
| nullscheck = case((column.isnot(None), 0), else_=1) | ||
|
|
||
| columns.append(nullscheck) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents the database from using indexes, which caused the dagruns page to load in about 30-40 seconds on large databases
Also after these changes dag page and main page works better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it will break MySQL, and instead of removing this it should be placed behind and "if ismysql:" style check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And for non-mysql, we want to set nullsfirst/nullslast correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, i think then we need to discuss solution for this problem.
We want to show null always(both ASC and DESC) in the end, right?
In the postgres it won't work because index can't have null records in the start index and in the end at the same time.
In the postgres we are able to can specify where to store nulls in the index. If we want to store nulls in the end, then when we use ASC order it will be in the end, but if we use DESC order postgres should reverse index and nulls will be in the end. If we want to store nulls in the start, then vise versa.
So for example we can create index in the postgres
CREATE INDEX idx_dag_run_start_date ON dag_run (start_date NULLS FIRST);so when we want to order by start_date ASC, then we get null, 2020-01-01, 2020-01-02, but on odering by start_date DESC we gonna get 2020-01-01, 2020-01-02, null
If we do not want to specify nulls order in the index, then in postgres we gonna get them in the end of index, but in the mysql in the start of index.
Anyway we can't use index and have nulls in the end always.
I assume we shouldn't use index anywhere, for example when we have dagruns filtered by dags we can afford scan table, but when you need to get all dagruns by all dags we can't scan table.
one of the solutions may be make this nullscheck optional and disable it in "big" queries such as get all dagruns
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels like it will break MySQL, and instead of removing this it should be placed behind and "if ismysql:" style check.
Basically current solution will use default database behavior for each database.
- Postgres:
- ASC - nulls in the end
- DESC - nulls in the start
- Mysql:
- ASC - nulls in the start
- DESC - nulls in the end
Instead database will always prefer index(if it possible) instead of scan table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tend to think that we should remove this piece of code, and default to the database default sorting for nulls. This was a 'nice to have' improvement because users usually don't care about nulls, but even that can be argued with. So I think it's a lot of trouble for a very little added value.
At least this will make the endpoint 'usable' again from a performance point of view. Then we can later on add this back, more carefully, depending on the db used etc... there are multiple things to consider, and I don't think it's worth the effort at this point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a completely dumb question, but if we really want this, would this work with two indexes?
CREATE INDEX idx_dag_run_start_date_nf ON dag_run (start_date NULLS FIRST);
CREATE INDEX idx_dag_run_start_date_nl ON dag_run (start_date NULLS LAST);
So the planner can always find an index with 'nulls last' for both asc and desc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also this makes me think that anyway this piece of code need to go. Because it will always prevent index usage of any column used for sorting. Basically the 'order_by' column is configurable from the query, so we would have the same problem with other available sorting columns.
Then we might need to add some other indexes to support performant other sorting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pain to have such many indexes for ordering and will negatively affect performance during writing. Honestly I agree with you, this feature is good to have, but it has a huge impact on performance. I'd remove it at all and then slowly add where we really need it and where we have small amount of elements
|
|
||
| def upgrade(): | ||
| """Apply add start_date index to dagrun.""" | ||
| # ### commands auto generated by Alembic - please adjust! ### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove these comments generated by Alembic after you review the migration logic.
|
My bad with tests, im working on it |
DagRuns API page was extremely slow with large datasets due to unindexed ORDER BY on start_date column. The nullsfirst CASE statement workaround prevented index usage, forcing full table scans. This change adds idx_dag_run_start_date index and removes the nullsfirst logic. While NULLS FIRST DESC would be ideal, it cannot leverage the index. Since there's no way to achieve both indexed performance and consistent NULL ordering across PostgreSQL and MySQL, we accept the database defaults: PostgreSQL orders NULLs first in DESC, MySQL orders NULLs last in DESC. Fixes apache#57344
2732fcf to
fc116eb
Compare
| down_revision = "b87d2135fa50" | ||
| branch_labels = None | ||
| depends_on = None | ||
| airflow_version = "3.2.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets make sure if we want this bug fix to be in 3.2 or the next 3.1.x
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this idea (generally, not this PR specifically) and I think we don't want to add all indexes just to support filtering/ordering on the UI, but instead to give a general means to create index on fields that each user cares about
The reason we can't just add all indices for every column you can sort or order by is that the # of combinations is too large and it will kill DB write performance and the disk size needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we have some columns set as the default ordering on the frontend. Should we create indexes for these columns, or at least change them to columns that already have indexes? My main point is that it's not ideal when we have slow performance on default sorting and users have to wait just to change it
|
I am working on Ash suggestion which is to add configurable metadata database indexes. I'll create a new PR when it's ready and we might be able to close this one. |
|
Closing in favor of: |
DagRuns API page was extremely slow with large datasets due to unindexed ORDER BY on start_date column. The nullslast CASE statement workaround prevented index usage, forcing full table scans.
This change adds idx_dag_run_start_date index and removes the nullsfirst logic. While NULLS LAST would be ideal, it cannot leverage the index. Since there's no way to achieve both indexed performance and consistent NULL ordering across PostgreSQL and MySQL, we accept the database defaults: PostgreSQL orders NULLs first in DESC, MySQL orders NULLs last in DESC.
Fixes #57344
In the issue you can read full debugging description and tests results.
Previous time to load
6872051records was 25-30 seconds.New time is 1.9s(which is quite big IMO but exactly rundag query takes 50ms)
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in airflow-core/newsfragments.