Add (updated_at, dag_id) index to task_instance for updated_at-range API queries#65602
Add (updated_at, dag_id) index to task_instance for updated_at-range API queries#65602seanmuth wants to merge 9 commits into
Conversation
…nup queries Adds two new migration files (0112, 0113) targeting high-volume query patterns on the task_instance table: - idx_CRE_ti_state_updated_at: partial index on (state, updated_at) filtered to terminal states (success, failed) — speeds up queries that scan completed tasks by updated_at without touching in-flight rows. - idx_CRE_ti_span_status: partial index on (span_status) filtered to 'should_end' — narrows the index to only the small fraction of rows the OTel span-closing logic needs to find, avoiding full-table scans. Both indexes use postgresql_where/sqlite_where for partial semantics on supported backends. MySQL receives non-partial fallback indexes (same pattern as existing idx_dag_run_running_dags). CONCURRENTLY is not used in the migration itself since Alembic batch mode runs inside a transaction; operators needing lock-free builds on large tables can apply the indexes manually with CONCURRENTLY before upgrading. Migration chain: 0111 (9fabad868fdb) → 0112 (c4f5e6d7a8b9) → 0113 (b0c1d2e3f4a5)
Adds migration 0114 and the corresponding __table_args__ entry for a composite index on (updated_at, dag_id). Without this index, the GET /dags/~/dagRuns/~/taskInstances endpoint with an updated_at range filter (plus the dag_id IN (...) clause from PermittedTIFilter) causes a full sequential scan on task_instance, observed at ~39s avg latency in RDS Performance Insights. Putting updated_at first lets Postgres bound the scan to the time window; dag_id as the second column avoids heap fetches for the predicate evaluation within that range. A partial index (as in 0112) would not help here because this query carries no state filter and must cover task instances in all states. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Renames idx_CRE_ti_state_updated_at → ti_state_updated_at and idx_CRE_ti_span_status → ti_span_status to match the ti_* convention used by all other task_instance indexes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3.2.0 is already released. Rename files and update airflow_version to 3.3.0 to match the current development version and the convention established by 0110/0111 on this branch. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Should be able to get some real-world query plan before/after's today |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
0114 dag_id updated_at(Note: the actual query produced here ends with a Before: After: |
|
@seanmuth Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.
See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
|
oops didn't mean to convert back from draft. will work on getting those tests to pass today/tmr |
|
@seanmuth Is the span_status still used on 3.2.0+ I know Standish did some work on overhauling Otel Traces. |
will dig into that codepath |
|
ah yep, removed as of 3.2 in #63452 , will pull that index/migration out of the PR shortly |
Mirror the ti_state_updated_at partial index in __table_args__ so the SQLAlchemy model matches the migrated schema. Drop the unused ti_span_status migration (the OTEL span_status code was removed in 3.2) and reorder the remaining migrations.
|
@seanmuth — There are 1 unresolved review thread(s) on this PR from @ashb. Could you either push a fix or reply in each thread explaining why the feedback doesn't apply? Once you believe the feedback is addressed, mark the thread as resolved so the reviewer isn't re-pinged needlessly. Thanks! Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
Two production deployments observed over a multi-week window show
idx_cre_ti_state_updated_at receiving only 2 and 28 index scans
respectively, against tables where ti_dag_run / ti_state see tens of
millions of scans. The terminal-state polling query that motivated the
index was an external service that has since been refactored away — no
upstream Airflow callsite issues a query matching the partial predicate
(state IN ('success','failed') AND updated_at <op> ...).
Carrying the index costs write amplification on every task transition
into a terminal state plus tens of MB per deployment, with no
corresponding read benefit. Drop the migration and the model index.
The composite (updated_at, dag_id) index is retained — production stats
show 2.4M and 2.5M scans on the same deployments, confirming it serves
the public taskInstances updated_at-range endpoint as intended.
|
@seanmuth — There is 1 unresolved review thread on this PR from @ashb, and you have engaged with each one (post-review commits and/or in-thread replies). Could you confirm whether you believe the feedback is fully addressed and the PR is ready for maintainer review confirmation? If yes, reply here (a short "yes / ready" is fine) and an Apache Airflow maintainer will pick the PR up from the review queue on the next sweep. If you are still working on a thread, please reply with what is outstanding so the threads stay unresolved on purpose. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
|
Sorry @potiuk forgot to mark that convo as resolved. Should be good to go now with just one index addition here. |
|
I think this is a migration that should be very carefully benchmarked. How long it takes to run it on a huge database ? Will it increase the size of the database ? Will it increase time to create entries ? I am not a specialist in this part -there are other people - bit at the very least evidences of migration checks on real data should be shown. This is also unclear what should drive the decision on running it And how users should be warned they have to update the index with/without airflow running concurrently or not . I think any kind of migration of our DB should have evidence of testing it heavily on various types of deployments. Also cc: @vatsrahul1001 and @ephraimbuddy who had to deal with lots of aftermath of not well tested migrations. |
Summary
Adds one migration on
task_instance: a composite index on(updated_at, dag_id)to support the public REST API endpointGET /dags/~/dagRuns/~/taskInstanceswhen called with anupdated_atrange filter.
Without this index, the query — which adds a
dag_id IN (...)clausevia
PermittedTIFilter— performs a full sequential scan ontask_instance, observed at ~39s avg latency in RDS PerformanceInsights on a real deployment.
updated_atis the leading column soPostgres can bound the scan to the time window;
dag_idis thetrailing column to narrow within that range without a heap fetch.
Production evidence
Live
pg_stat_user_indexesover a multi-week window across twoproduction deployments running this index out-of-band shows it heavily
used:
idx_scanonti_updated_at_dag_idNotes
CONCURRENTLYis intentionally omitted — Alembic batch mode runsinside a transaction. Operators with large existing tables who want a
lock-free build can apply the index manually with
CREATE INDEX CONCURRENTLYbefore runningairflow db migrate.(
ti_state_updated_atpartial index for terminal states, andti_span_statuspartial index). Both were dropped after productionindex-usage stats showed them receiving ~0 scans on real workloads —
the queries that motivated them either no longer exist upstream or
were issued by an external service that has since been refactored.
Test plan
airflow db migrateruns cleanly from a currentmainschemaairflow db downgradereverses the migration correctlyEXPLAIN ANALYZEon a loaded DB shows an index scan onti_updated_at_dag_idfor theupdated_at-rangetaskInstancesendpoint after migrationWas generative AI tooling used to co-author this PR?
Generated-by: Claude Code (Opus 4.7) following the guidelines