-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Improve performance of rendered templates cleanup #60951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
8be4834 to
ba544bf
Compare
jedcunningham
approved these changes
Jan 22, 2026
pierrejeambrun
approved these changes
Jan 22, 2026
Member
Author
|
Ofcourse it has to be MySQL! |
dstandish
previously requested changes
Jan 22, 2026
Contributor
dstandish
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i never like to request changes, but i think you need to rename the config here so just wanted to signal that
airflow-core/docs/administration-and-deployment/dag-serialization.rst
Outdated
Show resolved
Hide resolved
The delete_old_records query was scanning the RTIF table with a complex NOT EXISTS subquery containing a join, causing statement timeouts and scheduler crashes for DAGs with 3k+ mapped task instances (~100k RTIF records). Simplified the query to only use the dag_run table for finding recent run_ids, avoiding the expensive RTIF table scan entirely. Benchmarks show ~38x improvement (330ms -> 6ms for 5000 records). Changes: - Replaced NOT EXISTS with simple NOT IN on dag_run table - Uses run_after instead of logical_date (avoids NULL for manual runs) - Renamed config to max_num_rendered_ti_fields_per_dag_run with backward-compatible deprecation of old name - Added test for sparse task behavior to document the semantic change Note: Retention is now based on N most recent dag runs rather than N most recent task executions. For sparse/conditional tasks, this may result in fewer historical records being retained.
The delete_old_records query was scanning the RTIF table with a complex NOT EXISTS subquery containing a join, causing statement timeouts and scheduler crashes for DAGs with 3k+ mapped task instances (~100k RTIF records). Simplified the query to only use the dag_run table for finding recent run_ids, avoiding the expensive RTIF table scan entirely. Benchmarks show ~38x improvement (330ms -> 6ms for 5000 records). Changes: - Replaced NOT EXISTS with simple NOT IN on dag_run table - Uses run_after instead of logical_date (avoids NULL for manual runs) - Renamed config to max_num_rendered_ti_fields_per_dag_run with backward-compatible deprecation of old name - Added test for sparse task behavior to document the semantic change Note: Retention is now based on N most recent dag runs rather than N most recent task executions. For sparse/conditional tasks, this may result in fewer historical records being retained.
20195b1 to
68ddfdf
Compare
dstandish
approved these changes
Jan 29, 2026
amoghrajesh
approved these changes
Jan 29, 2026
sanchalitorpe-source
pushed a commit
to sanchalitorpe-source/airflow
that referenced
this pull request
Jan 30, 2026
The delete_old_records query was scanning the RTIF table with a complex NOT EXISTS subquery containing a join, causing statement timeouts and scheduler crashes for DAGs with 3k+ mapped task instances (~100k RTIF records). Simplified the query to only use the dag_run table for finding recent run_ids, avoiding the expensive RTIF table scan entirely. Benchmarks show ~38x improvement (330ms -> 6ms for 5000 records). Changes: - Replaced NOT EXISTS with simple NOT IN on dag_run table - Uses run_after instead of logical_date (avoids NULL for manual runs) - Renamed config to max_num_rendered_ti_fields_per_dag_run with backward-compatible deprecation of old name - Added test for sparse task behavior to document the semantic change Note: Retention is now based on N most recent dag runs rather than N most recent task executions. For sparse/conditional tasks, this may result in fewer historical records being retained. * Improve performance of rendered templates cleanup The delete_old_records query was scanning the RTIF table with a complex NOT EXISTS subquery containing a join, causing statement timeouts and scheduler crashes for DAGs with 3k+ mapped task instances (~100k RTIF records). Simplified the query to only use the dag_run table for finding recent run_ids, avoiding the expensive RTIF table scan entirely. Benchmarks show ~38x improvement (330ms -> 6ms for 5000 records). Changes: - Replaced NOT EXISTS with simple NOT IN on dag_run table - Uses run_after instead of logical_date (avoids NULL for manual runs) - Renamed config to max_num_rendered_ti_fields_per_dag_run with backward-compatible deprecation of old name - Added test for sparse task behavior to document the semantic change Note: Retention is now based on N most recent dag runs rather than N most recent task executions. For sparse/conditional tasks, this may result in fewer historical records being retained.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area:ConfigTemplates
area:serialization
full tests needed
We need to run full set of tests for this PR to merge
kind:documentation
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Optimizes
RenderedTaskInstanceFields.delete_old_records()to avoid scanning the RTIF table when determining which records to keep. This fixes scheduler crashes caused by slow RTIF cleanup queries on DAGs with many dynamically mapped tasks.Problem: For DAGs with 3k+ mapped task instances, the delete query was scanning ~100k RTIF records with a complex
NOT EXISTSsubquery containing a join, causing statement timeouts and high CPU utilization that prevented the scheduler from heartbeating.This PR simplifies the query to only use the
dag_runtable for finding recent run_ids, avoiding the expensive RTIF table scan entirely.Before (slow - scans RTIF table with join)
After (fast - only queries dag_run table)
Benchmark Results (PostgreSQL)
Test Setup: 100 DAG runs x 50 mapped TIs = 5,000 RTIF records, keeping 30 most recent runs
Query Plan Comparison
Old Query - Nested loop with expensive heap fetches:
New Query - Simple index scan with hashed subplan:
Behavioral Change
The semantics change slightly:
This imo is acceptable because of the following, but looking forward for what others think too:
Additional Changes
run_afterinstead oflogical_datefor ordering (sincelogical_datecan be NULL for manual runs)How to Reproduce
To reproduce the benchmark, save this script and run inside breeze with PostgreSQL:
Benchmark Script