Fix scheduler UniqueViolation crash on downgrade from 3.2.0 to 3.1.x#65688
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a downgrade-time data migration to prevent scheduler crash loops when rolling back from Airflow 3.2.0 to 3.1.x due to dag_run run_id collisions stemming from changed scheduled run_id semantics.
Changes:
- Extend the
0107_3_2_0_add_partition_fields_to_dagmigration’sdowngrade()to NULL outDagModelnext-run fields (next_dagrun*) so 3.1.x recomputes them using its own semantics.
Testing Report: Airflow 3.1.8 <-> 3.2.0 Upgrade/Downgrade MigrationObjectiveVerify that the Airflow survives a full upgrade (3.1.8 -> 3.2.0) and downgrade Environment
Test Steps & ResultsStep 1 — Baseline on 3.1.8
Verification query: SELECT run_id, logical_date, run_after, state
FROM dag_run
WHERE dag_id = 'my_dag'
ORDER BY logical_date DESC;
Step 2 — Upgrade to 3.2.0
Step 3 — Downgrade back to 3.1.8 (with the downgrade migration fix)From the 3.2.0 breeze shell: airflow db downgrade -n 3.1.8Downgrade completed cleanly through all 3.2.0 migrations. Verification query — SELECT dag_id, next_dagrun, next_dagrun_create_after,
next_dagrun_data_interval_start, next_dagrun_data_interval_end
FROM dag
WHERE dag_id = 'my_dag';The fix intentionally NULLs scheduling fields during downgrade so 3.1.8 recomputes them
Step 4 — Restart on 3.1.8 with downgraded DB
DAG processor log confirmed recomputation of NULL fields and it was handled correctly. Verification queries: -- All fields recomputed
SELECT dag_id, next_dagrun, next_dagrun_create_after,
next_dagrun_data_interval_start, next_dagrun_data_interval_end
FROM dag
WHERE dag_id = 'my_dag';
-- All historical runs intact
SELECT run_id, logical_date, run_after, state
FROM dag_run
WHERE dag_id = 'my_dag'
ORDER BY logical_date DESC; |
|
cc @vatsrahul1001 / @kaxil would love to get your reviews here too |
|
Test status on Astro
|
|
Thanks for testing it @vatsrahul1001! Merging this one in. |
Was generative AI tooling used to co-author this PR?
What?
When downgrading from Airflow 3.2.0 to 3.1.x, the scheduler enters a crash loop due to
UniqueViolationerrors on thedag_runtable, causing all DAG scheduling to stop.Current behaviour
Airflow 3.2.0 (related PR: #59115) changed how run_id is generated for scheduled DAG runs:
run_id = scheduled__<logical_date>run_id = scheduled__<run_after>For a daily DAG, these timestamps differ by one interval. After downgrading to 3.1.x,
DagModel.next_dagrunstill holds a value set by 3.2.0. The 3.1.x scheduler uses that value to generate a run_id that already exists in dag_run (created by 3.2.0 with the new format) leading to DB insertion error.This is made worse by the known session handling issue in _create_dag_runs_timetable (issue: #59120): one UniqueViolation messed the entire SQLAlchemy session, causing errors to cascade to every other DAG in the same scheduler batch. The result is a full scheduler crash loop affecting all DAGs, not just the one with the collision.
Proposed change
Add a data migration to the downgrade path of existing migration: 0107, so that on downgrade, we null out
next_dagrun,next_dagrun_create_after,next_dagrun_data_interval_start, andnext_dagrun_data_interval_endfor all DAGs.Nulling is intentional so that the scheduler as it already handles NULL can start recalculating these fields from the last completed run on its next cycle. After downgrade, the 3.1.x scheduler recalculates using 3.1.x semantics, generating run_ids that will not collide with existing runs.
What does this mean for 3.2 -> 3.2.1? (assuming we make this into 3.2.2)
For 3.2 and 3.2.1:
Anyone on these versions who needs to roll back to 3.1.x should manually run the sql workaround (null out the 4 fields) before starting the 3.1.x scheduler.
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.