Skip to content

Fix scheduler UniqueViolation crash on downgrade from 3.2.0 to 3.1.x#65688

Merged
amoghrajesh merged 1 commit into
apache:mainfrom
astronomer:db-insert-violation0fix
Apr 28, 2026
Merged

Fix scheduler UniqueViolation crash on downgrade from 3.2.0 to 3.1.x#65688
amoghrajesh merged 1 commit into
apache:mainfrom
astronomer:db-insert-violation0fix

Conversation

@amoghrajesh
Copy link
Copy Markdown
Contributor


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)

What?

When downgrading from Airflow 3.2.0 to 3.1.x, the scheduler enters a crash loop due to UniqueViolation errors on the dag_run table, causing all DAG scheduling to stop.

Current behaviour

Airflow 3.2.0 (related PR: #59115) changed how run_id is generated for scheduled DAG runs:

  • 3.1.x: run_id = scheduled__<logical_date>
  • 3.2.0: run_id = scheduled__<run_after>

For a daily DAG, these timestamps differ by one interval. After downgrading to 3.1.x, DagModel.next_dagrun still holds a value set by 3.2.0. The 3.1.x scheduler uses that value to generate a run_id that already exists in dag_run (created by 3.2.0 with the new format) leading to DB insertion error.

This is made worse by the known session handling issue in _create_dag_runs_timetable (issue: #59120): one UniqueViolation messed the entire SQLAlchemy session, causing errors to cascade to every other DAG in the same scheduler batch. The result is a full scheduler crash loop affecting all DAGs, not just the one with the collision.

Proposed change

Add a data migration to the downgrade path of existing migration: 0107, so that on downgrade, we null out next_dagrun, next_dagrun_create_after, next_dagrun_data_interval_start, and next_dagrun_data_interval_end for all DAGs.

Nulling is intentional so that the scheduler as it already handles NULL can start recalculating these fields from the last completed run on its next cycle. After downgrade, the 3.1.x scheduler recalculates using 3.1.x semantics, generating run_ids that will not collide with existing runs.

What does this mean for 3.2 -> 3.2.1? (assuming we make this into 3.2.2)

For 3.2 and 3.2.1:

  • This fix is not in it
  • Downgrade path is broken
  • Anyone who upgraded to 3.2.0 and rolled back to 3.1.x will hit the crash loop

Anyone on these versions who needs to roll back to 3.1.x should manually run the sql workaround (null out the 4 fields) before starting the 3.1.x scheduler.


  • Read the Pull Request Guidelines for more information. Note: commit author/co-author name and email in commits become permanently public when merged.
  • For fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
  • When adding dependency, check compliance with the ASF 3rd Party License Policy.
  • For significant user-facing changes create newsfragment: {pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.

@boring-cyborg boring-cyborg Bot added the area:db-migrations PRs with DB migration label Apr 22, 2026
@amoghrajesh amoghrajesh changed the title Fix scheduler UniqueViolation crash on downgrade from 3.2.0 to 3.1.x [dont merge] Fix scheduler UniqueViolation crash on downgrade from 3.2.0 to 3.1.x Apr 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a downgrade-time data migration to prevent scheduler crash loops when rolling back from Airflow 3.2.0 to 3.1.x due to dag_run run_id collisions stemming from changed scheduled run_id semantics.

Changes:

  • Extend the 0107_3_2_0_add_partition_fields_to_dag migration’s downgrade() to NULL out DagModel next-run fields (next_dagrun*) so 3.1.x recomputes them using its own semantics.

@kaxil kaxil marked this pull request as draft April 22, 2026 20:44
@amoghrajesh
Copy link
Copy Markdown
Contributor Author

Testing Report: Airflow 3.1.8 <-> 3.2.0 Upgrade/Downgrade Migration

Objective

Verify that the Airflow survives a full upgrade (3.1.8 -> 3.2.0) and downgrade
(3.2.0 -> 3.1.8) cycle without data loss, and that the fix in db-insert-violation0fix correctly
handles dag table state during downgrade.


Environment

Property Value
Branches v3-1-test (3.1.8), v3-2-test (3.2.0)
Backend PostgreSQL 14
Project name airflow-repro
Docker volume airflow-repro-postgres14-db-volume
DAG my_dag, schedule */2 * * * *, catchup=False, start_date=2026-04-23

Test Steps & Results

Step 1 — Baseline on 3.1.8

  • Started breeze on v3-1-test, unpaused my_dag
  • Let ~70 scheduled runs complete (every 2 min, 00:0002:40)
  • Stopped airflow

Verification query:

SELECT run_id, logical_date, run_after, state
FROM dag_run
WHERE dag_id = 'my_dag'
ORDER BY logical_date DESC;
Check Result
run_id suffix == logical_date == run_after Pass
All runs success Pass

Step 2 — Upgrade to 3.2.0

  • Switched to v3-2-test, started breeze
  • db migrate ran automatically, upgrading schema to 1d6611b6ab7c (3.2.0 head)
  • Verified all 70+ runs survived the upgrade intact
Check Result
Row count unchanged after upgrade Pass
run_id format unchanged for existing rows Pass

Step 3 — Downgrade back to 3.1.8 (with the downgrade migration fix)

From the 3.2.0 breeze shell:

airflow db downgrade -n 3.1.8

Downgrade completed cleanly through all 3.2.0 migrations.

Verification query — dag table after downgrade:

SELECT dag_id, next_dagrun, next_dagrun_create_after,
       next_dagrun_data_interval_start, next_dagrun_data_interval_end
FROM dag
WHERE dag_id = 'my_dag';

The fix intentionally NULLs scheduling fields during downgrade so 3.1.8 recomputes them
fresh rather than acting on stale 3.2.0 values.

Field Expected Result
next_dagrun NULL Pass
next_dagrun_create_after NULL Pass
next_dagrun_data_interval_start NULL Pass
next_dagrun_data_interval_end NULL Pass

Step 4 — Restart on 3.1.8 with downgraded DB

  • Switched back to v3-1-test, started breeze (no --db-reset)
  • db migrate was a no-op — DB already at 509b94a1042d

DAG processor log confirmed recomputation of NULL fields and it was handled correctly.

Setting next_dagrun for my_dag to 2026-04-23 02:10:00+00:00, run_after=2026-04-23 02:10:00+00:00

Verification queries:

-- All fields recomputed
SELECT dag_id, next_dagrun, next_dagrun_create_after,
       next_dagrun_data_interval_start, next_dagrun_data_interval_end
FROM dag
WHERE dag_id = 'my_dag';

-- All historical runs intact
SELECT run_id, logical_date, run_after, state
FROM dag_run
WHERE dag_id = 'my_dag'
ORDER BY logical_date DESC;

@amoghrajesh amoghrajesh self-assigned this Apr 23, 2026
@amoghrajesh amoghrajesh marked this pull request as ready for review April 23, 2026 11:08
@amoghrajesh amoghrajesh changed the title [dont merge] Fix scheduler UniqueViolation crash on downgrade from 3.2.0 to 3.1.x Fix scheduler UniqueViolation crash on downgrade from 3.2.0 to 3.1.x Apr 23, 2026
@amoghrajesh
Copy link
Copy Markdown
Contributor Author

cc @vatsrahul1001 / @kaxil would love to get your reviews here too

@amoghrajesh amoghrajesh requested a review from kaxil April 24, 2026 12:05
@vatsrahul1001
Copy link
Copy Markdown
Contributor

Test status on Astro

  1. Created deployment with 3.1-14 and ran my_dag
  2. Applied migration patch from this PR and upgraded to 3.2 runtime (Migration were all fine)
  3. Rollbacked to 3.1-14 no issues with downgrade migration
    cc: @amoghrajesh

@amoghrajesh
Copy link
Copy Markdown
Contributor Author

Thanks for testing it @vatsrahul1001! Merging this one in.

@amoghrajesh amoghrajesh merged commit 6358968 into apache:main Apr 28, 2026
152 checks passed
@amoghrajesh amoghrajesh deleted the db-insert-violation0fix branch April 28, 2026 07:46
@amoghrajesh amoghrajesh added this to the Airflow 3.2.2 milestone Apr 28, 2026
amoghrajesh added a commit that referenced this pull request Apr 28, 2026
vatsrahul1001 pushed a commit that referenced this pull request May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:db-migrations PRs with DB migration

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants