Skip to content

Scheduler: max_consecutive_failed_dag_runs broken by NULL logical_date #65125

@Subham-KRLX

Description

@Subham-KRLX

Under which category would you file this issue?

Airflow Core

Apache Airflow version

3.2.0.dev0

What happened and how to reproduce it?

In Airflow 3, logical_date is nullable for many run types. The scheduler's auto-pausing logic in dagrun.py relies on .order_by(DagRun.logical_date.desc()), which is non-deterministic for NULLs and fails to isolate Manual vs. Scheduled runs.

Steps to Reproduce:

Set max_consecutive_failed_dag_runs=3.
A healthy scheduled run succeeds (Run A).
3 Manual test runs fail with logical_date=None (Runs B, C, D).
On Postgres/MySQL, the query order_by(logical_date.desc()) handles NULLs inconsistently. Often, the old success (Run A) is returned in the "top 3," preventing the auto-pause.
In other cases, manual test failures "pollute" the count and pause the production schedule unnecessarily.

What you think should happen instead?

Deterministic Ordering: The scheduler should use a stable ordering mechanism that accounts for nullable logical_date by using order_by(DagRun.logical_date.desc().nulls_last(), DagRun.id.desc()) or prioritizing the run_after column.
RunType Isolation: Evaluation of consecutive failures should be isolated by run_type (e.g., only scheduled runs should trigger a production auto-pause) to prevent manual test failures from impacting automated schedules.

The DagRun model in Airflow 3 was updated to make logical_date optional, but the logic in _check_last_n_dagruns_failed in airflow/models/dagrun.py was not updated to handle this change. Specifically, the query at line 868: .order_by(DagRun.logical_date.desc()) is a regression that was missed when similar fixes were applied in PR #47301. It incorrectly treats all run types as a single chronological sequence and uses an unstable sort on a nullable column.

Operating System

macOS

Deployment

Virtualenv installation

Apache Airflow Provider(s)

No response

Versions of Apache Airflow Providers

N/A

Official Helm Chart version

Not Applicable

Kubernetes Version

N/A

Helm Chart configuration

N/A

Docker Image customizations

N/A

Anything else?

This issue was identified while researching the impact of nullable logical_date on core scheduler stability in Airflow 3. It appears to be a regression that was missed when similar fixes for nullable dates were applied elsewhere (such as in PR #47301 for get_previous_scheduled_dagrun). This problem occurs every time manual and scheduled runs are mixed in the history of a DAG with auto-pausing enabled.

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:Schedulerincluding HA (high availability) schedulerarea:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yetpriority:highHigh priority bug that should be patched quickly but does not require immediate new release

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions