Skip to content

Feature/add max new dagruns to schedule#64294

Open
Nataneljpwd wants to merge 12 commits intoapache:mainfrom
Nataneljpwd:feature/add-max-new-dagruns-to-schedule
Open

Feature/add max new dagruns to schedule#64294
Nataneljpwd wants to merge 12 commits intoapache:mainfrom
Nataneljpwd:feature/add-max-new-dagruns-to-schedule

Conversation

@Nataneljpwd
Copy link
Copy Markdown
Contributor


When new dagruns are created in bulk (i.e with triggerDagRunOperator), the scheduler might struggle with the amount created, and cause other dagruns to starve.

This is due to the sort order in get_running_dagruns_to_examine which selects (with a nulls first) by last scheduling decision, which means that if a lot of new dagruns are created, the scheduler will examine them first, and in situations where the dags have a lot of tasks (hundreds to tens of thousands) it can cause the scheduler to stall, as it has to both examine a lot of dagruns, and create new tasks for those dagruns.

When we have tried to tune the max_dagruns_per_loop_to_schedule we either got starvation of other dagruns OR the scheduler being reset due to not returning a heartbeat for a long time and failing the readiness probe.

To fix this, a new configuration is added, max_new_dagruns_per_loop_to_schedule which can help when a lot of new dagruns are created in large batches at the same time, and allow the scheduler to both look at existing dagruns (not starving them and causing them to timeout with no running / scheduled tasks) and create and manage the new dagruns.

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
  • No

@Nataneljpwd Nataneljpwd marked this pull request as draft March 27, 2026 12:28
@Nataneljpwd Nataneljpwd marked this pull request as ready for review March 27, 2026 17:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a scheduler tuning knob to limit how many new (never-before-examined) running DagRuns are considered per scheduling loop, to reduce starvation/slowdown when large batches of DagRuns are created at once.

Changes:

  • Add scheduler.max_new_dagruns_per_loop_to_schedule config (default 0) and plumb it into DagRun selection.
  • Update DagRun.get_running_dag_runs_to_examine() to optionally split selection into “previously examined” vs “new” DagRuns.
  • Add/adjust unit tests to cover the new selection behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
airflow-core/src/airflow/models/dagrun.py Adds config-backed limit and changes running DagRun selection logic to optionally fetch “old” and “new” runs separately.
airflow-core/src/airflow/config_templates/config.yml Documents the new scheduler configuration option.
airflow-core/tests/unit/models/test_dagrun.py Adds tests for the new DagRun selection behavior and updates an existing test to handle the new return type.

Comment on lines +997 to +1051
self, session, dag_maker
):

DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE = 0

def create_dagruns(
last_scheduling_decision: datetime.datetime | None = None,
count: int = 20,
):
dagrun = dag_maker.create_dagrun(
run_type=DagRunType.SCHEDULED,
state=State.RUNNING,
run_after=datetime.datetime(2024, 1, 1),
)
dagrun.last_scheduling_decision = last_scheduling_decision
session.merge(dagrun)
for _ in range(count - 1):
dagrun = dag_maker.create_dagrun_after(
dagrun,
run_type=DagRunType.SCHEDULED,
state=State.RUNNING,
run_after=datetime.datetime(2024, 1, 1),
)

dagrun.last_scheduling_decision = last_scheduling_decision
session.merge(dagrun)

with dag_maker(
dag_id="dummy_dag",
schedule=datetime.timedelta(days=1),
start_date=datetime.datetime(2024, 1, 1),
session=session,
):
EmptyOperator(task_id="dummy_task")

create_dagruns(None, 10)

with dag_maker(
dag_id="dummy_dag2",
schedule=datetime.timedelta(days=1),
start_date=datetime.datetime(2024, 1, 1),
session=session,
):
EmptyOperator(task_id="dummy_task2")

create_dagruns(func.now(), 20)

session.flush()

dagruns = list(DagRun.get_running_dag_runs_to_examine(session=session))

assert len([dagrun for dagrun in dagruns if dagrun.last_scheduling_decision is None]) == 10

assert len([dagrun for dagrun in dagruns if dagrun.last_scheduling_decision is not None]) == 10

Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test name implies it covers the "< 0" configuration path, but it sets DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE = 0, so the warning/clamping branch is never exercised. Set a negative value here (e.g. -1) and assert the expected warning (via caplog) to actually cover the behavior.

Suggested change
self, session, dag_maker
):
DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE = 0
def create_dagruns(
last_scheduling_decision: datetime.datetime | None = None,
count: int = 20,
):
dagrun = dag_maker.create_dagrun(
run_type=DagRunType.SCHEDULED,
state=State.RUNNING,
run_after=datetime.datetime(2024, 1, 1),
)
dagrun.last_scheduling_decision = last_scheduling_decision
session.merge(dagrun)
for _ in range(count - 1):
dagrun = dag_maker.create_dagrun_after(
dagrun,
run_type=DagRunType.SCHEDULED,
state=State.RUNNING,
run_after=datetime.datetime(2024, 1, 1),
)
dagrun.last_scheduling_decision = last_scheduling_decision
session.merge(dagrun)
with dag_maker(
dag_id="dummy_dag",
schedule=datetime.timedelta(days=1),
start_date=datetime.datetime(2024, 1, 1),
session=session,
):
EmptyOperator(task_id="dummy_task")
create_dagruns(None, 10)
with dag_maker(
dag_id="dummy_dag2",
schedule=datetime.timedelta(days=1),
start_date=datetime.datetime(2024, 1, 1),
session=session,
):
EmptyOperator(task_id="dummy_task2")
create_dagruns(func.now(), 20)
session.flush()
dagruns = list(DagRun.get_running_dag_runs_to_examine(session=session))
assert len([dagrun for dagrun in dagruns if dagrun.last_scheduling_decision is None]) == 10
assert len([dagrun for dagrun in dagruns if dagrun.last_scheduling_decision is not None]) == 10
self, session, dag_maker, caplog
):
original_value = DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE
try:
# Set a negative value to exercise the "< 0" clamping and warning path.
DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE = -1
# Capture warnings emitted when handling the negative configuration value.
caplog.set_level("WARNING", logger="airflow.models.dagrun")
def create_dagruns(
last_scheduling_decision: datetime.datetime | None = None,
count: int = 20,
):
dagrun = dag_maker.create_dagrun(
run_type=DagRunType.SCHEDULED,
state=State.RUNNING,
run_after=datetime.datetime(2024, 1, 1),
)
dagrun.last_scheduling_decision = last_scheduling_decision
session.merge(dagrun)
for _ in range(count - 1):
dagrun = dag_maker.create_dagrun_after(
dagrun,
run_type=DagRunType.SCHEDULED,
state=State.RUNNING,
run_after=datetime.datetime(2024, 1, 1),
)
dagrun.last_scheduling_decision = last_scheduling_decision
session.merge(dagrun)
with dag_maker(
dag_id="dummy_dag",
schedule=datetime.timedelta(days=1),
start_date=datetime.datetime(2024, 1, 1),
session=session,
):
EmptyOperator(task_id="dummy_task")
create_dagruns(None, 10)
with dag_maker(
dag_id="dummy_dag2",
schedule=datetime.timedelta(days=1),
start_date=datetime.datetime(2024, 1, 1),
session=session,
):
EmptyOperator(task_id="dummy_task2")
create_dagruns(func.now(), 20)
session.flush()
dagruns = list(DagRun.get_running_dag_runs_to_examine(session=session))
# Verify that the negative value was ignored/clamped by checking for the warning.
assert any(
"DEFAULT_NEW_DAGRUNS_TO_EXAMINE" in record.getMessage()
and ("negative" in record.getMessage() or "< 0" in record.getMessage())
for record in caplog.records
)
assert len([dagrun for dagrun in dagruns if dagrun.last_scheduling_decision is None]) == 10
assert len([dagrun for dagrun in dagruns if dagrun.last_scheduling_decision is not None]) == 10
finally:
DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE = original_value

Copilot uses AI. Check for mistakes.
Comment on lines +997 to +1000
self, session, dag_maker
):

DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE = 0
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests mutate the class-level DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE but never restore it, which can make later tests order-dependent. Please use monkeypatch.setattr(...) (or save/restore the original value) so the change is scoped to the test.

Suggested change
self, session, dag_maker
):
DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE = 0
self, session, dag_maker, monkeypatch
):
monkeypatch.setattr(DagRun, "DEFAULT_NEW_DAGRUNS_TO_EXAMINE", 0)

Copilot uses AI. Check for mistakes.
Comment on lines +1052 to +1054
def test_get_running_dag_runs_with_max_new_dagruns_to_examine(self, session, dag_maker):

DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE = 10
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE is modified without being restored, which can leak state across tests. Please scope this via monkeypatch or restore the previous value in a finally block.

Suggested change
def test_get_running_dag_runs_with_max_new_dagruns_to_examine(self, session, dag_maker):
DagRun.DEFAULT_NEW_DAGRUNS_TO_EXAMINE = 10
def test_get_running_dag_runs_with_max_new_dagruns_to_examine(self, session, dag_maker, monkeypatch):
monkeypatch.setattr(DagRun, "DEFAULT_NEW_DAGRUNS_TO_EXAMINE", 10)

Copilot uses AI. Check for mistakes.
Comment on lines 34 to 44
from sqlalchemy import (
JSON,
Enum,
ForeignKey,
ForeignKeyConstraint,
Index,
Integer,
PrimaryKeyConstraint,
SQLColumnExpression,
String,
Text,
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SQLColumnExpression is only used for typing in _get_dagrun_query, and this file already keeps most SQLAlchemy typing-only imports under TYPE_CHECKING. Consider moving this import under TYPE_CHECKING (or using an already-imported typing like ColumnElement[Any]) to avoid adding an extra runtime dependency/import surface.

Copilot uses AI. Check for mistakes.
Comment on lines +671 to +676
new_dagruns_to_examine = cls.DEFAULT_NEW_DAGRUNS_TO_EXAMINE
dagruns_to_examine = cls.DEFAULT_DAGRUNS_TO_EXAMINE

if new_dagruns_to_examine < 0:
log.warning("'max_new_dagruns_per_loop_to_schedule' is smaller than 0, ignoring configuration")
new_dagruns_to_examine = 0
Copy link

Copilot AI Apr 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If max_new_dagruns_per_loop_to_schedule is configured as a negative value, this warning will be emitted on every scheduler loop, potentially spamming logs. Consider clamping/validating the config once when DEFAULT_NEW_DAGRUNS_TO_EXAMINE is initialized (and logging once), instead of warning on every call.

Copilot uses AI. Check for mistakes.
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label Apr 2, 2026
@eladkal eladkal added this to the Airflow 3.2.1 milestone Apr 9, 2026
@eladkal eladkal added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:ConfigTemplates backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants