Skip to content

feat(metrics): emit dagrun.deadlocked counter on task-deadlock detection#66819

Open
1fanwang wants to merge 4 commits into
apache:mainfrom
1fanwang:dagrun-deadlocked-metric
Open

feat(metrics): emit dagrun.deadlocked counter on task-deadlock detection#66819
1fanwang wants to merge 4 commits into
apache:mainfrom
1fanwang:dagrun-deadlocked-metric

Conversation

@1fanwang
Copy link
Copy Markdown
Contributor

@1fanwang 1fanwang commented May 12, 2026

Fix for the missing metric called out in #66818. On our LinkedIn DI cluster we have no signal today when DagRun.update_state detects an all-tasks-deadlocked Dag run — the case is logged but not metered, so we don't alert until a user reports it. This PR emits a dagrun.deadlocked counter on the existing detection path; no behaviour change otherwise.

Problem

DagRun.update_state() already detects the task-deadlock case — when every unfinished task is unrunnable — logs an error, and notifies the state-changed listeners with msg="all_tasks_deadlocked". It does not emit a Stats counter, so operators who want to alert on deadlock-induced failures end up grepping scheduler logs or scraping state-change notifications. There's no first-class signal alongside zombies.zombie_unfinished_run_failure_count or the executor-event failure counters.

Fix

Add stats.incr("dagrun.deadlocked", tags={"dag_id": self.dag_id, "run_type": self.run_type}) at the existing log + notify call site in DagRun.update_state, and register the new counter in the observability metrics template.

Tests

Added test_dagrun_deadlock_emits_stats_counter in airflow-core/tests/unit/models/test_dagrun.py. Mirrors the existing test_dagrun_deadlock fixture (invalid trigger_rule to force the deadlock branch), mocks stats.incr, and asserts the call with the expected name and tags.

Reproducer

The new test airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_deadlock_emits_stats_counter builds a two-task DagRun whose downstream task has an invalid trigger_rule, drives update_state into the deadlock branch, and asserts stats.incr was called with "dagrun.deadlocked" plus the dag_id and run_type tags.

Without the dagrun.py change (reverted to upstream/main), the test fails on the mock assertion:

FAILED airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_deadlock_emits_stats_counter
AssertionError: incr('dagrun.deadlocked', tags={'dag_id': 'test_dag', 'run_type': <DagRunType.MANUAL: 'manual'>}) call not found

With the stats.incr call restored, the test passes:

PASSED airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_deadlock_emits_stats_counter
======================== 1 passed, 1 warning in 15.88s =========================

The deadlock branch is reached either way — Task deadlock (no runnable tasks); marking run ... failed appears in both runs — confirming the FAILED→PASSED transition is driven by the new stats.incr line, not by a difference in which code path executes.

Closes #66818

`DagRun.update_state` already detects the all-tasks-unfinished-but-none-schedulable
case, logs an error, and notifies state-changed. Add a `dagrun.deadlocked` Stats
counter at the same call site, tagged with `dag_id` and `run_type`, so existing
statsd / OTel pipelines can chart deadlock rate without scraping logs.

Register the metric in the observability template and add a focused unit test
that mocks `stats.incr` and asserts emission when the deadlock branch fires.

Closes apache#66818

Signed-off-by: 1fanwang <1fannnw@gmail.com>
Signed-off-by: 1fanwang <1fannnw@gmail.com>
…heck

`unrunnable` is not in the English dictionary the docs spell-check uses,
which failed the `Build documentation (--spellcheck-only)` job. Reword to
"all unfinished tasks were hung", matching the reviewer's suggestion.

Signed-off-by: 1fanwang <1fannnw@gmail.com>
@1fanwang 1fanwang requested a review from Srabasti May 13, 2026 07:42
@potiuk potiuk added the ready for maintainer review Set after triaging when all criteria pass. label May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready for maintainer review Set after triaging when all criteria pass.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emit Stats counter when DagRun.update_state detects task deadlock

3 participants