feat(metrics): emit dagrun.deadlocked counter on task-deadlock detection#66819
Open
1fanwang wants to merge 4 commits into
Open
feat(metrics): emit dagrun.deadlocked counter on task-deadlock detection#668191fanwang wants to merge 4 commits into
1fanwang wants to merge 4 commits into
Conversation
`DagRun.update_state` already detects the all-tasks-unfinished-but-none-schedulable case, logs an error, and notifies state-changed. Add a `dagrun.deadlocked` Stats counter at the same call site, tagged with `dag_id` and `run_type`, so existing statsd / OTel pipelines can chart deadlock rate without scraping logs. Register the metric in the observability template and add a focused unit test that mocks `stats.incr` and asserts emission when the deadlock branch fires. Closes apache#66818 Signed-off-by: 1fanwang <1fannnw@gmail.com>
Signed-off-by: 1fanwang <1fannnw@gmail.com>
Srabasti
suggested changes
May 13, 2026
…heck `unrunnable` is not in the English dictionary the docs spell-check uses, which failed the `Build documentation (--spellcheck-only)` job. Reword to "all unfinished tasks were hung", matching the reviewer's suggestion. Signed-off-by: 1fanwang <1fannnw@gmail.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix for the missing metric called out in #66818. On our LinkedIn DI cluster we have no signal today when
DagRun.update_statedetects an all-tasks-deadlocked Dag run — the case is logged but not metered, so we don't alert until a user reports it. This PR emits adagrun.deadlockedcounter on the existing detection path; no behaviour change otherwise.Problem
DagRun.update_state()already detects the task-deadlock case — when every unfinished task is unrunnable — logs an error, and notifies the state-changed listeners withmsg="all_tasks_deadlocked". It does not emit a Stats counter, so operators who want to alert on deadlock-induced failures end up grepping scheduler logs or scraping state-change notifications. There's no first-class signal alongsidezombies.zombie_unfinished_run_failure_countor the executor-event failure counters.Fix
Add
stats.incr("dagrun.deadlocked", tags={"dag_id": self.dag_id, "run_type": self.run_type})at the existing log + notify call site inDagRun.update_state, and register the new counter in the observability metrics template.Tests
Added
test_dagrun_deadlock_emits_stats_counterinairflow-core/tests/unit/models/test_dagrun.py. Mirrors the existingtest_dagrun_deadlockfixture (invalidtrigger_ruleto force the deadlock branch), mocksstats.incr, and asserts the call with the expected name and tags.Reproducer
The new test
airflow-core/tests/unit/models/test_dagrun.py::TestDagRun::test_dagrun_deadlock_emits_stats_counterbuilds a two-task DagRun whose downstream task has an invalidtrigger_rule, drivesupdate_stateinto the deadlock branch, and assertsstats.incrwas called with"dagrun.deadlocked"plus thedag_idandrun_typetags.Without the
dagrun.pychange (reverted toupstream/main), the test fails on the mock assertion:With the
stats.incrcall restored, the test passes:The deadlock branch is reached either way —
Task deadlock (no runnable tasks); marking run ... failedappears in both runs — confirming the FAILED→PASSED transition is driven by the newstats.incrline, not by a difference in which code path executes.Closes #66818