feat: add scheduler observability metrics#68068
Open
safaehar wants to merge 2 commits into
Open
Conversation
9f3e477 to
d018424
Compare
Emit three new metrics from SchedulerJobRunner to improve visibility
into scheduler health:
- scheduler.loop_exceptions{exception_class}: counter incremented when
the scheduler loop exits with an unhandled exception, tagged with the
exception class to aid triaging crash loops.
- scheduler.executor_events.batch_size (gauge) and
scheduler.executor_events.processed (counter): emitted on each
successful call to process_executor_events with the event count.
scheduler.executor_events.failed{reason} is incremented instead when
the call raises, tagged with the exception class.
The original body is extracted into _process_executor_events_core so
process_executor_events can act as a thin metrics wrapper.
- scheduler.zombies.detected{reason}: counter incremented when zombie
task instances are detected, tagged with the detection path:
- heartbeat_timeout: task exceeded the heartbeat threshold
- adopt_failure: orphaned task could not be re-adopted and was reset
These metrics are already running in production via local monkey-patches
at Datadog; this commit contributes them natively upstream so the
patches can eventually be removed.
… metric Three existing tests used mock_stats.incr.assert_not_called() to verify that no error/mismatch metrics were emitted in requeued-TI and stale-success scenarios. The new executor_events.processed counter now fires unconditionally for every _process_executor_events call, so those blanket assertions became too broad. Replace each with an assertion that permits the expected counter while still verifying that no scheduler.tasks.killed_externally metric fired.
d018424 to
27131ce
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Emit three new metrics from `SchedulerJobRunner` to improve visibility into scheduler health. These metrics have been running in production at Datadog via local monkey-patches and are contributed upstream so the patches can eventually be removed.
New metrics
`scheduler.loop_exceptions{exception_class}` — counter incremented when the scheduler loop exits with an unhandled exception, tagged with the exception class. Emitted from the existing `except Exception` block in `_execute`.
`scheduler.executor_events.batch_size` (gauge) and `scheduler.executor_events.processed` (counter) — emitted on every successful call to `process_executor_events` with the size of the event batch. `scheduler.executor_events.failed{reason}` is incremented when the call raises, tagged with the exception class.
`scheduler.zombies.detected{reason}` — counter incremented when zombie task instances are detected, tagged by detection path:
Tests
New test class `TestSchedulerObservabilityMetrics` in `airflow-core/tests/unit/jobs/test_scheduler_job.py` covers all three metrics (success and failure paths):
Was generative AI tooling used to co-author this PR?