Skip to content

feat: add scheduler observability metrics#68068

Open
safaehar wants to merge 2 commits into
apache:mainfrom
safaehar:feat/scheduler-observability-metrics
Open

feat: add scheduler observability metrics#68068
safaehar wants to merge 2 commits into
apache:mainfrom
safaehar:feat/scheduler-observability-metrics

Conversation

@safaehar
Copy link
Copy Markdown

@safaehar safaehar commented Jun 5, 2026

Emit three new metrics from `SchedulerJobRunner` to improve visibility into scheduler health. These metrics have been running in production at Datadog via local monkey-patches and are contributed upstream so the patches can eventually be removed.

New metrics

  • `scheduler.loop_exceptions{exception_class}` — counter incremented when the scheduler loop exits with an unhandled exception, tagged with the exception class. Emitted from the existing `except Exception` block in `_execute`.

  • `scheduler.executor_events.batch_size` (gauge) and `scheduler.executor_events.processed` (counter) — emitted on every successful call to `process_executor_events` with the size of the event batch. `scheduler.executor_events.failed{reason}` is incremented when the call raises, tagged with the exception class.

  • `scheduler.zombies.detected{reason}` — counter incremented when zombie task instances are detected, tagged by detection path:

    • `heartbeat_timeout`: task exceeded the heartbeat threshold
    • `adopt_failure`: orphaned task could not be re-adopted and was reset

Tests

New test class `TestSchedulerObservabilityMetrics` in `airflow-core/tests/unit/jobs/test_scheduler_job.py` covers all three metrics (success and failure paths):

pytest airflow-core/tests/unit/jobs/test_scheduler_job.py::TestSchedulerObservabilityMetrics -v

Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below) : Claude

@safaehar safaehar requested review from XD-DENG and ashb as code owners June 5, 2026 09:10
@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Jun 5, 2026
@safaehar safaehar force-pushed the feat/scheduler-observability-metrics branch 3 times, most recently from 9f3e477 to d018424 Compare June 5, 2026 09:58
safaehar added 2 commits June 5, 2026 12:05
Emit three new metrics from SchedulerJobRunner to improve visibility
into scheduler health:

- scheduler.loop_exceptions{exception_class}: counter incremented when
  the scheduler loop exits with an unhandled exception, tagged with the
  exception class to aid triaging crash loops.

- scheduler.executor_events.batch_size (gauge) and
  scheduler.executor_events.processed (counter): emitted on each
  successful call to process_executor_events with the event count.
  scheduler.executor_events.failed{reason} is incremented instead when
  the call raises, tagged with the exception class.
  The original body is extracted into _process_executor_events_core so
  process_executor_events can act as a thin metrics wrapper.

- scheduler.zombies.detected{reason}: counter incremented when zombie
  task instances are detected, tagged with the detection path:
    - heartbeat_timeout: task exceeded the heartbeat threshold
    - adopt_failure: orphaned task could not be re-adopted and was reset

These metrics are already running in production via local monkey-patches
at Datadog; this commit contributes them natively upstream so the
patches can eventually be removed.
… metric

Three existing tests used mock_stats.incr.assert_not_called() to verify that
no error/mismatch metrics were emitted in requeued-TI and stale-success
scenarios. The new executor_events.processed counter now fires unconditionally
for every _process_executor_events call, so those blanket assertions became
too broad. Replace each with an assertion that permits the expected counter
while still verifying that no scheduler.tasks.killed_externally metric fired.
@safaehar safaehar force-pushed the feat/scheduler-observability-metrics branch from d018424 to 27131ce Compare June 5, 2026 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant