Skip to content

observability: add structured diagnostics to scheduler loop and heartbeat-timeout detection#67077

Open
prince8273 wants to merge 5 commits into
apache:mainfrom
prince8273:observability/scheduler-loop-diagnostics
Open

observability: add structured diagnostics to scheduler loop and heartbeat-timeout detection#67077
prince8273 wants to merge 5 commits into
apache:mainfrom
prince8273:observability/scheduler-loop-diagnostics

Conversation

@prince8273
Copy link
Copy Markdown

Currently the scheduler loop and heartbeat-timeout detection emit minimal
context, making production diagnosis of stalls, slot contention, and worker
crashes difficult.

Changes

_do_scheduling()

  • Captures dag_runs_examined after fetching active runs
  • Emits scheduler.dag_runs.examined and scheduler.executor.open_slots gauges
  • Adds a structured summary log line before return

_purge_task_instances_without_heartbeats()

  • Emits scheduler.tasks.heartbeat_timeout gauge
  • Enriches the existing error log with heartbeat_age_seconds, hostname,
    pid, and task_running_seconds

Notes

  • Additive only — no behavior change, no state transitions affected
  • Follows existing stats.gauge naming and import conventions

@prince8273 prince8273 requested review from XD-DENG and ashb as code owners May 17, 2026 23:06
@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label May 17, 2026
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg Bot commented May 17, 2026

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our prek-hooks will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example Dag that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

…beat-timeout detection

- Capture dag_runs_examined count in _do_scheduling()

- Emit scheduler.dag_runs.examined and scheduler.executor.open_slots gauges

- Add scheduling loop summary log with dag_runs, queued_tis, open_slots

- Emit scheduler.tasks.heartbeat_timeout gauge in _purge_task_instances_without_heartbeats()

- Enrich heartbeat-timeout error log with heartbeat_age_seconds, hostname,
  pid, task_running_seconds

Signed-off-by: Prince Kumar <princesingh29757@gmail.com>
- scheduler.dag_runs.examined
- scheduler.executor.open_slots
- scheduler.tasks.heartbeat_timeout

Signed-off-by: Prince Kumar <princesingh29757@gmail.com>
@prince8273 prince8273 force-pushed the observability/scheduler-loop-diagnostics branch from 3118003 to 1ee5a6a Compare May 18, 2026 13:02
@potiuk
Copy link
Copy Markdown
Member

potiuk commented May 19, 2026

@prince8273 Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.

  • Other failing CI checks — Failing: Postgres tests: core / DB-core:Postgres:14:3.10:Core...Serialization, MySQL tests: core / DB-core:MySQL:8.0:3.10:Core...Serialization, Sqlite tests: core / DB-core:Sqlite:3.10:Core...Serialization, Low dep tests:core / All-core:LowestDeps:14:3.10:Core...Serialization, Integration and System Tests / Integration core otel. See docs.
  • Pre-commit / static checks — Failing: CI image checks / Static checks. See docs.

See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.


Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting

@potiuk potiuk marked this pull request as draft May 19, 2026 01:26
dag_runs is a ScalarResult which has no len(). Calling .all() materializes
it into a list, matching the existing test mock contract at test_scheduler_job.py:9215.

Signed-off-by: Prince Kumar <princesingh29757@gmail.com>
@prince8273 prince8273 marked this pull request as ready for review May 19, 2026 02:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants