observability: add structured diagnostics to scheduler loop and heartbeat-timeout detection#67077
observability: add structured diagnostics to scheduler loop and heartbeat-timeout detection#67077prince8273 wants to merge 5 commits into
Conversation
|
Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide
|
…beat-timeout detection - Capture dag_runs_examined count in _do_scheduling() - Emit scheduler.dag_runs.examined and scheduler.executor.open_slots gauges - Add scheduling loop summary log with dag_runs, queued_tis, open_slots - Emit scheduler.tasks.heartbeat_timeout gauge in _purge_task_instances_without_heartbeats() - Enrich heartbeat-timeout error log with heartbeat_age_seconds, hostname, pid, task_running_seconds Signed-off-by: Prince Kumar <princesingh29757@gmail.com>
- scheduler.dag_runs.examined - scheduler.executor.open_slots - scheduler.tasks.heartbeat_timeout Signed-off-by: Prince Kumar <princesingh29757@gmail.com>
3118003 to
1ee5a6a
Compare
|
@prince8273 Converting to draft — this PR doesn't yet meet our Pull Request quality criteria.
See the linked criteria for how to fix each item, then mark the PR "Ready for review". This is not a rejection — just an invitation to bring the PR up to standard. No rush. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. Drafted-by: Claude Code (Opus 4.7); reviewed by @potiuk before posting |
dag_runs is a ScalarResult which has no len(). Calling .all() materializes it into a list, matching the existing test mock contract at test_scheduler_job.py:9215. Signed-off-by: Prince Kumar <princesingh29757@gmail.com>
Currently the scheduler loop and heartbeat-timeout detection emit minimal
context, making production diagnosis of stalls, slot contention, and worker
crashes difficult.
Changes
_do_scheduling()dag_runs_examinedafter fetching active runsscheduler.dag_runs.examinedandscheduler.executor.open_slotsgauges_purge_task_instances_without_heartbeats()scheduler.tasks.heartbeat_timeoutgaugeheartbeat_age_seconds,hostname,pid, andtask_running_secondsNotes
stats.gaugenaming and import conventions