Skip to content

feat(metrics): wrap executor.heartbeat() in a timer to localize loop slowdowns#66808

Merged
potiuk merged 1 commit into
apache:mainfrom
1fanwang:metrics-scheduler-heartbeat-timer
May 12, 2026
Merged

feat(metrics): wrap executor.heartbeat() in a timer to localize loop slowdowns#66808
potiuk merged 1 commit into
apache:mainfrom
1fanwang:metrics-scheduler-heartbeat-timer

Conversation

@1fanwang
Copy link
Copy Markdown
Contributor

Problem

dag_processing.loop_duration is aggregate. When the scheduler loop slows down, there's no signal pointing at the executor as the cause — executor.heartbeat() runs every iteration and can spike from Kubernetes API throttling, Celery broker lag, etc., but its wall time is folded into the loop total.

Fix

Wrap executor.heartbeat() in stats.timer("scheduler.executor_heartbeat_duration"), tagged with executor: <ClassName> so each configured executor reports independently in multi-executor deployments.

Tests

test_executor_heartbeat_emits_timer patches stats.timer, runs one scheduler loop iteration with the two-executor mock_executors fixture, and asserts the timer was opened once per executor with the right metric name and tags={"executor": <class>}.

Note

I scoped this to executor.heartbeat() only, not _process_executor_events. The two calls are structurally separate — they live in different loops with a fresh create_session() block between them — and their cost characteristics differ (heartbeat is executor I/O; _process_executor_events is DB writes plus event-buffer drain). They look like two independent signals rather than a tight pair, so a separate timer for _process_executor_events makes sense as a follow-up if the same need surfaces there. Happy to roll it into this PR if reviewers would prefer.

Closes #66803

…slowdowns

Emit scheduler.executor_heartbeat_duration as a per-executor timer so
operators can see whether executor.heartbeat() is the bottleneck of a
slow scheduler loop, instead of inferring from the aggregate
scheduler.scheduler_loop_duration.

Tagged by type(executor).__name__ so multi-executor deployments
attribute the cost to each configured executor separately.

Closes apache#66803

Signed-off-by: 1fanwang <1fannnw@gmail.com>
@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label May 12, 2026
@potiuk potiuk merged commit 2956c98 into apache:main May 12, 2026
79 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Emit a timer around scheduler executor.heartbeat() to localize loop slowdowns

2 participants