Skip to content

Emit histograms alongside pool slot gauges to capture distribution between scrapes #66800

@1fanwang

Description

@1fanwang

Description

scheduler_job_runner.py emits gauges for pool slot states (pool.open_slots, pool.queued_slots, pool.running_slots, pool.starving_tasks). On most backends, gauges are last-write-wins — a spike in pool pressure between two scheduler loop iterations shows up as a single value, and the distribution between scrapes is lost.

Use case / motivation

Backend operators sizing pools want p50/p95/p99 of pool utilization, not just point-in-time gauge samples. Today there's no way to see the spread.

Proposal

Alongside each existing pool slot gauge emission, also emit a histogram with the same value. Four Stats.histogram(...) additions in scheduler_job_runner.py, same call sites as the existing gauges. Nothing removed — gauges stay for backwards-compatible scrapers.

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

  • I agree to follow this project's Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions