Skip to content

feat(planner): resolve sub-30s queue time and surface median#185

Merged
cbartz merged 7 commits intomainfrom
feat/planner-waiting-time-buckets
Apr 23, 2026
Merged

feat(planner): resolve sub-30s queue time and surface median#185
cbartz merged 7 commits intomainfrom
feat/planner-waiting-time-buckets

Conversation

@cbartz
Copy link
Copy Markdown
Collaborator

@cbartz cbartz commented Apr 22, 2026

Overview

Two related improvements to the planner's "Job queue time" panel and the metric behind it:

  1. Bucket layout (internal/planner/telemetry.go) — the waiting-time histogram no longer reuses the running-time bucket boundaries. Queue times are usually lower than job running times, so the new layout resolves the low end more finely (1, 2.5, 5, 10, 15, 30s) and caps the tail earlier (at 4h). Sub-30s queue times now land in distinct buckets instead of collapsing into a single one.

  2. Dashboard (charms/planner-operator/cos_custom/grafana_dashboards/go.json) — the single "Job queue time" panel is split into two side-by-side panels: "Job queue time (median)" and "Job queue time (p95)". They live in the same dashboard slot the original panel occupied. The split is needed because the two series differ by orders of magnitude (median in seconds–minutes, p95 sometimes in hours), so a shared Y axis squashed the median to a flat line near zero. With separate panels each axis auto-scales independently. The panel descriptions also flag a caveat: started_at − created_at may overstate runner pickup time because the job's effective queued time can be later than created_at.

image

Rationale

The previous p95-only view on coarse buckets was misleading in two ways: sub-30s waits showed up as ~28s due to linear interpolation inside the too-wide first bucket, and a single upper percentile without a central-tendency signal gave no sense of what a normal job experienced. Median + p95 on buckets tuned for queue times — and on separate auto-scaled panels — give a much more honest picture of the system's behaviour.

Checklist

  • Changes comply with the project's coding standards and guidelines (see CONTRIBUTING.md and STYLE.md)
  • CONTRIBUTING.md has been updated upon changes to the contribution/development process (e.g. changes to the way tests are run) — N/A
  • Technical author has been assigned to review the PR in case of documentation changes (usually *.md files) — N/A

Split the waiting-time histogram from the running-time bucket layout so
sub-30s queue times no longer collapse into a single bucket. New
boundaries include 1, 2.5, 5, 10, and 15 seconds and cap the tail at 4h,
better matching the typical shape of queue-time data.

Surface p50 alongside p95 on the "Job queue time" Grafana panel with
explicit legend formats. Median is now a meaningful signal rather than
a flat line pinned to ~15s by the old coarse bucketing.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the planner’s “Job queue time” observability by tuning the waiting-time histogram buckets for sub-30s resolution and updating the Grafana panel to surface both median and p95.

Changes:

  • Update webhookJobWaitingTime histogram bucket boundaries to a queue-time-appropriate layout (finer sub-30s buckets, tail capped at 4h).
  • Add a unit test ensuring the waiting-time histogram exposes bucket boundaries below 30s.
  • Update the Grafana “Job queue time” panel to plot p50 and p95 and enable the legend with explicit labels.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
internal/planner/telemetry.go Changes queue-time histogram buckets; adds rationale comment.
internal/planner/telemetry_test.go Adds a test to ensure waiting-time histogram has sub-30s bucket boundaries.
charms/planner-operator/cos_custom/grafana_dashboards/go.json Updates the “Job queue time” panel to show p50 + p95 with legend enabled.
Comments suppressed due to low confidence (1)

internal/planner/telemetry.go:143

  • The waiting-time histogram description says it measures “from webhook reception to job start”, but the value recorded is StartedAt - CreatedAt (job creation to job start). Please update the description to match the actual timestamps being used so the metric is not misleading to operators.
			webhookJobWaitingTimeMetricName,
			metric.WithDescription("Histogram of job waiting times from webhook reception to job start"),
			metric.WithUnit("s"),

Comment thread internal/planner/telemetry_test.go
Comment thread internal/planner/telemetry_test.go Outdated
cbartz added 3 commits April 22, 2026 20:37
Update the metric description to reflect what is actually measured
(workflow_job creation to job start on a runner) instead of the
inaccurate "webhook reception to job start" wording.

Drop the stale p50/p80 vocabulary from the bucket-resolution test
comments and use a percentile-agnostic phrasing that matches the
dashboard's p50/p95 targets.
The combined median/p95 panel made the median illegible because the
p95 tail (hours) compressed the median (seconds to minutes) into a
flat line near zero. Split into two side-by-side panels so each gets
its own auto-scaled Y axis, and use descriptive legend labels.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

cbartz added 2 commits April 23, 2026 07:24
The waiting metric is `started_at - created_at`, which for workflows
with `needs:` or `concurrency:` includes dependency/blocking wait
time on top of pure runner-queue wait. Call this out in both panel
descriptions so an operator looking at high p95 doesn't immediately
conclude the runner pool is undersized.
We don't yet know empirically whether `created_at` reflects the
moment the job became eligible for a runner or an earlier point
(e.g. workflow run creation). Reword the caveat to flag the
uncertainty around the effective queued time rather than asserting
that dep/concurrency wait is always included.
Remove the `needs:` / `concurrency:` mention from the queue-time
caveats — both are speculation about how GitHub stamps `created_at`
and not empirically confirmed. Keep the general note that the
effective queued time may be later than `created_at`.

Hide the legend on both queue-time panels since each panel now shows
a single series; the panel title carries the percentile.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@cbartz cbartz marked this pull request as ready for review April 23, 2026 08:36
Copy link
Copy Markdown
Collaborator

@yhaliaw yhaliaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Look good.
Adding the median would be a improvement.

@cbartz cbartz merged commit f2a4e83 into main Apr 23, 2026
50 checks passed
@cbartz cbartz deleted the feat/planner-waiting-time-buckets branch April 23, 2026 09:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants