feat(planner): resolve sub-30s queue time and surface median by cbartz · Pull Request #185 · canonical/github-runner-operators

cbartz · 2026-04-22T14:00:35Z

Overview

Two related improvements to the planner's "Job queue time" panel and the metric behind it:

Bucket layout (internal/planner/telemetry.go) — the waiting-time histogram no longer reuses the running-time bucket boundaries. Queue times are usually lower than job running times, so the new layout resolves the low end more finely (1, 2.5, 5, 10, 15, 30s) and caps the tail earlier (at 4h). Sub-30s queue times now land in distinct buckets instead of collapsing into a single one.
Dashboard (charms/planner-operator/cos_custom/grafana_dashboards/go.json) — the single "Job queue time" panel is split into two side-by-side panels: "Job queue time (median)" and "Job queue time (p95)". They live in the same dashboard slot the original panel occupied. The split is needed because the two series differ by orders of magnitude (median in seconds–minutes, p95 sometimes in hours), so a shared Y axis squashed the median to a flat line near zero. With separate panels each axis auto-scales independently. The panel descriptions also flag a caveat: started_at − created_at may overstate runner pickup time because the job's effective queued time can be later than created_at.

Rationale

The previous p95-only view on coarse buckets was misleading in two ways: sub-30s waits showed up as ~28s due to linear interpolation inside the too-wide first bucket, and a single upper percentile without a central-tendency signal gave no sense of what a normal job experienced. Median + p95 on buckets tuned for queue times — and on separate auto-scaled panels — give a much more honest picture of the system's behaviour.

Checklist

Changes comply with the project's coding standards and guidelines (see CONTRIBUTING.md and STYLE.md)
CONTRIBUTING.md has been updated upon changes to the contribution/development process (e.g. changes to the way tests are run) — N/A
Technical author has been assigned to review the PR in case of documentation changes (usually *.md files) — N/A

Split the waiting-time histogram from the running-time bucket layout so sub-30s queue times no longer collapse into a single bucket. New boundaries include 1, 2.5, 5, 10, and 15 seconds and cap the tail at 4h, better matching the typical shape of queue-time data. Surface p50 alongside p95 on the "Job queue time" Grafana panel with explicit legend formats. Median is now a meaningful signal rather than a flat line pinned to ~15s by the old coarse bucketing.

Copilot

Pull request overview

This PR improves the planner’s “Job queue time” observability by tuning the waiting-time histogram buckets for sub-30s resolution and updating the Grafana panel to surface both median and p95.

Changes:

Update webhookJobWaitingTime histogram bucket boundaries to a queue-time-appropriate layout (finer sub-30s buckets, tail capped at 4h).
Add a unit test ensuring the waiting-time histogram exposes bucket boundaries below 30s.
Update the Grafana “Job queue time” panel to plot p50 and p95 and enable the legend with explicit labels.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
internal/planner/telemetry.go	Changes queue-time histogram buckets; adds rationale comment.
internal/planner/telemetry_test.go	Adds a test to ensure waiting-time histogram has sub-30s bucket boundaries.
charms/planner-operator/cos_custom/grafana_dashboards/go.json	Updates the “Job queue time” panel to show p50 + p95 with legend enabled.

Comments suppressed due to low confidence (1)

internal/planner/telemetry.go:143

The waiting-time histogram description says it measures “from webhook reception to job start”, but the value recorded is StartedAt - CreatedAt (job creation to job start). Please update the description to match the actual timestamps being used so the metric is not misleading to operators.

			webhookJobWaitingTimeMetricName,
			metric.WithDescription("Histogram of job waiting times from webhook reception to job start"),
			metric.WithUnit("s"),

Update the metric description to reflect what is actually measured (workflow_job creation to job start on a runner) instead of the inaccurate "webhook reception to job start" wording. Drop the stale p50/p80 vocabulary from the bucket-resolution test comments and use a percentile-agnostic phrasing that matches the dashboard's p50/p95 targets.

The combined median/p95 panel made the median illegible because the p95 tail (hours) compressed the median (seconds to minutes) into a flat line near zero. Split into two side-by-side panels so each gets its own auto-scaled Y axis, and use descriptive legend labels.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

The waiting metric is `started_at - created_at`, which for workflows with `needs:` or `concurrency:` includes dependency/blocking wait time on top of pure runner-queue wait. Call this out in both panel descriptions so an operator looking at high p95 doesn't immediately conclude the runner pool is undersized.

We don't yet know empirically whether `created_at` reflects the moment the job became eligible for a runner or an earlier point (e.g. workflow run creation). Reword the caveat to flag the uncertainty around the effective queued time rather than asserting that dep/concurrency wait is always included.

Remove the `needs:` / `concurrency:` mention from the queue-time caveats — both are speculation about how GitHub stamps `created_at` and not empirically confirmed. Keep the general note that the effective queued time may be later than `created_at`. Hide the legend on both queue-time panels since each panel now shows a single series; the panel title carries the percentile.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

yhaliaw

Look good.
Adding the median would be a improvement.

cbartz requested a review from Copilot April 22, 2026 14:01

Copilot started reviewing on behalf of cbartz April 22, 2026 14:02 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread internal/planner/telemetry_test.go

Comment thread internal/planner/telemetry_test.go Outdated

cbartz added 3 commits April 22, 2026 20:37

Merge branch 'main' into feat/planner-waiting-time-buckets

2a6f189

cbartz requested a review from Copilot April 23, 2026 05:06

Copilot started reviewing on behalf of cbartz April 23, 2026 05:07 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

cbartz added 2 commits April 23, 2026 07:24

cbartz requested a review from Copilot April 23, 2026 08:01

Copilot started reviewing on behalf of cbartz April 23, 2026 08:01 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

cbartz requested a review from Copilot April 23, 2026 08:32

Copilot started reviewing on behalf of cbartz April 23, 2026 08:32 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

cbartz marked this pull request as ready for review April 23, 2026 08:36

cbartz requested review from florentianayuwono, javierdelapuente, weiiwang01, yanksyoon and yhaliaw as code owners April 23, 2026 08:36

yhaliaw approved these changes Apr 23, 2026

View reviewed changes

florentianayuwono approved these changes Apr 23, 2026

View reviewed changes

cbartz merged commit f2a4e83 into main Apr 23, 2026
50 checks passed

cbartz deleted the feat/planner-waiting-time-buckets branch April 23, 2026 09:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(planner): resolve sub-30s queue time and surface median#185

feat(planner): resolve sub-30s queue time and surface median#185
cbartz merged 7 commits intomainfrom
feat/planner-waiting-time-buckets

cbartz commented Apr 22, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

yhaliaw left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

cbartz commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Rationale

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

yhaliaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cbartz commented Apr 22, 2026 •

edited

Loading