feat(planner): resolve sub-30s queue time and surface median#185
Merged
feat(planner): resolve sub-30s queue time and surface median#185
Conversation
Split the waiting-time histogram from the running-time bucket layout so sub-30s queue times no longer collapse into a single bucket. New boundaries include 1, 2.5, 5, 10, and 15 seconds and cap the tail at 4h, better matching the typical shape of queue-time data. Surface p50 alongside p95 on the "Job queue time" Grafana panel with explicit legend formats. Median is now a meaningful signal rather than a flat line pinned to ~15s by the old coarse bucketing.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR improves the planner’s “Job queue time” observability by tuning the waiting-time histogram buckets for sub-30s resolution and updating the Grafana panel to surface both median and p95.
Changes:
- Update
webhookJobWaitingTimehistogram bucket boundaries to a queue-time-appropriate layout (finer sub-30s buckets, tail capped at 4h). - Add a unit test ensuring the waiting-time histogram exposes bucket boundaries below 30s.
- Update the Grafana “Job queue time” panel to plot p50 and p95 and enable the legend with explicit labels.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/planner/telemetry.go | Changes queue-time histogram buckets; adds rationale comment. |
| internal/planner/telemetry_test.go | Adds a test to ensure waiting-time histogram has sub-30s bucket boundaries. |
| charms/planner-operator/cos_custom/grafana_dashboards/go.json | Updates the “Job queue time” panel to show p50 + p95 with legend enabled. |
Comments suppressed due to low confidence (1)
internal/planner/telemetry.go:143
- The waiting-time histogram description says it measures “from webhook reception to job start”, but the value recorded is
StartedAt - CreatedAt(job creation to job start). Please update the description to match the actual timestamps being used so the metric is not misleading to operators.
webhookJobWaitingTimeMetricName,
metric.WithDescription("Histogram of job waiting times from webhook reception to job start"),
metric.WithUnit("s"),
Update the metric description to reflect what is actually measured (workflow_job creation to job start on a runner) instead of the inaccurate "webhook reception to job start" wording. Drop the stale p50/p80 vocabulary from the bucket-resolution test comments and use a percentile-agnostic phrasing that matches the dashboard's p50/p95 targets.
The combined median/p95 panel made the median illegible because the p95 tail (hours) compressed the median (seconds to minutes) into a flat line near zero. Split into two side-by-side panels so each gets its own auto-scaled Y axis, and use descriptive legend labels.
The waiting metric is `started_at - created_at`, which for workflows with `needs:` or `concurrency:` includes dependency/blocking wait time on top of pure runner-queue wait. Call this out in both panel descriptions so an operator looking at high p95 doesn't immediately conclude the runner pool is undersized.
We don't yet know empirically whether `created_at` reflects the moment the job became eligible for a runner or an earlier point (e.g. workflow run creation). Reword the caveat to flag the uncertainty around the effective queued time rather than asserting that dep/concurrency wait is always included.
Remove the `needs:` / `concurrency:` mention from the queue-time caveats — both are speculation about how GitHub stamps `created_at` and not empirically confirmed. Keep the general note that the effective queued time may be later than `created_at`. Hide the legend on both queue-time panels since each panel now shows a single series; the panel title carries the percentile.
yhaliaw
approved these changes
Apr 23, 2026
Collaborator
yhaliaw
left a comment
There was a problem hiding this comment.
Look good.
Adding the median would be a improvement.
florentianayuwono
approved these changes
Apr 23, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Two related improvements to the planner's "Job queue time" panel and the metric behind it:
Bucket layout (
internal/planner/telemetry.go) — the waiting-time histogram no longer reuses the running-time bucket boundaries. Queue times are usually lower than job running times, so the new layout resolves the low end more finely (1, 2.5, 5, 10, 15, 30s) and caps the tail earlier (at 4h). Sub-30s queue times now land in distinct buckets instead of collapsing into a single one.Dashboard (
charms/planner-operator/cos_custom/grafana_dashboards/go.json) — the single "Job queue time" panel is split into two side-by-side panels: "Job queue time (median)" and "Job queue time (p95)". They live in the same dashboard slot the original panel occupied. The split is needed because the two series differ by orders of magnitude (median in seconds–minutes, p95 sometimes in hours), so a shared Y axis squashed the median to a flat line near zero. With separate panels each axis auto-scales independently. The panel descriptions also flag a caveat:started_at − created_atmay overstate runner pickup time because the job's effective queued time can be later thancreated_at.Rationale
The previous p95-only view on coarse buckets was misleading in two ways: sub-30s waits showed up as ~28s due to linear interpolation inside the too-wide first bucket, and a single upper percentile without a central-tendency signal gave no sense of what a normal job experienced. Median + p95 on buckets tuned for queue times — and on separate auto-scaled panels — give a much more honest picture of the system's behaviour.
Checklist
CONTRIBUTING.mdhas been updated upon changes to the contribution/development process (e.g. changes to the way tests are run) — N/A