-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Description
Describe the bug
After a GitHub Actions platform incident, the TotalAssignedJobs value in RunnerScaleSetStatistic gets stuck at an inflated count that never recovers. The listener passes this value directly to the scaling calculation, causing the controller to permanently over-provision runners for phantom demand. The only fix is deleting the AutoscalingRunnerSet CR — listener pod restarts don't clear the state because it comes from the GitHub Actions service, not the listener.
We're running ARC 0.13.1 on GKE with an org-level runner group serving ~100 repos.
How we tracked it down
Started with Prometheus. The listener metrics were showing a persistent gap between gha_assigned_jobs and gha_running_jobs:
| Metric | Value | Source |
|---|---|---|
gha_assigned_jobs |
34 | Listener (TotalAssignedJobs) |
gha_running_jobs |
19 | Listener (TotalRunningJobs) |
gha_desired_runners |
39 | Listener (calculated) |
gha_busy_runners |
19 | Listener (TotalBusyRunners) |
gha_idle_runners |
0 | Listener (TotalIdleRunners) |
gha_min_runners |
5 | Config |
gha_max_runners |
80 | Config |
The gap — assigned(34) - running(19) = 15 stale jobs — was the first clue. The listener was inflating desired_runners to 39 instead of the expected ~24 (19 busy + 5 minRunners).
This wasn't a transient burst. Looking at the desired_runners - busy_runners gap over time:
- Pre-incident (72-48h prior): gap was 5-8 during work hours, dropping to 5 (minRunners) overnight. Normal.
- Post-incident (last 24h): gap jumped to 16-25 and stayed there permanently, never dropping below ~16 even overnight when real job volume was near zero.
Grafana dashboard showing the 48h timeline across runners (busy/idle/desired/max), jobs (assigned/running), and job results (cancelled/failed/succeeded). The persistent divergence between assigned and running in the Jobs panel is the stale state, and the elevated desired in the Runners panel is the resulting over-provisioning:
The controller was creating runner pods to satisfy the inflated demand:
gha_controller_running_ephemeral_runners: 23
gha_controller_pending_ephemeral_runners: 6
These extra runners had no real work. Meanwhile, real job startup latency degraded:
| Phase | p95 Startup Latency |
|---|---|
| Pre-incident | 20-55s |
| Incident onset | 582-822s |
| Incident peak | 1,727s (~29 min) |
| Sustained (24h later) | 242-1,540s |
What we tried
-
Deleted the listener pod — it restarted with a new IP and reconnected to the message session, but
TotalAssignedJobscame back at the same inflated value. The stale state is on GitHub's side, not in the pod. -
Deleted the AutoscalingRunnerSet CR — this forced a complete deregistration and re-registration with the GitHub Actions service. The recreated listener immediately showed
assigned_jobs == running_jobswith no stale gap.
Post-fix metrics:
| Metric | Before (broken) | After (fixed) |
|---|---|---|
| assigned_jobs | 34 | 20 |
| running_jobs | 19 | 20 |
| desired_runners | 39 | 26 |
| busy_runners | 19 | 20 |
| idle_runners | 0 | 1 |
| Stale gap | 15 | 0 |
Root cause analysis
We traced the code path in 0.13.1. The listener has no local job state — TotalAssignedJobs flows directly from the GitHub Actions service through to the scaling decision with no validation:
RunnerScaleSetMessage.Statistics.TotalAssignedJobs (types.go:130)
→ listener.handleMessage() publishes to metrics (listener.go:189)
→ handler.HandleDesiredRunnerCount(TotalAssignedJobs) (listener.go:217)
→ worker.setDesiredWorkerState(count) (worker.go:225)
→ targetRunnerCount = min(MinRunners + count, MaxRunners)
The listener trusts TotalAssignedJobs completely. There's no reconciliation against TotalRunningJobs, no TTL on stale assignments, and no mechanism to detect that the gap between assigned and running has become permanent.
When the GitHub Actions service fails to clean up cancelled/timed-out job assignments after a platform incident, the inflated TotalAssignedJobs persists in the message session. The listener faithfully reports it, the worker faithfully scales for it, and the controller creates runners that have no work to do.
To Reproduce
- Run ARC with a scale set registered to an org with active CI traffic
- A GitHub Actions platform incident assigns jobs to the scale set but prevents them from executing
- After the incident resolves, observe that
gha_assigned_jobsremains higher thangha_running_jobs - Restart the listener pod — the gap persists (state is on GitHub's side)
- Delete the AutoscalingRunnerSet CR and let it be recreated — the gap clears
Proposed mitigation
The underlying bug is that the GitHub Actions service reports stale TotalAssignedJobs after incidents. That's a server-side issue. But the listener could add defense-in-depth to limit the blast radius:
-
Staleness detection with session reconnect — If
TotalAssignedJobs - TotalRunningJobsremains constant (same gap) across N consecutive message cycles, force a session delete/recreate. This would prompt the service to recompute statistics from scratch. This seems like the most targeted fix since it directly addresses the stuck state. -
Sanity cap on desired runners — If
TotalAssignedJobssignificantly exceedsTotalRunningJobs + TotalBusyRunnersfor a sustained period, cap the scaling input atTotalRunningJobsrather thanTotalAssignedJobs. This prevents over-provisioning but doesn't clear the stale state. -
Periodic session rotation — Delete and recreate the message session on a configurable interval (e.g., every 6 hours). Blunter than option 1 but simpler to implement and protects against any form of session state drift.
Happy to submit a PR for whichever approach the maintainers prefer.
Environment:
- Infrastructure: GKE (Google Kubernetes Engine)
- ARC version: 0.13.1 (gha-runner-scale-set and gha-runner-scale-set-controller)
- Deployment: Helm, org-level runner group, ephemeral runners with dind
- Scale: ~100 repos, minRunners=5, maxRunners=80
Workaround
Delete the AutoscalingRunnerSet CR. The controller's finalizer will gracefully drain running jobs before completing deletion. Once deleted, your GitOps tool (or Helm/kubectl) recreates the CR with a fresh registration, clearing the stale state.
Restarting the listener pod does not work.