Skip to content

listener: stale TotalAssignedJobs from GitHub Actions service causes permanent over-provisioning after platform incidents #4397

@jameshounshell

Description

@jameshounshell

Describe the bug

After a GitHub Actions platform incident, the TotalAssignedJobs value in RunnerScaleSetStatistic gets stuck at an inflated count that never recovers. The listener passes this value directly to the scaling calculation, causing the controller to permanently over-provision runners for phantom demand. The only fix is deleting the AutoscalingRunnerSet CR — listener pod restarts don't clear the state because it comes from the GitHub Actions service, not the listener.

We're running ARC 0.13.1 on GKE with an org-level runner group serving ~100 repos.

How we tracked it down

Started with Prometheus. The listener metrics were showing a persistent gap between gha_assigned_jobs and gha_running_jobs:

Metric Value Source
gha_assigned_jobs 34 Listener (TotalAssignedJobs)
gha_running_jobs 19 Listener (TotalRunningJobs)
gha_desired_runners 39 Listener (calculated)
gha_busy_runners 19 Listener (TotalBusyRunners)
gha_idle_runners 0 Listener (TotalIdleRunners)
gha_min_runners 5 Config
gha_max_runners 80 Config

The gap — assigned(34) - running(19) = 15 stale jobs — was the first clue. The listener was inflating desired_runners to 39 instead of the expected ~24 (19 busy + 5 minRunners).

This wasn't a transient burst. Looking at the desired_runners - busy_runners gap over time:

  • Pre-incident (72-48h prior): gap was 5-8 during work hours, dropping to 5 (minRunners) overnight. Normal.
  • Post-incident (last 24h): gap jumped to 16-25 and stayed there permanently, never dropping below ~16 even overnight when real job volume was near zero.

Grafana dashboard showing the 48h timeline across runners (busy/idle/desired/max), jobs (assigned/running), and job results (cancelled/failed/succeeded). The persistent divergence between assigned and running in the Jobs panel is the stale state, and the elevated desired in the Runners panel is the resulting over-provisioning:

Image

The controller was creating runner pods to satisfy the inflated demand:

gha_controller_running_ephemeral_runners: 23
gha_controller_pending_ephemeral_runners: 6

These extra runners had no real work. Meanwhile, real job startup latency degraded:

Phase p95 Startup Latency
Pre-incident 20-55s
Incident onset 582-822s
Incident peak 1,727s (~29 min)
Sustained (24h later) 242-1,540s

What we tried

  1. Deleted the listener pod — it restarted with a new IP and reconnected to the message session, but TotalAssignedJobs came back at the same inflated value. The stale state is on GitHub's side, not in the pod.

  2. Deleted the AutoscalingRunnerSet CR — this forced a complete deregistration and re-registration with the GitHub Actions service. The recreated listener immediately showed assigned_jobs == running_jobs with no stale gap.

Post-fix metrics:

Metric Before (broken) After (fixed)
assigned_jobs 34 20
running_jobs 19 20
desired_runners 39 26
busy_runners 19 20
idle_runners 0 1
Stale gap 15 0

Root cause analysis

We traced the code path in 0.13.1. The listener has no local job state — TotalAssignedJobs flows directly from the GitHub Actions service through to the scaling decision with no validation:

RunnerScaleSetMessage.Statistics.TotalAssignedJobs     (types.go:130)
  → listener.handleMessage() publishes to metrics      (listener.go:189)
  → handler.HandleDesiredRunnerCount(TotalAssignedJobs) (listener.go:217)
    → worker.setDesiredWorkerState(count)               (worker.go:225)
      → targetRunnerCount = min(MinRunners + count, MaxRunners)

The listener trusts TotalAssignedJobs completely. There's no reconciliation against TotalRunningJobs, no TTL on stale assignments, and no mechanism to detect that the gap between assigned and running has become permanent.

When the GitHub Actions service fails to clean up cancelled/timed-out job assignments after a platform incident, the inflated TotalAssignedJobs persists in the message session. The listener faithfully reports it, the worker faithfully scales for it, and the controller creates runners that have no work to do.

To Reproduce

  1. Run ARC with a scale set registered to an org with active CI traffic
  2. A GitHub Actions platform incident assigns jobs to the scale set but prevents them from executing
  3. After the incident resolves, observe that gha_assigned_jobs remains higher than gha_running_jobs
  4. Restart the listener pod — the gap persists (state is on GitHub's side)
  5. Delete the AutoscalingRunnerSet CR and let it be recreated — the gap clears

Proposed mitigation

The underlying bug is that the GitHub Actions service reports stale TotalAssignedJobs after incidents. That's a server-side issue. But the listener could add defense-in-depth to limit the blast radius:

  1. Staleness detection with session reconnect — If TotalAssignedJobs - TotalRunningJobs remains constant (same gap) across N consecutive message cycles, force a session delete/recreate. This would prompt the service to recompute statistics from scratch. This seems like the most targeted fix since it directly addresses the stuck state.

  2. Sanity cap on desired runners — If TotalAssignedJobs significantly exceeds TotalRunningJobs + TotalBusyRunners for a sustained period, cap the scaling input at TotalRunningJobs rather than TotalAssignedJobs. This prevents over-provisioning but doesn't clear the stale state.

  3. Periodic session rotation — Delete and recreate the message session on a configurable interval (e.g., every 6 hours). Blunter than option 1 but simpler to implement and protects against any form of session state drift.

Happy to submit a PR for whichever approach the maintainers prefer.

Environment:

  • Infrastructure: GKE (Google Kubernetes Engine)
  • ARC version: 0.13.1 (gha-runner-scale-set and gha-runner-scale-set-controller)
  • Deployment: Helm, org-level runner group, ephemeral runners with dind
  • Scale: ~100 repos, minRunners=5, maxRunners=80

Workaround

Delete the AutoscalingRunnerSet CR. The controller's finalizer will gracefully drain running jobs before completing deletion. Once deleted, your GitOps tool (or Helm/kubectl) recreates the CR with a fresh registration, clearing the stale state.

Restarting the listener pod does not work.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions