listener: stale TotalAssignedJobs from GitHub Actions service causes permanent over-provisioning after platform incidents

**Describe the bug**

After a GitHub Actions platform incident, the `TotalAssignedJobs` value in `RunnerScaleSetStatistic` gets stuck at an inflated count that never recovers. The listener passes this value directly to the scaling calculation, causing the controller to permanently over-provision runners for phantom demand. The only fix is deleting the AutoscalingRunnerSet CR — listener pod restarts don't clear the state because it comes from the GitHub Actions service, not the listener.

We're running ARC 0.13.1 on GKE with an org-level runner group serving ~100 repos.

**How we tracked it down**

Started with Prometheus. The listener metrics were showing a persistent gap between `gha_assigned_jobs` and `gha_running_jobs`:

| Metric | Value | Source |
|--------|-------|--------|
| `gha_assigned_jobs` | 34 | Listener (`TotalAssignedJobs`) |
| `gha_running_jobs` | 19 | Listener (`TotalRunningJobs`) |
| `gha_desired_runners` | 39 | Listener (calculated) |
| `gha_busy_runners` | 19 | Listener (`TotalBusyRunners`) |
| `gha_idle_runners` | 0 | Listener (`TotalIdleRunners`) |
| `gha_min_runners` | 5 | Config |
| `gha_max_runners` | 80 | Config |

The gap — `assigned(34) - running(19) = 15` stale jobs — was the first clue. The listener was inflating `desired_runners` to 39 instead of the expected ~24 (19 busy + 5 minRunners).

This wasn't a transient burst. Looking at the `desired_runners - busy_runners` gap over time:

- **Pre-incident (72-48h prior):** gap was 5-8 during work hours, dropping to 5 (minRunners) overnight. Normal.
- **Post-incident (last 24h):** gap jumped to 16-25 and stayed there permanently, never dropping below ~16 even overnight when real job volume was near zero.

Grafana dashboard showing the 48h timeline across runners (busy/idle/desired/max), jobs (assigned/running), and job results (cancelled/failed/succeeded). The persistent divergence between `assigned` and `running` in the Jobs panel is the stale state, and the elevated `desired` in the Runners panel is the resulting over-provisioning:

<img width="1597" height="1032" alt="Image" src="https://github.com/user-attachments/assets/d468cd42-dbe0-4418-b9bb-ef9d2b9ae870" />

The controller was creating runner pods to satisfy the inflated demand:

```
gha_controller_running_ephemeral_runners: 23
gha_controller_pending_ephemeral_runners: 6
```

These extra runners had no real work. Meanwhile, real job startup latency degraded:

| Phase | p95 Startup Latency |
|-------|---------------------|
| Pre-incident | 20-55s |
| Incident onset | 582-822s |
| Incident peak | 1,727s (~29 min) |
| Sustained (24h later) | 242-1,540s |

**What we tried**

1. **Deleted the listener pod** — it restarted with a new IP and reconnected to the message session, but `TotalAssignedJobs` came back at the same inflated value. The stale state is on GitHub's side, not in the pod.

2. **Deleted the AutoscalingRunnerSet CR** — this forced a complete deregistration and re-registration with the GitHub Actions service. The recreated listener immediately showed `assigned_jobs == running_jobs` with no stale gap.

Post-fix metrics:

| Metric | Before (broken) | After (fixed) |
|--------|-----------------|---------------|
| assigned_jobs | 34 | 20 |
| running_jobs | 19 | 20 |
| desired_runners | 39 | 26 |
| busy_runners | 19 | 20 |
| idle_runners | 0 | 1 |
| Stale gap | **15** | **0** |

**Root cause analysis**

We traced the code path in 0.13.1. The listener has no local job state — `TotalAssignedJobs` flows directly from the GitHub Actions service through to the scaling decision with no validation:

```
RunnerScaleSetMessage.Statistics.TotalAssignedJobs     (types.go:130)
  → listener.handleMessage() publishes to metrics      (listener.go:189)
  → handler.HandleDesiredRunnerCount(TotalAssignedJobs) (listener.go:217)
    → worker.setDesiredWorkerState(count)               (worker.go:225)
      → targetRunnerCount = min(MinRunners + count, MaxRunners)
```

The listener trusts `TotalAssignedJobs` completely. There's no reconciliation against `TotalRunningJobs`, no TTL on stale assignments, and no mechanism to detect that the gap between assigned and running has become permanent.

When the GitHub Actions service fails to clean up cancelled/timed-out job assignments after a platform incident, the inflated `TotalAssignedJobs` persists in the message session. The listener faithfully reports it, the worker faithfully scales for it, and the controller creates runners that have no work to do.

**To Reproduce**

1. Run ARC with a scale set registered to an org with active CI traffic
2. A GitHub Actions platform incident assigns jobs to the scale set but prevents them from executing
3. After the incident resolves, observe that `gha_assigned_jobs` remains higher than `gha_running_jobs`
4. Restart the listener pod — the gap persists (state is on GitHub's side)
5. Delete the AutoscalingRunnerSet CR and let it be recreated — the gap clears

**Proposed mitigation**

The underlying bug is that the GitHub Actions service reports stale `TotalAssignedJobs` after incidents. That's a server-side issue. But the listener could add defense-in-depth to limit the blast radius:

1. **Staleness detection with session reconnect** — If `TotalAssignedJobs - TotalRunningJobs` remains constant (same gap) across N consecutive message cycles, force a session delete/recreate. This would prompt the service to recompute statistics from scratch. This seems like the most targeted fix since it directly addresses the stuck state.

2. **Sanity cap on desired runners** — If `TotalAssignedJobs` significantly exceeds `TotalRunningJobs + TotalBusyRunners` for a sustained period, cap the scaling input at `TotalRunningJobs` rather than `TotalAssignedJobs`. This prevents over-provisioning but doesn't clear the stale state.

3. **Periodic session rotation** — Delete and recreate the message session on a configurable interval (e.g., every 6 hours). Blunter than option 1 but simpler to implement and protects against any form of session state drift.

Happy to submit a PR for whichever approach the maintainers prefer.

**Environment:**
- Infrastructure: GKE (Google Kubernetes Engine)
- ARC version: 0.13.1 (gha-runner-scale-set and gha-runner-scale-set-controller)
- Deployment: Helm, org-level runner group, ephemeral runners with dind
- Scale: ~100 repos, minRunners=5, maxRunners=80

**Workaround**

Delete the AutoscalingRunnerSet CR. The controller's finalizer will gracefully drain running jobs before completing deletion. Once deleted, your GitOps tool (or Helm/kubectl) recreates the CR with a fresh registration, clearing the stale state.

Restarting the listener pod does **not** work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

listener: stale TotalAssignedJobs from GitHub Actions service causes permanent over-provisioning after platform incidents #4397

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric	Value	Source
`gha_assigned_jobs`	34	Listener (`TotalAssignedJobs`)
`gha_running_jobs`	19	Listener (`TotalRunningJobs`)
`gha_desired_runners`	39	Listener (calculated)
`gha_busy_runners`	19	Listener (`TotalBusyRunners`)
`gha_idle_runners`	0	Listener (`TotalIdleRunners`)
`gha_min_runners`	5	Config
`gha_max_runners`	80	Config

Phase	p95 Startup Latency
Pre-incident	20-55s
Incident onset	582-822s
Incident peak	1,727s (~29 min)
Sustained (24h later)	242-1,540s

Metric	Before (broken)	After (fixed)
assigned_jobs	34	20
running_jobs	19	20
desired_runners	39	26
busy_runners	19	20
idle_runners	0	1
Stale gap	15	0

listener: stale TotalAssignedJobs from GitHub Actions service causes permanent over-provisioning after platform incidents #4397

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions