Summary
On non-ephemeral self-hosted runners managed by the github-aws-runners/terraform-aws-github-runner module, we observed two incidents where:
- A runner successfully completed Job A.
- The same runner picked up Job B and began executing it.
- The runner reported
JobState: Busy to broker.actions.githubusercontent.com for Job B and successfully renewed the job lease every 60s.
- Despite (3),
GET /repos/{owner}/{repo}/actions/runners/{runner_id} returned "busy": false while Job B was running.
- Our auto-scaling Lambda queried the REST API, saw
busy: false, and terminated the runner instance — killing Job B mid-execution.
This appears to be a state desynchronization between the broker (broker.actions.githubusercontent.com) and the REST API (api.github.com/repos/.../actions/runners/{id}). The broker knows the runner is busy; the REST API does not.
Environment
- Self-hosted runners on AWS EC2, Ubuntu 24.04.4 LTS
- Runner version: 2.334.0
- Managed by
github-aws-runners/terraform-aws-github-runner v6.6.0
- v2 flow (
useV2Flow: True, BrokerMessageListener)
- JIT runner config (
enable_jit_config: true)
- Non-ephemeral runners (
enable_ephemeral_runners: false) — i.e., a single runner instance can execute multiple jobs sequentially
Reproduction (observed twice in production, ~10 minutes apart)
The shared shape is:
- A non-ephemeral self-hosted runner instance boots.
- It executes Job A successfully.
- After a short idle period (seconds to minutes), it picks up Job B.
- Several minutes into Job B, an external poller queries
GET /repos/.../actions/runners/{runner_id} and receives "busy": false.
We have not yet identified the precise triggering event in the runner's lifecycle, but the pattern is reproducible enough that it occurred twice in our CI within 10 minutes on independent runners.
Incident 1
Runner instance i-06cbe04821ccba677, AgentId 18700.
01:47:02 Picks up Job A ("Test")
01:55:50 Job A completes SUCCESS
~02:00 Picks up Job B ("Package & Publish")
02:00:12 External REST poll: "busy": true ✓
02:05:09 External REST poll: "busy": true ✓
02:06:30 ERR BrokerServer TaskCanceledException on
GET broker.actions.githubusercontent.com/message?...
02:06:30 WARN BrokerServer Back off 6.743 seconds before next retry. 4 attempt left.
02:07:48 INFO BrokerMessageListener Acknowledging runner request '697d350f-...' (recovered)
02:07:49 INFO Terminal WRITE LINE: Running job: Package & Publish
02:07:49 INFO JobDispatcher Successfully renew job, valid till 02:17:49
02:08:49 INFO JobDispatcher Successfully renew job, valid till 02:18:49
02:09:49 INFO JobDispatcher Successfully renew job, valid till 02:19:49
02:10:08 External REST poll: "busy": false ← desync
02:10:09 Instance terminated by auto-scaler
02:10:24 Job marked failure
Note that the REST API correctly tracked busy: true for ~10 minutes after Job B started, then spontaneously flipped to busy: false while the runner was healthy with a valid lease.
Incident 2
Runner instance i-0f3edd56994b7e40b, AgentId 18701.
01:47:06 INFO Runner Refresh message received, kick-off selfupdate background process.
01:47:25 Runner v2.332.0 -> v2.334.0 after self-update; new BrokerMessageListener session
01:47:47 Picks up Job A (Atlas run 25836817925)
01:55:04 External REST poll: "busy": true ✓
01:55:37 Job A completes SUCCESS
01:58:00 Picks up Job B (Atlas run 25837142256)
01:58:00 INFO JobDispatcher Successfully renew job, valid till 02:08:00
01:59:00 INFO JobDispatcher Successfully renew job, valid till 02:09:00
02:00:00 INFO JobDispatcher Successfully renew job, valid till 02:10:00
02:00:11 External REST poll: "busy": false ← never updated for Job B
02:00:12 Instance terminated by auto-scaler
02:00:12 ##[error]The runner has received a shutdown signal.
02:00:26 Job marked failure
In Incident 2, the REST API apparently never picked up the busy: true transition for Job B at all, despite the runner having been successfully assigned the job and reporting JobState: Busy to the broker.
Why we believe this is a bug
- The runner reports its busy state by calling the broker. The broker successfully receives, acknowledges, and renews the job lease.
- The REST
/runners endpoint is documented as returning the runner's current busy state. We expect this to be consistent with what the broker knows.
- In both incidents, no runner-side error or disconnection persisted beyond a couple of seconds. The runner was healthy and lease-renewing right up to the moment it was killed.
- This breaks any auto-scaling system (including the recommended
terraform-aws-github-runner module) that uses the /runners REST API to determine which runners are idle.
Workaround
Setting enable_ephemeral_runners = true in the module makes each runner instance handle a single job and terminate. There is no second job for the desync to affect. We have applied this on our infrastructure as a workaround.
Asks
- Acknowledgement of the desync between the broker and the
/runners REST API.
- Either a fix to keep the REST API consistent with the broker's busy state, or guidance to consumers (auto-scaling tools, monitoring) on how to determine "this runner is actually busy" reliably.
- If there's a separate API endpoint that always reflects the broker's view, please point to it.
We can provide additional runner _diag/Runner_*.log excerpts on request.
Summary
On non-ephemeral self-hosted runners managed by the
github-aws-runners/terraform-aws-github-runnermodule, we observed two incidents where:JobState: Busytobroker.actions.githubusercontent.comfor Job B and successfully renewed the job lease every 60s.GET /repos/{owner}/{repo}/actions/runners/{runner_id}returned"busy": falsewhile Job B was running.busy: false, and terminated the runner instance — killing Job B mid-execution.This appears to be a state desynchronization between the broker (
broker.actions.githubusercontent.com) and the REST API (api.github.com/repos/.../actions/runners/{id}). The broker knows the runner is busy; the REST API does not.Environment
github-aws-runners/terraform-aws-github-runnerv6.6.0useV2Flow: True,BrokerMessageListener)enable_jit_config: true)enable_ephemeral_runners: false) — i.e., a single runner instance can execute multiple jobs sequentiallyReproduction (observed twice in production, ~10 minutes apart)
The shared shape is:
GET /repos/.../actions/runners/{runner_id}and receives"busy": false.We have not yet identified the precise triggering event in the runner's lifecycle, but the pattern is reproducible enough that it occurred twice in our CI within 10 minutes on independent runners.
Incident 1
Runner instance
i-06cbe04821ccba677, AgentId18700.Note that the REST API correctly tracked
busy: truefor ~10 minutes after Job B started, then spontaneously flipped tobusy: falsewhile the runner was healthy with a valid lease.Incident 2
Runner instance
i-0f3edd56994b7e40b, AgentId18701.In Incident 2, the REST API apparently never picked up the
busy: truetransition for Job B at all, despite the runner having been successfully assigned the job and reportingJobState: Busyto the broker.Why we believe this is a bug
/runnersendpoint is documented as returning the runner's current busy state. We expect this to be consistent with what the broker knows.terraform-aws-github-runnermodule) that uses the/runnersREST API to determine which runners are idle.Workaround
Setting
enable_ephemeral_runners = truein the module makes each runner instance handle a single job and terminate. There is no second job for the desync to affect. We have applied this on our infrastructure as a workaround.Asks
/runnersREST API.We can provide additional runner
_diag/Runner_*.logexcerpts on request.