`/runners` REST API reports `busy: false` for a self-hosted runner that is actively executing a job (broker says busy, REST says idle)

## Summary

On non-ephemeral self-hosted runners managed by the [`github-aws-runners/terraform-aws-github-runner`](https://github.com/github-aws-runners/terraform-aws-github-runner) module, we observed two incidents where:

1. A runner successfully completed Job A.
2. The same runner picked up Job B and began executing it.
3. The runner reported `JobState: Busy` to `broker.actions.githubusercontent.com` for Job B and successfully renewed the job lease every 60s.
4. Despite (3), `GET /repos/{owner}/{repo}/actions/runners/{runner_id}` returned `"busy": false` while Job B was running.
5. Our auto-scaling Lambda queried the REST API, saw `busy: false`, and terminated the runner instance — killing Job B mid-execution.

This appears to be a state desynchronization between the broker (`broker.actions.githubusercontent.com`) and the REST API (`api.github.com/repos/.../actions/runners/{id}`). The broker knows the runner is busy; the REST API does not.

## Environment

- Self-hosted runners on AWS EC2, Ubuntu 24.04.4 LTS
- Runner version: 2.334.0
- Managed by [`github-aws-runners/terraform-aws-github-runner`](https://github.com/github-aws-runners/terraform-aws-github-runner) v6.6.0
- v2 flow (`useV2Flow: True`, `BrokerMessageListener`)
- JIT runner config (`enable_jit_config: true`)
- Non-ephemeral runners (`enable_ephemeral_runners: false`) — i.e., a single runner instance can execute multiple jobs sequentially

## Reproduction (observed twice in production, ~10 minutes apart)

The shared shape is:

1. A non-ephemeral self-hosted runner instance boots.
2. It executes Job A successfully.
3. After a short idle period (seconds to minutes), it picks up Job B.
4. Several minutes into Job B, an external poller queries `GET /repos/.../actions/runners/{runner_id}` and receives `"busy": false`.

We have not yet identified the precise triggering event in the runner's lifecycle, but the pattern is reproducible enough that it occurred twice in our CI within 10 minutes on independent runners.

## Incident 1

Runner instance `i-06cbe04821ccba677`, AgentId `18700`.

```
01:47:02  Picks up Job A ("Test")
01:55:50  Job A completes SUCCESS
~02:00    Picks up Job B ("Package & Publish")
02:00:12  External REST poll: "busy": true   ✓
02:05:09  External REST poll: "busy": true   ✓
02:06:30  ERR  BrokerServer  TaskCanceledException on
          GET broker.actions.githubusercontent.com/message?...
02:06:30  WARN BrokerServer  Back off 6.743 seconds before next retry. 4 attempt left.
02:07:48  INFO BrokerMessageListener  Acknowledging runner request '697d350f-...'  (recovered)
02:07:49  INFO Terminal      WRITE LINE: Running job: Package & Publish
02:07:49  INFO JobDispatcher Successfully renew job, valid till 02:17:49
02:08:49  INFO JobDispatcher Successfully renew job, valid till 02:18:49
02:09:49  INFO JobDispatcher Successfully renew job, valid till 02:19:49
02:10:08  External REST poll: "busy": false  ← desync
02:10:09  Instance terminated by auto-scaler
02:10:24  Job marked failure
```

Note that the REST API correctly tracked `busy: true` for ~10 minutes after Job B started, then spontaneously flipped to `busy: false` while the runner was healthy with a valid lease.

## Incident 2

Runner instance `i-0f3edd56994b7e40b`, AgentId `18701`.

```
01:47:06  INFO Runner       Refresh message received, kick-off selfupdate background process.
01:47:25  Runner v2.332.0 -> v2.334.0 after self-update; new BrokerMessageListener session
01:47:47  Picks up Job A (Atlas run 25836817925)
01:55:04  External REST poll: "busy": true   ✓
01:55:37  Job A completes SUCCESS
01:58:00  Picks up Job B (Atlas run 25837142256)
01:58:00  INFO JobDispatcher Successfully renew job, valid till 02:08:00
01:59:00  INFO JobDispatcher Successfully renew job, valid till 02:09:00
02:00:00  INFO JobDispatcher Successfully renew job, valid till 02:10:00
02:00:11  External REST poll: "busy": false  ← never updated for Job B
02:00:12  Instance terminated by auto-scaler
02:00:12  ##[error]The runner has received a shutdown signal.
02:00:26  Job marked failure
```

In Incident 2, the REST API apparently never picked up the `busy: true` transition for Job B at all, despite the runner having been successfully assigned the job and reporting `JobState: Busy` to the broker.

## Why we believe this is a bug

- The runner reports its busy state by calling the broker. The broker successfully receives, acknowledges, and renews the job lease.
- The REST `/runners` endpoint is documented as returning the runner's current busy state. We expect this to be consistent with what the broker knows.
- In both incidents, no runner-side error or disconnection persisted beyond a couple of seconds. The runner was healthy and lease-renewing right up to the moment it was killed.
- This breaks any auto-scaling system (including the [recommended `terraform-aws-github-runner` module](https://github.com/github-aws-runners/terraform-aws-github-runner)) that uses the `/runners` REST API to determine which runners are idle.

## Workaround

Setting `enable_ephemeral_runners = true` in the module makes each runner instance handle a single job and terminate. There is no second job for the desync to affect. We have applied this on our infrastructure as a workaround.

## Asks

1. Acknowledgement of the desync between the broker and the `/runners` REST API.
2. Either a fix to keep the REST API consistent with the broker's busy state, or guidance to consumers (auto-scaling tools, monitoring) on how to determine "this runner is actually busy" reliably.
3. If there's a separate API endpoint that always reflects the broker's view, please point to it.

We can provide additional runner `_diag/Runner_*.log` excerpts on request.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`/runners` REST API reports `busy: false` for a self-hosted runner that is actively executing a job (broker says busy, REST says idle) #4422

Summary

Environment

Reproduction (observed twice in production, ~10 minutes apart)

Incident 1

Incident 2

Why we believe this is a bug

Workaround

Asks

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

/runners REST API reports busy: false for a self-hosted runner that is actively executing a job (broker says busy, REST says idle) #4422

Description

Summary

Environment

Reproduction (observed twice in production, ~10 minutes apart)

Incident 1

Incident 2

Why we believe this is a bug

Workaround

Asks

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`/runners` REST API reports `busy: false` for a self-hosted runner that is actively executing a job (broker says busy, REST says idle) #4422