Skip to content

/runners REST API reports busy: false for a self-hosted runner that is actively executing a job (broker says busy, REST says idle) #4422

@fabgo

Description

@fabgo

Summary

On non-ephemeral self-hosted runners managed by the github-aws-runners/terraform-aws-github-runner module, we observed two incidents where:

  1. A runner successfully completed Job A.
  2. The same runner picked up Job B and began executing it.
  3. The runner reported JobState: Busy to broker.actions.githubusercontent.com for Job B and successfully renewed the job lease every 60s.
  4. Despite (3), GET /repos/{owner}/{repo}/actions/runners/{runner_id} returned "busy": false while Job B was running.
  5. Our auto-scaling Lambda queried the REST API, saw busy: false, and terminated the runner instance — killing Job B mid-execution.

This appears to be a state desynchronization between the broker (broker.actions.githubusercontent.com) and the REST API (api.github.com/repos/.../actions/runners/{id}). The broker knows the runner is busy; the REST API does not.

Environment

  • Self-hosted runners on AWS EC2, Ubuntu 24.04.4 LTS
  • Runner version: 2.334.0
  • Managed by github-aws-runners/terraform-aws-github-runner v6.6.0
  • v2 flow (useV2Flow: True, BrokerMessageListener)
  • JIT runner config (enable_jit_config: true)
  • Non-ephemeral runners (enable_ephemeral_runners: false) — i.e., a single runner instance can execute multiple jobs sequentially

Reproduction (observed twice in production, ~10 minutes apart)

The shared shape is:

  1. A non-ephemeral self-hosted runner instance boots.
  2. It executes Job A successfully.
  3. After a short idle period (seconds to minutes), it picks up Job B.
  4. Several minutes into Job B, an external poller queries GET /repos/.../actions/runners/{runner_id} and receives "busy": false.

We have not yet identified the precise triggering event in the runner's lifecycle, but the pattern is reproducible enough that it occurred twice in our CI within 10 minutes on independent runners.

Incident 1

Runner instance i-06cbe04821ccba677, AgentId 18700.

01:47:02  Picks up Job A ("Test")
01:55:50  Job A completes SUCCESS
~02:00    Picks up Job B ("Package & Publish")
02:00:12  External REST poll: "busy": true   ✓
02:05:09  External REST poll: "busy": true   ✓
02:06:30  ERR  BrokerServer  TaskCanceledException on
          GET broker.actions.githubusercontent.com/message?...
02:06:30  WARN BrokerServer  Back off 6.743 seconds before next retry. 4 attempt left.
02:07:48  INFO BrokerMessageListener  Acknowledging runner request '697d350f-...'  (recovered)
02:07:49  INFO Terminal      WRITE LINE: Running job: Package & Publish
02:07:49  INFO JobDispatcher Successfully renew job, valid till 02:17:49
02:08:49  INFO JobDispatcher Successfully renew job, valid till 02:18:49
02:09:49  INFO JobDispatcher Successfully renew job, valid till 02:19:49
02:10:08  External REST poll: "busy": false  ← desync
02:10:09  Instance terminated by auto-scaler
02:10:24  Job marked failure

Note that the REST API correctly tracked busy: true for ~10 minutes after Job B started, then spontaneously flipped to busy: false while the runner was healthy with a valid lease.

Incident 2

Runner instance i-0f3edd56994b7e40b, AgentId 18701.

01:47:06  INFO Runner       Refresh message received, kick-off selfupdate background process.
01:47:25  Runner v2.332.0 -> v2.334.0 after self-update; new BrokerMessageListener session
01:47:47  Picks up Job A (Atlas run 25836817925)
01:55:04  External REST poll: "busy": true   ✓
01:55:37  Job A completes SUCCESS
01:58:00  Picks up Job B (Atlas run 25837142256)
01:58:00  INFO JobDispatcher Successfully renew job, valid till 02:08:00
01:59:00  INFO JobDispatcher Successfully renew job, valid till 02:09:00
02:00:00  INFO JobDispatcher Successfully renew job, valid till 02:10:00
02:00:11  External REST poll: "busy": false  ← never updated for Job B
02:00:12  Instance terminated by auto-scaler
02:00:12  ##[error]The runner has received a shutdown signal.
02:00:26  Job marked failure

In Incident 2, the REST API apparently never picked up the busy: true transition for Job B at all, despite the runner having been successfully assigned the job and reporting JobState: Busy to the broker.

Why we believe this is a bug

  • The runner reports its busy state by calling the broker. The broker successfully receives, acknowledges, and renews the job lease.
  • The REST /runners endpoint is documented as returning the runner's current busy state. We expect this to be consistent with what the broker knows.
  • In both incidents, no runner-side error or disconnection persisted beyond a couple of seconds. The runner was healthy and lease-renewing right up to the moment it was killed.
  • This breaks any auto-scaling system (including the recommended terraform-aws-github-runner module) that uses the /runners REST API to determine which runners are idle.

Workaround

Setting enable_ephemeral_runners = true in the module makes each runner instance handle a single job and terminate. There is no second job for the desync to affect. We have applied this on our infrastructure as a workaround.

Asks

  1. Acknowledgement of the desync between the broker and the /runners REST API.
  2. Either a fix to keep the REST API consistent with the broker's busy state, or guidance to consumers (auto-scaling tools, monitoring) on how to determine "this runner is actually busy" reliably.
  3. If there's a separate API endpoint that always reflects the broker's view, please point to it.

We can provide additional runner _diag/Runner_*.log excerpts on request.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions