Skip to content

Listener silently exits broker-reconnect loop after AAD credential-refresh (ghost-busy) #4446

@OrangeBoatPencil

Description

@OrangeBoatPencil

Describe the bug

On self-hosted macOS runners (v2.334.0), after a long-running session encounters a broker disconnect followed by a credential refresh, the Runner.Listener process can silently exit its broker-reconnect loop. The OS process stays alive (parked on a pthread_cond_wait) but holds no ESTABLISHED TCP socket to the broker, writes no further diag log entries, and accepts no new jobs.

Critically: the broker side of the agent state still shows the runner as busy from its last in-flight job, so the queue stalls behind the phantom. Only launchctl bootout + bootstrap of the LaunchAgent (or equivalent hard process restart) clears the state. The runner does not self-recover.

We observed this simultaneously on 4 of 4 macOS-arm64 runners on the same host within a 32-minute window, freezing our queue for ~4 hours.

Runner

  • Version: v2.334.0
  • OS: macOS 26.3.1 (Apple Silicon, M3 Ultra)
  • Service supervision: launchd LaunchAgent (actions.runner.<owner-repo>.<name>)
  • Repo-scoped (not org-scoped) registration
  • 4 ephemeral=false runners on one host, ~10 other runners for a different repo on the same host

Expected behavior

On any unrecoverable broker session error, the listener should either:

  1. Surface a fatal error and exit (so the supervisor restarts it), or
  2. Keep retrying the reconnect indefinitely with visible diag log entries.

Actual behavior

The listener handled dozens of routine SocketException (89): Operation canceled events successfully across an ~11-hour run (these are the normal long-poll cancel/retry pattern). Then at a credential-refresh boundary it logged this sequence and went completely silent:

[2026-05-22 08:36:41Z INFO RSAFileKeyManager] Loading RSA key parameters from file <path>/.credentials_rsaparams
[2026-05-22 08:36:42Z INFO GitHubActionsService] AAD Correlation ID for this token request: Unknown

No further log entries for >30 min. No further TCP socket to the broker (TCP socket scan against the listener PID returns empty). Process alive but parked — main thread in ObjectNative::WaitTimeout -> SyncBlock::Wait -> _pthread_cond_wait per sample <pid> 3.

All four listeners on the host hit the same final-log signature within a 32-minute window (08:11-08:43 UTC), suggesting an upstream broker-session event (TLS keepalive teardown, broker rolling restart, or scheduled session expiry policy) triggered the wedge — but the runner client failed to re-establish.

Reproduction

Not deterministically reproducible from a fresh runner, but reliably observed in the steady-state of long-lived self-hosted runners. The trigger appears to require:

  • Listener lifetime > a few hours (so credential refresh happens)
  • A broker disconnect concurrent with or immediately following an AAD credential refresh
  • macOS Apple Silicon (we have not observed it on Linux x64)

Workaround

Out-of-band watchdog that probes for ESTABLISHED TCP on the listener PID and the freshness of _diag/Runner_*.log, and launchctl bootout/bootstrap if both fail. Reference implementation: https://github.com/EVCA-Org/evca-web-app/pull/16130

Asks

  1. Detect the silent-exit path and either exit non-zero (so a supervisor restarts) or keep retrying loudly.
  2. After regaining the broker session, re-sync agent state so any in-flight job that was acknowledging at the moment of disconnect is correctly transitioned out of "busy" on the broker side.
  3. If there is already an internal channel for the "Correlation ID: Unknown" AAD response, surface it in the diag log instead of swallowing it.

Full diag logs for all four affected listeners available if helpful for diagnosis — happy to attach.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions