-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Ephemeral (one-time-use) JIT runners that successfully complete a job are frequently reported by GitHub as "The self-hosted runner lost communication with the server," even though:
- The worker process exits with code 100 (success)
- CompleteJobAsync succeeds — the listener logs "finish job request for job {id} with result: Succeeded"
- The broker confirms the state change — "Received job status event. JobState: Online"
- The runner cleanly deletes its session and exits with return code 0
The GitHub UI then either:
- Shows the job stuck as "queued" or "in_progress" for 10-30+ minutes before eventually updating
- Shows "The self-hosted runner lost communication with the server" despite the job having completed successfully
- In some cases, never delivers the workflow_job.completed webhook
Root Cause
In Runner.cs line 576, after the one-time-use job completes, the message queue is cancelled immediately with zero grace period:
// Line 570-576
Task completeTask = await Task.WhenAny(getNextMessage, jobDispatcher.RunOnceJobCompleted.Task);
if (completeTask == jobDispatcher.RunOnceJobCompleted.Task)
{
runOnceJobCompleted = true;
Trace.Info("Job has finished at backend, the runner will exit since it is running under onetime use mode.");
Trace.Info("Stop message queue looping.");
messageQueueLoopTokenSource.Cancel(); // <-- immediate teardown, no grace period
This cancels the in-flight broker long-poll (GET broker.actions.githubusercontent.com/message), which severs the TCP connection. GitHub's broker health monitoring detects the disconnect and flags "runner lost communication" — before GitHub's internal pipeline service has propagated the job completion to the webhook/UI systems.
The race is between two independent GitHub backend systems:
- Pipeline service — received CompleteJobAsync, knows the job succeeded
- Broker health monitor — sees TCP disconnect, flags the runner as lost
When the broker health monitor wins the race, the job is marked as failed/lost despite having completed.
Evidence
Runner diagnostic logs showing the complete successful flow immediately followed by the forced disconnect:
[08:03:33Z INFO] Worker finished for job f19fed06-... Code: 100
[08:03:33Z INFO] finish job request for job f19fed06-... with result: Succeeded
[08:03:33Z INFO] Job X Build completed with result: Succeeded
[08:03:33Z INFO] JobCompleted Notification
[08:03:33Z INFO] Received job status event. JobState: Online ← GitHub acknowledged completion
[08:03:33Z INFO] Fire signal for one time used runner.
[08:03:33Z INFO] Job has finished at backend...
[08:03:33Z INFO] Stop message queue looping.
[08:03:33Z WARN] GET request to broker.actions.githubusercontent.com/message... has been cancelled.
[08:03:33Z ERR ] TaskCanceledException: The operation was canceled. ← Broker sees disconnect
[08:03:33Z INFO] Job request f1918d06-... processed succeed.
[08:03:33Z INFO] Deleting Runner Session...
[08:03:33Z INFO] Runner execution has finished with return code 0
All timestamps are identical (08:03:33Z) — there is zero delay between completion acknowledgment and broker teardown.
GitHub UI result: "The self-hosted runner lost communication with the server" — despite the job completing successfully.
Reproduction
- Use ephemeral JIT runners (Ephemeral: true, UseV2Flow: true)
- Run any short job (< 60s makes the race more likely)
- Observe runner logs show successful completion + immediate broker disconnect
- GitHub UI shows "lost communication" or stays stuck in queued/in_progress
Tested on runner versions 2.331.0 and 2.333.0 — same behavior.
Proposed Fix
Add a brief grace period (e.g., 5 seconds) before cancelling the message queue, allowing GitHub's backend systems to propagate the completion:
runOnceJobCompleted = true;
Trace.Info("Job has finished at backend, the runner will exit since it is running under onetime use mode.");
// Grace period: keep broker connection alive so GitHub's backend can
// propagate job completion before seeing the runner disconnect.
await Task.Delay(TimeSpan.FromSeconds(5));
Trace.Info("Stop message queue looping.");
messageQueueLoopTokenSource.Cancel();
This ensures the broker connection remains healthy while GitHub's pipeline service syncs the completion status to its webhook and UI systems.
Environment
- Runner versions: 2.331.0, 2.333.0
- OS: Ubuntu 24.04
- Mode: Ephemeral JIT runner (--jitconfig, org-level, V2 broker flow)
- Scale: ~hundreds of jobs/day, issue affects ~5-10% of runs
Related Issues
- The self-hosted runner: xxx lost communication with the server #3539 — same error message, different root cause (resource starvation)
- The self-hosted runner: xxx lost communication with the server #3981 — same error message, references The self-hosted runner: xxx lost communication with the server #2624
- Workflow failure due to runner shutdown/stoppage #2040 — runner shutdown/stoppage, open since 2022