Skip to content

Self-hosted runner gets stuck in active state, blocking queued jobs across multiple repositories #4312

@OcheOps

Description

@OcheOps

Summary

We experienced an issue where a self-hosted GitHub Actions runner remained in an active (busy) state indefinitely, preventing new jobs from being picked up. This impacted multiple repositories using the same runner infrastructure.

Impact
Affected 3 repositories sharing the same self-hosted runner host
Jobs remained in queued state for several hours (~3+ hours)
CI/CD pipelines (including SonarQube analysis) were blocked
Manual intervention (runner restart) was required to restore functionality
Observed Behavior
Runner appears as Active in GitHub UI

New jobs show:

Waiting for a runner to pick up this job...
Requested labels: self-hosted
No new jobs are picked up despite runner being online and connected
Runner logs show it is listening for jobs, but an existing worker process remains attached

System inspection shows a hanging process:

npm run test:co
└─ vitest run --coverage
Runner does not automatically recover or release the job
Expected Behavior
Runner should:
Detect and terminate stalled/hung jobs
Return to Idle state after job timeout or failure
Continue picking up queued jobs without manual restart
Temporary Resolution

Restarting the runner service resolves the issue:

sudo systemctl restart actions.runner...service

After restart:

Runner returns to Idle
Queued jobs are immediately picked up

Environment
Self-hosted runners on Linux (Ubuntu)
Multiple runners installed on same host (different repos)
Runner version: 2.333.0
Workloads include:
Node.js (Vitest tests with coverage)
SonarQube analysis
CI/CD pipelines

Suspected Cause
Long-running or hanging test process (vitest --coverage) does not exit cleanly
Runner does not enforce timeout or detect job inactivity
Worker process (Runner.Worker) remains attached indefinitely
No automatic cleanup or job termination

Suggested Improvements
Add automatic detection of stalled jobs (e.g., no output / no CPU activity threshold)
Enforce configurable job timeouts at runner level
Improve visibility in GitHub UI for:
currently running job duration
stuck/hung jobs
Allow runner to gracefully recover without requiring full service restart
Additional Notes
This issue affected multiple repositories sharing the same infrastructure, indicating it is not isolated to a single workflow
Adding workflow-level timeout-minutes mitigates the issue but does not fully solve runner-level hangs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions