-
Notifications
You must be signed in to change notification settings - Fork 680
Description
The scale-down lambda fails to terminate orphan AWS self-hosted runner instances, and throws errors.
Background:
The latest versions of start-runner.sh
and start-runner.ps1
both intended to terminate the EC2 instance after the job has finished. You can see this by searching for "terminate" in the files. If those succeed in terminating the instance after the job is done, then everything is fine. No orphans. But what happens if that termination doesn't occur?
Termination will not occur if...
- you are using an older version of start-runner.ps1, which did not yet call "terminate". The problem would always happen.
- currently I am trying to re-design start-runner.ps1 (to set the user) and if the script doesn't yet manage to call "terminate", the problem happens.
- most importantly - the whole terraform system should be robust to unexpected failures. If start-runner.sh or start-runner.ps1 glitch, or github itself has problems, and you may end up with an orphan instance. Once you get an orphaned instance, the scale-down should handle the situation gracefully, and terminate it. Any time you encounter an "orphan", scale-down should terminate it correctly. But that isn't happening now, it seems.
A stack trace from cloudwatch is shown below. Consider:
GET /repos/myorg/myrepo/actions/runners/653 - 404
This api call was added recently in https://github.com/github-aws-runners/terraform-aws-github-runner/pull/4595/files . It's a new feature, right?
type OrgRunnerList = Endpoints['GET /orgs/{org}/actions/runners']['response']['data']['runners'];
type RepoRunnerList = Endpoints['GET /repos/{owner}/{repo}/actions/runners']['response']['data']['runners'];
type RunnerState = OrgRunnerList[number] | RepoRunnerList[number];
By basic code inspection, ask yourself, how will this function react if the runner doesn't exist. If the job finished, Github Actions doesn't know about the ephemeral runner any more. That means, the endpoint "/orgs/{org}/actions/runners" returns nothing, because Github thinks the runner isn't there. Will that cause the 404? Then, if the scale-down process gets a 404, is it able to handle it, and terminate the orphan?
Cloudwatch log message 1:
2025-09-22T20:31:03.167Z e1066279-d200-4500-bc75-3f7a6274b4e9 ERROR GET /repos/myorg/myrepo/actions/runners/653 - 404 with id B82E:10BDA9:D22A39:D6CC52:68D1B206 in 246ms
Cloudwatch log message 2:
{
"level": "WARN",
"message": "Failure during orphan termination processing.",
"timestamp": "2025-09-22T20:32:03.223Z",
"service": "runners-scale-down",
"sampling_rate": 0,
"xray_trace_id": "1-68d1b242-4965478d1f42877e3fcd62a6",
"region": "us-west-2",
"environment": "gha-ubuntu-noble",
"module": "scale-down",
"aws-request-id": "7212ccf6-8d5c-4ce8-98af-493f59b0eda3",
"function-name": "gha-ubuntu-noble-scale-down",
"error": {
"name": "HttpError",
"location": "file:///var/task/index.js:158569",
"message": "Not Found - https://docs.github.com/rest/actions/self-hosted-runners#get-a-self-hosted-runner-for-a-repository",
"stack": "HttpError: Not Found - https://docs.github.com/rest/actions/self-hosted-runners#get-a-self-hosted-runner-for-a-repository\n at fetchWrapper (file:///var/task/index.js:158569:11)\n at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n at async Job.doExecute (file:///var/task/index.js:126562:18)",
"status": 404,
"request": {
"method": "GET",
"url": "https://api.github.com/repos/myorg/myrepo/actions/runners/653",
"headers": {
"accept": "application/vnd.github.v3+json",
"user-agent": "github-aws-runners octokit-rest.js/22.0.0 octokit-core.js/7.0.2 Node.js/22",
"authorization": "token [REDACTED]"
},
"request": {}
},
"response": {
"url": "https://api.github.com/repos/myorg/myrepo/actions/runners/653",
"status": 404,
"headers": {
"access-control-allow-origin": "*",
"access-control-expose-headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset",
"content-encoding": "gzip",
"content-security-policy": "default-src 'none'",
"content-type": "application/json; charset=utf-8",
"date": "Mon, 22 Sep 2025 20:32:03 GMT",
"referrer-policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
"server": "github.com",
"strict-transport-security": "max-age=31536000; includeSubdomains; preload",
"transfer-encoding": "chunked",
"vary": "Accept-Encoding, Accept, X-Requested-With",
"x-accepted-github-permissions": "administration=read",
"x-content-type-options": "nosniff",
"x-frame-options": "deny",
"x-github-api-version-selected": "2022-11-28",
"x-github-media-type": "github.v3; format=json",
"x-github-request-id": "833A:2C77BC:D42741:D8CCE8:68D1B242",
"x-ratelimit-limit": "5000",
"x-ratelimit-remaining": "4840",
"x-ratelimit-reset": "1758574803",
"x-ratelimit-resource": "core",
"x-ratelimit-used": "160",
"x-xss-protection": "0"
},
"data": {
"message": "Not Found",
"documentation_url": "https://docs.github.com/rest/actions/self-hosted-runners#get-a-self-hosted-runner-for-a-repository",
"status": "404"
}
}
}
}
How to replicate the issue:
Modify /modules/runners/templates/start-runner.sh , comment out trap 'cleanup $? $LINENO $BASH_LINENO' EXIT
so that cleanup doesn't happen.
Modify /modules/runners/templates/start-runner.ps1 , comment out aws ec2 terminate-instances --instance-ids "$InstanceId" --region "$Region"
so that cleanup doesn't happen.
Run github actions jobs. Instances will become "orphans". And then scale-down will fail to terminate the orphans.
What should happen:
In the past, orphans would get scaled-down without errors, and that was the right result.