Failure during orphan termination processing

The scale-down lambda fails to terminate orphan AWS self-hosted runner instances, and throws errors.  

Background:    

The latest versions of `start-runner.sh` and `start-runner.ps1` both intended to terminate the EC2 instance after the job has finished. You can see this by searching for "terminate" in the files. If those succeed in terminating the instance after the job is done, then everything is fine. No orphans. But what happens if that termination doesn't occur?

Termination will not occur if...
- you are using an older version of start-runner.ps1, which did not yet call "terminate". The problem would always happen.   
- currently I am trying to re-design start-runner.ps1 (to set the user) and if the script doesn't yet manage to call "terminate", the problem happens.  
- most importantly - the whole terraform system should be robust to unexpected failures. If start-runner.sh or start-runner.ps1 glitch, or github itself has problems, and you may end up with an orphan instance. Once you get an orphaned instance, the scale-down should handle the situation gracefully, and terminate it. Any time you encounter an "orphan", scale-down should terminate it correctly. But that isn't happening now, it seems.   

A stack trace from cloudwatch is shown below.  Consider:   

```
GET /repos/myorg/myrepo/actions/runners/653 - 404
```

This api call was added recently in https://github.com/github-aws-runners/terraform-aws-github-runner/pull/4595/files .  It's a new feature, right?  

```
type OrgRunnerList = Endpoints['GET /orgs/{org}/actions/runners']['response']['data']['runners'];
type RepoRunnerList = Endpoints['GET /repos/{owner}/{repo}/actions/runners']['response']['data']['runners'];
type RunnerState = OrgRunnerList[number] | RepoRunnerList[number];
```

By basic code inspection, ask yourself, how will this function react if the runner doesn't exist. If the job finished, Github Actions doesn't know about the ephemeral runner any more. That means, the endpoint "/orgs/{org}/actions/runners" returns nothing, because Github thinks the runner isn't there. Will that cause the 404? Then, if the scale-down process gets a 404, is it able to handle it, and terminate the orphan?

```

Cloudwatch log message 1:

2025-09-22T20:31:03.167Z	e1066279-d200-4500-bc75-3f7a6274b4e9	ERROR	GET /repos/myorg/myrepo/actions/runners/653 - 404 with id B82E:10BDA9:D22A39:D6CC52:68D1B206 in 246ms

Cloudwatch log message 2:

{
    "level": "WARN",
    "message": "Failure during orphan termination processing.",
    "timestamp": "2025-09-22T20:32:03.223Z",
    "service": "runners-scale-down",
    "sampling_rate": 0,
    "xray_trace_id": "1-68d1b242-4965478d1f42877e3fcd62a6",
    "region": "us-west-2",
    "environment": "gha-ubuntu-noble",
    "module": "scale-down",
    "aws-request-id": "7212ccf6-8d5c-4ce8-98af-493f59b0eda3",
    "function-name": "gha-ubuntu-noble-scale-down",
    "error": {
        "name": "HttpError",
        "location": "file:///var/task/index.js:158569",
        "message": "Not Found - https://docs.github.com/rest/actions/self-hosted-runners#get-a-self-hosted-runner-for-a-repository",
        "stack": "HttpError: Not Found - https://docs.github.com/rest/actions/self-hosted-runners#get-a-self-hosted-runner-for-a-repository\n    at fetchWrapper (file:///var/task/index.js:158569:11)\n    at process.processTicksAndRejections (node:internal/process/task_queues:105:5)\n    at async Job.doExecute (file:///var/task/index.js:126562:18)",
        "status": 404,
        "request": {
            "method": "GET",
            "url": "https://api.github.com/repos/myorg/myrepo/actions/runners/653",
            "headers": {
                "accept": "application/vnd.github.v3+json",
                "user-agent": "github-aws-runners octokit-rest.js/22.0.0 octokit-core.js/7.0.2 Node.js/22",
                "authorization": "token [REDACTED]"
            },
            "request": {}
        },
        "response": {
            "url": "https://api.github.com/repos/myorg/myrepo/actions/runners/653",
            "status": 404,
            "headers": {
                "access-control-allow-origin": "*",
                "access-control-expose-headers": "ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Used, X-RateLimit-Resource, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, X-GitHub-SSO, X-GitHub-Request-Id, Deprecation, Sunset",
                "content-encoding": "gzip",
                "content-security-policy": "default-src 'none'",
                "content-type": "application/json; charset=utf-8",
                "date": "Mon, 22 Sep 2025 20:32:03 GMT",
                "referrer-policy": "origin-when-cross-origin, strict-origin-when-cross-origin",
                "server": "github.com",
                "strict-transport-security": "max-age=31536000; includeSubdomains; preload",
                "transfer-encoding": "chunked",
                "vary": "Accept-Encoding, Accept, X-Requested-With",
                "x-accepted-github-permissions": "administration=read",
                "x-content-type-options": "nosniff",
                "x-frame-options": "deny",
                "x-github-api-version-selected": "2022-11-28",
                "x-github-media-type": "github.v3; format=json",
                "x-github-request-id": "833A:2C77BC:D42741:D8CCE8:68D1B242",
                "x-ratelimit-limit": "5000",
                "x-ratelimit-remaining": "4840",
                "x-ratelimit-reset": "1758574803",
                "x-ratelimit-resource": "core",
                "x-ratelimit-used": "160",
                "x-xss-protection": "0"
            },
            "data": {
                "message": "Not Found",
                "documentation_url": "https://docs.github.com/rest/actions/self-hosted-runners#get-a-self-hosted-runner-for-a-repository",
                "status": "404"
            }
        }
    }
}
```

How to replicate the issue:  

Modify /modules/runners/templates/start-runner.sh , comment out `trap 'cleanup $? $LINENO $BASH_LINENO' EXIT` so that cleanup doesn't happen.
Modify /modules/runners/templates/start-runner.ps1 , comment out `aws ec2 terminate-instances --instance-ids "$InstanceId" --region "$Region"` so that cleanup doesn't happen.

Run github actions jobs. Instances will become "orphans". And then scale-down will fail to terminate the orphans.  

What should happen:  

In the past, orphans would get scaled-down without errors, and that was the right result.   

@stuartp44 , @npalm


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failure during orphan termination processing #4786

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failure during orphan termination processing #4786

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions