Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Self-hosted runners disappeared #756

Closed
BrightRan opened this issue Oct 15, 2020 · 45 comments
Closed

Self-hosted runners disappeared #756

BrightRan opened this issue Oct 15, 2020 · 45 comments
Assignees
Labels
awaiting-customer-response bug Something isn't working

Comments

@BrightRan
Copy link

Associated GitHub Community topic: https://github.community/t/disappearing-self-hosted-runners/137669

The customer has added some self-hosted runners for his repository, but the runners would completely disappear as if he never added any.
When he refreshes, the runners would come back. Some would be Offline but would go back to being Idle after another refresh. Other times when he refreshes the runners disappear again.
When the customer logs into the runner machines to check their status, he can see a lot of connection retries.

2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
@chingc
Copy link

chingc commented Oct 15, 2020

@BrightRan Thank you for creating this ticket. I'd like to add that I'm using v2.273.5 of the runner on a plain Amazon Linux 2 EC2 Instance. I haven't experienced any issues yesterday so perhaps it was an intermittent issue on GitHub or Amazon's end.

@brandan-schmitz
Copy link

I have see this issue as well using the save version as @chingc. Today I received an error from my build actions for a project, seems that the runner vanished from the project and left me with no runners registered. Opening the runner itself on my server it showed that the runner was still registered but was getting an unknown disconnect error from GitHub, and that was all that it would do was loop between restarting the runner service to saying it received an unknown disconnect error back to starting the runner service.

I have since wiped the old runner and re-downloaded it and registered it back to the server again as I needed the build pipeline running but not sure when it first occurred on my system.

@jonnikim
Copy link

This still seems to be an issue.

We had 5 runners, 4 of them were offline and 1 was idle. I disabled Actions to fix some syntax. Left it alone for ~2+ weeks and came back to see that only the idle one was remaining. The other 4 looks to have been deleted. Re-enabling Actions didn't bring them back either.

I still have the directories for the other 4 runners, but trying to start them throws this error Failed to create a session. The runner registration has been deleted from the server, please re-configure.

@ruvceskistefan ruvceskistefan added the bug Something isn't working label Mar 17, 2022
@nikola-jokic
Copy link
Contributor

nikola-jokic commented Apr 4, 2022

Hi @jonnikim,

If the runner does not get any tasks for 30 days, it is being cleaned up from the service side. That might be the reason why you needed to re-configure your runner again.

@brandan-schmitz, @chingc, does this help?

@mhl-itm-bhg
Copy link

I am experiencing a similar issue, when attempting to run the actions-runner (runc.cmd) on my machine I get the following error
Failed to create a session. The runner registration has been deleted from the server, please re-configure.
When attempting to reconfigure the runner (config.cmd) I get the following error
Cannot configure the runner because it is already configured. To reconfigure the runner, run 'config.cmd remove' or './config.sh remove' first.
When I run config.cmd remove I'm asked to enter a runner removal token.

I have no idea where to get this token. Is there any way to reconfigure without being dependent on tokens that disappeared from the repo?

@nikola-jokic
Copy link
Contributor

Hi @mhl-itm-bhg,

You can just remove a file named .runner inside your root directory from where you are executing config.sh.

@nikola-jokic
Copy link
Contributor

nikola-jokic commented Apr 14, 2022

Hi everyone,

Since this seems to be resolved, I am going to close this issue. If you experience this issue again, you can create a new issue or write a comment here, and we will re-open it 😄

@shishodiyas
Copy link

how can we make it so that the runner doesn't get deleted.

@whutchinson98
Copy link

I just experienced this issue. Is there any update on how to prevent this?

@nikola-jokic
Copy link
Contributor

The docs now state:

A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 14 days.

@shishodiyas, @whutchinson98 you can't. One way you can automate this is to use API to fetch the registration token and register your runner again from a shell script.

@sxtyxmm
Copy link

sxtyxmm commented Nov 1, 2022

I have a shell script for the same but can you elaborate on the API use.

@nikola-jokic
Copy link
Contributor

Of course, this docs describe how to use API to fetch registration token for example: https://docs.github.com/en/rest/actions/self-hosted-runners#create-a-registration-token-for-a-repository. You can create small script that can fetch the registration token, then once you start configuring your runner, you may want to add flags like :--unattended and --replace.

@shukriadams
Copy link

@nikola-jokic From an automation point of view, this is some pretty anti-user design. Why would you auto-terminate an integration that has been down for 14 days? It's not costing Github anything that a runner that one of us hosting has gone idle. Some of us do projects as hobbies, we take breaks from them, we have lives. Is it really that much to ask that an automated build works again after a Raspberry Pi got accidentally unplugged for two weeks? I actually spend more time maintaining self-hosted runners than I build with them.

@nikola-jokic
Copy link
Contributor

Hi @shukriadams,

For most enterprises, this is expected and wanted. We understand you’re not most enterprises. If you want to discuss it more:

  • Bring it to forums to start a product feedback.
  • Once again, this is not something you can change from the runner code, so bring it to the feedback page ☺️

@newfunda
Copy link

newfunda commented Feb 1, 2024

A self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 14 days. An ephemeral self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 1 day.

@sxtyxmm
Copy link

sxtyxmm commented Feb 1, 2024 via email

@pitoniak32
Copy link

It seems the original error was not fully addressed in this issue 😓

2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

I am seeing this same thing on our self-hosted runners. And the only fixes I have found are related to disabling ipv6. Is there another solution for this? or is there at least a workaround?

@ttolbol
Copy link

ttolbol commented May 6, 2024

Just give us a setting to disable the automatic removal! It's completely ridiculous that I have to manually add a self hosted runner whenever I have to deploy an update (usually once per month). It takes me more time to go through the whole process of adding the runner again, than the time it takes to actually run the process. It didn't use to be this way. An automation tool that requires manual labour to use is not much of an automation tool.

@dgiambo
Copy link

dgiambo commented May 6, 2024

This just bit me too. Can we please have a setting for this, or at least a warning of some kind. This is not a good user experience. Why is the deletion not recorded in the audit logs?

@sxtyxmm
Copy link

sxtyxmm commented May 6, 2024

It seems the original error was not fully addressed in this issue 😓

2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

I am seeing this same thing on our self-hosted runners. And the only fixes I have found are related to disabling ipv6. Is there another solution for this? or is there at least a workaround?

Best i could come up with was to write an automation to add the runner again. after every 14 days.

@tedgarb
Copy link

tedgarb commented May 9, 2024

Adding my voice to the dissatisfaction here. There are absolutely no docs on how to reset a runner once github has unilaterally purged it. If github insists on this design paradigm for what are supposed to be persistent self-hosted runners, I would like to request

  1. The documentation at https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners?learn=hosting_your_own_runners&learnProduct=actions#adding-a-self-hosted-runner-to-an-organization actually explain that runners will be unilaterally purged
  2. Documentation be added on how to reset a runner once it has been registered, set up as a service, and then broken by github

@oschwartz10612
Copy link

Adding support for this issue here as well! We need a setting; runners cant just be deleted because they are turned off. We dont pay for our EC2 runners to be on all the time if we are only using it once a month and manually adding them back every time is ridiculous!

slimsag added a commit to hexops/mach that referenced this issue Aug 25, 2024
@github fucked me over and deleted the aarch64-macos runner's configuration
after it was down for a brief period of time[0] so I will have to set it up
from scratch again. For now, we remove aarch64-macos so our CI at least
passes once again.

[0] actions/runner#756

Signed-off-by: Stephen Gutekanst <stephen@hexops.com>
@BrendenWalker
Copy link

Thought I'd add that we just had multiple self-hosted runners disappear from GitHub organization configuration. Nobody else has access to GH config or the runners and I know for sure that I didn't remove them.

2 of the missing runners were running jobs 3 days ago. Strangely.. one runner is still present. No clue why just this one.

I submitted a ticket, hopefully they can pull from a backup. Access to some of these runners can be difficult, so just adding again would be a hassle.

@cullenwren-volair
Copy link

Very confused why this happened to my self-hosted runners. Ours are used multiple times a day yet I've had it happen twice now that they were removed for seemingly no reason. Our runners are setup as services and checking sudo ./svc.sh status shows they are still connected to github despite having been removed? It would be nice if restarting the service or uninstalling and reinstalling the service allowed the runners to be re-added instead of having to reconfigure them

@BrendenWalker
Copy link

Very confused why this happened to my self-hosted runners. Ours are used multiple times a day yet I've had it happen twice now that they were removed for seemingly no reason. Our runners are setup as services and checking sudo ./svc.sh status shows they are still connected to github despite having been removed? It would be nice if restarting the service or uninstalling and reinstalling the service allowed the runners to be re-added instead of having to reconfigure them

I'm not sure if you can get to this or comment on it, but my ticket: https://support.github.com/ticket/enterprise/122857/2997363

Seems like it's not just me. I have created scripts to automate installation of runners and I'm now keeping the configuration stored in version control (except secrets of course) to make it easy to reinstall.

@nextjsdude
Copy link

Associated GitHub Community topic: https://github.community/t/disappearing-self-hosted-runners/137669

The customer has added some self-hosted runners for his repository, but the runners would completely disappear as if he never added any. When he refreshes, the runners would come back. Some would be Offline but would go back to being Idle after another refresh. Other times when he refreshes the runners disappear again. When the customer logs into the runner machines to check their status, he can see a lot of connection retries.

2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

2024 and i face the same issue.

It seems the original error was not fully addressed in this issue 😓

2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

I am seeing this same thing on our self-hosted runners. And the only fixes I have found are related to disabling ipv6. Is there another solution for this? or is there at least a workaround?

Best i could come up with was to write an automation to add the runner again. after every 14 days.

can you share your automation configuration

@BrendenWalker
Copy link

I'm using a workflow I call 'doorstop'. I have to manually update it with new runners but that's so far not been an issue.

Example:

name: doorstop

on:
  schedule:
    # times in UTC, standard Chron format
    - cron:  '0 05 01,10,20 * *' # 5am 1st/10th/20th day of month

  workflow_dispatch:

jobs: 
  this_runner:
    runs-on: [self-hosted,thisrunner]
    steps:
      - name: Hello
        shell: powershell
        run: Write-Host "Hello World"

  that_runner:
    runs-on: [self-hosted,thatrunner]
    steps:
      - name: Hello
        shell: powershell
        run: Write-Host "Hello World"

@nextjsdude
Copy link

Associated GitHub Community topic: https://github.community/t/disappearing-self-hosted-runners/137669

The customer has added some self-hosted runners for his repository, but the runners would completely disappear as if he never added any. When he refreshes, the runners would come back. Some would be Offline but would go back to being Idle after another refresh. Other times when he refreshes the runners disappear again. When the customer logs into the runner machines to check their status, he can see a lot of connection retries.

2020-10-13 21:12:44Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.
2020-10-13 21:14:42Z: Runner reconnected.
2020-10-13 21:15:42Z: Runner connect error: The HTTP request timed out after 00:01:00.. Retrying until reconnected.

It would be good if the GitHub owner could receive an alert via email about the runner approaching 14 days rather than deleting it totally.
How i fixed it in my case:

  1. Decided to remove the config.sh from the runner but i was asked to enter the runner token. Off course i didn't have/know it and besides it was already deleted from GitHub servers, so it just returned a 404 error, no matter what token i entered.
  2. i finally uninstalled svc and remove that runner. I created a new runner and installed svc in that runner. Long but worked.

@nextjsdude
Copy link

I'm using a workflow I call 'doorstop'. I have to manually update it with new runners but that's so far not been an issue.

Example:

name: doorstop

on:
  schedule:
    # times in UTC, standard Chron format
    - cron:  '0 05 01,10,20 * *' # 5am 1st/10th/20th day of month

  workflow_dispatch:

jobs: 
  this_runner:
    runs-on: [self-hosted,thisrunner]
    steps:
      - name: Hello
        shell: powershell
        run: Write-Host "Hello World"

  that_runner:
    runs-on: [self-hosted,thatrunner]
    steps:
      - name: Hello
        shell: powershell
        run: Write-Host "Hello World"

Absolutely brilliant. Thanks man

@gojimmypi
Copy link

I am also encountering a problem where my self-hosted runner gets removed, and long before a 2 week unused expiration.

I have test scripts on WSL in a Windows 11 VM. I had another odd WSL error that caused my runner to crash, Only a few days later when I noticed the action was not working, upon restarting it, I noticed that the runner object no longer existed on GitHub.

image

As this issue is closed; is there an open one on this topic, or any known reliable workarounds?

@BrendenWalker
Copy link

I just recently had a runner disappear after being offline for 6 days.... maybe if enough people chime in here it'll be reopened.

@gojimmypi
Copy link

@BrendenWalker would you happen to be on a VPN? I was asked that question & yes: I was.

I'm trying again on a non-VPN segment to see if that helps.

@alvieridev
Copy link

I am also encountering a problem where my self-hosted runner gets removed, and long before a 2 week unused expiration.

I have test scripts on WSL in a Windows 11 VM. I had another odd WSL error that caused my runner to crash, Only a few days later when I noticed the action was not working, upon restarting it, I noticed that the runner object no longer existed on GitHub.

image

As this issue is closed; is there an open one on this topic, or any known reliable workarounds?

I realised the runner gets removed even before the two weeks mark, if the runner is not idle or active. Since your runner crashed, I say it was removed before 14days because it wasn't active. A work around that worked for me ( as suggested by @BrendenWalker ) is to setup a workflow runner that runs every two days or one week. Depends on you. The runner should perform very minimal task like echo or whoami. This will give github the impression that the runner is still active. This has worked for me.

@BrendenWalker
Copy link

I am also encountering a problem where my self-hosted runner gets removed, and long before a 2 week unused expiration.
I have test scripts on WSL in a Windows 11 VM. I had another odd WSL error that caused my runner to crash, Only a few days later when I noticed the action was not working, upon restarting it, I noticed that the runner object no longer existed on GitHub.
image
As this issue is closed; is there an open one on this topic, or any known reliable workarounds?

I realised the runner gets removed even before the two weeks mark, if the runner is not idle or active. Since your runner crashed, I say it was removed before 14days because it wasn't active. A work around that worked for me ( as suggested by @BrendenWalker ) is to setup a workflow runner that runs every two days or one week. Depends on you. The runner should perform very minimal task like echo or whoami. This will give github the impression that the runner is still active. This has worked for me.

Sadly, my workaround didn't save me from the last one. 6 days so my workflow didn't have a chance to startup the VM and run an action.

I've also taken to semu-automating installation on Windows via powershell. If this keeps happening I'll probably deploy ansible or some other full automated means..

@BrendenWalker
Copy link

@BrendenWalker would you happen to be on a VPN? I was asked that question & yes: I was.

I'm trying again on a non-VPN segment to see if that helps.

This latest failure is an Azure VM.. no VPN that I know of, however it does not have a public IP address and icmp traffic to the internet doesn't work so no ping.

The config.cmd --check functionality reports failure when it can't ping some servers even though it's already verified HTTPS access to the same servers. AFAIK https access is all that runners require, which would explain why my action runners work fine.. as long as they're not booted out of GitHub configuration.

@gojimmypi
Copy link

@BrendenWalker and @alvieridev there's certainly a possibility that I have an unstable network, even without the VPN.

With your experience with self-hosted runners, what do you think of this (admittedly hacky) idea:

while true; do
    timeout 6h ./run.sh  # Run for 6 hours
    echo "Restarting run.sh after 6 hours..."
done

@BrendenWalker
Copy link

@BrendenWalker and @alvieridev there's certainly a possibility that I have an unstable network, even without the VPN.

With your experience with self-hosted runners, what do you think of this (admittedly hacky) idea:

while true; do
    timeout 6h ./run.sh  # Run for 6 hours
    echo "Restarting run.sh after 6 hours..."
done

A bit blunt, but sometimes that's necessary. I haven't had that particular issue (yet). In my case I'm running as a windows service (so far, we have *nix runners in GitLab but haven't migrated those projects yet), and they can be setup to automatically restart.. That is IF they stop cleanly and notify the windows SCM that they stopped ;-)

@gojimmypi
Copy link

fwiw, on the list of "possible solutions, but won't work for me".... is this scheduled keep-alive task.

TIL scheduled tasks only work on the main branch, which is undesired when contributing upstream via a fork. :/

name: Keep Alive

on:
  schedule:
    # Runs every hour
    - cron: "0 * * * *"

jobs:
  keep-alive:
    runs-on: self-hosted  # Ensure this runs on your self-hosted runner

    steps:
      - name: Run keep-alive task
        run: |
          echo "Running periodic keep-alive task."

Perhaps this might help someone that's ok with main branch workflow edits.

@BrendenWalker
Copy link

I just had GH support give me this gem:

GitHub does not remove runners.

Had to refer them to the GitHub documentation which contradicts that:

A self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 14 days. An ephemeral self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 1 day.

@cullenwren-volair
Copy link

cullenwren-volair commented Oct 9, 2024

A self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 14 days

I wonder if the runner is registered with a particular IP address that is then never re-used when connecting (in the instances of VMs) Github will remove the runner after the 14 days despite the runner being used within that window

@BrendenWalker
Copy link

A self-hosted runner is automatically removed from GitHub Enterprise Cloud if it has not connected to GitHub Actions for more than 14 days

I wonder if the runner is registered with a particular IP address that is then never re-used when connecting (in the instances of VMs) Github will remove the runner after the 14 days despite the runner being used within that window

Interesting theory. However, I would expect it to show offline whenever the IP address changed. That's not been the case so far in my experience. Last one was removed 6 days after running a job successfully.

I think their is a bug in the 'cleanup' logic and it's simply not functioning like it should. They just need to open source all of GitHub and I'll fix the dang thing ;-)

@gojimmypi
Copy link

I just had GH support give me this gem:

GitHub does not remove runners.

Well, that's false. I've had self hosted runners go missing long before 14 days of inactivity. See screen snip, above; last processed based on a commit action on 10/2 then when I tried to restart it on 10/8, the runner was gone from my GitHub account and I had to setup a new one.

I wonder if the runner is registered with a particular IP address that is then never re-used when connecting (in the instances of VMs) Github will remove the runner after the 14 days despite the runner being used within that window

Now that's an interesting hypothesis.

... GitHub Enterprise Cloud ...

I'm not an enterprise customer.

fautore pushed a commit to fautore/mach that referenced this issue Oct 11, 2024
@github fucked me over and deleted the aarch64-macos runner's configuration
after it was down for a brief period of time[0] so I will have to set it up
from scratch again. For now, we remove aarch64-macos so our CI at least
passes once again.

[0] actions/runner#756

Signed-off-by: Stephen Gutekanst <stephen@hexops.com>
@BrendenWalker
Copy link

Hey everyone! This just in on my ticket:

I continued working on this, and opened an internal issue to track down this unexpected behaviour. Update from engineers on this unexpected deregistering of runners does identify this as a bug that was inadvertently introduced with recent updates, just as we discovered in the logs

The team identified a bug that could cause runners that were created over 14 days ago and offline for more than 1 hour could be incorrectly removed as part of our dormant runner cleanup job. It looks like your affected runner fits this criteria.

We've rolled out a mitigation to prevent this from happening any further, and are in the process of rolling out a long term fix to make sure self-hosted runner are only dormant if offline for 14 days.

that might.. just might confirm that we're not imagining things ;-)

@IronSean
Copy link

Hey everyone! This just in on my ticket:

I continued working on this, and opened an internal issue to track down this unexpected behaviour. Update from engineers on this unexpected deregistering of runners does identify this as a bug that was inadvertently introduced with recent updates, just as we discovered in the logs
The team identified a bug that could cause runners that were created over 14 days ago and offline for more than 1 hour could be incorrectly removed as part of our dormant runner cleanup job. It looks like your affected runner fits this criteria.
We've rolled out a mitigation to prevent this from happening any further, and are in the process of rolling out a long term fix to make sure self-hosted runner are only dormant if offline for 14 days.

that might.. just might confirm that we're not imagining things ;-)

If this is true and it was an error that caused them to delete after 14 days of Idle, and they were meant to only delete after 14 days Offline, this is starting to approach a sane policy. Deleting after 14 days or inactivity (or 14 days of existence and 1 hour of inactivity) is baffling.

@sakhisoufiane
Copy link

We've had all our self-hosted runners deleted. For anyone encountering the same issue, it looks like it was a bug on Github's end that deleted runners it thought were dormant when in fact they were active but in an idle state.

They couldn't restore them, so we had to reconfigure these runners from scratch.

The reply we've got from support if it can help anyone:

We've found that a cleanup job for dormant runners removed this self-hosted runner on 6th Oct. A self-hosted runner is automatically removed from GitHub if it has not connected to GitHub Actions for more than 14 days.

However, even though it would be the expected behaviour that a runner is removed if it has remained offline for 14 days, I do appreciate that the workflows in [redacted-repo-name] are run more frequently than 14 days, and that it should not have been the case that the runners were removed.

This indicated that there was a bug that caused runners to incorrectly be marked as dormant and then removed. Please rest assured that we are already working on a fix right now, and have also already rolled out a temporary mitigation plan to prevent self-hosted runners from being prematurely removed while we work on it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting-customer-response bug Something isn't working
Projects
None yet
Development

No branches or pull requests