-
Notifications
You must be signed in to change notification settings - Fork 962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow failure due to runner shutdown/stoppage #2040
Comments
@hamidgg would it be possible to share the RunnerListener and RunnerWorker logs too? |
I got the same error when running job in the self-hosted runners
other people also have the same error message Hi expert @AvaStancu, could you help to check on this? This issue is kind of blocking our progress... |
I am getting this as well. Unfortunately the other issue referenced is closed but has many many reports (even after closure) about the same behavior. |
@AvaStancu Sorry for my delayed reply. I've been waiting for another failure to get the RunnerListener and RunnerWorker logs as previous logs were cleaned up. I'll get back to you with the logs once a similar failure happens (hopefully not :D). |
same situation. I use AWS EC2 and get, after first run, this error
still waiting for a solution :( |
@zaknafein83 did you resolved it? |
|
not yet, I must restart my instance every time |
[taking my comment out, I think I got some logs mixed up] |
Hello all, I would like to point out the big issue here. We have enough RAM and good internet connection. We think that the runners do not receive enough CPU time when we build applications. Although we expect the connection to stabilize sooner or later. The run.sh script does not fail or succeed. It just stops without any clear error. Whatever the reason: We expect runners to be as stable as Jenkins nodes. It should not matter if the system is overloaded, when building. The connection is dying randomly. Sometimes our builds even succeed, as expected. But sometimes, they just die. Can this issue be emphasized more, please? Sorry for being harsh, but this is literally an issue for months now. Stay healthy! BR Edit:
|
I did some more investigation and apparently it was a problem on our side while instantiating the runner via a VM through systemd management. The problem was a mixture of how our VM solution works with systemd. I am not sure about the others now... It seems to be stable now, after fixing our services. For the interested people: Best regards! |
We started having the same issues ~1 month ago, we use custom-sized GitHub-hosted runners, though. Not sure, but it feels like it happens more often with 16 core runners. The message is:
If I try to get raw logs, they are almost empty, although a few job steps succeeded:
Not sure this is related, but initially they were defined as Ubuntu 20.04 runners, but after December 15 they started using 22.04, the warning was
UPD: I 'fixed' this by just re-creating the runners group in the GitHub UI. I.e. we had And this makes me wonder, why? I thought that runners group is just some stateless abstraction to limit usage, but it appears to be something statefull, i.e. it binds to some infra (?), so if it has some problems -- you will have too, and re-creating the group may help. |
I am now having this experience with self-hosted runners in AWS with no apparent cause. Disk is fine, Mem is fine, CPU is fine - but just randomly a GitHub runner decides it can no longer talk with the GitHub web sockets and fails to reconnect. That being said, I see numerous times (5-10%) of the web socket connections during a workflow run are error'ing out and causing the web socket process to reconnect. Not sure if this is related, or a red herring. |
…sue:actions/runner#2040 Signed-off-by: Michael Shitrit <mshitrit@redhat.com>
…e out issue:actions/runner#2040" This reverts commit dda4d0c.
…sue:actions/runner#2040 Signed-off-by: Michael Shitrit <mshitrit@redhat.com>
+1 to this thread - and I'm even using |
+1 same issue. |
Faced the same issue when using AWS EC2 instances as self-hosted runners.
|
We're seeing this in hosted runners also, here's the stack trace from our worker logs...
|
Update to my case: I was able to resolve the issue by using larger EC2 instances, so yeah, CPU/Memory starvation was the cause of this problem. |
We are also experiencing similar problems. Since we are running 4 build at the time, it nearly always fails in one of them. 3 are Ubuntu based and 1 is Windows, but it does not seem to just be affecting 1 type of OS. The GitHub runner version is always the newest, as they are created via a script that fetches the newset runner. Our current "solution" is just to "rebuild failed jobs" untill it works. However, longterm this is unacceptable Similar to #2624 (comment) |
We have the same issue, which happens from time to time with our runners |
For my particular scenario, the web socket errors are a red herring and aren't necessarily associated with the random loss of a runner. IF in AWS and running on SPOT instances, depending on the SPOT instance settings and autoscaling, its very possible that the SPOT instances are heavily associated with the random loss of a GH runner instance. In our particular case, we went from SPOT to ON_DEMAND node groups and went from 30-40% failure rates to 0.01% |
I downgraded Ubuntu from 22.04 LTS to 20.04 LTS and the workflow is no longer exhausting anything. |
Attempting to fix actions/runner#2040
Description
Since 30 July 2022, our workflow fails with the following message:
"The self-hosted runner: ***** lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error."
We run our workflow on an AWS EC2 instance which is always connected and has enough resources (CPU/memory). The above failure happens even for the parts of the workflow that don't require high utilization of CPU/memory.
It seems that runner loses communication with GitHub and does not continue running the job.
Log
[2022-08-05 09:05:47Z INFO JobServer] Caught exception during append web console line to websocket, let's fallback to sending via non-websocket call (total calls: 48, failed calls: 2, websocket state: Open).
[2022-08-05 09:05:47Z ERR JobServer] System.Net.WebSockets.WebSocketException (0x80004005): The remote party closed the WebSocket connection without completing the close handshake.
---> System.IO.IOException: Unable to write data to the transport connection: An existing connection was forcibly closed by the remote host..
---> System.Net.Sockets.SocketException (10054): An existing connection was forcibly closed by the remote host.
Runner Version and Platform
Version 2.294.0 and runs on Windows Server 2016
The text was updated successfully, but these errors were encountered: