Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dealing with jobs failing with "lost communication with the server" errors #466

Open
mumoshu opened this issue Apr 19, 2021 · 10 comments
Open
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed

Comments

@mumoshu
Copy link
Collaborator

mumoshu commented Apr 19, 2021

I think I have not yet encountered this myself, but I believe any jobs on self-hosted GitHub runners are subject to get this error due to the race condition between the runner agent and GitHub.

This isn't specific to actions-runner-controller and I believe it's an upstream issue. But I'd still like to gather voices and knowledge around it and hopefully find a work-around.

Please see the related issues for more information.

This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.

Verifying if you're affected by this problem

Note that the error can also happen when:

  • The runner container got OOM-killed due to that your runner pod has insufficient resource. Set higher resource requests/limits.
  • The runner container got OOM-killed due to that your node has insufficient resource and your runner pod had low priority. Use a more resourceful machine as your node.

If you encounter the error even after tweaking your pod and node resources, it is likely that it's due to the race between the runner agent and GitHub.

Information

  • Even GitHub support seems to say that stopping the runner and using --once are the goto solutions. But I believe both are subject to this race condition issue.

Possible workarounds

  • Disabling ephemeral runners (Ephemeral Runner: Can we make this optional? #457) (i.e. removing the --once flag from run.sh) may "alleviate" this issue, but not completely.
  • Don't use ephemeral runners and stop runners only in the maintenance window you've defined, while telling your colleagues to not run jobs while in the maintenance window. (The downside of this approach is that you can't rolling-update runners outside of the maintenance window
  • Restart the whole workflow run whenever any job in it failed (Note that we can't retry individual job on GitHub Actions today)
@jwalters-gpsw

This comment has been minimized.

@mumoshu

This comment has been minimized.

@jwalters-gpsw

This comment has been minimized.

@mumoshu mumoshu added documentation Improvements or additions to documentation help wanted Extra attention is needed labels May 16, 2021
mumoshu added a commit that referenced this issue Aug 11, 2021
This add support for two upcoming enhancements on the GitHub side of self-hosted runners, ephemeral runners, and `workflow_jow` events. You can't use these yet.

**These features are not yet generally available to all GitHub users**. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released the necessary features on their end.

**Ephemeral runners**:

The former, ephemeral runners, is basically the reliable alternative to `--once`, which we've been using when you enabled `ephemeral: true` (default in actions-runner-controller).

`--once` has been suffering from a race issue #466. `--ephemeral` fixes that.

To enable ephemeral runners with `actions/runner`, you give `--ephemeral` to `config.sh`. This updated version of `actions-runner-controller` does it for you, by using `--ephemeral` instead of `--once` when you set `RUNNER_FEATURE_FLAG_EPHEMERAL=true`.

Please read the section `Ephemeral Runners` in the updated version of our README for more information.

Note that ephemeral runners is not released on GitHub yet. And `RUNNER_FEATURE_FLAG_EPHEMERAL=true` won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub!

**`workflow_job` events**:

`workflow_job` is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides `actions-runner-controller` a solid foundation to improve our webhook-based autoscale.

Formerly, we've been exploiting webhook events like `check_run` for autoscaling. However, as none of our supported events has included `labels`, you had to configure an HRA to only match relevant `check_run` events. It wasn't trivial.

In contrast, a `workflow_job` event payload contains `labels` of runners requested. `actions-runner-controller` is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by `labels` included in the webhook payload. So all you need to use webhook-based autoscale will be to enable `workflow_job` on GitHub and expose actions-runner-controller's webhook server to the internet.

Note that the current implementation of `workflow_job` support works in two ways, increment, and decrement. An increment happens when the webhook server receives` workflow_job` of `queued` status. A decrement happens when it receives `workflow_job` of `completed` status. The latter is used to make scaling-down faster so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to `scaleDownDelaySecondsAfterScaleOut `.

Please read the section `Example 3: Scale on each `workflow_job` event` in the updated version of our README for more information on its usage.
@actions actions deleted a comment from 1818l Oct 17, 2021
@shreyasGit
Copy link

is this issue resolved now? we are still getting this for self-hosted runners on Linux box?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Jun 20, 2022

@shreyasGit Hey. First of all, this can't be fixed 100% by ARC alone. For example, if you use EC2 spot instances for hosting self-hosted runners, it's unavoidable (as we can't block spot termination). ARC has addressed all the issues related to this so you'd better check your deployment first and think if it's supposed to work without the communication lost error.

Also, fundamentally this can be said as an issue in GitHub Actions itself, as it doesn't have any facility to auto-restart jobs that disappeared prematurely. Would you mind considering submitting a feature request to GitHub too?

@DerekTBrown
Copy link

Note that the error can also happen when: The runner container got OOM-killed due to that your runner pod has insufficient resource. Set higher resource requests/limits.

Are there ways that actions-runner-controller or actions-runner could more gracefully handle the OOM-killed case? Could we somehow report OOM kills to the GitHub UI? Could we run a separate OOM killer inside the workers to kill the workflow before it exceeds memory limits?

@mumoshu
Copy link
Collaborator Author

mumoshu commented Sep 15, 2022

@DerekTBrown Hey. I think that's fundamentally impossible. Even if we were able to hook into the host's system log to know about which process got OOM killed, we have no easy way to correlate it with a process within a container in a pod and even worse, GitHub Actions doesn't provide an API to "externally" set a workflow job status.

However, I thought you could still see a job times out after 10 minutes or so and the job that was running when the runner disappeared (due to whatever reason like OOM) was eventually marked as failed(although without an explicit error message at all). Would that be enough?

@alyssaruth
Copy link

This issue is mainly to gather experiences from whoever has been affected by the error. I appreciate it if you could share your stories, workarounds, fixes, etc. around the issue so that it would ideally be fixed upstream or in actions-runner-controller.

Really appreciate this! ❤️

We currently see this occasionally - from our stats it's ~2.5% of builds of our main CI/CD pipeline. Our setup is that we're using actions-runner-controller within GKE, and the nodes that the runners get spun up on are Spot VMs. We do this because it's a massive cost saving - however, it inevitably means sometimes we bump into this error when a node that's hosting a runner gets taken away from us by Google.

Even outside of Spot VMs, there are all sorts of other imaginable reasons that are hard or impossible to mitigate - e.g. node upgrades, OOM kills and suchlike. The dream for us would be for jobs affected by this to automatically restart from the beginning, provisioning a fresh runner and going again.

@nsheaps
Copy link
Contributor

nsheaps commented Aug 28, 2023

@alyssa-glean for re-running, I'd recommend a workflow that runs on completion of the specific workflows you want (can use a cloud hosted or self-hosted runner for this) which re-triggers the job. We've been doing that at my place-of-work and it works great

As for lost communication errors, for us, this was mainly caused by custom timeout logic within our github actions workflows, which attempted to disable job control, then use the process group ID to determine which process to kill when the timeout expires. We've since changed this to using the linux native timeout command with job control and the problem has mostly resolved itself, except for some cases when github itself is having issues.

As for runner shutdown errors, this has been fully mitigated for us by the graceful termination suggestions here. This has been confirmed by both the logs on the runner (which we export to ensure we still have them after the pod dies, by building a custom image that has fluentd in it which pipes the runner logs, runner worker logs, and the actions runner daemon to stdout), and prometheus metrics, which we send to datadog.

What didn't work is setting the pod label cluster-autoscaler.kubernetes.io/safe-to-evict=false, it seemed to have no effect (at least on EKS)

As for other general termination behavior, we've noticed our longer running jobs that utilize docker in docker/dind as a sidecar do not terminate gracefully. The system logs and metrics do note that a termination signal is sent, and the runner pod successfully waits. However, the default behavior of the container runtime/kubernetes is to send the termination signal to the main process on all containers in a pod, including docker-in-docker. This kills the docker daemon and causes the tests to fail when the autoscaler decides that the pod should be terminated (which we haven't really figured out why it's happening in the first place). For us this equates to failures on ~13% of our job runs for these tests. It'd be nice for the sidecar to only get terminated if the runner itself is also exited, similar to the idea here. For now, my best idea (without adding more custom stuff to the image) would be to include dind in the runner container, rather than as a sidecar.

EDIT: The dind container DOES have that termination mechanism but the docs don't suggest how to set it properly. I peeked in the CRD and found the right way to set it (dockerEnv)

@signor-mike
Copy link

I downgraded Ubuntu from 22.04 LTS to 20.04 LTS and the workflow is no longer exhausting anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

7 participants