Ephemeral (single use) runner registrations #510

MatisseHack · 2020-05-30T01:48:03Z

Describe the bug
When starting a self hosted runner with ./run.cmd --once, the runner sometimes accepts a second job before shutting down, which causes that second job to fail with the message:

The runner: [runner-name] lost communication with the server. Verify the machine is running and has a healthy network connection.

This looks like the same issue recently fixed here: microsoft/azure-pipelines-agent#2728

To Reproduce
Steps to reproduce the behavior:

Create a repo, enable GitHub Actions, and add a new workflow
Configure a new runner on your machine
Run the runner with ./run.cmd --once
Queue two runs of your workflow
The first job will run and the runner will go offline
(Optionally) configure and start a second runner

The second job will time out after several minutes with the message:

The runner: [runner-name] lost communication with the server. Verify the machine is running and has a healthy network connection.

(where [runner-name] is the name of the first runner)

Also: trying to remove the first runner with the command ./config.cmd remove --token [token] will result in the following error until the second job times out:
```
Failed: Removing runner from the server
Runner "[runner-name]" is running a job for pool "Default"
```

Expected behavior
The second job should run on (and wait for) any new runner that comes online rather than try to run as a second job on the, now offline, original runner.

Runner Version and Platform

2.262.1 on Windows

Runner and Worker's Diagnostic Logs

_diag.zip

The text was updated successfully, but these errors were encountered:

TingluoHuang · 2020-05-30T03:56:39Z

@MatisseHack
How did you find the flag? 😆
the run once flag --once is not in the official actions doc and better not be used in production, there is a known issue that we haven't fix. 😄

bryanmacfarlane · 2020-05-30T13:20:06Z

Yeah, this is why it's not part of the documentation or command line help. Because the partially implemented code doesn't work and has races so we didn't ship it.

We have work on the backlog to register a runner with the service as ephemeral. It has to be a service side feature so the service is cooperative and doesn't assign it another job in the window when the runner or outside / in automation decides to tear it down.

MatisseHack · 2020-06-01T07:08:01Z

I've used the --once flag extensively with Azure Pipelines to create an automatically scaling pool of single use agents. I'm trying to convert that system to work with GitHub Actions. I was pleased to find that the flag still worked (at least partially)!

Makes sens why it wasn't shipped 😄 . I look forward to using it when it's supported though!

rclmenezes · 2020-06-07T23:56:19Z

We solved this issue by adding a timeout to our ephemeral workers. The timeout will deregister and kill the worker.

#!/bin/bash -x
set -eux

/usr/bin/actions/register

# Automatically de-register in 30 min
echo "sudo /usr/bin/actions/deregister && sudo shutdown -h now" | at now + 30 minutes

sudo -i -u actions /home/actions/run.sh --once
/usr/bin/actions/deregister
shutdown -h now

Actions team: please support --once when you have a chance!

We currently have ~16x parallelization on our CI to keep builds under 20 minutes. When moving to self-hosted, we had the following options:

Have an autoscaling group that keeps 16 instances around at the minimum and deploy more at peak times (this is a huge waste of money, it's hard to scale out and we have no clue how to scale-in without affecting existing jobs).
Make a Github-hosted job that deploys spot instances to perform the other jobs! This is substantially cheaper, avoids queues at peak times and allows us to parallelize even more when we need to!

Unfortunately, the second strategy only works well with --once...

So please consider supporting --once officially! If --once stops working, we'll probably have to consider an alternative CI provider.

pietroalbini · 2020-06-08T08:33:34Z

We also use the --once flag as we need a fresh build environment for each build (on AArch64, so we can't use GitHub hosted runners), and while we haven't encountered this issue yet it'd be really nice to get proper support!

rclmenezes · 2020-06-10T20:56:29Z

Also @bryanmacfarlane - two more requests that would make ephermeral workers easier:

Have --once workers automatically de-register themselves on completion / timeout. Currently, if a worker errors out before cleanup code, you'll have a ton of zombie "offline" registered workers. I think this needs to be solved on the backend.
If no workers are registered with the proper labels, Github will error out the job immediately. This is a problem if you're spinning up ephemeral workers, because the job will error before they boot up.

Our hackish solution is to keep a registered worker that does no work:

It simply re-registers itself once a day to avoid the 30 day cleanup 😂

Thanks!

nabilcode · 2020-06-18T09:27:47Z

👋 there,

Do we have a date as to which the --once would be supported?

zacker150 · 2020-07-06T20:26:41Z

Our hackish solution is to keep a registered worker that does no work:

It looks GitHub changed the runner service to error out the actions if there are no live runners. This hackish solution no longer works ☹️.

rclmenezes · 2020-07-09T22:20:05Z

This really sucks. It's the last nail in the coffin for ephemeral self-hosted workers.

We're going to switch to permanent self-hosted workers and then probably switch to another CI provider :(

bryanmacfarlane · 2020-07-10T15:05:40Z

@rclmenezes ack on #1 and #2. we're currently designing and working on it. The plan is exactly what you laid out, register the runner ephemeral with the backend service so the service auto cleans it up after the job is complete and the runner / container exits.

rclmenezes · 2020-07-10T15:22:26Z

That's amazing - thanks Bryan! 😄

We can definitely use permanent workers for the next few weeks until that's pushed. Thanks again!

Dids · 2020-07-15T06:00:13Z

@bryanmacfarlane Just to verify, does this mean that single use runners won't accept more than one job?

Right now I'm using a custom orchestrator that keeps a certain amount of single use runners running, removing them from the GitHub API when the containers exit.
While this works well enough as is, there's always a chance of an additional job being queued on an active runner, which means the queued job will eventually timeout and fail.

simonbyrne · 2020-07-28T20:40:27Z

we're looking at trying to use self-hosted runners with a cluster scheduler (Slurm in our case): it would also be useful for us if there was an additional option that would exit immediately if there are no jobs queued

This could be more general as a timeout feature (i.e. quit if no jobs received after X mins)

bitfactory-henno-schooljan · 2020-08-27T19:37:39Z

Looking forward to this! I take it that when this is implemented, the backend won't be erroring out anymore when zero suitable runners are present, but instead just waits until one has registered itself (e.g. in response to a webhook)? It would be quite difficult to start the first runner otherwise...

andreabenfatto · 2020-10-27T09:44:08Z

Any updates on this? Perhaps it will be solved by #660

j3parker · 2020-10-27T14:33:28Z

We (https://github.com/Brightspace) are eagerly awaiting this feature 😄

pekhota · 2020-10-30T12:45:36Z

We (https://github.com/airslateinc) too

shwuhk · 2020-12-02T13:52:19Z

I am waiting for this too..

earlephilhower · 2020-12-23T22:11:37Z

https://github.com/esp8266/Arduino/ would love this, too.

Single-use self-hosted support would let us safely run PRs on actual embedded hardware, giving better coverage than our simulated host-based environment.

stgraber · 2021-01-28T23:49:58Z

We'd love to make use of those for https://github.com/lxc including providing tooling so someone can easily run a fixed number of instances (containers or VMs) on a LXD cluster, having each of those handle a single job before getting destroyed and replaced by a new one. This would allow for a safe, low-cost, self-hosted set of runners running a wide variety of Linux distributions.

stgraber · 2021-01-28T23:51:19Z

Forgot to mention but this would also make it easy to run on non-x86 architectures. LXD was similarly used to build up the foreign architecture support of Travis-CI where it's used to run arm64, ppc64el and s390x instances. Something similar could easily be done with Github actions if it wasn't for this limitation :)

xriu · 2021-02-09T09:45:17Z

Any updates on this?

kylecarbs · 2021-02-10T21:11:19Z

@bryanmacfarlane any updates on this?

0x2b3bfa0 · 2021-02-14T13:28:04Z

We would like to use this feature as part of CML for running ephemeral continuous integration jobs.

david-behnke · 2021-02-17T09:40:31Z

Not having a deterministic build environment is a show stopper for us 👎

rclmenezes · 2021-03-26T13:58:49Z

Help us @bryanmacfarlane, you're our only hope! 🙏

korotovsky · 2021-08-20T08:19:11Z

We are experiencing this issue on plain hardware server with 10+ runners, any ideas how to fix it? No --once flag was used.

aidan-mundy · 2021-09-17T00:16:52Z

Is this resolved with the merge of #660?

joeyparrish · 2021-09-17T21:42:21Z

I believe so.

pje · 2021-09-20T20:45:00Z

Closing this as #660 has been merged, --ephemeral support has been fully deployed, and the docs have been updated. 🎉

…r#510 - Change setup script to use --ephemeral which replaces --once Support multiple configurations mapped by workflow labels - Consume workflow_job instead of check_run events. - Give RunnerConf.profiles a default. - Update README - Update sample config

Implement changes for emphemeral fix of actions runner. actions/runner#510 - Change setup script to use --ephemeral which replaces --once Support multiple configurations mapped by workflow labels - Consume workflow_job instead of check_run events. - Update README - Update sample config - Support remote images and LXD server - Fix max_workers limit - Add systemd user service unit - Add dev dependencies to setup.cfg/py - Update Makefile to install via setup and pip - Show version via help - Set loglevel via cli argument - Add Alpine image build script and OpenRC startup Changes to AppConf: - RunnerConf: Add default values - RunnerConf: Add max_workers - RunnerConf: Add semaphore for limiting - Add Remote object - AppConf: Add dirs object with common paths RunnerEvent: - Generate instname on object creation LXDRunner: - Remove _thread_launch. Use ThreadPoolExecutor - Name threads with instname: lxdrunner-xxxxxx - Add LXD event listener to watch for completion events RunManager: - Use semaphore for runner conf limits - Submit to threadpool - Switch to multiple queues: One per config - Fix cleanup schedule Update Workflow: - build wheel - build lxd image - push to releases

spangaer · 2022-05-11T17:58:41Z

It seems like the --once flag is still around and I'd like to ask keeping it around.

Registering an --ephemeral runner as part of a webhook, sure has it's own right to exist.

--once seems to cater for a different use case:

Have the action user home dir and actions working dir in a tempfs.
Reinitialize that tempfs with every run (matching the clean slate the cloud hosted runners produce)
This includes a non ephemeral registration.
Run above in a loop

That doesn't work with the ephemeral system because the registration token expires. While you don't need a registration token in the other case as the registration stays valid.

We're using this to use the same workflows across both cloud and a self hosted runners, because some builds have tie in to the enterprise network (for the time being).

fkorotkov · 2022-09-08T13:59:00Z

@pje I'd like to confirm that there is no way to guarantee which workflow an ephemeral runner will pick up in case there are several runners are starting simultaneously for several workflow_job events. I've created a separate feature request #2106 to allow specify a workflow id for a runner to pick.

MatisseHack added the bug Something isn't working label May 30, 2020

bryanmacfarlane added enhancement New feature or request and removed bug Something isn't working labels May 30, 2020

TingluoHuang added the Runner ❤️ Container label Jun 6, 2020

bryanmacfarlane changed the title ~~Run once runner accepts a second job~~ Ephemeral (single use) runner registrations Jul 10, 2020

simonbyrne mentioned this issue Jul 29, 2020

Option to limit runner to a particular run or job #620

Open

TingluoHuang mentioned this issue Aug 14, 2020

Support configure runner as ephemeral. #660

Merged

This was referenced Sep 10, 2020

Feature request: provide a signal to other programs when the runner has started a job #699

Closed

Ability to exit runner svc after completing workflow #559

Closed

onelapahead mentioned this issue Sep 21, 2020

Use self update ready entrypoint actions/actions-runner-controller#99

Merged

j3parker mentioned this issue Sep 25, 2020

Question: Is it possible to run the runner for only one job in queue so I can stop virtual machine and setup new one a fresh one. #720

Closed

simonbyrne mentioned this issue Nov 2, 2020

Evaluate Travis or move CI away CEED/libCEED#652

Closed

j3parker mentioned this issue Dec 1, 2020

allow self-hosted runner to be restricted to container builds only #838

Open

davidkarlsen mentioned this issue Dec 6, 2020

Add to documentation if runners are "throw away" / "single use" evryfs/github-actions-runner-operator#123

Closed

vschettino mentioned this issue Jan 5, 2021

Send job logs to stdout #891

Closed

0x2b3bfa0 mentioned this issue Feb 14, 2021

Single job mode iterative/cml#393

Merged

callum-tait-pbx mentioned this issue Apr 17, 2021

Ephemeral Runner: Can we make this optional? actions/actions-runner-controller#457

Closed

mumoshu mentioned this issue Apr 19, 2021

Dealing with jobs failing with "lost communication with the server" errors actions/actions-runner-controller#466

Open

j3parker mentioned this issue May 3, 2021

Self-hosted runner with Docker step creates files that trip up the checkout step #434

Open

mrmeyers99 mentioned this issue Jul 14, 2021

Ephemeral Runners? philips-labs/terraform-aws-github-runner#182

Closed

seemethere mentioned this issue Aug 30, 2021

Investigate why github runners have multiple workers work on the same workflow pytorch/pytorch#64217

Closed

pje closed this as completed Sep 20, 2021

potiuk mentioned this issue Sep 26, 2021

Improve isolation of MSSQL files apache/airflow#18538

Merged

0x2b3bfa0 mentioned this issue May 24, 2022

Bitbucket runner support iterative/cml#798

Merged

snyk-bot mentioned this issue Jun 21, 2022

[Snyk] Upgrade @zeit/ncc from 0.20.5 to 0.22.3 LadyK-21/runner#6

Closed

int128 mentioned this issue Aug 27, 2022

Metric of "lost communication with the server" error int128/datadog-actions-metrics#444

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ephemeral (single use) runner registrations #510

Ephemeral (single use) runner registrations #510

MatisseHack commented May 30, 2020 •

edited

Loading

TingluoHuang commented May 30, 2020

bryanmacfarlane commented May 30, 2020 •

edited

Loading

MatisseHack commented Jun 1, 2020 •

edited

Loading

rclmenezes commented Jun 7, 2020 •

edited

Loading

pietroalbini commented Jun 8, 2020

rclmenezes commented Jun 10, 2020

nabilcode commented Jun 18, 2020

zacker150 commented Jul 6, 2020 •

edited

Loading

rclmenezes commented Jul 9, 2020 •

edited

Loading

bryanmacfarlane commented Jul 10, 2020

rclmenezes commented Jul 10, 2020 •

edited

Loading

Dids commented Jul 15, 2020

simonbyrne commented Jul 28, 2020 •

edited

Loading

bitfactory-henno-schooljan commented Aug 27, 2020

andreabenfatto commented Oct 27, 2020 •

edited

Loading

j3parker commented Oct 27, 2020

pekhota commented Oct 30, 2020 •

edited

Loading

shwuhk commented Dec 2, 2020

earlephilhower commented Dec 23, 2020

stgraber commented Jan 28, 2021

stgraber commented Jan 28, 2021

xriu commented Feb 9, 2021

kylecarbs commented Feb 10, 2021

0x2b3bfa0 commented Feb 14, 2021

david-behnke commented Feb 17, 2021

rclmenezes commented Mar 26, 2021 •

edited

Loading

korotovsky commented Aug 20, 2021 •

edited

Loading

aidan-mundy commented Sep 17, 2021

joeyparrish commented Sep 17, 2021

pje commented Sep 20, 2021

spangaer commented May 11, 2022

fkorotkov commented Sep 8, 2022

Ephemeral (single use) runner registrations #510

Ephemeral (single use) runner registrations #510

Comments

MatisseHack commented May 30, 2020 • edited Loading

Runner Version and Platform

Runner and Worker's Diagnostic Logs

TingluoHuang commented May 30, 2020

bryanmacfarlane commented May 30, 2020 • edited Loading

MatisseHack commented Jun 1, 2020 • edited Loading

rclmenezes commented Jun 7, 2020 • edited Loading

pietroalbini commented Jun 8, 2020

rclmenezes commented Jun 10, 2020

nabilcode commented Jun 18, 2020

zacker150 commented Jul 6, 2020 • edited Loading

rclmenezes commented Jul 9, 2020 • edited Loading

bryanmacfarlane commented Jul 10, 2020

rclmenezes commented Jul 10, 2020 • edited Loading

Dids commented Jul 15, 2020

simonbyrne commented Jul 28, 2020 • edited Loading

bitfactory-henno-schooljan commented Aug 27, 2020

andreabenfatto commented Oct 27, 2020 • edited Loading

j3parker commented Oct 27, 2020

pekhota commented Oct 30, 2020 • edited Loading

shwuhk commented Dec 2, 2020

earlephilhower commented Dec 23, 2020

stgraber commented Jan 28, 2021

stgraber commented Jan 28, 2021

xriu commented Feb 9, 2021

kylecarbs commented Feb 10, 2021

0x2b3bfa0 commented Feb 14, 2021

david-behnke commented Feb 17, 2021

rclmenezes commented Mar 26, 2021 • edited Loading

korotovsky commented Aug 20, 2021 • edited Loading

aidan-mundy commented Sep 17, 2021

joeyparrish commented Sep 17, 2021

pje commented Sep 20, 2021

spangaer commented May 11, 2022

fkorotkov commented Sep 8, 2022

MatisseHack commented May 30, 2020 •

edited

Loading

bryanmacfarlane commented May 30, 2020 •

edited

Loading

MatisseHack commented Jun 1, 2020 •

edited

Loading

rclmenezes commented Jun 7, 2020 •

edited

Loading

zacker150 commented Jul 6, 2020 •

edited

Loading

rclmenezes commented Jul 9, 2020 •

edited

Loading

rclmenezes commented Jul 10, 2020 •

edited

Loading

simonbyrne commented Jul 28, 2020 •

edited

Loading

andreabenfatto commented Oct 27, 2020 •

edited

Loading

pekhota commented Oct 30, 2020 •

edited

Loading

rclmenezes commented Mar 26, 2021 •

edited

Loading

korotovsky commented Aug 20, 2021 •

edited

Loading