Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ephemeral (single use) runner registrations #510

Closed
MatisseHack opened this issue May 30, 2020 · 33 comments
Closed

Ephemeral (single use) runner registrations #510

MatisseHack opened this issue May 30, 2020 · 33 comments
Labels

Comments

@MatisseHack
Copy link
Contributor

MatisseHack commented May 30, 2020

Describe the bug
When starting a self hosted runner with ./run.cmd --once, the runner sometimes accepts a second job before shutting down, which causes that second job to fail with the message:

The runner: [runner-name] lost communication with the server. Verify the machine is running and has a healthy network connection.

This looks like the same issue recently fixed here: microsoft/azure-pipelines-agent#2728

To Reproduce
Steps to reproduce the behavior:

  1. Create a repo, enable GitHub Actions, and add a new workflow

  2. Configure a new runner on your machine

  3. Run the runner with ./run.cmd --once

  4. Queue two runs of your workflow

  5. The first job will run and the runner will go offline

  6. (Optionally) configure and start a second runner

  7. The second job will time out after several minutes with the message:

    The runner: [runner-name] lost communication with the server. Verify the machine is running and has a healthy network connection.
    

    (where [runner-name] is the name of the first runner)

  8. Also: trying to remove the first runner with the command ./config.cmd remove --token [token] will result in the following error until the second job times out:

    Failed: Removing runner from the server
    Runner "[runner-name]" is running a job for pool "Default"
    

Expected behavior
The second job should run on (and wait for) any new runner that comes online rather than try to run as a second job on the, now offline, original runner.

Runner Version and Platform

2.262.1 on Windows

Runner and Worker's Diagnostic Logs

_diag.zip

@MatisseHack MatisseHack added the bug Something isn't working label May 30, 2020
@TingluoHuang
Copy link
Member

@MatisseHack
How did you find the flag? 😆
the run once flag --once is not in the official actions doc and better not be used in production, there is a known issue that we haven't fix. 😄

@bryanmacfarlane
Copy link
Member

bryanmacfarlane commented May 30, 2020

Yeah, this is why it's not part of the documentation or command line help. Because the partially implemented code doesn't work and has races so we didn't ship it.

We have work on the backlog to register a runner with the service as ephemeral. It has to be a service side feature so the service is cooperative and doesn't assign it another job in the window when the runner or outside / in automation decides to tear it down.

@bryanmacfarlane bryanmacfarlane added enhancement New feature or request and removed bug Something isn't working labels May 30, 2020
@MatisseHack
Copy link
Contributor Author

MatisseHack commented Jun 1, 2020

I've used the --once flag extensively with Azure Pipelines to create an automatically scaling pool of single use agents. I'm trying to convert that system to work with GitHub Actions. I was pleased to find that the flag still worked (at least partially)!

Makes sens why it wasn't shipped 😄 . I look forward to using it when it's supported though!

@rclmenezes
Copy link

rclmenezes commented Jun 7, 2020

We solved this issue by adding a timeout to our ephemeral workers. The timeout will deregister and kill the worker.

#!/bin/bash -x
set -eux

/usr/bin/actions/register

# Automatically de-register in 30 min
echo "sudo /usr/bin/actions/deregister && sudo shutdown -h now" | at now + 30 minutes

sudo -i -u actions /home/actions/run.sh --once
/usr/bin/actions/deregister
shutdown -h now

Actions team: please support --once when you have a chance!

We currently have ~16x parallelization on our CI to keep builds under 20 minutes. When moving to self-hosted, we had the following options:

  1. Have an autoscaling group that keeps 16 instances around at the minimum and deploy more at peak times (this is a huge waste of money, it's hard to scale out and we have no clue how to scale-in without affecting existing jobs).

  2. Make a Github-hosted job that deploys spot instances to perform the other jobs! This is substantially cheaper, avoids queues at peak times and allows us to parallelize even more when we need to!

Unfortunately, the second strategy only works well with --once...

So please consider supporting --once officially! If --once stops working, we'll probably have to consider an alternative CI provider.

@pietroalbini
Copy link

We also use the --once flag as we need a fresh build environment for each build (on AArch64, so we can't use GitHub hosted runners), and while we haven't encountered this issue yet it'd be really nice to get proper support!

@rclmenezes
Copy link

Also @bryanmacfarlane - two more requests that would make ephermeral workers easier:

  1. Have --once workers automatically de-register themselves on completion / timeout. Currently, if a worker errors out before cleanup code, you'll have a ton of zombie "offline" registered workers. I think this needs to be solved on the backend.

  2. If no workers are registered with the proper labels, Github will error out the job immediately. This is a problem if you're spinning up ephemeral workers, because the job will error before they boot up.

Our hackish solution is to keep a registered worker that does no work:

image

It simply re-registers itself once a day to avoid the 30 day cleanup 😂

Thanks!

@nabilcode
Copy link

👋 there,

Do we have a date as to which the --once would be supported?

@zacker150
Copy link

zacker150 commented Jul 6, 2020

Our hackish solution is to keep a registered worker that does no work:

image

It looks GitHub changed the runner service to error out the actions if there are no live runners. This hackish solution no longer works ☹️.

@rclmenezes
Copy link

rclmenezes commented Jul 9, 2020

This really sucks. It's the last nail in the coffin for ephemeral self-hosted workers.

We're going to switch to permanent self-hosted workers and then probably switch to another CI provider :(

@bryanmacfarlane
Copy link
Member

@rclmenezes ack on #1 and #2. we're currently designing and working on it. The plan is exactly what you laid out, register the runner ephemeral with the backend service so the service auto cleans it up after the job is complete and the runner / container exits.

@bryanmacfarlane bryanmacfarlane changed the title Run once runner accepts a second job Ephemeral (single use) runner registrations Jul 10, 2020
@rclmenezes
Copy link

rclmenezes commented Jul 10, 2020

That's amazing - thanks Bryan! 😄

We can definitely use permanent workers for the next few weeks until that's pushed. Thanks again!

@Dids
Copy link

Dids commented Jul 15, 2020

@bryanmacfarlane Just to verify, does this mean that single use runners won't accept more than one job?

Right now I'm using a custom orchestrator that keeps a certain amount of single use runners running, removing them from the GitHub API when the containers exit.
While this works well enough as is, there's always a chance of an additional job being queued on an active runner, which means the queued job will eventually timeout and fail.

@simonbyrne
Copy link

simonbyrne commented Jul 28, 2020

we're looking at trying to use self-hosted runners with a cluster scheduler (Slurm in our case): it would also be useful for us if there was an additional option that would exit immediately if there are no jobs queued

This could be more general as a timeout feature (i.e. quit if no jobs received after X mins)

@bitfactory-henno-schooljan

Looking forward to this! I take it that when this is implemented, the backend won't be erroring out anymore when zero suitable runners are present, but instead just waits until one has registered itself (e.g. in response to a webhook)? It would be quite difficult to start the first runner otherwise...

@andreabenfatto
Copy link

andreabenfatto commented Oct 27, 2020

Any updates on this? Perhaps it will be solved by #660

@j3parker
Copy link

We (https://github.com/Brightspace) are eagerly awaiting this feature 😄

@pekhota
Copy link

pekhota commented Oct 30, 2020

We (https://github.com/airslateinc) too

@shwuhk
Copy link

shwuhk commented Dec 2, 2020

I am waiting for this too..

@earlephilhower
Copy link

https://github.com/esp8266/Arduino/ would love this, too.

Single-use self-hosted support would let us safely run PRs on actual embedded hardware, giving better coverage than our simulated host-based environment.

@stgraber
Copy link

We'd love to make use of those for https://github.com/lxc including providing tooling so someone can easily run a fixed number of instances (containers or VMs) on a LXD cluster, having each of those handle a single job before getting destroyed and replaced by a new one. This would allow for a safe, low-cost, self-hosted set of runners running a wide variety of Linux distributions.

@stgraber
Copy link

Forgot to mention but this would also make it easy to run on non-x86 architectures. LXD was similarly used to build up the foreign architecture support of Travis-CI where it's used to run arm64, ppc64el and s390x instances. Something similar could easily be done with Github actions if it wasn't for this limitation :)

@xriu
Copy link

xriu commented Feb 9, 2021

Any updates on this?

@kylecarbs
Copy link

@bryanmacfarlane any updates on this?

@0x2b3bfa0
Copy link

We would like to use this feature as part of CML for running ephemeral continuous integration jobs.

@david-behnke
Copy link

Not having a deterministic build environment is a show stopper for us 👎

@rclmenezes
Copy link

rclmenezes commented Mar 26, 2021

Help us @bryanmacfarlane, you're our only hope! 🙏

image

@korotovsky
Copy link

korotovsky commented Aug 20, 2021

We are experiencing this issue on plain hardware server with 10+ runners, any ideas how to fix it? No --once flag was used.

@aidan-mundy
Copy link

Is this resolved with the merge of #660?

@joeyparrish
Copy link

I believe so.

@pje
Copy link
Member

pje commented Sep 20, 2021

Closing this as #660 has been merged, --ephemeral support has been fully deployed, and the docs have been updated. 🎉

@pje pje closed this as completed Sep 20, 2021
jonans added a commit to jonans/lxdrunner that referenced this issue Oct 2, 2021
…r#510

- Change setup script to use --ephemeral which replaces --once

Support multiple configurations mapped by workflow labels
- Consume workflow_job instead of check_run events.

- Give RunnerConf.profiles a default.
- Update README
- Update sample config
jonans added a commit to jonans/lxdrunner that referenced this issue Oct 27, 2021
Implement changes for emphemeral fix of actions runner. actions/runner#510
- Change setup script to use --ephemeral which replaces --once

Support multiple configurations mapped by workflow labels
- Consume workflow_job instead of check_run events.

- Update README
- Update sample config

- Support remote images and LXD server
- Fix max_workers limit
- Add systemd user service unit
- Add dev dependencies to setup.cfg/py
- Update Makefile to install via setup and pip
- Show version via help
- Set loglevel via cli argument
- Add Alpine image build script and OpenRC startup

Changes to AppConf:
  - RunnerConf: Add default values
  - RunnerConf: Add max_workers
  - RunnerConf: Add semaphore for limiting
  - Add Remote object
  - AppConf: Add dirs object with common paths

RunnerEvent:
  - Generate instname on object creation

LXDRunner:
  - Remove _thread_launch. Use ThreadPoolExecutor
  - Name threads with instname: lxdrunner-xxxxxx
  - Add LXD event listener to watch for completion events

RunManager:
  - Use semaphore for runner conf limits
  - Submit to threadpool
  - Switch to multiple queues: One per config
  - Fix cleanup schedule

Update Workflow:
   - build wheel
   - build lxd image
   - push to releases
@spangaer
Copy link

It seems like the --once flag is still around and I'd like to ask keeping it around.

Registering an --ephemeral runner as part of a webhook, sure has it's own right to exist.

--once seems to cater for a different use case:

  1. Have the action user home dir and actions working dir in a tempfs.
  2. Reinitialize that tempfs with every run (matching the clean slate the cloud hosted runners produce)
    This includes a non ephemeral registration.
  3. Run above in a loop

That doesn't work with the ephemeral system because the registration token expires. While you don't need a registration token in the other case as the registration stays valid.

We're using this to use the same workflows across both cloud and a self hosted runners, because some builds have tie in to the enterprise network (for the time being).

@fkorotkov
Copy link

@pje I'd like to confirm that there is no way to guarantee which workflow an ephemeral runner will pick up in case there are several runners are starting simultaneously for several workflow_job events. I've created a separate feature request #2106 to allow specify a workflow id for a runner to pick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests