Re-use EC2 runner instances #191

moltar · 2022-12-15T16:35:24Z

Does it make sense to launch multiple EC2 instances for each workflow?

This really slows the process down. I think ideally it should launch one instance, and then re-use it for all runs, and then when idle - shut down.

kichik · 2022-12-15T18:34:10Z

We currently focus on ephemeral runners where one job doesn't affect the status of the next one. Leaving a runner behind means every package you installed in job 1, will be available in job 2. That can cause unexpected and unpredictable results.

In the future we may add API for a pool of prewarmed runners or even runners that stay around. But right now the focus is on-demand runners.

moltar · 2022-12-15T21:20:19Z

Agreed on stateless.

Would it be possible to run an EC2 with a Docker daemon that starts ephemeral containers for jobs?

kichik · 2022-12-15T21:33:08Z

How would that be different than CodeBuild/Fargate/Lambda providers?

moltar · 2022-12-15T22:05:08Z

Ability to run Docker in Docker (dind)

moltar · 2022-12-15T22:05:42Z

Faster start times. Lambda is fast. But CodeBuild is very slow to provision. Same for fargate.

kichik · 2022-12-16T21:42:56Z

I think solving this with pre-warmed instances will be simpler. This way we don't have to create another system to manage capacity and running containers in EC2.

moltar · 2022-12-17T08:44:42Z

Wouldn't pre-warmed instances eat up all of the savings then?

E.g. in our case, for every PR, we run 12 GitHub Workflows. This means 12 instances will need to be always on stand-by.
Otherwise, even if we needed to wait for one, it will delay the entire PR readiness (all tests need to pass).

kichik · 2022-12-18T16:49:38Z

Yes, using pre-warmed instances, or an always-on instance, would eat up the savings of on-demand. AWS prices scale linearly with instance size. So they both solutions would eat up savings in a similar manner, unless you wish to overprovision on the shared instance. But either way you don't get the cost savings of on-demand.

This is beginning to sound like ECS. Do you know if ECS with an already active instance provisions faster than Fargate? Adding ECS support should be relatively easy and I think it will give us a shared instance without much work or orchestration code to maintain in the future.

kichik · 2022-12-21T00:34:04Z

I gave ECS a try and found that it starts the container in about a second. When it also has to pull the image, it takes about 20 seconds to start the container. We can probably pre-pull the image. This assumes a running instance and no need to scale-up and create more instances.

Both cases take an extra 10 seconds or so for the runner to connect after the container starts. That's just how long the runner code takes and this might be optimized in the future by GitHub.

Some code for the future (doesn't handle auto-scaling by container number yet):

new ecs.Cluster(
  stack,
  'ecs cluster',
  {
    enableFargateCapacityProviders: false,
    capacity: {
      instanceType: ec2.InstanceType.of(ec2.InstanceClass.M5, ec2.InstanceSize.LARGE),
      minCapacity: 1,
      maxCapacity: 2,
      desiredCapacity: 1,
      vpcSubnets: { subnetType: SubnetType.PUBLIC },
      spotInstanceDraining: false, // waste of money to restart jobs as the restarted job won't have a token
    },
    vpc: vpc,
  },
  // TODO ECS_IMAGE_PULL_BEHAVIOR=prefer-cached
);
const task = new ecs.Ec2TaskDefinition(
  stack,
  'task',
);
task.addContainer('runner', {
  image: ecs.ContainerImage.fromEcrRepository(fargateX64Builder.bind().imageRepository),
  user: 'runner',
  memoryLimitMiB: 2048,
  command: ['./config.sh', '--unattended'],
  logging: ecs.LogDriver.awsLogs({
    streamPrefix: 'runner',
  }),
});

RichiCoder1 · 2023-03-10T23:12:30Z

Maybe ECS autoscaling + a simple daemon to precache builder images?

You'd still get extra long cold starts for scale up + image pull, though you could potentially mitigate that w/ Warm Pools which let you balance the performance/price tradeoff

automartin5000 · 2023-03-22T02:07:32Z

I was surprised to see the on-demand behavior of the EC2 Runner Provider. My understanding was that GitHub Actions runs in Docker containers, so I was expecting that Actions run on EC2 Runners would similarly use Docker and inherently be ephemeral. I would second the notion of using ECS to manage runners on EC2.

EC2 backed ECS runner provider gives the user more control over the underlaying resources. The number of EC2 instances that can run runner containers and their auto-scaling can be adjusted. By default it scales down to zero, but `minInstances` can be set to a higher number to get faster runner start times. Default auto-scaling behavior leaves instances around for 15 minutes after the last job finished. Fixes #191 Related #260

kichik mentioned this issue Apr 3, 2023

feat: ECS runner provider #273

Merged

mergify bot closed this as completed in #273 Apr 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Re-use EC2 runner instances #191

Re-use EC2 runner instances #191

moltar commented Dec 15, 2022

kichik commented Dec 15, 2022

moltar commented Dec 15, 2022

kichik commented Dec 15, 2022

moltar commented Dec 15, 2022

moltar commented Dec 15, 2022

kichik commented Dec 16, 2022

moltar commented Dec 17, 2022

kichik commented Dec 18, 2022

kichik commented Dec 21, 2022 •

edited

Loading

RichiCoder1 commented Mar 10, 2023

automartin5000 commented Mar 22, 2023

Re-use EC2 runner instances #191

Re-use EC2 runner instances #191

Comments

moltar commented Dec 15, 2022

kichik commented Dec 15, 2022

moltar commented Dec 15, 2022

kichik commented Dec 15, 2022

moltar commented Dec 15, 2022

moltar commented Dec 15, 2022

kichik commented Dec 16, 2022

moltar commented Dec 17, 2022

kichik commented Dec 18, 2022

kichik commented Dec 21, 2022 • edited Loading

RichiCoder1 commented Mar 10, 2023

automartin5000 commented Mar 22, 2023

kichik commented Dec 21, 2022 •

edited

Loading