Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-use EC2 runner instances #191

Closed
moltar opened this issue Dec 15, 2022 · 11 comments · Fixed by #273
Closed

Re-use EC2 runner instances #191

moltar opened this issue Dec 15, 2022 · 11 comments · Fixed by #273

Comments

@moltar
Copy link

moltar commented Dec 15, 2022

Does it make sense to launch multiple EC2 instances for each workflow?

This really slows the process down. I think ideally it should launch one instance, and then re-use it for all runs, and then when idle - shut down.

@kichik
Copy link
Member

kichik commented Dec 15, 2022

We currently focus on ephemeral runners where one job doesn't affect the status of the next one. Leaving a runner behind means every package you installed in job 1, will be available in job 2. That can cause unexpected and unpredictable results.

In the future we may add API for a pool of prewarmed runners or even runners that stay around. But right now the focus is on-demand runners.

@moltar
Copy link
Author

moltar commented Dec 15, 2022

Agreed on stateless.

Would it be possible to run an EC2 with a Docker daemon that starts ephemeral containers for jobs?

@kichik
Copy link
Member

kichik commented Dec 15, 2022

How would that be different than CodeBuild/Fargate/Lambda providers?

@moltar
Copy link
Author

moltar commented Dec 15, 2022

Ability to run Docker in Docker (dind)

@moltar
Copy link
Author

moltar commented Dec 15, 2022

Faster start times. Lambda is fast. But CodeBuild is very slow to provision. Same for fargate.

@kichik
Copy link
Member

kichik commented Dec 16, 2022

I think solving this with pre-warmed instances will be simpler. This way we don't have to create another system to manage capacity and running containers in EC2.

@moltar
Copy link
Author

moltar commented Dec 17, 2022

Wouldn't pre-warmed instances eat up all of the savings then?

E.g. in our case, for every PR, we run 12 GitHub Workflows. This means 12 instances will need to be always on stand-by.
Otherwise, even if we needed to wait for one, it will delay the entire PR readiness (all tests need to pass).

@kichik
Copy link
Member

kichik commented Dec 18, 2022

Yes, using pre-warmed instances, or an always-on instance, would eat up the savings of on-demand. AWS prices scale linearly with instance size. So they both solutions would eat up savings in a similar manner, unless you wish to overprovision on the shared instance. But either way you don't get the cost savings of on-demand.

This is beginning to sound like ECS. Do you know if ECS with an already active instance provisions faster than Fargate? Adding ECS support should be relatively easy and I think it will give us a shared instance without much work or orchestration code to maintain in the future.

@kichik
Copy link
Member

kichik commented Dec 21, 2022

I gave ECS a try and found that it starts the container in about a second. When it also has to pull the image, it takes about 20 seconds to start the container. We can probably pre-pull the image. This assumes a running instance and no need to scale-up and create more instances.

Both cases take an extra 10 seconds or so for the runner to connect after the container starts. That's just how long the runner code takes and this might be optimized in the future by GitHub.

Some code for the future (doesn't handle auto-scaling by container number yet):

new ecs.Cluster(
  stack,
  'ecs cluster',
  {
    enableFargateCapacityProviders: false,
    capacity: {
      instanceType: ec2.InstanceType.of(ec2.InstanceClass.M5, ec2.InstanceSize.LARGE),
      minCapacity: 1,
      maxCapacity: 2,
      desiredCapacity: 1,
      vpcSubnets: { subnetType: SubnetType.PUBLIC },
      spotInstanceDraining: false, // waste of money to restart jobs as the restarted job won't have a token
    },
    vpc: vpc,
  },
  // TODO ECS_IMAGE_PULL_BEHAVIOR=prefer-cached
);
const task = new ecs.Ec2TaskDefinition(
  stack,
  'task',
);
task.addContainer('runner', {
  image: ecs.ContainerImage.fromEcrRepository(fargateX64Builder.bind().imageRepository),
  user: 'runner',
  memoryLimitMiB: 2048,
  command: ['./config.sh', '--unattended'],
  logging: ecs.LogDriver.awsLogs({
    streamPrefix: 'runner',
  }),
});

@RichiCoder1
Copy link

Maybe ECS autoscaling + a simple daemon to precache builder images?

You'd still get extra long cold starts for scale up + image pull, though you could potentially mitigate that w/ Warm Pools which let you balance the performance/price tradeoff

@automartin5000
Copy link

I was surprised to see the on-demand behavior of the EC2 Runner Provider. My understanding was that GitHub Actions runs in Docker containers, so I was expecting that Actions run on EC2 Runners would similarly use Docker and inherently be ephemeral. I would second the notion of using ECS to manage runners on EC2.

@mergify mergify bot closed this as completed in #273 Apr 6, 2023
mergify bot pushed a commit that referenced this issue Apr 6, 2023
EC2 backed ECS runner provider gives the user more control over the underlaying resources. The number of EC2 instances that can run runner containers and their auto-scaling can be adjusted. By default it scales down to zero, but `minInstances` can be set to a higher number to get faster runner start times.

Default auto-scaling behavior leaves instances around for 15 minutes after the last job finished.

Fixes #191
Related #260
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants