-
-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-use EC2 runner instances #191
Comments
We currently focus on ephemeral runners where one job doesn't affect the status of the next one. Leaving a runner behind means every package you installed in job 1, will be available in job 2. That can cause unexpected and unpredictable results. In the future we may add API for a pool of prewarmed runners or even runners that stay around. But right now the focus is on-demand runners. |
Agreed on stateless. Would it be possible to run an EC2 with a Docker daemon that starts ephemeral containers for jobs? |
How would that be different than CodeBuild/Fargate/Lambda providers? |
Ability to run Docker in Docker (dind) |
Faster start times. Lambda is fast. But CodeBuild is very slow to provision. Same for fargate. |
I think solving this with pre-warmed instances will be simpler. This way we don't have to create another system to manage capacity and running containers in EC2. |
Wouldn't pre-warmed instances eat up all of the savings then? E.g. in our case, for every PR, we run 12 GitHub Workflows. This means 12 instances will need to be always on stand-by. |
Yes, using pre-warmed instances, or an always-on instance, would eat up the savings of on-demand. AWS prices scale linearly with instance size. So they both solutions would eat up savings in a similar manner, unless you wish to overprovision on the shared instance. But either way you don't get the cost savings of on-demand. This is beginning to sound like ECS. Do you know if ECS with an already active instance provisions faster than Fargate? Adding ECS support should be relatively easy and I think it will give us a shared instance without much work or orchestration code to maintain in the future. |
I gave ECS a try and found that it starts the container in about a second. When it also has to pull the image, it takes about 20 seconds to start the container. We can probably pre-pull the image. This assumes a running instance and no need to scale-up and create more instances. Both cases take an extra 10 seconds or so for the runner to connect after the container starts. That's just how long the runner code takes and this might be optimized in the future by GitHub. Some code for the future (doesn't handle auto-scaling by container number yet): new ecs.Cluster(
stack,
'ecs cluster',
{
enableFargateCapacityProviders: false,
capacity: {
instanceType: ec2.InstanceType.of(ec2.InstanceClass.M5, ec2.InstanceSize.LARGE),
minCapacity: 1,
maxCapacity: 2,
desiredCapacity: 1,
vpcSubnets: { subnetType: SubnetType.PUBLIC },
spotInstanceDraining: false, // waste of money to restart jobs as the restarted job won't have a token
},
vpc: vpc,
},
// TODO ECS_IMAGE_PULL_BEHAVIOR=prefer-cached
);
const task = new ecs.Ec2TaskDefinition(
stack,
'task',
);
task.addContainer('runner', {
image: ecs.ContainerImage.fromEcrRepository(fargateX64Builder.bind().imageRepository),
user: 'runner',
memoryLimitMiB: 2048,
command: ['./config.sh', '--unattended'],
logging: ecs.LogDriver.awsLogs({
streamPrefix: 'runner',
}),
}); |
Maybe ECS autoscaling + a simple daemon to precache builder images? You'd still get extra long cold starts for scale up + image pull, though you could potentially mitigate that w/ Warm Pools which let you balance the performance/price tradeoff |
I was surprised to see the on-demand behavior of the EC2 Runner Provider. My understanding was that GitHub Actions runs in Docker containers, so I was expecting that Actions run on EC2 Runners would similarly use Docker and inherently be ephemeral. I would second the notion of using ECS to manage runners on EC2. |
EC2 backed ECS runner provider gives the user more control over the underlaying resources. The number of EC2 instances that can run runner containers and their auto-scaling can be adjusted. By default it scales down to zero, but `minInstances` can be set to a higher number to get faster runner start times. Default auto-scaling behavior leaves instances around for 15 minutes after the last job finished. Fixes #191 Related #260
Does it make sense to launch multiple EC2 instances for each workflow?
This really slows the process down. I think ideally it should launch one instance, and then re-use it for all runs, and then when idle - shut down.
The text was updated successfully, but these errors were encountered: