Add support for one-shot capacity using AWS Fleet #153

mcrute · 2019-04-01T20:28:14Z

This PR adds support for acquiring all desired scale-up capacity at one time using the EC2 Fleet API.

To use this create an instance launch template and add the ID to your Escalator configuration. The presence of this setting will cause Escalator to use the fleet API to acquire the capacity in one-shot and add it into the autoscaling group rather than changing the desired instance count of the autoscaling group and waiting for instances to be delivered. If there are not sufficient instances available to satisfy the scaling request the request will fail and try again next time Escalator checks for scaling-up the cluster.

coreyjohnston · 2019-04-04T23:28:24Z

Hi Mike,

Thanks very much for contributing Fleet support! We'll take a look and be in touch shortly.

awprice

Thanks for adding this functionality to Escalator! Overall great PR, I've highlighted some areas that need work. My main concern is around AZ balancing with fleet.

pkg/cloudprovider/aws/aws.go

pkg/controller/node_group.go

pkg/cloudprovider/aws/aws.go

awprice · 2019-04-08T01:56:27Z

pkg/cloudprovider/aws/aws.go

+		TerminateInstancesWithExpiration: awsapi.Bool(false),
+		OnDemandOptions: &ec2.OnDemandOptionsRequest{
+			MinTargetCapacity:      awsapi.Int64(addCount),
+			SingleAvailabilityZone: awsapi.Bool(true),


I had a bit of trouble with test the SingleAvailabilityZone option. My testing setup used an ASG that spanned multiple AZs and had the AZRebalance feature enabled (not suspended). Whenever I requested new instances with fleet, they kept being created in us-west-2a, and shortly after being attached to the ASG, the ASG would terminate the instances (with jobs running on them) and move them to another AZ. When changing the desired size of the same ASG manually, the instances were automatically rebalanced.

Is there a way to use fleet on-demand and have the instances balanced across multiple AZs?

I've just tested adding override options with each of the three AZs and subnets I want the instances to be balanced across, and it still seems like they get created in the one AZ - us-west-2.

I removed the SingleAvailabilityZone, that was for some early testing and I forgot to remove it. Without that flag and with multiple AZs it should be possible to balance across AZs. I'll do some more testing and update with more details.

Dug into this a little more and it looks like AZ balancing isn't yet supported with the CreateFleet API. The current use-case I'm solving for by integrating CreateFleet support into Escalator is to handle workloads that would benefit from having capacity available immediately without waiting for an ASG to scale-up and are not concerned with AZ balance (for example: an ML training workload that needs p3dn capacity and may be using placement groups to manage network locality).

I think there are two paths forward here. The first would be to merge this as-is and update the docs to state that this is designed to support workloads that don't need AZ balance. I can also follow up with the fleet team to see if AZ balance is on their roadmap and advocate for it. This would give us a path towards updating this code in the future when/if that becomes supported.

The second option is to break the fleet request into multiple parts, one per AZ, and handle balancing within the Escalator code. I think this is risky because there are a lot of edge cases around acquiring the correct amount of capacity at the right time and managing over/under-scaling manually in Escalator and would strongly prefer option one if you're open to it.

WDYT? Do you see different options?

Thanks for the detailed explanation, I'm happy with option 1. I definitely agree the second option isn't ideal and could potentially be a bit of a hack.

Could you update the docs in this PR to include the AZ balancing caveat for fleet? It would be awesome if you could advocate for AZ balancing for fleet as it's something that we could utilise too.

I assume most users that are doing ML training workloads aren't concerned with all of the workloads running in a single AZ?

I've updated the docs and will start to push forward AZ balancing with my partner team that owns this API.

As far as the ML training workloads go it's highly desirable that instances be near each other from a networking perspective to cut inter-instance latencies. This patch will also allow users to configure placement groups within their instance profile to further restrict placement for highly latency sensitive workloads.

pkg/cloudprovider/aws/aws.go

mcrute · 2019-04-30T04:35:00Z

Thanks @awprice! Sorry for the slow follow-up. I've posted updates to most of your review notes. I'll follow up with the configuration re-arrangement as well as validation of the AZ balancing shortly.

awprice · 2019-04-30T05:02:25Z

Thanks @awprice! Sorry for the slow follow-up. I've posted updates to most of your review notes. I'll follow up with the configuration re-arrangement as well as validation of the AZ balancing shortly.

Awesome! I'll keep an eye out for it.

mcrute · 2019-05-03T15:36:30Z

Thanks for the approval @awprice! Is there anything else that needs to be done before this can be merged?

awprice · 2019-05-03T22:48:44Z

Thanks for the approval @awprice! Is there anything else that needs to be done before this can be merged?

Apologies, I was going to get one of my coworkers to check over it, then we are good to merge and release.

cc @patrickshan @Jacobious52

Jacobious52 added the feature label Apr 2, 2019

Jacobious52 requested review from Jacobious52 and awprice April 2, 2019 03:42

awprice requested a review from patrickshan April 2, 2019 03:54

awprice self-assigned this Apr 3, 2019

awprice added architecture AWS Cloud Provider enhancement New feature or request labels Apr 3, 2019

awprice requested changes Apr 8, 2019

View reviewed changes

awprice removed architecture enhancement New feature or request labels Apr 8, 2019

awprice reviewed Apr 9, 2019

View reviewed changes

pkg/cloudprovider/aws/aws.go Show resolved Hide resolved

tabern mentioned this pull request Apr 25, 2019

EKS / Kubernetes: Add support for using AWS Fleet to atlatssian/escalator aws/containers-roadmap#270

Closed

Add support for one-shot capacity using AWS Fleet

6f3bff0

awprice approved these changes May 2, 2019

View reviewed changes

awprice merged commit cbe8c67 into atlassian:master May 5, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for one-shot capacity using AWS Fleet #153

Add support for one-shot capacity using AWS Fleet #153

mcrute commented Apr 1, 2019

coreyjohnston commented Apr 4, 2019

awprice left a comment •

edited

Loading

awprice Apr 8, 2019

awprice Apr 8, 2019

mcrute Apr 30, 2019

mcrute May 1, 2019

awprice May 2, 2019

mcrute May 2, 2019

mcrute commented Apr 30, 2019

awprice commented Apr 30, 2019

mcrute commented May 3, 2019

awprice commented May 3, 2019

Add support for one-shot capacity using AWS Fleet #153

Add support for one-shot capacity using AWS Fleet #153

Conversation

mcrute commented Apr 1, 2019

coreyjohnston commented Apr 4, 2019

awprice left a comment • edited Loading

Choose a reason for hiding this comment

awprice Apr 8, 2019

Choose a reason for hiding this comment

awprice Apr 8, 2019

Choose a reason for hiding this comment

mcrute Apr 30, 2019

Choose a reason for hiding this comment

mcrute May 1, 2019

Choose a reason for hiding this comment

awprice May 2, 2019

Choose a reason for hiding this comment

mcrute May 2, 2019

Choose a reason for hiding this comment

mcrute commented Apr 30, 2019

awprice commented Apr 30, 2019

mcrute commented May 3, 2019

awprice commented May 3, 2019

awprice left a comment •

edited

Loading