Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for one-shot capacity using AWS Fleet #153

Merged
merged 1 commit into from
May 5, 2019
Merged

Add support for one-shot capacity using AWS Fleet #153

merged 1 commit into from
May 5, 2019

Conversation

mcrute
Copy link
Contributor

@mcrute mcrute commented Apr 1, 2019

This PR adds support for acquiring all desired scale-up capacity at one time using the EC2 Fleet API.

To use this create an instance launch template and add the ID to your Escalator configuration. The presence of this setting will cause Escalator to use the fleet API to acquire the capacity in one-shot and add it into the autoscaling group rather than changing the desired instance count of the autoscaling group and waiting for instances to be delivered. If there are not sufficient instances available to satisfy the scaling request the request will fail and try again next time Escalator checks for scaling-up the cluster.

@coreyjohnston
Copy link

Hi Mike,

Thanks very much for contributing Fleet support! We'll take a look and be in touch shortly.

Copy link
Member

@awprice awprice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this functionality to Escalator! Overall great PR, I've highlighted some areas that need work. My main concern is around AZ balancing with fleet.

pkg/cloudprovider/aws/aws.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/aws.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/aws.go Show resolved Hide resolved
pkg/controller/node_group.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/aws.go Outdated Show resolved Hide resolved
TerminateInstancesWithExpiration: awsapi.Bool(false),
OnDemandOptions: &ec2.OnDemandOptionsRequest{
MinTargetCapacity: awsapi.Int64(addCount),
SingleAvailabilityZone: awsapi.Bool(true),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a bit of trouble with test the SingleAvailabilityZone option. My testing setup used an ASG that spanned multiple AZs and had the AZRebalance feature enabled (not suspended). Whenever I requested new instances with fleet, they kept being created in us-west-2a, and shortly after being attached to the ASG, the ASG would terminate the instances (with jobs running on them) and move them to another AZ. When changing the desired size of the same ASG manually, the instances were automatically rebalanced.

Is there a way to use fleet on-demand and have the instances balanced across multiple AZs?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just tested adding override options with each of the three AZs and subnets I want the instances to be balanced across, and it still seems like they get created in the one AZ - us-west-2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the SingleAvailabilityZone, that was for some early testing and I forgot to remove it. Without that flag and with multiple AZs it should be possible to balance across AZs. I'll do some more testing and update with more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dug into this a little more and it looks like AZ balancing isn't yet supported with the CreateFleet API. The current use-case I'm solving for by integrating CreateFleet support into Escalator is to handle workloads that would benefit from having capacity available immediately without waiting for an ASG to scale-up and are not concerned with AZ balance (for example: an ML training workload that needs p3dn capacity and may be using placement groups to manage network locality).

I think there are two paths forward here. The first would be to merge this as-is and update the docs to state that this is designed to support workloads that don't need AZ balance. I can also follow up with the fleet team to see if AZ balance is on their roadmap and advocate for it. This would give us a path towards updating this code in the future when/if that becomes supported.

The second option is to break the fleet request into multiple parts, one per AZ, and handle balancing within the Escalator code. I think this is risky because there are a lot of edge cases around acquiring the correct amount of capacity at the right time and managing over/under-scaling manually in Escalator and would strongly prefer option one if you're open to it.

WDYT? Do you see different options?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation, I'm happy with option 1. I definitely agree the second option isn't ideal and could potentially be a bit of a hack.

Could you update the docs in this PR to include the AZ balancing caveat for fleet? It would be awesome if you could advocate for AZ balancing for fleet as it's something that we could utilise too.

I assume most users that are doing ML training workloads aren't concerned with all of the workloads running in a single AZ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the docs and will start to push forward AZ balancing with my partner team that owns this API.

As far as the ML training workloads go it's highly desirable that instances be near each other from a networking perspective to cut inter-instance latencies. This patch will also allow users to configure placement groups within their instance profile to further restrict placement for highly latency sensitive workloads.

pkg/cloudprovider/aws/aws.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/aws.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/aws.go Outdated Show resolved Hide resolved
pkg/cloudprovider/aws/aws.go Outdated Show resolved Hide resolved
@awprice awprice removed architecture enhancement New feature or request labels Apr 8, 2019
@mcrute
Copy link
Contributor Author

mcrute commented Apr 30, 2019

Thanks @awprice! Sorry for the slow follow-up. I've posted updates to most of your review notes. I'll follow up with the configuration re-arrangement as well as validation of the AZ balancing shortly.

@awprice
Copy link
Member

awprice commented Apr 30, 2019

Thanks @awprice! Sorry for the slow follow-up. I've posted updates to most of your review notes. I'll follow up with the configuration re-arrangement as well as validation of the AZ balancing shortly.

Awesome! I'll keep an eye out for it.

@mcrute
Copy link
Contributor Author

mcrute commented May 3, 2019

Thanks for the approval @awprice! Is there anything else that needs to be done before this can be merged?

@awprice
Copy link
Member

awprice commented May 3, 2019

Thanks for the approval @awprice! Is there anything else that needs to be done before this can be merged?

Apologies, I was going to get one of my coworkers to check over it, then we are good to merge and release.

cc @patrickshan @Jacobious52

@awprice awprice merged commit cbe8c67 into atlassian:master May 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants