Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ecs): Autoscaling Group Capacity Provider #9192

Closed
wants to merge 32 commits into from
Closed

feat(ecs): Autoscaling Group Capacity Provider #9192

wants to merge 32 commits into from

Conversation

pahud
Copy link
Contributor

@pahud pahud commented Jul 21, 2020

feat(ecs): Autoscaling Group Capacity Provider

This PR adds the Autoscaling Group Capacity Provider support for Amazon ECS

Closes: #5471


Considerations

  1. Use addCapacityProvider to create the CapacityProvider resource. Behind the scene, it addCapacity for this cluster and use the retuned AutoscalingGroup to provision the CapacityProvider so we can leverage the existing CapacityOptions.
  2. When managedTerminationProtection is enabled, which is the default behavior, behind the scene the CapacityProvider construct will create a EnforcedInstanceProtection custom resource which will do the following required steps on create:
    a. set NewInstancesProtectedFromScaleIn=True for this ASG
    b. set ProtectedFromScaleIn=True for all existing instances in this ASG
  3. When managedTerminationProtection is enabled, on CapacityProvider resource deletion, the EnforcedInstanceProtection custom resource which will do the following things:
    a. set NewInstancesProtectedFromScaleIn=False for this ASG
    b. set ProtectedFromScaleIn=False for all existing instances in this ASG
    In this case if the ASG is going to terminate with the CapacityProvider, the instances can successfully be terminated otherwise the whole stack will be pending.
  4. CapacityProviderConfiguration to break the circular dependency and configure the capacity providers as well as the strategies for the cluster

Known Issues

  1. Update the properties from AWS::ECS::CapacityProvider with the same AutoscalingGroup will encounter the error:

Invalid request provided: CreateCapacityProvider error: The specified Auto Scaling group
ARN is already being used by another capacity provider. Specify a unique Auto Scaling gr
oup ARN and try again. (Service: AmazonECS; Status Code: 400; Error Code: ClientExceptio
n; Request ID: 46641521-15db-40e1-83b8-90bd3dc75a3e; Proxy: null)

Looks like when updating the AWS::ECS::CapacityProvider resource, a new AWS::ECS::CapacityProvider with the same ASG provider will be created first for replacement and will immediately fail because the same ASG is not allowed for two CapacityProvider. No idea how to work it around at this moment.

Sample

const cluster = new ecs.Cluster(stack, 'Cluster', { vpc });

// create the 1st capacity provider with on-demand t3.large instances
cluster.addCapacityProvider('CP', {
  capacityOptions: {
    instanceType: new ec2.InstanceType('t3.large'),
    minCapacity: 2,
  },
  managedScaling: true,
  managedTerminationProtection: true,
  defaultStrategy: { base: 1, weight: 1 },
});

// create the 2nd capacity provider with ec2 spot t3.large instances
cluster.addCapacityProvider('CPSpot', {
  capacityOptions: {
    instanceType: new ec2.InstanceType('t3.large'),
    minCapacity: 1,
    spotPrice: '0.1',
  },
  managedScaling: true,
  managedTerminationProtection: true,
  defaultStrategy: { weight: 3 },
});

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@pahud pahud changed the title feat(ecs): Capacity Provider for Auto Scaling Group feat(ecs): Autoscaling Group Capacity Provider Jul 21, 2020
@SomayaB SomayaB added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Jul 21, 2020
@pahud pahud marked this pull request as draft July 22, 2020 02:47
@pahud pahud marked this pull request as ready for review July 22, 2020 05:05
@pahud
Copy link
Contributor Author

pahud commented Jul 22, 2020

Hi @uttarasridhar

I am ready for the first round. Please take a look.

thanks.

@pahud
Copy link
Contributor Author

pahud commented Jul 24, 2020

converted to draft. Working on the circular dependency issue addressed in aws/containers-roadmap#631 (comment)

@jgaudet
Copy link

jgaudet commented Jul 24, 2020

Question: In order for #5471 to be considered closed, does CloudFormation need to support setting a custom capacity provider on a service?

@pahud
Copy link
Contributor Author

pahud commented Jul 25, 2020

Question: In order for #5471 to be considered closed, does CloudFormation need to support setting a custom capacity provider on a service?

It would be great if AWS::ECS::Service can support capacityProviderStrategy. But it's still possible to build something like

service.addCapacityProvider(CapacityProviderStrategy)

And behind the scene the AwsCustomResource update the service instead. I will give it a try. Any comments are welcome here.

@pahud
Copy link
Contributor Author

pahud commented Jul 25, 2020

converted to draft. Working on the circular dependency issue addressed in aws/containers-roadmap#631 (comment)

To break the circular dependency, maybe we should read the cluster name from parameter store rather than the this.clusterName like this

// Tie instances to cluster
autoScalingGroup.addUserData(`echo ECS_CLUSTER=${this.clusterName} >> /etc/ecs/ecs.config`);

@pahud
Copy link
Contributor Author

pahud commented Jul 31, 2020

parameter store doesn't seem to work. Trying to use the CapacityProviderConfiguration custom resource to break the circular dependency instead.

Copy link
Contributor

@iamhopaul123 iamhopaul123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for contributing❤️ I believe this will be very good news for our ECS customers!

packages/@aws-cdk/aws-ecs/README.md Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-ecs/lib/capacity-provider.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-ecs/lib/capacity-provider.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-ecs/lib/capacity-provider.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-ecs/lib/capacity-provider.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-ecs/lib/capacity-provider.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-ecs/lib/cluster.ts Outdated Show resolved Hide resolved
const cluster = new ecs.Cluster(stack, 'Cluster', { vpc });

// create the 1st capacity provider with on-demand t3.large instances
const cp = cluster.addCapacityProvider('CP', {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general question: is it possible to merge addCapacityProviderConfiguration into addCapacityProvider so that users don't need to create a capacity provider then register it to a cluster? Because I would expect cluster.addCapacityProvider already create and register a capacity provider to that cluster and return the created capacity provider (which is similar to what users would expect for cluster.addCapacity)

Copy link
Contributor Author

@pahud pahud Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A more general question: is it possible to merge addCapacityProviderConfiguration into addCapacityProvider so that users don't need to create a capacity provider then register it to a cluster? Because I would expect cluster.addCapacityProvider already create and register a capacity provider to that cluster and return the created capacity provider (which is similar to what users would expect for cluster.addCapacity)

Hi @iamhopaul123

I have been exploring on this but haven't make it yet. Let me explain my current status and I am looking for your advice.

addCapacityProvider() is easy as we just need to create the AWS::ECS::CapacityProvider resource, but registering multiple capacity providers to the cluster is still challenging. We probably have two options here:

  1. simply associate the cp to the cluster in the CapacityProvider and DefaultCapacityProviderStrategy properties of the AWS::ECS::Cluster
    https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-ecs-cluster.html#cfn-ecs-cluster-capacityproviders

However, we'll have circular dependency(aws/containers-roadmap#631 (comment)). An option is like this aws/containers-roadmap#631 (comment) but it might cause breaking changes.

Another solution could be to get the cluster name from the parameter store with the parameter name like ${stackName}-${id}-clusterName and run the aws ssm get-parameters in the UserData of the instance node. But I haven't tried this yet. Still a breaking change though but might be a better solution than above.

  1. The second options is what I've been working on now. I am creating a new CapacityProviderConfiguration custom resource. When we addCapacityProvider(), the cp will be added into capacityProviders: CapacityProviders[] of the cluster instance and on cluster creation, we synthesize all the added capacity providers and run the new CapacityProviderConfiguration inside the Cluster class, but in this case the CapacityProviderConfiguration custom resource can hardly addDependency on all the capacity providers, which will lead to the error below:

Failed to delete resource. Error: An error occurred (UpdateInProgressException) when calling the PutClusterCapac
ityProviders operation: The specified cluster is in a busy state. Cluster attachments must be in UPDATE_COMPLETE
or UPDATE_FAILED state before they can be updated. Wait and try again.

I believe the 2nd option is the way to go, but I am still struggling with how to addDependency on the capacity providers for the CapacityProviderConfiguration resource to ensure all CPs are ready before we configure them to the cluster.

I am not sure if I am on the right track and would be appreciated if you can share your feedbacks. What do you think?

Copy link
Contributor Author

@pahud pahud Aug 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RE the circular dependency.

I believe we need break the following 2 circular dependencies before we can simply register the CP to the cluster from native cfn's perspective.

cluster -> cp -> addCapacity() -> addAutoScalingGroup() -> cluster

// Tie instances to cluster
autoScalingGroup.addUserData(`echo ECS_CLUSTER=${this.clusterName} >> /etc/ecs/ecs.config`);

cluster -> cp -> addCapacity() -> addAutoScalingGroup() -> InstanceDrainHook -> cluster

if (!options.taskDrainTime || options.taskDrainTime.toSeconds() !== 0) {
new InstanceDrainHook(autoScalingGroup, 'DrainECSHook', {
autoScalingGroup,
cluster: this,
drainTime: options.taskDrainTime,
topicEncryptionKey: options.topicEncryptionKey,
});
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Now I am able to resolve the two circular dependencies.

Will add comments inline for discussion.


/**
* Whether to enable the managed scaling. This value will be overrided to be True
* if the `managedTerminationProtection` is enabled.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we should instead of overriding it to be true, we throw an error saying when managedTerminationProtection is enabled managedScaling cannot be disabled sth like that. And also have a test case for this error.

The reason is because default values for both managedTerminationProtection and managedScaling are true, and when a user set managedScaling to be false which means they don't want it to be enabled. Given that if we override managedScaling to be true then it might result in sth that users don't expect (or contradictory to their intention).

@iamhopaul123
Copy link
Contributor

Update the properties from AWS::ECS::CapacityProvider with the same AutoscalingGroup will encounter the error:

Invalid request provided: CreateCapacityProvider error: The specified Auto Scaling group
ARN is already being used by another capacity provider. Specify a unique Auto Scaling gr
oup ARN and try again. (Service: AmazonECS; Status Code: 400; Error Code: ClientExceptio
n; Request ID: 46641521-15db-40e1-83b8-90bd3dc75a3e; Proxy: null)

Looks like when updating the AWS::ECS::CapacityProvider resource, a new AWS::ECS::CapacityProvider with the same ASG provider will be created first for replacement and will immediately fail because the same ASG is not allowed for two CapacityProvider. No idea how to work it around at this moment.

@MrArnoldPalmer do you have any idea on how to resolve the resource updating issue?

@pahud pahud marked this pull request as draft August 12, 2020 01:50
Comment on lines 17 to 40
const cluster = new ecs.Cluster(stack, 'Cluster', { vpc });

// create the 1st capacity provider with on-demand t3.large instances
cluster.addCapacityProvider('CP', {
capacityOptions: {
instanceType: new ec2.InstanceType('t3.large'),
minCapacity: 2,
},
managedScaling: true,
managedTerminationProtection: true,
defaultStrategy: { base: 1, weight: 1 },
});

// create the 2nd capacity provider with ec2 spot t3.large instances
cluster.addCapacityProvider('CPSpot', {
capacityOptions: {
instanceType: new ec2.InstanceType('t3.large'),
minCapacity: 1,
spotPrice: '0.1',
},
managedScaling: true,
managedTerminationProtection: true,
defaultStrategy: { weight: 3 },
});
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integ testing is doing great now. But fails to destroy with the errors:

圖片

Looks like there are some constraints:

  1. When deleting the AWS::ECS::Cluster, all container instances should not be active or draining.
  2. Cannot delete a capacity provider while it is associated with a cluster.

It's interesting as the two conditions look like a mutual conflict.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we dissociate the cp with the cluster then try to delete the cp then delete the cluster?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When deleting the AWS::ECS::Cluster, all container instances should not be active or draining.

Do you think this has sth to do with removing dependency for InstanceDrainHook on ecs cluster?

packages/@aws-cdk/aws-ecs/lib/capacity-provider.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-ecs/lib/capacity-provider.ts Outdated Show resolved Hide resolved
packages/@aws-cdk/aws-ecs/lib/cluster.ts Outdated Show resolved Hide resolved
Comment on lines 17 to 40
const cluster = new ecs.Cluster(stack, 'Cluster', { vpc });

// create the 1st capacity provider with on-demand t3.large instances
cluster.addCapacityProvider('CP', {
capacityOptions: {
instanceType: new ec2.InstanceType('t3.large'),
minCapacity: 2,
},
managedScaling: true,
managedTerminationProtection: true,
defaultStrategy: { base: 1, weight: 1 },
});

// create the 2nd capacity provider with ec2 spot t3.large instances
cluster.addCapacityProvider('CPSpot', {
capacityOptions: {
instanceType: new ec2.InstanceType('t3.large'),
minCapacity: 1,
spotPrice: '0.1',
},
managedScaling: true,
managedTerminationProtection: true,
defaultStrategy: { weight: 3 },
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we dissociate the cp with the cluster then try to delete the cp then delete the cluster?

@pahud
Copy link
Contributor Author

pahud commented Aug 18, 2020

Encountered circular dependency because an iam policy is referring to this.clusterArn which is formatted from the cluster.physicalName. Will try fix and mitigate it.

this.clusterArn = this.getResourceArnAttribute(cluster.attrArn, {
service: 'ecs',
resource: 'cluster',
resourceName: this.physicalName,
});

圖片

@pahud
Copy link
Contributor Author

pahud commented Aug 18, 2020

I am holding this PR because of this and will not continue until further notice from the service team.

Summary for current status

  • cluster and CP can be deployed with no error
  • can't destroy the stack due to the dependency described here

@iamhopaul123 iamhopaul123 added the pr/blocked This PR cannot be merged or reviewed, because it is blocked for some reason. label Aug 18, 2020
@gitpod-io
Copy link

gitpod-io bot commented Nov 3, 2020

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildProject6AEA49D1-qxepHUsryhcu
  • Commit ID: be38d99
  • Result: FAILED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@ghost
Copy link

ghost commented Dec 2, 2020

Hi! When would this be available?

@SoManyHs
Copy link
Contributor

SoManyHs commented Jun 9, 2021

This was addressed by #14386. Closing.

@SoManyHs SoManyHs closed this Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container pr/blocked This PR cannot be merged or reviewed, because it is blocked for some reason.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aws-ecs: support capacity provider
8 participants