Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws_ecs: "Resource timed out waiting for completion" error during stack deletion #25969

Open
IllarionovDimitri opened this issue Jun 14, 2023 · 3 comments
Assignees
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. effort/medium Medium work item – several days of effort investigating This issue is being investigated and/or work is in progress to resolve the issue. p2

Comments

@IllarionovDimitri
Copy link

Describe the bug

I am running the ECS cluster with ASG as capacity provider (due to GPU loads) on one single EC2.

In order to avoid app down times during ecs task update I have set enable_managed_scaling=True in ecs.AsgCapacityProvider() with the goal that ecs first spins up a new instance, places task on it and only after that the previous instance will be deregistered and terminated.

Enabling of managed scaling adds two CloudWatch alarms behind the scenes.

Bildschirmfoto 2023-06-13 um 16 13 12

The problem is now that the instance termination happens now only after 15 minutes according to the alarm setting.
During stack deletion I obtain "Resource timed out waiting for completion" error, which crashes the CI/CD pipeline, which manages the stacks.

I have not found way to override the 15 min setting on the template, since this is how it looks in it.

"felgandev7ecsclusterstackfelgandev7capacityprovider2150902F": {
   "Type": "AWS::ECS::CapacityProvider",
   "Properties": {
    "AutoScalingGroupProvider": {
     "AutoScalingGroupArn": {
      "Ref": "felgandev7asgstackfelgandev7asgASG4A2CB50E"
     },
     "ManagedScaling": {
      "Status": "ENABLED",
      "TargetCapacity": 100
     },
     "ManagedTerminationProtection": "DISABLED"
    },
    "Name": "felgan-dev-7-capacity-provider",
    "Tags": [
     {
      "Key": "project",
      "Value": "felgan"
     },
     {
      "Key": "stack",
      "Value": "storage-stack"
     }
    ]
   },

Expected Behavior

Enabling of managed scaling in the ECS for ASG capacity provider either does not have "collisions" with stack timeouts or there is a way to alter the CloudWatch rules (e.g. lower the 15 min threshold) via cdk.

Current Behavior

During stack deletion with enable_managed_scaling=True in ecs.AsgCapacityProvider() "Resource timed out waiting for completion" error will be raised and stack deletion fails

Reproduction Steps

In order to reproduce the issue a lot of components must be deployed so I can assist with further information if needed since the stack is up and running

Possible Solution

No response

Additional Information/Context

No response

CDK CLI Version

2.83.0

Framework Version

No response

Node.js Version

18

OS

Ubuntu 20.04 LTS

Language

Python

Language Version

3.1.0.6

Other information

No response

@IllarionovDimitri IllarionovDimitri added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels Jun 14, 2023
@github-actions github-actions bot added the @aws-cdk/aws-ecs Related to Amazon Elastic Container label Jun 14, 2023
@pahud
Copy link
Contributor

pahud commented Jun 14, 2023

Sounds like it happens when you delete the stack. Where did you see the Resource timed out waiting for completion error? Is it from CloudFormation? Can you share more screenshots for it? I am wondering which resource was timed out waiting for completion. Any more screenshots would be helpful.

@pahud pahud added p2 effort/medium Medium work item – several days of effort investigating This issue is being investigated and/or work is in progress to resolve the issue. and removed needs-triage This issue or PR still needs to be triaged. labels Jun 14, 2023
@pahud pahud self-assigned this Jun 14, 2023
@IllarionovDimitri
Copy link
Author

IllarionovDimitri commented Jun 15, 2023

yes, as mentioned in a title the timeout comes during stack deletion.

here is the very first failure during stack deletion

Bildschirmfoto 2023-06-15 um 10 01 01

here is how I define the capacity provider

ecs.AsgCapacityProvider(
            self,
            f"{config.ID}-capacity-provider",
            capacity_provider_name=f"{config.ID}-capacity-provider",
            enable_managed_scaling=True,
            enable_managed_termination_protection=False,
            auto_scaling_group=asg,
        )

the issue comes after I have set enable_managed_scaling=True. this setting adds two Cloudwatch Alarms, one of them delays instance termination to 15 min, which can not be overridden in the template or cdk

Bildschirmfoto 2023-06-15 um 09 12 13

@IllarionovDimitri
Copy link
Author

IllarionovDimitri commented Jun 16, 2023

ok, since nothing else worked, I had to implement a workaround based on custom resource:

sg_parameters = {
             "AutoScalingGroupName": asg.auto_scaling_group_name,
             "ForceDelete": True,
         }

 asg_sdk_call_params = {
     "action": "deleteAutoScalingGroup",
     "service": "AutoScaling",
     "parameters": asg_parameters,
     "physical_resource_id": cr.PhysicalResourceId.of(asg.node.id),
 }

 asg_force_delete = cr.AwsCustomResource(
     self,
     f"{config.ID}-cr-delete-asg",
     install_latest_aws_sdk=False,
     on_delete=cr.AwsSdkCall(**asg_sdk_call_params),
     policy=cr.AwsCustomResourcePolicy.from_sdk_calls(
         resources=cr.AwsCustomResourcePolicy.ANY_RESOURCE
     ),
 )

 asg_force_delete.node.add_dependency(asg)
 asg_force_delete.node.add_dependency(ecs_cluster)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
@aws-cdk/aws-ecs Related to Amazon Elastic Container bug This issue is a bug. effort/medium Medium work item – several days of effort investigating This issue is being investigated and/or work is in progress to resolve the issue. p2
Projects
None yet
Development

No branches or pull requests

2 participants