Skip to content

Conversation

@agarakan
Copy link
Contributor

@agarakan agarakan commented Jul 30, 2025

Description of the issue

This allows the ecs cleanup script to handle deletion of clusters whose services have failed tasks, or have tasks that have been open for more than a week.

See old output of buggy ECS Resource Cleanup run (clean-ecs-clusters): https://github.com/aws/amazon-cloudwatch-agent/actions/runs/16610452973/job/46992332357

Description of changes

  1. Add logic to scale down ecs cluster services before deleting them to avoid getting a 400 on deletion of active services
    Ex:
2025/07/30 00:29:15 Error operation error ECS: DeleteCluster, https response error StatusCode: 400, RequestID: 3162fff1-ae25-4a19-8723-efd1610f8702, ClusterContainsServicesException: The Cluster cannot be deleted while Services are active. terminating cluster arn:aws:ecs:us-west-2:506463145083:cluster/cwagent-integ-test-cluster-04e4c49e62995d6e
  1. Fix describeTasks invocation to include task list
  2. Fix buggy expiry time logic when checking for tasks to delete
  3. Error handling

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Ran locally with developer account

cd ./tool/clean && go run ./clean_ecs/clean_ecs.go --tags clean

See fix in kicked-off resource cleanup in github runner (see clean-ecs-clusters): https://github.com/aws/amazon-cloudwatch-agent/actions/runs/16631933178/job/47063290921

Requirements

Before commiting your code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

Integration Tests

To run integration tests against this PR, add the ready for testing label.

@agarakan agarakan requested a review from a team as a code owner July 30, 2025 19:26
** What **
1. Add logic to scale down ecs cluster services before deleting them to
   avoid getting a 400 on deletion of active services
2. Fix describeTasks invocation to include task list
3. Fix buggy expiry time logic when checking for tasks to delete

** Why **

This allows the ecs cleanup script to handle deletion of clusters whose
services have failed tasks, or have tasks that have been open for more
than a week.
@agarakan agarakan force-pushed the cleanup_ecs_active_services branch from 268f45f to 3060eaa Compare July 30, 2025 19:29
}
}

func isClusterTasksExpired(ctx context.Context, client *ecs.Client, clusterArn *string) bool {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this logic was failing with an incorrect request, where describeTaskInput was missing the Tasks parameter. This now retrieves the tasks and then corrects the describeTask call

continue
}

for _, service := range services.ServiceArns {
Copy link
Contributor Author

@agarakan agarakan Jul 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now handles deleting clusters with active services by performing Service ScaleDown and then deletion. Validated the original 400 is no longer encountered. See 400 in PR description

// Clean ECS clusters if they have been running longer than 7 days

var expirationTimeOneWeek = time.Now().UTC().Add(clean.KeepDurationOneWeek)
var expirationTimeOneWeek = time.Now().UTC().Add(-clean.KeepDurationOneWeek)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed bug. Expiration time used to be set 1 week in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

if !strings.HasPrefix(*cluster.ClusterName, "cwagent-integ-test-cluster-") {
continue
}
if cluster.ActiveServicesCount > 0 {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check not needed since we handle activeServiceCount in deletion now

TravisStark
TravisStark previously approved these changes Jul 30, 2025
@agarakan agarakan merged commit a2fc23a into main Jul 30, 2025
102 of 105 checks passed
@agarakan agarakan deleted the cleanup_ecs_active_services branch July 30, 2025 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants