fix several issues with the autodraining lambda #23

masneyb · 2018-04-16T14:23:53Z

This patch makes the following changes that address several issues
and cleans up the existing Lambda function:

Previously, the old Lambda function was included as a separate 8.1 MB
ZIP file that needed to be stored at S3 and managed separately from
the rest of your ECS cluster. Python code in AWS Lambda no longer
needs to bundle all of its dependencies . With all of the refactoring
above, the new Python code is small enough that it is embedded
directly in the CloudFormation template to reduce external
dependencies. This will make it easy to make changes to this code on
a branch and test it against a single ECS cluster.
The AWS code discovers the ECS cluster name by parsing the user data.
This will cause issues if the EC2 instances need to be bootstrapped
with other items through the user data script.
The AWS code can post messages to the wrong SNS topic when retrying.
It looks for the first SNS topic in the account that has a lambda
function subscribed to it and posts the retry message to that topic.
The AWS code does not do any kind of pagination against the ECS API
when reading the list of EC2 instances. So if it couldn't find the
instance ID that was about to be terminated on the first page, then
the instance was not set to DRAINING and the end users would see 50X
messages when the operation timed out and autoscaling killed the
instance.
The retry logic did not put in any kind of delay in place when
retrying. The Lambda function would be invoked about 5-10 times a
second, and each Lambda function invocation would probably
make close to a dozen AWS API calls. A 5 second delay between each
retry was introduced.
There was a large amount of unused code and variables in the in the
AWS implementation.
Converted the code from Python 2 to 3.

This patch makes the following changes that address several issues and cleans up the existing Lambda function: - Previously, the old Lambda function was included as a separate 8.1 MB ZIP file that needed to be stored at S3 and managed separately from the rest of your ECS cluster. Python code in AWS Lambda no longer needs to bundle all of its dependencies . With all of the refactoring above, the new Python code is small enough that it is embedded directly in the CloudFormation template to reduce external dependencies. This will make it easy to make changes to this code on a branch and test it against a single ECS cluster. - The AWS code discovers the ECS cluster name by parsing the user data. This will cause issues if the EC2 instances need to be bootstrapped with other items through the user data script. - The AWS code can post messages to the wrong SNS topic when retrying. It looks for the first SNS topic in the account that has a lambda function subscribed to it and posts the retry message to that topic. - The AWS code does not do any kind of pagination against the ECS API when reading the list of EC2 instances. So if it couldn't find the instance ID that was about to be terminated on the first page, then the instance was not set to DRAINING and the end users would see 50X messages when the operation timed out and autoscaling killed the instance. - The retry logic did not put in any kind of delay in place when retrying. The Lambda function would be invoked about 5-10 times a second, and each Lambda function invocation would probably make close to a dozen AWS API calls. A 5 second delay between each retry was introduced. - There was a large amount of unused code and variables in the in the AWS implementation. - Converted the code from Python 2 to 3.

within the CloudFormation template.

clorichel · 2018-05-03T23:36:39Z

Totally legit, great work, thanks @masneyb 👍

masneyb · 2018-05-03T23:48:08Z

Thanks! Here is a blog post that describes how we are using ECS at my employer: https://techblog.realtor.com/a-better-ecs/. It includes a CloudFormation template with the corrected autodraining Lambda.

Brian

Leonidimus · 2018-07-11T18:28:30Z

Calling time.sleep(5) does not stop the execution environment. You will still pay for the duration of the execution environment, from the time the function is invoked until the time the function exits.

masneyb · 2018-07-12T00:35:04Z

That is correct. The sleep time is to avoid the API throttling errors that we were encountering in our accounts (even beyond the autodraining lambdas). According to the AWS pricing page, the 128MB Lambdas with 128MB of RAM get 3,200,000 free seconds per month. Beyond that, each 5 second pause will cost cost $0.000010400. If I recall correctly, the EC2 scale down timeout is set to 15 minutes in the template. Brian

…

On Wed, Jul 11, 2018 at 11:28:31AM -0700, Leonid wrote: Calling time.sleep(5) does not stop the execution environment. You will still pay for the duration of the execution environment, from the time the function is invoked until the time the function exits. -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #23 (comment)

nathanpeck · 2018-07-24T16:25:22Z

These changes look really great. I don't have write access right now, but will try to track down someone who can merge this.

Edit: @mperi is looking at it and testing before merging.

mperi

@masneyb : Great simplication and enhancements. Line 425 in ecs.yaml needs to reference CFN clustername instead of stack name. Rest all looks good testing wise. Thank you!

masneyb · 2018-07-31T13:21:38Z

@mperi : I corrected the ECS cluster name. Let me know if you need any other changes.

Brian

mperi

Thanks Brian. Tested and merging

masneyb · 2018-08-01T13:23:07Z

OK, thanks. Let me know if I need to do anything else to get this merged.

Brian

mperi · 2018-08-01T23:58:35Z

Merging post testing. Thank you for the changes in PR.

masneyb added 2 commits April 16, 2018 10:20

simplify installation directions now that the python code is embedded

198dad4

within the CloudFormation template.

sstarcher approved these changes Jul 5, 2018

View reviewed changes

mperi requested changes Jul 30, 2018

View reviewed changes

corrected ECS cluster name

ca07943

mperi approved these changes Jul 31, 2018

View reviewed changes

mperi closed this Aug 1, 2018

mperi merged commit 0d60dd5 into aws-samples:master Aug 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix several issues with the autodraining lambda #23

fix several issues with the autodraining lambda #23

masneyb commented Apr 16, 2018

clorichel commented May 3, 2018

masneyb commented May 3, 2018

Leonidimus commented Jul 11, 2018

masneyb commented Jul 12, 2018 via email

nathanpeck commented Jul 24, 2018 •

edited

mperi left a comment

masneyb commented Jul 31, 2018

mperi left a comment

masneyb commented Aug 1, 2018

mperi commented Aug 1, 2018

fix several issues with the autodraining lambda #23

fix several issues with the autodraining lambda #23

Conversation

masneyb commented Apr 16, 2018

clorichel commented May 3, 2018

masneyb commented May 3, 2018

Leonidimus commented Jul 11, 2018

masneyb commented Jul 12, 2018 via email

nathanpeck commented Jul 24, 2018 • edited

mperi left a comment

Choose a reason for hiding this comment

masneyb commented Jul 31, 2018

mperi left a comment

Choose a reason for hiding this comment

masneyb commented Aug 1, 2018

mperi commented Aug 1, 2018

nathanpeck commented Jul 24, 2018 •

edited