Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix several issues with the autodraining lambda #23

Merged
merged 3 commits into from Aug 1, 2018
Merged

fix several issues with the autodraining lambda #23

merged 3 commits into from Aug 1, 2018

Conversation

masneyb
Copy link
Contributor

@masneyb masneyb commented Apr 16, 2018

This patch makes the following changes that address several issues
and cleans up the existing Lambda function:

  • Previously, the old Lambda function was included as a separate 8.1 MB
    ZIP file that needed to be stored at S3 and managed separately from
    the rest of your ECS cluster. Python code in AWS Lambda no longer
    needs to bundle all of its dependencies . With all of the refactoring
    above, the new Python code is small enough that it is embedded
    directly in the CloudFormation template to reduce external
    dependencies. This will make it easy to make changes to this code on
    a branch and test it against a single ECS cluster.
  • The AWS code discovers the ECS cluster name by parsing the user data.
    This will cause issues if the EC2 instances need to be bootstrapped
    with other items through the user data script.
  • The AWS code can post messages to the wrong SNS topic when retrying.
    It looks for the first SNS topic in the account that has a lambda
    function subscribed to it and posts the retry message to that topic.
  • The AWS code does not do any kind of pagination against the ECS API
    when reading the list of EC2 instances. So if it couldn't find the
    instance ID that was about to be terminated on the first page, then
    the instance was not set to DRAINING and the end users would see 50X
    messages when the operation timed out and autoscaling killed the
    instance.
  • The retry logic did not put in any kind of delay in place when
    retrying. The Lambda function would be invoked about 5-10 times a
    second, and each Lambda function invocation would probably
    make close to a dozen AWS API calls. A 5 second delay between each
    retry was introduced.
  • There was a large amount of unused code and variables in the in the
    AWS implementation.
  • Converted the code from Python 2 to 3.

This patch makes the following changes that address several issues
and cleans up the existing Lambda function:

- Previously, the old Lambda function was included as a separate 8.1 MB
  ZIP file that needed to be stored at S3 and managed separately from
  the rest of your ECS cluster. Python code in AWS Lambda no longer
  needs to bundle all of its dependencies . With all of the refactoring
  above, the new Python code is small enough that it is embedded
  directly in the CloudFormation template to reduce external
  dependencies. This will make it easy to make changes to this code on
  a branch and test it against a single ECS cluster.
- The AWS code discovers the ECS cluster name by parsing the user data.
  This will cause issues if the EC2 instances need to be bootstrapped
  with other items through the user data script.
- The AWS code can post messages to the wrong SNS topic when retrying.
  It looks for the first SNS topic in the account that has a lambda
  function subscribed to it and posts the retry message to that topic.
- The AWS code does not do any kind of pagination against the ECS API
  when reading the list of EC2 instances. So if it couldn't find the
  instance ID that was about to be terminated on the first page, then
  the instance was not set to DRAINING and the end users would see 50X
  messages when the operation timed out and autoscaling killed the
  instance.
- The retry logic did not put in any kind of delay in place when
  retrying. The Lambda function would be invoked about 5-10 times a
  second, and each Lambda function invocation would probably
  make close to a dozen AWS API calls. A 5 second delay between each
  retry was introduced.
- There was a large amount of unused code and variables in the in the
  AWS implementation.
- Converted the code from Python 2 to 3.
@clorichel
Copy link

Totally legit, great work, thanks @masneyb 👍

@masneyb
Copy link
Contributor Author

masneyb commented May 3, 2018

Thanks! Here is a blog post that describes how we are using ECS at my employer: https://techblog.realtor.com/a-better-ecs/. It includes a CloudFormation template with the corrected autodraining Lambda.

Brian

@Leonidimus
Copy link

Calling time.sleep(5) does not stop the execution environment. You will still pay for the duration of the execution environment, from the time the function is invoked until the time the function exits.

@masneyb
Copy link
Contributor Author

masneyb commented Jul 12, 2018 via email

@nathanpeck
Copy link

nathanpeck commented Jul 24, 2018

These changes look really great. I don't have write access right now, but will try to track down someone who can merge this.

Edit: @mperi is looking at it and testing before merging.

Copy link
Contributor

@mperi mperi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@masneyb : Great simplication and enhancements. Line 425 in ecs.yaml needs to reference CFN clustername instead of stack name. Rest all looks good testing wise. Thank you!

@masneyb
Copy link
Contributor Author

masneyb commented Jul 31, 2018

@mperi : I corrected the ECS cluster name. Let me know if you need any other changes.

Brian

Copy link
Contributor

@mperi mperi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Brian. Tested and merging

@masneyb
Copy link
Contributor Author

masneyb commented Aug 1, 2018

OK, thanks. Let me know if I need to do anything else to get this merged.

Brian

@mperi
Copy link
Contributor

mperi commented Aug 1, 2018

Merging post testing. Thank you for the changes in PR.

@mperi mperi closed this Aug 1, 2018
@mperi mperi merged commit 0d60dd5 into aws-samples:master Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants