This repository has been archived by the owner. It is now read-only.

Race in ec2_asg when health checking ELB instances #383

Closed
bwhaley opened this Issue Nov 24, 2014 · 18 comments

Comments

Projects
None yet
6 participants
@bwhaley
Copy link
Contributor

bwhaley commented Nov 24, 2014

Issue Type:

Bug Report

Ansible Version:

ansible 1.8
configured module search path = None

Environment:

Ansible running from Ubuntu 14.04 against AWS API (no target host)

Summary:

Per the discussion on the mailing list, the ec2_asg module may need to wait for the grace period to expire before determining the health of an instance.

I've worked around the issue temporarily by checking the instances in an ELB with an external script. https://gist.github.com/bwhaley/eee6a0f61636862515aa

Steps To Reproduce:
  1. Create an AWS ELB
  2. Create an autoscale group and launch configuration to use that ELB
  3. Configure "immutablish deploys" with ASG replacement, as per https://github.com/ansible/immutablish-deploys/tree/master/roles/dual_asg
Expected Results:

For a new release, ansible should wait for new instances to show as InService in the ELB before considering them healthy and moving to the next play.

Actual Results:

The autoscale group considers instances as healthy when they are not during the HealthCheckGracePeriod, even though instances are not yet InService in the ELB.

@jsmartin

This comment has been minimized.

Copy link
Contributor

jsmartin commented Nov 24, 2014

I'm still not quite sure what the correct behavior should be -- should we wait for the grace period to expire, and then allow the autoscale group to properly determine the instances' health status. Or should we interrogate the ELB as you have done in your example and ignore the grace period?

@bwhaley

This comment has been minimized.

Copy link
Contributor

bwhaley commented Nov 24, 2014

The ELB is ultimately the source of truth for whether instances are healthy or not from a network perspective. Wrapping an ELB check in an ansible loop ensures that new instances being added to the ASG via scaling policies are also included.

The grace period feature of the ASG only clouds the issue IMO. The ELB already has timing built in through the healthy/unhealthy thresholds and the check interval. Would you have to wait the grace period for each instance? How long after the grace period does autoscale then wait before terminating and launching another? The timing becomes trickier if an existing ASG is being updated and has scaling policies attached.

Pinging @pas256 for his ideas. How does Asgard handle this?

@ansibot

This comment has been minimized.

Copy link

ansibot commented Nov 24, 2014

Can You Help Us Out?

Thanks for filing a ticket! I am the friendly GitHub Ansibot.

It looks like you might not have filled out the issue description based on our standard issue template. You might not have known about that, and that's ok too, we'll tell you how to do it.

We have a standard template because Ansible is a really busy project and it helps to have some standard information in each ticket, and GitHub doesn't yet provide a standard facility to do this like some other bug trackers. We hope you understand as this is really valuable to us!.

Solving this is simple: please copy the contents of this template and paste it into the description of your ticket. That's it!

If You Had A Question To Ask Instead

If you happened to have a "how do I do this in Ansible" type of question, that's probably more of a user-list question than a bug report, and you should probably ask this question on the project mailing list instead.

However, if you think you have a bug, the report is the way to go! We definitely want all the bugs filed :) Just trying to help!

About Priority Tags

Since you're here, we'll also share some useful information at this time.

In general tickets will be assigned a priority between P1 (highest) and P5, and then worked in priority order. We may also have some follow up questions along the way, so keeping up with follow up comments via GitHub notifications is a good idea.

Due to large interest in Ansible, humans may not comment on your ticket immediately.

Mailing Lists

If you have concerns or questions, you're welcome to stop by the ansible-project or ansible-development mailing lists, as appropriate. Here are the links:

Thanks again for the interest in Ansible!

@jsmartin

This comment has been minimized.

Copy link
Contributor

jsmartin commented Nov 25, 2014

I'm thinking at this point, I will make a pull request that modifies the get_properties function in the ec2_asg module. Pseudo code:

if autoscale group has a load_balancer and health check type is load_balancer:
    return load_balancer object:
    for instance in load_balancer.instances:
        if instance state !="Inservice":
            properties['unhealthy_instances'] += 1
@jsmartin

This comment has been minimized.

Copy link
Contributor

jsmartin commented Nov 25, 2014

This also seems very closely tied to the LB health check itself. The health check timeout, interval, healthy/unhealthy threshold all would come to play here I think.

This could get a little bit tricky because: "f there are multiple load balancers associated with your Auto Scaling group, Auto Scaling checks the health state of your EC2 instances by making health check calls to each load balancer."

"http://docs.aws.amazon.com/AutoScaling/latest/DeveloperGuide/as-add-elb-healthcheck.html

This makes the code more complicated, having to check these things myself rather than just waiting out the grace period.

@bwhaley

This comment has been minimized.

Copy link
Contributor

bwhaley commented Nov 25, 2014

Yes. Your plan looks good to me, avoids the need to change any other
module. Users should configure the ELB values appropriately and rely on
those exclusively for health checking.

On Mon, Nov 24, 2014 at 4:41 PM, James S. Martin notifications@github.com
wrote:

This also seems very closely tied to the LB health check itself. The
health check timeout, interval, healthy/unhealthy threshold all would come
to play here I think.


Reply to this email directly or view it on GitHub
#383 (comment)
.

@jsmartin

This comment has been minimized.

Copy link
Contributor

jsmartin commented Nov 25, 2014

It appears that AWS solves this problem differently, and I think I like their approach: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html#cfn-attributes-updatepolicy-rollingupdate-pausetime

and from a forum article:

https://forums.aws.amazon.com/message.jspa?messageID=426945

"PauseTime specifies the duration at which CloudFormation will wait to allow replacement instances to initialize itself before terminating more instances"

If Amazon's not interrogating the LBs, and is just doing a hard wait, I'm not sure we should do any differently. If we create a pause_time module option, have that set to elb grace period or greater, that should do the trick. The pause_time would be inserted right around line 528, right before the:

    # now we make sure that we have enough instances in a viable state
    wait_timeout = time.time() + wait_timeout

Thoughts?

@bwhaley

This comment has been minimized.

Copy link
Contributor

bwhaley commented Nov 25, 2014

Thinking through edge cases.

  • Autoscale has is set to desired capacity 3, launches 3 instances
  • grace check period + pause_time starts
  • 2 of 3 instances come up healthy. 3rd instances does not, autoscale
    terminates and launches another. grace check resets for the replacement
    instance
  • grace check period + pause_time has elapsed.
  • ec2_asg interrogates autoscale. Autoscale reports 3/3 Healthy, InService
    instances
  • ELB has 2 of the 3 instances InService

Doesn't seem like adding another timer helps here.

On Mon, Nov 24, 2014 at 6:22 PM, James S. Martin notifications@github.com
wrote:

It appears that AWS solves this problem differently, and I think I like
their approach:
http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html#cfn-attributes-updatepolicy-rollingupdate-pausetime

and from a forum article:

https://forums.aws.amazon.com/message.jspa?messageID=426945

"PauseTime specifies the duration at which CloudFormation will wait to
allow replacement instances to initialize itself before terminating more
instances"

If Amazon's not interrogating the LBs, and is just doing a hard wait, I'm
not sure we should do any differently. If we create a pause_time module
option, have that set to elb grace period or greater, that should do the
trick. The pause_time would be inserted right around line 528, right before
the:

# now we make sure that we have enough instances in a viable state
wait_timeout = time.time() + wait_timeout

Thoughts?


Reply to this email directly or view it on GitHub
#383 (comment)
.

@bwhaley

This comment has been minimized.

Copy link
Contributor

bwhaley commented Nov 25, 2014

I could engage AWS support about this if you think that would be helpful

@jsmartin

This comment has been minimized.

Copy link
Contributor

jsmartin commented Nov 25, 2014

I think you have valid points there -- given your approach, how does it handle an ASG that has multiple ELBs assigned to it? Or should it even?

@bwhaley

This comment has been minimized.

Copy link
Contributor

bwhaley commented Nov 25, 2014

I get the list of ELBs from the ASG and check each instance in each ELB to be completely thorough. You want to do this since each ELB may have different health checks.

@jsmartin

This comment has been minimized.

Copy link
Contributor

jsmartin commented Nov 25, 2014

Are you on freenode by chance? Might be better to hash this out in real-time?

@jsmartin

This comment has been minimized.

Copy link
Contributor

jsmartin commented Nov 26, 2014

@bwhaley would you mind testing out this re-worked module? https://github.com/jsmartin/ansible-modules-core/blob/asg_check_lb/cloud/amazon/ec2_asg.py .. It seems to work on my end. It's full of logging stuff that would be removed in the pull req, but other than that, I think I'm happy.

@bwhaley

This comment has been minimized.

Copy link
Contributor

bwhaley commented Nov 26, 2014

Took at look at the updated code. The logic looks right, but currently it
only waits when replacing all instances. The logic should also apply to new
ASGs. I suggest calling wait_for_elb() from within get_properties().

On Tue, Nov 25, 2014 at 10:08 PM, James S. Martin notifications@github.com
wrote:

@bwhaley https://github.com/bwhaley would you mind testing out this
re-worked module?
https://github.com/jsmartin/ansible-modules-core/blob/asg_check_lb/cloud/amazon/ec2_asg.py
.. It seems to work on my end. It's full of logging stuff that would be
removed in the pull req, but other than that, I think I'm happy.


Reply to this email directly or view it on GitHub
#383 (comment)
.

@jsmartin

This comment has been minimized.

Copy link
Contributor

jsmartin commented Dec 4, 2014

I'm out-of-office to the end of the year due to a stork's special delivery, so will get this tweaked and tested when I get back.

@lostmimic

This comment has been minimized.

Copy link

lostmimic commented Mar 4, 2015

Congrats on your special delivery :) Do we know if this will make it back for a specific version release?

Thank you,

EDIT: N/M I see #601

@abadger

This comment has been minimized.

Copy link
Member

abadger commented Mar 20, 2015

#601 has been merged to both devel and the stable-1.9 branch (Although I rebased it onto the current tip of devel before merging so github doesn't know that it was actually merged.) Please test any of devel, stable-1.9, or the upcoming (hopefully tomorrow) 1.9rc2 to see if this is now fixed.

I'll close in a few days if no one reports it's still broken.

@abadger

This comment has been minimized.

Copy link
Member

abadger commented Mar 26, 2015

1.9.0.1 is released which should contain the code from #601. that should fix this issue for you!

Closing This Ticket

Hi!

We believe recent commits (likely detailed above) should resolve this question or problem for you.

This will also be included in the next major release.

If you continue seeing any problems related to this issue, or if you have any further questions, please let us know by stopping by one of the two mailing lists, as appropriate:

Because this project is very active, we're unlikely to see comments made on closed tickets, but the mailing list is a great way to ask questions, or post if you don't think this particular
issue is resolved.

Thank you!

@abadger abadger closed this Mar 26, 2015

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.