Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Race in ec2_asg when health checking ELB instances #383
Ansible running from Ubuntu 14.04 against AWS API (no target host)
Per the discussion on the mailing list, the ec2_asg module may need to wait for the grace period to expire before determining the health of an instance.
I've worked around the issue temporarily by checking the instances in an ELB with an external script. https://gist.github.com/bwhaley/eee6a0f61636862515aa
Steps To Reproduce:
For a new release, ansible should wait for new instances to show as
The autoscale group considers instances as healthy when they are not during the HealthCheckGracePeriod, even though instances are not yet InService in the ELB.
I'm still not quite sure what the correct behavior should be -- should we wait for the grace period to expire, and then allow the autoscale group to properly determine the instances' health status. Or should we interrogate the ELB as you have done in your example and ignore the grace period?
The ELB is ultimately the source of truth for whether instances are healthy or not from a network perspective. Wrapping an ELB check in an ansible loop ensures that new instances being added to the ASG via scaling policies are also included.
The grace period feature of the ASG only clouds the issue IMO. The ELB already has timing built in through the healthy/unhealthy thresholds and the check interval. Would you have to wait the grace period for each instance? How long after the grace period does autoscale then wait before terminating and launching another? The timing becomes trickier if an existing ASG is being updated and has scaling policies attached.
Pinging @pas256 for his ideas. How does Asgard handle this?
Can You Help Us Out?
Thanks for filing a ticket! I am the friendly GitHub Ansibot.
It looks like you might not have filled out the issue description based on our standard issue template. You might not have known about that, and that's ok too, we'll tell you how to do it.
We have a standard template because Ansible is a really busy project and it helps to have some standard information in each ticket, and GitHub doesn't yet provide a standard facility to do this like some other bug trackers. We hope you understand as this is really valuable to us!.
Solving this is simple: please copy the contents of this template and paste it into the description of your ticket. That's it!
If You Had A Question To Ask Instead
If you happened to have a "how do I do this in Ansible" type of question, that's probably more of a user-list question than a bug report, and you should probably ask this question on the project mailing list instead.
However, if you think you have a bug, the report is the way to go! We definitely want all the bugs filed :) Just trying to help!
About Priority Tags
Since you're here, we'll also share some useful information at this time.
In general tickets will be assigned a priority between P1 (highest) and P5, and then worked in priority order. We may also have some follow up questions along the way, so keeping up with follow up comments via GitHub notifications is a good idea.
Due to large interest in Ansible, humans may not comment on your ticket immediately.
If you have concerns or questions, you're welcome to stop by the ansible-project or ansible-development mailing lists, as appropriate. Here are the links:
Thanks again for the interest in Ansible!
I'm thinking at this point, I will make a pull request that modifies the get_properties function in the ec2_asg module. Pseudo code:
This also seems very closely tied to the LB health check itself. The health check timeout, interval, healthy/unhealthy threshold all would come to play here I think.
This could get a little bit tricky because: "f there are multiple load balancers associated with your Auto Scaling group, Auto Scaling checks the health state of your EC2 instances by making health check calls to each load balancer."
This makes the code more complicated, having to check these things myself rather than just waiting out the grace period.
Yes. Your plan looks good to me, avoids the need to change any other
On Mon, Nov 24, 2014 at 4:41 PM, James S. Martin firstname.lastname@example.org
It appears that AWS solves this problem differently, and I think I like their approach: http://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html#cfn-attributes-updatepolicy-rollingupdate-pausetime
and from a forum article:
"PauseTime specifies the duration at which CloudFormation will wait to allow replacement instances to initialize itself before terminating more instances"
If Amazon's not interrogating the LBs, and is just doing a hard wait, I'm not sure we should do any differently. If we create a pause_time module option, have that set to elb grace period or greater, that should do the trick. The pause_time would be inserted right around line 528, right before the:
Thinking through edge cases.
Doesn't seem like adding another timer helps here.
On Mon, Nov 24, 2014 at 6:22 PM, James S. Martin email@example.com
@bwhaley would you mind testing out this re-worked module? https://github.com/jsmartin/ansible-modules-core/blob/asg_check_lb/cloud/amazon/ec2_asg.py .. It seems to work on my end. It's full of logging stuff that would be removed in the pull req, but other than that, I think I'm happy.
Took at look at the updated code. The logic looks right, but currently it
On Tue, Nov 25, 2014 at 10:08 PM, James S. Martin firstname.lastname@example.org
#601 has been merged to both devel and the stable-1.9 branch (Although I rebased it onto the current tip of devel before merging so github doesn't know that it was actually merged.) Please test any of devel, stable-1.9, or the upcoming (hopefully tomorrow) 1.9rc2 to see if this is now fixed.
I'll close in a few days if no one reports it's still broken.
18.104.22.168 is released which should contain the code from #601. that should fix this issue for you!
Closing This Ticket
We believe recent commits (likely detailed above) should resolve this question or problem for you.
This will also be included in the next major release.
If you continue seeing any problems related to this issue, or if you have any further questions, please let us know by stopping by one of the two mailing lists, as appropriate:
Because this project is very active, we're unlikely to see comments made on closed tickets, but the mailing list is a great way to ask questions, or post if you don't think this particular