asg spin doctor #88

kapilt · 2016-05-08T14:51:46Z

resize/suspend/email/etc

more specifically several accounts have autoscale groups that are spinning, ie. trying and failing to launch an instance, repeatedly due to invalid ami, subnets, elbs, etc.

jeffastorey · 2016-05-13T14:27:15Z

this can also occur due to instances with non-complying tags being terminated. in our policy for example, we terminate instances with non-complying tags once an hour. if the missing tag is who owns the instance, we have no way to email the owner and the asg keeps spinning. would be nice to be able to track an hourly instance scan back to the creator of the asg, but at the very least suspending the ASGs under these cases as well

not just applicable to this issue, but maybe custodian could hook into cloud trail on all instance creation events and auto tag them with who launched it? that person could always be included in event notifications

kapilt · 2016-05-22T14:27:10Z

i added better support for asg cwe rules including state notifications, but i'm a still a little unclear what we should do as an action when we detect these. In some of the larger accounts, there would be thousands of event fires a day. We could try batch and aggregate for notification. We could also resize down, but i'm hesistant to due that unless its a structural issue with the launch config, ie. elb health check outage could be transient. sounds like we need a filter on the structural issue with resize down and notify actions.

jeffastorey · 2016-06-06T01:08:21Z

to add some specifics of things I think would be useful...

ASGs usually spin for a few different reasons. I'm sure there are others, but these come to mind:

invalid configs - invalid ami, subnet ids, etc
continually failing health checks
spinning due to instances being killed by other custodian rules
no space left for launching instances

It would be nice to be able to split them out to perform different actions based on the category of spin. For example, in the case of invalid configs, we can suspend and/or delete. In the case of continually failing health checks, I'd rather just notify since we've had cases where spinning was due to a network change, and we may not want to stop ASGs in that case.

For the spinning due to instances being killed by other rules, I think we're solving that by putting those rules on the asg configs. There may be some outliers, but that seems like less of an issue right now.

kapilt · 2016-06-17T20:24:13Z

addressed in #220 .. basically a filter for asgs to detect invalid configs, or missing elbs

kapilt added this to the 2016-05-28 milestone May 22, 2016

kapilt modified the milestones: 2016-06-11, 2016-05-28 May 31, 2016

kapilt modified the milestones: 2016-06-25, 2016-06-11 Jun 6, 2016

kapilt added resource/asg priority/P2 scope/medium labels Jun 10, 2016

kapilt closed this as completed Jun 17, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

asg spin doctor #88

asg spin doctor #88

kapilt commented May 8, 2016

jeffastorey commented May 13, 2016 •

edited

kapilt commented May 22, 2016

jeffastorey commented Jun 6, 2016

kapilt commented Jun 17, 2016

asg spin doctor #88

asg spin doctor #88

Comments

kapilt commented May 8, 2016

jeffastorey commented May 13, 2016 • edited

kapilt commented May 22, 2016

jeffastorey commented Jun 6, 2016

kapilt commented Jun 17, 2016

jeffastorey commented May 13, 2016 •

edited