Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

asg spin doctor #88

Closed
kapilt opened this issue May 8, 2016 · 4 comments
Closed

asg spin doctor #88

kapilt opened this issue May 8, 2016 · 4 comments

Comments

@kapilt
Copy link
Collaborator

kapilt commented May 8, 2016

resize/suspend/email/etc

more specifically several accounts have autoscale groups that are spinning, ie. trying and failing to launch an instance, repeatedly due to invalid ami, subnets, elbs, etc.

@jeffastorey
Copy link
Contributor

jeffastorey commented May 13, 2016

this can also occur due to instances with non-complying tags being terminated. in our policy for example, we terminate instances with non-complying tags once an hour. if the missing tag is who owns the instance, we have no way to email the owner and the asg keeps spinning. would be nice to be able to track an hourly instance scan back to the creator of the asg, but at the very least suspending the ASGs under these cases as well

not just applicable to this issue, but maybe custodian could hook into cloud trail on all instance creation events and auto tag them with who launched it? that person could always be included in event notifications

@kapilt kapilt added this to the 2016-05-28 milestone May 22, 2016
@kapilt
Copy link
Collaborator Author

kapilt commented May 22, 2016

i added better support for asg cwe rules including state notifications, but i'm a still a little unclear what we should do as an action when we detect these. In some of the larger accounts, there would be thousands of event fires a day. We could try batch and aggregate for notification. We could also resize down, but i'm hesistant to due that unless its a structural issue with the launch config, ie. elb health check outage could be transient. sounds like we need a filter on the structural issue with resize down and notify actions.

@kapilt kapilt modified the milestones: 2016-06-11, 2016-05-28 May 31, 2016
@jeffastorey
Copy link
Contributor

to add some specifics of things I think would be useful...

ASGs usually spin for a few different reasons. I'm sure there are others, but these come to mind:

  • invalid configs - invalid ami, subnet ids, etc
  • continually failing health checks
  • spinning due to instances being killed by other custodian rules
  • no space left for launching instances

It would be nice to be able to split them out to perform different actions based on the category of spin. For example, in the case of invalid configs, we can suspend and/or delete. In the case of continually failing health checks, I'd rather just notify since we've had cases where spinning was due to a network change, and we may not want to stop ASGs in that case.

For the spinning due to instances being killed by other rules, I think we're solving that by putting those rules on the asg configs. There may be some outliers, but that seems like less of an issue right now.

@kapilt
Copy link
Collaborator Author

kapilt commented Jun 17, 2016

addressed in #220 .. basically a filter for asgs to detect invalid configs, or missing elbs

@kapilt kapilt closed this as completed Jun 17, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants