Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS EC2 instance never comes up - ASG reports 1 instance, but stuck in pending #572

Closed
dhalperi opened this issue Apr 30, 2019 · 8 comments · Fixed by #638
Closed

AWS EC2 instance never comes up - ASG reports 1 instance, but stuck in pending #572

dhalperi opened this issue Apr 30, 2019 · 8 comments · Fixed by #638

Comments

@dhalperi
Copy link

This sure looks like an AWS bug not a BK bug.

We have new autoscaler-based deployment in us-west-1. Our ASG goes all the way down to 0 instances, up to 15. Usually things work great; in the last few days we've had VMs that don't come up for 20+ minutes.

When there's only 1 step queued for an instance, this can be frustrating - that blocked step hangs seemingly "forever". BK does not try to scale up, EC2 does not actually have the instance.

The group is big enough that new VMs are spun up in both AZs multiple times per day, which rules out basic config errors like one subnet not being networked correctly. It really seems like an AWS issue.

@dhalperi
Copy link
Author

Screen Shot 2019-04-29 at 10 05 51 PM

@dhalperi
Copy link
Author

ASG: ~20 minutes after the attempt to upscale.

Screen Shot 2019-04-29 at 9 49 59 PM

@dhalperi
Copy link
Author

EC2 for that instance:

Screen Shot 2019-04-29 at 9 49 18 PM

@dhalperi
Copy link
Author

Manually terminating the instance works fine as a way to unblock.

@dhalperi dhalperi changed the title AWS EC2 instance never comes up - ASG reports 1 instance, but stick in pending AWS EC2 instance never comes up - ASG reports 1 instance, but stuck in pending May 1, 2019
@lox
Copy link
Contributor

lox commented May 6, 2019

Thanks for the details @dhalperi, seems like a really frustrating issue. If it keeps happening we'll add a check for it in the scaling lambda.

@dbaggerman
Copy link
Contributor

This sounds like it might be related to something we've been seeing.

We traced the issue back to cloud-init failing to bootstrap the instance, seemingly because of problems talking to the EC2 metadata endpoint.

The bootstrap failure doesn't appear to result in the instances getting flagged as unhealthy. On the other hand, it means the buildkite-agent service never gets started. Since the agent isn't started, the stop timeout and ExecStopPost shutdown trigger never occur. All this results in the problem described above where the instance is neither killed by the ASG, nor able to run BK jobs.

I'll raise a PR with a change that seems to have more or less solved this problem for us. It just adds an OnFailure to the cloud-final systemd unit to poweroff the instance. So it doesn't prevent the failures from occurring, but it allows the ASG to clean up and the replace the instances when the problem occurs.

@lox
Copy link
Contributor

lox commented Oct 1, 2019

@dbaggerman in this instance the status was stuck at pending, despite the failure of the instance, is that what you are seeing?

@dbaggerman
Copy link
Contributor

@lox, Yes that matches what we saw. The Instance State was pending with the Status Checks as Initializing, as can be seen in the screenshot above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants