Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignoring the capacity of errored servers #33

Closed
wants to merge 0 commits into from
Closed

Ignoring the capacity of errored servers #33

wants to merge 0 commits into from

Conversation

andy-trimble
Copy link
Contributor

@andy-trimble andy-trimble commented May 8, 2019

In the case where a server cannot be created (due to an AWS issue, for example), the non-existent server's capacity is still used to calculate total capacity. This seems like undesirable behavior.

@bradrydzewski
Copy link
Member

bradrydzewski commented May 8, 2019

Hi Andy, this is actually by design. The reason is that in an error state the server may still exist in AWS, and we do not want Drone to keep adding servers on loop and running up bills.

Hypothetical example at #16 (comment)

@andy-trimble
Copy link
Contributor Author

Ah, that makes sense. I ran into an issue where the EC2 instance could not be created (due to my misconfiguring the security group), but the autoscaler would not allocate additional resources since it believed the capacity had been filled. I wonder if there is a way to detect when the instance has not been created and ignore the capacity in that instance.

@tboerger
Copy link
Contributor

tboerger commented May 8, 2019

For some cases I wish the autoscaler would exclude the errored nodes, at least for some providers it would be possible to check if there is really some server living.

@bradrydzewski
Copy link
Member

bradrydzewski commented May 8, 2019

We have a process that tries to automatically remove errored instances:
https://github.com/drone/autoscaler/blob/master/engine/reaper.go

We should not ignore the error state, which feels very risky. Instead we should continue to improve the reaper routine to try to understand why an instance is errored and resolve the issue by repairing or removing the instance.

@tboerger
Copy link
Contributor

tboerger commented May 8, 2019

We have a process that tries to automatically remove errored instances:
https://github.com/drone/autoscaler/blob/master/engine/reaper.go

Good to know that it's currently not working for Hetzner, got to fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants