Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-ha-release - servers are considered healthy before they actually are #32

Open
anujbiyani opened this issue May 29, 2013 · 10 comments
Open

Comments

@anujbiyani
Copy link
Contributor

This is actually a bug with AWS but I wanted to open an issue here so that anyone using aws-ha-release.sh might see this.

It seems that when an instance is being spun up, the ELB health checks first fail, then pass (false positive; they can't possible pass because Passenger (in my case) is still starting up), then fail, and then pass (true positive; the web requests are being processed).

aws-ha-release sees the first pass and thinks the instance is healthy and so moves on before the instance is actually healthy.

I'm checking to see if there's already a bug post with AWS; if not, I'll make one.

We could add some sort of tolerance to aws-ha-release where it requires an instance to be in service for some amount of time before moving forward.

@dfevre
Copy link
Contributor

dfevre commented Jun 2, 2013

I've noticed this too. elb-describe-instance-health returns the status of the Ec2 instance at first instead of the ELB status for that instance.

@anujbiyani
Copy link
Contributor Author

@dfevre It'd be great if you could add your comments to an Amazon forum post I created here: https://forums.aws.amazon.com/thread.jspa?threadID=125859&tstart=50

In the meantime I'm working on a fix that basically requires an instance to be InService for some amount of time before considering it healthy. I started working on it in bash but then found it too annoying to implement, so I'm rewriting the whole script in Ruby and making aws-missing-tools a gem.

Still in progress, but I'm close to done: https://github.com/Lytro/aws-missing-tools/tree/aws_ha_release_ruby

@dfevre
Copy link
Contributor

dfevre commented Jun 3, 2013

Done. I'm running Windows instances so I'm thinking of testing with Powershell. I'd say this is a problem with the API though so it's probably not worth rewriting anything.

@colinbjohnson
Copy link
Collaborator

@anujbiyani - just looked at https://github.com/Lytro/aws-missing-tools/tree/aws_ha_release_ruby - that is awesome.

@anujbiyani
Copy link
Contributor Author

@dfevre Thanks for commenting; hopefully they'll give it more attention, now. Normally I wouldn't write code to fix a bug in a third party's API, but in this case manually cycling servers is annoying my team so I figured I'd work on a solution anyway just in case either 1) the API problem is actually something I screwed up, or 2) Amazon doesn't fix it. Plus I've been wanting to rewrite some of this stuff in Ruby ever since I had to brush up on my Bash when I first worked on aws-ha-release.

@colinbjohnson Thanks! Hopefully I'll have it done within a day or two and have a pull request open.

@anujbiyani
Copy link
Contributor Author

I addressed this issue in #34

@dfevre
Copy link
Contributor

dfevre commented Jun 5, 2013

I just reproduced this problem with Powershell. Definitely an API issue. I had a thought that might fix it. An ELB requires multiple health checks to pass before it considers an instance to be in service. I think release-ha should do the same. After seeing the instance as in service, it should do another poll 10 seconds later. it should require 2 consecutive healthy responses before terminating the other instance. I might have a look at implementing this soon.

@anujbiyani
Copy link
Contributor Author

@dfevre I've implemented pretty much what you suggested in a Ruby version of aws-ha-release in the pull request linked just above your post. Basically I require instances to be InService for some time period before terminating an old instance.

https://github.com/colinbjohnson/aws-missing-tools/pull/34/files#L3R25
-m, --min-inservice-time TIME - Minimum time an instance must be in service before it is considered healthy (seconds). Defaults to 30

https://github.com/colinbjohnson/aws-missing-tools/pull/34/files#L9R101

def all_instances_inservice_for_time_period?(load_balancers, change_in_time)
  if all_instances_inservice?(load_balancers)
    if @time_spent_inservice >= @opts[:min_inservice_time]
      return true
    else
      puts "\nAll instances have been InService for #{@time_spent_inservice} seconds."

      @time_spent_inservice += change_in_time
      return false
    end
  else
    @time_spent_inservice = 0
    return false
  end
end

@dfevre
Copy link
Contributor

dfevre commented Jul 3, 2013

I just tried to reproduce this problem with the bash script numerous times on our test and production environment and it seems that AWS has fixed the bug. The bash script now works well.

@anujbiyani
Copy link
Contributor Author

I'm still able to reproduce the AWS bug D: . @dfevre have you changed any settings on your end that might have helped, or did things just randomly start working?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants