Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using _status page for Chef Server health check takes Chef Server offline if Automate is not available #62

Closed
jeffj254 opened this issue Oct 6, 2019 · 2 comments

Comments

@jeffj254
Copy link
Contributor

jeffj254 commented Oct 6, 2019

Currently the templates use the /_status endpoint for the ELB health check against the Chef Server instances. The /_status endpoint will return a 500 http code if the Automate DataCollectorURL is not available to the Chef server instance, even if the rest of Chef Server is totally healthy. This will cause the ELB to believe the node is unhealthy and to take it out of rotation.

This means if Automate is offline or unreachable for any reason all of Chef Server will be taken offline and the non-bootstrap autoscaling group will start continually tearing down and rebuilding Chef Server instances.

It seems like the endpoint used for the Chef Server health check should be an endpoint that will only report an error code if Chef Server itself is unhealthy. I'm not sure if there are any other quick loading endpoints like /_status on Chef Server, but as a preliminary fix it would probably work to use the /login endpoint.

Thoughts?

@jeffj254 jeffj254 changed the title Using _status page be used for Chef Server health check takes Chef offline if Automate is not available Using _status page for Chef Server health check takes Chef Server offline if Automate is not available Oct 6, 2019
@irvingpop
Copy link

@jeffj254 we mitigated this issue somewhat in a previous release with better alarming and cycling of the chef server frontend pool.

However it's a tough position - many customers see the reporting from Chef Server -> Automate as a critical audit control and want the system to "fail closed".

We've made this configurable recently in Chef Server and that shipped in 13.0.47. However we haven't integrated those changes back here because > 12.18.14 has a nasty LDAP SSL bug that the developers are still working on.

@jeffj254
Copy link
Contributor Author

Makes sense. Given that this is addressed in newer versions I don't think there's any reason to make a change to the templates. Closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants