Using _status page for Chef Server health check takes Chef Server offline if Automate is not available #62

jeffj254 · 2019-10-06T23:30:52Z

Currently the templates use the /_status endpoint for the ELB health check against the Chef Server instances. The /_status endpoint will return a 500 http code if the Automate DataCollectorURL is not available to the Chef server instance, even if the rest of Chef Server is totally healthy. This will cause the ELB to believe the node is unhealthy and to take it out of rotation.

This means if Automate is offline or unreachable for any reason all of Chef Server will be taken offline and the non-bootstrap autoscaling group will start continually tearing down and rebuilding Chef Server instances.

It seems like the endpoint used for the Chef Server health check should be an endpoint that will only report an error code if Chef Server itself is unhealthy. I'm not sure if there are any other quick loading endpoints like /_status on Chef Server, but as a preliminary fix it would probably work to use the /login endpoint.

Thoughts?

The text was updated successfully, but these errors were encountered:

irvingpop · 2019-10-07T23:02:43Z

@jeffj254 we mitigated this issue somewhat in a previous release with better alarming and cycling of the chef server frontend pool.

However it's a tough position - many customers see the reporting from Chef Server -> Automate as a critical audit control and want the system to "fail closed".

We've made this configurable recently in Chef Server and that shipped in 13.0.47. However we haven't integrated those changes back here because > 12.18.14 has a nasty LDAP SSL bug that the developers are still working on.

jeffj254 · 2019-10-11T21:13:37Z

Makes sense. Given that this is addressed in newer versions I don't think there's any reason to make a change to the templates. Closing this.

jeffj254 changed the title ~~Using _status page be used for Chef Server health check takes Chef offline if Automate is not available~~ Using _status page for Chef Server health check takes Chef Server offline if Automate is not available Oct 6, 2019

jeffj254 closed this as completed Oct 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using _status page for Chef Server health check takes Chef Server offline if Automate is not available #62

Using _status page for Chef Server health check takes Chef Server offline if Automate is not available #62

jeffj254 commented Oct 6, 2019

irvingpop commented Oct 7, 2019

jeffj254 commented Oct 11, 2019

Using _status page for Chef Server health check takes Chef Server offline if Automate is not available #62

Using _status page for Chef Server health check takes Chef Server offline if Automate is not available #62

Comments

jeffj254 commented Oct 6, 2019

irvingpop commented Oct 7, 2019

jeffj254 commented Oct 11, 2019