You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the templates use the /_status endpoint for the ELB health check against the Chef Server instances. The /_status endpoint will return a 500 http code if the Automate DataCollectorURL is not available to the Chef server instance, even if the rest of Chef Server is totally healthy. This will cause the ELB to believe the node is unhealthy and to take it out of rotation.
This means if Automate is offline or unreachable for any reason all of Chef Server will be taken offline and the non-bootstrap autoscaling group will start continually tearing down and rebuilding Chef Server instances.
It seems like the endpoint used for the Chef Server health check should be an endpoint that will only report an error code if Chef Server itself is unhealthy. I'm not sure if there are any other quick loading endpoints like /_status on Chef Server, but as a preliminary fix it would probably work to use the /login endpoint.
Thoughts?
The text was updated successfully, but these errors were encountered:
jeffj254
changed the title
Using _status page be used for Chef Server health check takes Chef offline if Automate is not available
Using _status page for Chef Server health check takes Chef Server offline if Automate is not available
Oct 6, 2019
@jeffj254 we mitigated this issue somewhat in a previous release with better alarming and cycling of the chef server frontend pool.
However it's a tough position - many customers see the reporting from Chef Server -> Automate as a critical audit control and want the system to "fail closed".
We've made this configurable recently in Chef Server and that shipped in 13.0.47. However we haven't integrated those changes back here because > 12.18.14 has a nasty LDAP SSL bug that the developers are still working on.
Currently the templates use the
/_status
endpoint for the ELB health check against the Chef Server instances. The/_status
endpoint will return a500
http code if the Automate DataCollectorURL is not available to the Chef server instance, even if the rest of Chef Server is totally healthy. This will cause the ELB to believe the node is unhealthy and to take it out of rotation.This means if Automate is offline or unreachable for any reason all of Chef Server will be taken offline and the non-bootstrap autoscaling group will start continually tearing down and rebuilding Chef Server instances.
It seems like the endpoint used for the Chef Server health check should be an endpoint that will only report an error code if Chef Server itself is unhealthy. I'm not sure if there are any other quick loading endpoints like
/_status
on Chef Server, but as a preliminary fix it would probably work to use the/login
endpoint.Thoughts?
The text was updated successfully, but these errors were encountered: