-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Degraded" health check #5063
Comments
@snowp this request has come up in different forms multiple times. The "devil is in the details" on this one as it can get quite complicated, scary, and error prone. If you feel like doing a design proposal that would be great. |
I have a few high level ideas on how to accomplish this:
The first point is how we're implementing this on the control plane level and it's been working decently well. The fact that it doesn't complicate the routing logic by introducing new concepts in Envoy is a nice property, but I'm not sure if frequent priority moves will be problematic. Happy to consider other options too, these are just some approaches. |
Here's a more concrete suggestion based on the first high level approach in the previous comment: Add a With this, we can update In order to maintain locality routing, a separate locality weights value can be kept that adjusts the weights based on (degraded vs total hosts) in a locality. This would be used in When in panic mode (be it per priority or global) degraded values should be ignored (route to all hosts) and degraded hosts should not contribute to the panic mode threshold. That is, if a priority is fully degraded it should not be in panic mode. This is because each host is still routable, so triggering panic mode isn't really useful. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
This implements load balancing to degraded hosts by treating them as "healthy" hosts in a lower priority than the healthy hosts. For degraded hosts in PN they are treated as healthy hosts in P(N+M), where M is the total number of priorities. This ensures that degraded hosts are only routed to when priority spillover due to a lack of healthy hosts cause a degraded priority to be selected. Locality weights for degraded locality are tracked similar to how locality weights are tracked for healthy hosts, ie scaled by the number of eligible hosts over total number of hosts in each locality. This ensures that when degraded hosts are selected, load balancing between localities will behave consistently with the behavior for healthy hosts. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium/High, touches core load balancing code quite a bit, but mostly refactors Testing: UTs Docs Changes: inline Release Notes: n/a Part of #5063 Signed-off-by: Snow Pettersen <snowp@squareup.com>
As part of this I'll have to update the health check filter to cache the response headers, or at least the degraded header. Currently the caching layer is causing degraded health checks to flap, making them incompatible. |
Updates the way panic mode is calculated to treat degraded hosts as available. This ensures that panic mode is entered when there are insufficient available hosts. Also updates the panic mode documentation to use more generic language. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium Testing: UTs for mixed healthy/degraded hosts Docs Changes: Updated panic mode docs Release Notes: n/a #5063 Signed-off-by: Snow Pettersen <snowp@squareup.com>
…yproxy#5630) Updates the way panic mode is calculated to treat degraded hosts as available. This ensures that panic mode is entered when there are insufficient available hosts. Also updates the panic mode documentation to use more generic language. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium Testing: UTs for mixed healthy/degraded hosts Docs Changes: Updated panic mode docs Release Notes: n/a envoyproxy#5063 Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Dan Zhang <danzh@google.com>
) Adds a DEGRADED HealthStatus value that can be set on a host through LoadAssignment, allowing for a host to be marked degraded without the need for active health checking. Moves the mapping of EDS flag to health flag to inside `registerHostForPriority`, which means that we're now consistently setting the EDS health flag for EDS/STATIC/STRICT_DNS/LOGICAL_DNS. Simplifies the check for whether the health flag value of a host has changed during EDS updates. Adds tests for the EDS mapping as well as tests to verify that we're honoring the EDS flag for non-EDS cluster types. Risk Level: High, substantial refactoring of how we determine whether health flag has changed. Testing: UTs coverage for new health flag values. Docs Changes: n/a Release Notes: n/a Fixes #5637 #5063 Signed-off-by: Snow Pettersen <snowp@squareup.com>
This is now fully implemented. |
This implements load balancing to degraded hosts by treating them as "healthy" hosts in a lower priority than the healthy hosts. For degraded hosts in PN they are treated as healthy hosts in P(N+M), where M is the total number of priorities. This ensures that degraded hosts are only routed to when priority spillover due to a lack of healthy hosts cause a degraded priority to be selected. Locality weights for degraded locality are tracked similar to how locality weights are tracked for healthy hosts, ie scaled by the number of eligible hosts over total number of hosts in each locality. This ensures that when degraded hosts are selected, load balancing between localities will behave consistently with the behavior for healthy hosts. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium/High, touches core load balancing code quite a bit, but mostly refactors Testing: UTs Docs Changes: inline Release Notes: n/a Part of envoyproxy#5063 Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Fred Douglas <fredlas@google.com>
…yproxy#5630) Updates the way panic mode is calculated to treat degraded hosts as available. This ensures that panic mode is entered when there are insufficient available hosts. Also updates the panic mode documentation to use more generic language. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium Testing: UTs for mixed healthy/degraded hosts Docs Changes: Updated panic mode docs Release Notes: n/a envoyproxy#5063 Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Fred Douglas <fredlas@google.com>
…voyproxy#5649) Adds a DEGRADED HealthStatus value that can be set on a host through LoadAssignment, allowing for a host to be marked degraded without the need for active health checking. Moves the mapping of EDS flag to health flag to inside `registerHostForPriority`, which means that we're now consistently setting the EDS health flag for EDS/STATIC/STRICT_DNS/LOGICAL_DNS. Simplifies the check for whether the health flag value of a host has changed during EDS updates. Adds tests for the EDS mapping as well as tests to verify that we're honoring the EDS flag for non-EDS cluster types. Risk Level: High, substantial refactoring of how we determine whether health flag has changed. Testing: UTs coverage for new health flag values. Docs Changes: n/a Release Notes: n/a Fixes envoyproxy#5637 envoyproxy#5063 Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Fred Douglas <fredlas@google.com>
Right now Envoy health checks are binary yes/no that determine whether a host should receive traffic. A possible extension to this would be to provide additional information in the http health check response that allows the downstream to reprioritize the host.
For instance, you can imagine an upstream responding with
x-envoy-health-degraded
when the host is able to serve most traffic but say a Redis shard is down. This would make the downstream prefer other hosts that aren't declaring themselves degraded, presumably because they have a healthy Redis shard.Floating this idea since it's something we have in our legacy RPC system and internal users seem to like it.
The text was updated successfully, but these errors were encountered: