"Degraded" health check #5063

snowp · 2018-11-16T15:07:41Z

Right now Envoy health checks are binary yes/no that determine whether a host should receive traffic. A possible extension to this would be to provide additional information in the http health check response that allows the downstream to reprioritize the host.

For instance, you can imagine an upstream responding with x-envoy-health-degraded when the host is able to serve most traffic but say a Redis shard is down. This would make the downstream prefer other hosts that aren't declaring themselves degraded, presumably because they have a healthy Redis shard.

Floating this idea since it's something we have in our legacy RPC system and internal users seem to like it.

The text was updated successfully, but these errors were encountered:

mattklein123 · 2018-11-16T17:33:01Z

@snowp this request has come up in different forms multiple times. The "devil is in the details" on this one as it can get quite complicated, scary, and error prone. If you feel like doing a design proposal that would be great.

snowp · 2018-11-16T18:01:34Z

I have a few high level ideas on how to accomplish this:

Reuse priorities: allow health checks to move hosts between priorities. You would shift a degraded host's priority by N where N is the number of priorities. P0 becomes PN, P1 becomes P(N-1). This maintains the relative priority between degraded hosts and reuses all of the existing LB mechanisms like spillover etc.
Multiple LBs: We could maintain several LBs for different ties of hosts (healthy, degraded, maybe several levels of degraded?) and attempt to select from them in order. This is a much larger change so I'm not sure if it's worth it. It also gets tricky because the existing LB already has fallbacks for when there are no healthy hosts.
Support degraded hosts in LB: Filter out degraded hosts during host updates. Similar to how unhealthy hosts are partitioned from healthy hosts, we could partition out the degraded hosts and maintain that state within the LB and add new rules for how degraded/unhealthy hosts interact (would degraded affect spillover? locality weight? etc.)

The first point is how we're implementing this on the control plane level and it's been working decently well. The fact that it doesn't complicate the routing logic by introducing new concepts in Envoy is a nice property, but I'm not sure if frequent priority moves will be problematic.

Happy to consider other options too, these are just some approaches.

snowp · 2018-11-18T19:12:35Z

Here's a more concrete suggestion based on the first high level approach in the previous comment:

Add a degraded_priority_load_ field to the LB which holds the load given to each priority based on how many degraded hosts are in that priority. This should be computed the same way as priority_load_ but looking at degraded hosts instead of healthy ones. These two should sum to a hundred the same way we're normalizing it today, with the healthy hosts being allocated load first. In the case of everything being unhealthy, we still route to P0 healthy.

With this, we can update chooseHostSet to iterate over both loads and return both the desired host set and whether we picked it from the degraded load. Using this information we can use this in hostSourceToUse to specify a that we want the degraded hosts from the subset.

In order to maintain locality routing, a separate locality weights value can be kept that adjusts the weights based on (degraded vs total hosts) in a locality. This would be used in hostSourceToUse to select a locality when locality weighting is enabled.

When in panic mode (be it per priority or global) degraded values should be ignored (route to all hosts) and degraded hosts should not contribute to the panic mode threshold. That is, if a priority is fully degraded it should not be in panic mode. This is because each host is still routable, so triggering panic mode isn't really useful.

mattklein123 · 2018-11-18T20:12:15Z

@snowp this makes sense to me. I like the idea of reusing priorities for this as I think it makes the implementation a lot simpler and more intuitive given what we already have. Perhaps you could also tackle #5081 as part of this. 😉

stale · 2018-12-18T22:45:04Z

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

This implements load balancing to degraded hosts by treating them as "healthy" hosts in a lower priority than the healthy hosts. For degraded hosts in PN they are treated as healthy hosts in P(N+M), where M is the total number of priorities. This ensures that degraded hosts are only routed to when priority spillover due to a lack of healthy hosts cause a degraded priority to be selected. Locality weights for degraded locality are tracked similar to how locality weights are tracked for healthy hosts, ie scaled by the number of eligible hosts over total number of hosts in each locality. This ensures that when degraded hosts are selected, load balancing between localities will behave consistently with the behavior for healthy hosts. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium/High, touches core load balancing code quite a bit, but mostly refactors Testing: UTs Docs Changes: inline Release Notes: n/a Part of #5063 Signed-off-by: Snow Pettersen <snowp@squareup.com>

snowp · 2019-01-18T20:52:09Z

As part of this I'll have to update the health check filter to cache the response headers, or at least the degraded header. Currently the caching layer is causing degraded health checks to flap, making them incompatible.

Updates the way panic mode is calculated to treat degraded hosts as available. This ensures that panic mode is entered when there are insufficient available hosts. Also updates the panic mode documentation to use more generic language. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium Testing: UTs for mixed healthy/degraded hosts Docs Changes: Updated panic mode docs Release Notes: n/a #5063 Signed-off-by: Snow Pettersen <snowp@squareup.com>

…yproxy#5630) Updates the way panic mode is calculated to treat degraded hosts as available. This ensures that panic mode is entered when there are insufficient available hosts. Also updates the panic mode documentation to use more generic language. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium Testing: UTs for mixed healthy/degraded hosts Docs Changes: Updated panic mode docs Release Notes: n/a envoyproxy#5063 Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Dan Zhang <danzh@google.com>

) Adds a DEGRADED HealthStatus value that can be set on a host through LoadAssignment, allowing for a host to be marked degraded without the need for active health checking. Moves the mapping of EDS flag to health flag to inside `registerHostForPriority`, which means that we're now consistently setting the EDS health flag for EDS/STATIC/STRICT_DNS/LOGICAL_DNS. Simplifies the check for whether the health flag value of a host has changed during EDS updates. Adds tests for the EDS mapping as well as tests to verify that we're honoring the EDS flag for non-EDS cluster types. Risk Level: High, substantial refactoring of how we determine whether health flag has changed. Testing: UTs coverage for new health flag values. Docs Changes: n/a Release Notes: n/a Fixes #5637 #5063 Signed-off-by: Snow Pettersen <snowp@squareup.com>

snowp · 2019-02-09T23:59:27Z

This is now fully implemented.

This implements load balancing to degraded hosts by treating them as "healthy" hosts in a lower priority than the healthy hosts. For degraded hosts in PN they are treated as healthy hosts in P(N+M), where M is the total number of priorities. This ensures that degraded hosts are only routed to when priority spillover due to a lack of healthy hosts cause a degraded priority to be selected. Locality weights for degraded locality are tracked similar to how locality weights are tracked for healthy hosts, ie scaled by the number of eligible hosts over total number of hosts in each locality. This ensures that when degraded hosts are selected, load balancing between localities will behave consistently with the behavior for healthy hosts. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium/High, touches core load balancing code quite a bit, but mostly refactors Testing: UTs Docs Changes: inline Release Notes: n/a Part of envoyproxy#5063 Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Fred Douglas <fredlas@google.com>

…yproxy#5630) Updates the way panic mode is calculated to treat degraded hosts as available. This ensures that panic mode is entered when there are insufficient available hosts. Also updates the panic mode documentation to use more generic language. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium Testing: UTs for mixed healthy/degraded hosts Docs Changes: Updated panic mode docs Release Notes: n/a envoyproxy#5063 Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Fred Douglas <fredlas@google.com>

…voyproxy#5649) Adds a DEGRADED HealthStatus value that can be set on a host through LoadAssignment, allowing for a host to be marked degraded without the need for active health checking. Moves the mapping of EDS flag to health flag to inside `registerHostForPriority`, which means that we're now consistently setting the EDS health flag for EDS/STATIC/STRICT_DNS/LOGICAL_DNS. Simplifies the check for whether the health flag value of a host has changed during EDS updates. Adds tests for the EDS mapping as well as tests to verify that we're honoring the EDS flag for non-EDS cluster types. Risk Level: High, substantial refactoring of how we determine whether health flag has changed. Testing: UTs coverage for new health flag values. Docs Changes: n/a Release Notes: n/a Fixes envoyproxy#5637 envoyproxy#5063 Signed-off-by: Snow Pettersen <snowp@squareup.com> Signed-off-by: Fred Douglas <fredlas@google.com>

mattklein123 added the design proposal Needs design doc/proposal before implementation label Nov 16, 2018

snowp mentioned this issue Nov 19, 2018

upstream: add degraded hosts and update LB to prefer healthy hosts over degraded #5084

Closed

snowp mentioned this issue Dec 3, 2018

upstream: add degraded to host/host set interfaces #5202

Merged

stale bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 18, 2018

mattklein123 added this to the 1.10.0 milestone Dec 18, 2018

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Dec 18, 2018

mattklein123 added enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently no stalebot Disables stalebot from closing an issue and removed design proposal Needs design doc/proposal before implementation labels Dec 18, 2018

stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Dec 18, 2018

This was referenced Dec 20, 2018

upstream: implement lb for degraded hosts #5367

Merged

upstream: allow active health check to set degraded #5374

Merged

snowp mentioned this issue Jan 8, 2019

upstream: degraded stats and hc event logs #5530

Merged

This was referenced Jan 18, 2019

health filter: cache degraded health checks #5659

Merged

upstream: fix degraded health check and thread posting #5662

Merged

snowp closed this as completed Feb 9, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Degraded" health check #5063

"Degraded" health check #5063

snowp commented Nov 16, 2018

mattklein123 commented Nov 16, 2018

snowp commented Nov 16, 2018 •

edited

Loading

snowp commented Nov 18, 2018 •

edited

Loading

mattklein123 commented Nov 18, 2018

stale bot commented Dec 18, 2018

snowp commented Jan 18, 2019

snowp commented Feb 9, 2019

"Degraded" health check #5063

"Degraded" health check #5063

Comments

snowp commented Nov 16, 2018

mattklein123 commented Nov 16, 2018

snowp commented Nov 16, 2018 • edited Loading

snowp commented Nov 18, 2018 • edited Loading

mattklein123 commented Nov 18, 2018

stale bot commented Dec 18, 2018

snowp commented Jan 18, 2019

snowp commented Feb 9, 2019

snowp commented Nov 16, 2018 •

edited

Loading

snowp commented Nov 18, 2018 •

edited

Loading