Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Degraded" health check #5063

Closed
snowp opened this issue Nov 16, 2018 · 7 comments
Closed

"Degraded" health check #5063

snowp opened this issue Nov 16, 2018 · 7 comments
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Milestone

Comments

@snowp
Copy link
Contributor

snowp commented Nov 16, 2018

Right now Envoy health checks are binary yes/no that determine whether a host should receive traffic. A possible extension to this would be to provide additional information in the http health check response that allows the downstream to reprioritize the host.

For instance, you can imagine an upstream responding with x-envoy-health-degraded when the host is able to serve most traffic but say a Redis shard is down. This would make the downstream prefer other hosts that aren't declaring themselves degraded, presumably because they have a healthy Redis shard.

Floating this idea since it's something we have in our legacy RPC system and internal users seem to like it.

@mattklein123 mattklein123 added the design proposal Needs design doc/proposal before implementation label Nov 16, 2018
@mattklein123
Copy link
Member

@snowp this request has come up in different forms multiple times. The "devil is in the details" on this one as it can get quite complicated, scary, and error prone. If you feel like doing a design proposal that would be great.

@snowp
Copy link
Contributor Author

snowp commented Nov 16, 2018

I have a few high level ideas on how to accomplish this:

  • Reuse priorities: allow health checks to move hosts between priorities. You would shift a degraded host's priority by N where N is the number of priorities. P0 becomes PN, P1 becomes P(N-1). This maintains the relative priority between degraded hosts and reuses all of the existing LB mechanisms like spillover etc.
  • Multiple LBs: We could maintain several LBs for different ties of hosts (healthy, degraded, maybe several levels of degraded?) and attempt to select from them in order. This is a much larger change so I'm not sure if it's worth it. It also gets tricky because the existing LB already has fallbacks for when there are no healthy hosts.
  • Support degraded hosts in LB: Filter out degraded hosts during host updates. Similar to how unhealthy hosts are partitioned from healthy hosts, we could partition out the degraded hosts and maintain that state within the LB and add new rules for how degraded/unhealthy hosts interact (would degraded affect spillover? locality weight? etc.)

The first point is how we're implementing this on the control plane level and it's been working decently well. The fact that it doesn't complicate the routing logic by introducing new concepts in Envoy is a nice property, but I'm not sure if frequent priority moves will be problematic.

Happy to consider other options too, these are just some approaches.

@snowp
Copy link
Contributor Author

snowp commented Nov 18, 2018

Here's a more concrete suggestion based on the first high level approach in the previous comment:

Add a degraded_priority_load_ field to the LB which holds the load given to each priority based on how many degraded hosts are in that priority. This should be computed the same way as priority_load_ but looking at degraded hosts instead of healthy ones. These two should sum to a hundred the same way we're normalizing it today, with the healthy hosts being allocated load first. In the case of everything being unhealthy, we still route to P0 healthy.

With this, we can update chooseHostSet to iterate over both loads and return both the desired host set and whether we picked it from the degraded load. Using this information we can use this in hostSourceToUse to specify a that we want the degraded hosts from the subset.

In order to maintain locality routing, a separate locality weights value can be kept that adjusts the weights based on (degraded vs total hosts) in a locality. This would be used in hostSourceToUse to select a locality when locality weighting is enabled.

When in panic mode (be it per priority or global) degraded values should be ignored (route to all hosts) and degraded hosts should not contribute to the panic mode threshold. That is, if a priority is fully degraded it should not be in panic mode. This is because each host is still routable, so triggering panic mode isn't really useful.

@mattklein123
Copy link
Member

@snowp this makes sense to me. I like the idea of reusing priorities for this as I think it makes the implementation a lot simpler and more intuitive given what we already have. Perhaps you could also tackle #5081 as part of this. 😉

@stale
Copy link

stale bot commented Dec 18, 2018

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Dec 18, 2018
@mattklein123 mattklein123 added this to the 1.10.0 milestone Dec 18, 2018
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Dec 18, 2018
@mattklein123 mattklein123 added enhancement Feature requests. Not bugs or questions. stale stalebot believes this issue/PR has not been touched recently no stalebot Disables stalebot from closing an issue and removed design proposal Needs design doc/proposal before implementation labels Dec 18, 2018
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Dec 18, 2018
alyssawilk pushed a commit that referenced this issue Jan 16, 2019
This implements load balancing to degraded hosts by treating them as
"healthy" hosts in a lower priority than the healthy hosts.

For degraded hosts in PN they are treated as healthy hosts in P(N+M),
where M is the total number of priorities. This ensures that degraded
hosts are only routed to when priority spillover due to a lack of
healthy hosts cause a degraded priority to be selected.

Locality weights for degraded locality are tracked similar to how
locality weights are tracked for healthy hosts, ie scaled by the number
of eligible hosts over total number of hosts in each locality. This
ensures that when degraded hosts are selected, load balancing between
localities will behave consistently with the behavior for healthy hosts.

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Medium/High, touches core load balancing code quite a bit, but mostly refactors
Testing: UTs
Docs Changes: inline
Release Notes: n/a
Part of #5063

Signed-off-by: Snow Pettersen <snowp@squareup.com>
@snowp
Copy link
Contributor Author

snowp commented Jan 18, 2019

As part of this I'll have to update the health check filter to cache the response headers, or at least the degraded header. Currently the caching layer is causing degraded health checks to flap, making them incompatible.

alyssawilk pushed a commit that referenced this issue Jan 23, 2019
Updates the way panic mode is calculated to treat degraded hosts as
available. This ensures that panic mode is entered when there are
insufficient available hosts.

Also updates the panic mode documentation to use more generic language.

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Medium
Testing: UTs for mixed healthy/degraded hosts
Docs Changes: Updated panic mode docs
Release Notes: n/a
#5063

Signed-off-by: Snow Pettersen <snowp@squareup.com>
danzh2010 pushed a commit to danzh2010/envoy that referenced this issue Jan 24, 2019
…yproxy#5630)

Updates the way panic mode is calculated to treat degraded hosts as
available. This ensures that panic mode is entered when there are
insufficient available hosts.

Also updates the panic mode documentation to use more generic language.

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Medium
Testing: UTs for mixed healthy/degraded hosts
Docs Changes: Updated panic mode docs
Release Notes: n/a
envoyproxy#5063

Signed-off-by: Snow Pettersen <snowp@squareup.com>

Signed-off-by: Dan Zhang <danzh@google.com>
htuch pushed a commit that referenced this issue Feb 8, 2019
)

Adds a DEGRADED HealthStatus value that can be set on a host through
LoadAssignment, allowing for a host to be marked degraded without
the need for active health checking.

Moves the mapping of EDS flag to health flag to inside
`registerHostForPriority`, which means that we're now consistently setting
the EDS health flag for EDS/STATIC/STRICT_DNS/LOGICAL_DNS.

Simplifies the check for whether the health flag value of a host has
changed during EDS updates.

Adds tests for the EDS mapping as well as tests to verify that we're
honoring the EDS flag for non-EDS cluster types.

Risk Level: High, substantial refactoring of how we determine whether health flag has changed.
Testing: UTs coverage for new health flag values.
Docs Changes: n/a
Release Notes: n/a

Fixes #5637
#5063

Signed-off-by: Snow Pettersen <snowp@squareup.com>
@snowp
Copy link
Contributor Author

snowp commented Feb 9, 2019

This is now fully implemented.

@snowp snowp closed this as completed Feb 9, 2019
fredlas pushed a commit to fredlas/envoy that referenced this issue Mar 5, 2019
This implements load balancing to degraded hosts by treating them as
"healthy" hosts in a lower priority than the healthy hosts.

For degraded hosts in PN they are treated as healthy hosts in P(N+M),
where M is the total number of priorities. This ensures that degraded
hosts are only routed to when priority spillover due to a lack of
healthy hosts cause a degraded priority to be selected.

Locality weights for degraded locality are tracked similar to how
locality weights are tracked for healthy hosts, ie scaled by the number
of eligible hosts over total number of hosts in each locality. This
ensures that when degraded hosts are selected, load balancing between
localities will behave consistently with the behavior for healthy hosts.

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Medium/High, touches core load balancing code quite a bit, but mostly refactors
Testing: UTs
Docs Changes: inline
Release Notes: n/a
Part of envoyproxy#5063

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
fredlas pushed a commit to fredlas/envoy that referenced this issue Mar 5, 2019
…yproxy#5630)

Updates the way panic mode is calculated to treat degraded hosts as
available. This ensures that panic mode is entered when there are
insufficient available hosts.

Also updates the panic mode documentation to use more generic language.

Signed-off-by: Snow Pettersen snowp@squareup.com

Risk Level: Medium
Testing: UTs for mixed healthy/degraded hosts
Docs Changes: Updated panic mode docs
Release Notes: n/a
envoyproxy#5063

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
fredlas pushed a commit to fredlas/envoy that referenced this issue Mar 5, 2019
…voyproxy#5649)

Adds a DEGRADED HealthStatus value that can be set on a host through
LoadAssignment, allowing for a host to be marked degraded without
the need for active health checking.

Moves the mapping of EDS flag to health flag to inside
`registerHostForPriority`, which means that we're now consistently setting
the EDS health flag for EDS/STATIC/STRICT_DNS/LOGICAL_DNS.

Simplifies the check for whether the health flag value of a host has
changed during EDS updates.

Adds tests for the EDS mapping as well as tests to verify that we're
honoring the EDS flag for non-EDS cluster types.

Risk Level: High, substantial refactoring of how we determine whether health flag has changed.
Testing: UTs coverage for new health flag values.
Docs Changes: n/a
Release Notes: n/a

Fixes envoyproxy#5637
envoyproxy#5063

Signed-off-by: Snow Pettersen <snowp@squareup.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. no stalebot Disables stalebot from closing an issue
Projects
None yet
Development

No branches or pull requests

2 participants