-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smarter retry behavior/cross priority retries #3958
Comments
A variant of the third option would be to use |
@snowp just catching up here. IMO we should probably do the "fully configurable retry routing" option with some built-in options such as "don't route to to the same host," "don't route to same priority," etc. I think this could be built very cleanly into the existing retry/router implementation and would be very flexible. WDYT? Basically, I would like to solve #2259 while also allowing folks to do more complex stuff if they want? |
Yeah that sounds reasonable to me. It would definitely satisfy our needs and seems generic enough that I'm sure others will find it useful. Happy to work on this. Would it be natural to use the subset lb to implement this behavior? e.g. generate one subset per priority and when a request fails going to P0, we set metadata_match to One thing that would be nice with this being tied to the subset lb is that the work for host/priority would naturally extend arbitrary subsets. The main concern I see with using the subset lb is performance, especially with the recent subset lb perf concerns. Does maintaining all these extra subsets in order to do smarter retries make sense? Can this be better implemented in a different way? |
Yeah I'm a little concerned about using the subset LB for this, at least in all cases. Or do you mean just for your plugin? The way I originally envisioned the "don't use the same host again for retry" behavior was to just pass the previous tried hosts as part of the LB context, and let the LB figure it out. So, potentially, should this plugin mechanism actually be a plugin that operates from within the LB context and could be somewhat generic depending on the LB actually used? |
Delegating the work to the LB makes sense. The subset lb looked attractive because it wouldn't implementing this in every LB, but I imagine the perf benefits makes it worth it. Looking over the code it doesn't seem all that bad to put this into the LB. Not sure how easy this would be to generically support - it seems like we need special handling depending on what we're filtering on? For hosts we can choose a host set like normal, and if it contains the undesired host we can remove it. For priorities we might scale the remaining priority weights s.t they sum to 100 in order to use the existing priority selection logic. Maybe there's a generic "ignore hosts matching X" approach that's eluding me? |
TBH I'm not sure of the best solution here without putting some dedicated thought time into it. There is going to be a tradeoff between flexibility and performance. For example, I could imagine a plugin that does something like:
This is not fleshed out, but I wonder if given something like this we could build it into the various LBs pretty easily? For most of the LBs this would be built-in to the base classes anyway and "just work." Anyway maybe it's worth doing a bit of experimentation here and coming back with a more firm proposal? |
That sounds reasonable. At least for priorities I think it would have to be a pre-filter simply due to how priorities work: if you're trying to avoid P0 but P0 is healthy you'll always hit P0, so trying again won't help. I'll spend some time playing around with it and report back when I got a better sense of how this would work. Thanks for all the ideas! |
After looking at the code for a while it’s clear that 1) priorities need to be filtered before we attempt to select a host and that 2) filtering individual hosts before selection is hard because they’re buried into We add three functions to the
The first function allows one to optionally specify a priority filter. The presence of this would mean that we generate a temporary The second filter will be used after We’ll probably want to handle the case where applying the filter makes it impossible to find a new host. For instance, when routing to a cluster with only one host and a postFilter that excludes that hosts, we’ll be unable to find a new candidate host. We’ll probably want to bail out of filtering hosts, falling back to using no filters. To make this behavior configurable we have the To configure this for retries, we can introduce a
The router would be initialized with default filters for the initial request (empty optional for preFilter, always true for postFIlter, and only one attempt). In To make this extensible we’d use the same kind of pattern that listeners/access logs/etc uses, and allow specifying a HostFilter by name. Something along the lines of:
which would live on the RouteAction.RetryPolicy. Rough order of implementation order:
Happy to move this to a Google doc if it makes commenting easier. |
@snowp in general I think this sounds great to me. For the retry policy, we might want to also allow specifying at the virtual host level? Also, I'm thinking that we can include one or more more built-in policies to solve the "don't retry same host" open issues. @envoyproxy/maintainers any thoughts on this? @alyssawilk specifically would love your feedback. |
Is this in a good enough state that I can start working on the implementation? Or should I wait for more feedback? |
Ugh, sorry, I was out last week and am swamped by maintainer rotation this week :-( Chiming in rather belatedly, I totally agree with Matt that an extensible/customization mechanism is the way to go. I suspect other folks will want to preserve legacy behaviors so making it easier to program custom host selection seems ideal. I guess my only thought is rather than have a callback before and after host selection where you mutate state such as per_priority_load, would it make more sense to make the priority and host selection actually the pluggable bit, where the default action is to call the current code, but if a plug-in is configured that's called instead? Changing the internals before and after feels more fragile than allowing custom code to inspect and make its own decisions. No strong feelings so if you want to go with the Matt-approved design over my hand-waving that's fine. :-) |
@snowp I'm fine for you to move forward. If you want to investigate @alyssawilk's thoughts first that's fine too. I agree with @alyssawilk that not having before/after would be better, though my concern would be that the plugin would then have to do a lot more than what we are proposing that it does with the above design. |
Seems like having the pre/post filters would be useful even if we allow overriding the entire host selection method: it'd let people move concerns such as "avoid this host/priority" out of the host selection algorithm, making it easier to reuse between implementations. If a bunch of awesome host selection algorithms got upstreamed, I could add "retry other priority" to any of them without having them explicitly support that. Additionally, I'd hate to have to somehow get ahold of the existing priority selection logic so that I can run through the same code with slight modifications. I think I'll go with the currently approved design for now but open to pivoting if it ends up being too heavily coupled to the lb internals. |
As an initial step towards envoyproxy#3958, this implements two mechanisms for affecting the host selected by the load balancer: 1) prePrioritySelectionFilter, which is used to generate a different per_priority_load table before determining what priority to route to 2) postHostSelectionFilter, which can be used to to reject a selected host and retry host selection (up to hostSelectionRetryCount times) Both of these are defined on the LoadBalancerContext, and will eventually be used by the router to affect the host selection strategy based on retry state. Signed-off-by: Snow Pettersen <snowp@squareup.com>
) As an initial step towards #3958, this implements two mechanisms for affecting the host selected by the load balancer during chooseHost: 1. prePrioritySelectionFilter, which is used to generate a different per_priority_load table before determining what priority to route to 2. postHostSelectionFilter, which can be used to to reject a selected host and retry host selection (up to hostSelectionRetryCount times) Both of these are defined on LoadBalancerContext, and will eventually be used by the router to affect the host selection strategy based on retry state. The first mechanism is implemented in the LoadBalancerBase, while the second one is only implemented for EdfLoadBalancerBase and RandomLoadBalancer. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Medium, new feature that's not enabled. Default behavior should match existing behavior. Testing: Unit tests Docs Changes: N/A (feature not configurable) Release Notes: N/A Signed-off-by: Snow Pettersen <snowp@squareup.com>
As an initial step towards envoyproxy#3958, this implements two mechanisms for affecting the host selected by the load balancer: 1) prePrioritySelectionFilter, which is used to generate a different per_priority_load table before determining what priority to route to 2) postHostSelectionFilter, which can be used to to reject a selected host and retry host selection (up to hostSelectionRetryCount times) Both of these are defined on the LoadBalancerContext, and will eventually be used by the router to affect the host selection strategy based on retry state. Signed-off-by: Snow Pettersen <snowp@squareup.com>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
not stale, some work done in #4097 |
…points (#4212) This adds the necessary configuration and interfaces to register implementations of RetryPriority and RetryHostPredicate, which will allow configuring smarter host selection during retries. Part of #3958 Risk Level: low, api changes Testing: n/a Doc Changes: inline Release Notes:n/a Signed-off-by: Snow Pettersen <snowp@squareup.com>
Wires up route configuration to allow specifying what hosts should be reattempted during retry host selection. Risk Level: Medium, some changes made to the router. Otherwise new optional feature Testing: unit and integration test Docs Changes: n/a Release Notes: n/a Part of #3958 Signed-off-by: Snow Pettersen <snowp@squareup.com>
Plumbs through the max_host_selection_count parameter from retry policy config -> router Risk Level: Low Testing: UT Docs Changes: n/a Release Notes: n/a Part of #3958 Signed-off-by: Snow Pettersen <snowp@squareup.com>
This wires up the necessary logic to allow registering a RetryPriorityFactory that can be used to impact which priority is selected during host selection for retry attempts. Signed-off-by: Snow Pettersen snowp@squareup.com Description: Risk Level: Low, new optional feautre Testing: Integration test Docs Changes: n/a Release Notes: n/a Part of #3958 Signed-off-by: Snow Pettersen <snowp@squareup.com>
Adds a simple RetryHostPredicate that keeps track of hosts that have already been attempted, triggering a new host to be selected if an already attempted host is selected. Signed-off-by: Snow Pettersen snowp@squareup.com Risk Level: Low Testing: Unit test Docs Changes: n/a Release Notes: n/a Part of #3958 Signed-off-by: Snow Pettersen <snowp@squareup.com>
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
Both host selection and retry priority modifications has been implemented, including example implementations. |
We're interested in having Envoy perform smarter retry behavior, where subsequent requests are sent to different hosts than the original request. More specifically, we'd be interested in having requests that route to priority P attempt to send retires to priority P' != P even in the case where the first priority is completely healthy, but fall back to the original priority if no healthy alternatives exist. The use case is mainly to maintain compatibility with our existing legacy systems: it's an attempt to minimize the change in behavior for our internal users when switching over to routing through Envoy.
Got a few ideas on the top of my mind that might help fuel a discussion:
Configurable subsets on retry
One somewhat suboptimal option would be to use subset lb and allow adjusting the metadata match on subsequent retires. I'm imagining something like:
which would issue the first request to the
primary
subset, the first retry tosecondary
, second retry totertiary
and then loop back again to route tosecondary
(i.e. the retry list wraps around). With this we could add metadata to the hosts in each priority to allow us to exclude certain priorities (e.g. hosts not in P0 get labeled P0=false).The problem I can see with this is that there's no guards against the subsets being empty, which would cause an immediate 503 and quickly exhaust the permitted number of retries. Presumably this could be solved by being able to query the load balancer to determine if a subset is empty before trying to route to it, but I'm not sure how easy that would be to achieve. The other issue here is that we don't have knowledge about what priority was actually hit - we have to statically guess what priority to hit on the subsequent requests. For instance, if we have P0, P1, P2 and everything is healthy, requests with retries could hit P0, P1, P2, while if P0 is unhealthy, we'd hit P1, P1, P2.
Cross-priority failover
Another option I can think of would be more tightly coupled to priorities and simply keep track in the retry state what priority was last tried, and would attempt to select a host while ignoring the previous priority. Again this seems like it needs some way of checking whether there are available hosts when when retrying (falling back to hitting the same priority), but it seems like a simpler API and would at least achieve what we're looking for. This approach handles the edge case mentioned in the previous paragraph, because we'd know that we hit P1 on the first request and be able to take that into account when selecting the next priority to route to.
Fully configurable retry routing
A last approach is less fleshed out but would involve being able to register custom retry implementations (like how custom filters can be registered). It could expose an API like
This would allow Envoy users to implement an arbitrary retry policy based on the previous attempts, and it could be configurable on the route level, something like:
This way any Envoy user could implement their own bespoke retry behavior. It would still be useful to be able to query the health of subsets/clusters etc for aforementioned reasons, but you could presumably add something a like
incrementRetryCounter()
to the API to let the implementation decide whether we count the retry towards the retry counters.Happy to consider other ideas, these were just top of mind.
The text was updated successfully, but these errors were encountered: