New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
router: support for rate limit retry policy #4946
router: support for rate limit retry policy #4946
Conversation
Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@mattklein123 can you PTAL when you get time? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Small question/comment.
@@ -99,6 +99,9 @@ retriable-status-codes | |||
in either :ref:`the retry policy <envoy_api_field_route.RouteAction.RetryPolicy.retriable_status_codes>` | |||
or in the :ref:`config_http_filters_router_x-envoy-retriable-status-codes` header. | |||
|
|||
rate-limited | |||
Envoy will attempt a retry if a request is rate limited (with response code 429) by the :ref:`rate limit service <arch_overview_rate_limit>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would remove the "by the rate limit service" clause? Since really you are just adding a 429 retry policy?
Do we want to add a parallel grpc response code policy for resource exhausted or something like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would remove the "by the rate limit service" clause? Since really you are just adding a 429 retry policy?
Yes. I will remove that.
Do we want to add a parallel grpc response code policy for resource exhausted or something like that?
I think parallel grpc response code policy is not required because rate limiting filter returns 429 status (along with gRPC status UNAVAILABLE
). So the newly added policy should be sufficient. Am I missing some thing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also realized the condition is not doing what I am intending to do (and the corresponding test also). Now I have corrected. It should be clear now. PTAL.
Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali sorry I'm confused. Per your comment here: #4855 (comment) do you plan on actually using this feature? I would rather not add it unless there is a specific use case. |
@mattklein123 There are two things that this PR does
|
Where does this PR do this? I'm totally lost. |
@@ -186,6 +189,11 @@ RetryStatus RetryStateImpl::shouldRetry(const Http::HeaderMap* response_headers, | |||
} | |||
|
|||
bool RetryStateImpl::wouldRetryFromHeaders(const Http::HeaderMap& response_headers) { | |||
// We should retry rate limited request only if the policy is set. | |||
if (Http::CodeUtility::isRateLimited(Http::Utility::getResponseStatus(response_headers))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 This is the check I am talking about. It checks if the request is rate limited and the retry policy for rate_limited
is set then only retries. Otherwise just returns, does not retry and check for other gRPC conditions. Am I understanding this incorrectly? or missing some thing here that is causing the confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think opt-in behaviour is not needed, we could some thing similar to https://github.com/envoyproxy/envoy/blob/master/source/common/router/retry_state_impl.cc#L266??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this help with gRPC though? In a normal gRPC response the status code is going to be 200, so this wouldn't apply?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was not intending this for normal gRPC. I was intending this for a use case where the rate limiting filter identifies the request as rate limited and that returns status code of 429. Do you think this configuration is generic and people get confused? That is why in the initial PR I added the clause that returned by rate limiting service
in the doc explanation. Do you see any other way of achieving the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, stepping back, can you explain what you are trying to achieve? You originally said you don't want to retry rate limited requests, but this PR will allow you to do that? I don't see any functional change in this PR other than an addition of a new rate limit policy which I think you said you don't want to use?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 Our caller is gRPC. I am following that PR of mapping change. However, we see some cases where RESOURCES_EXHAUSTED
is returned by the service (other than rate limiting). And retrying those requests to other nodes might help in some cases. So ideally we would want the RESOURCE_EXHAUSTED
to be in the retry policy but exclude rate limiting calls.
I think the confusion here is that for a gRPC caller, HTTP status is converted to 200 (OK), so what you are doing here is not going to help for gRPC. If you have an HTTP caller, 429 is not retried by default, so you wouldn't get a retry either.
Thanks for this. This is creating the confusion for me. I did not know that for gRPC HTTP status is converted to 200(OK). I was missing that point. Agree this PR is not going to help for that case. Sorry for all the confusion.
So basically if we want to have RESOURCE_EXHAUSTED
in the retry codes list (after #4879 is merged) and do not want to retry rate limited requests - how should we proceed?
I have couple of solutions.
- Add a new header called
x-envoy-ratelimited
to distinguish that it is a rate limited request and do not retry it if we see that header some thing similar tox-envoy-overloaded
? or use the headerx-envoy-ratelimited
to drive therate-limited
(similar to what I have here in this PR except that it would be driven by the new header) policy instead of Http Status? - Should we add a new policy as proposed here (retry of rate limited requests #4855 (comment) - option 2) - which also might need the above header?
or any other better solutions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if no code changes are possible, that would be best. For example, do you really need to retry in the case of RESOURCE_EXHAUSTED? That seems generally bad.
If you do, I guess setting a header much like we do for overloaded makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mattklein123 just to confirm, if we implement this new header approach, do you think it is better to configure the behaviour of enforcing retries based on some policy or just do not retry (and document this behaviour) if we see that header exactly as how x-envoy-overloaded
works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rather not add a retry policy until someone explicitly asks for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense..I agree with it. I have pushed the PR #4972 with above changes. PTAL.
@mattklein123 I am going to close this PR, but based on what you think about other point about |
Description: Adds a new retry policy for rate limited requests.
Risk Level: Low
Testing: Added automated tests
Docs Changes: Changed
Release Notes: Added
Fixes #4855