Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

router: support for rate limit retry policy #4946

Closed

Conversation

ramaraochavali
Copy link
Contributor

Description: Adds a new retry policy for rate limited requests.
Risk Level: Low
Testing: Added automated tests
Docs Changes: Changed
Release Notes: Added
Fixes #4855

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@ramaraochavali
Copy link
Contributor Author

@mattklein123 can you PTAL when you get time?

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Small question/comment.

@@ -99,6 +99,9 @@ retriable-status-codes
in either :ref:`the retry policy <envoy_api_field_route.RouteAction.RetryPolicy.retriable_status_codes>`
or in the :ref:`config_http_filters_router_x-envoy-retriable-status-codes` header.

rate-limited
Envoy will attempt a retry if a request is rate limited (with response code 429) by the :ref:`rate limit service <arch_overview_rate_limit>`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would remove the "by the rate limit service" clause? Since really you are just adding a 429 retry policy?

Do we want to add a parallel grpc response code policy for resource exhausted or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would remove the "by the rate limit service" clause? Since really you are just adding a 429 retry policy?

Yes. I will remove that.

Do we want to add a parallel grpc response code policy for resource exhausted or something like that?

I think parallel grpc response code policy is not required because rate limiting filter returns 429 status (along with gRPC status UNAVAILABLE). So the newly added policy should be sufficient. Am I missing some thing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also realized the condition is not doing what I am intending to do (and the corresponding test also). Now I have corrected. It should be clear now. PTAL.

Signed-off-by: Rama <rama.rao@salesforce.com>
Signed-off-by: Rama <rama.rao@salesforce.com>
@mattklein123
Copy link
Member

@ramaraochavali sorry I'm confused. Per your comment here: #4855 (comment) do you plan on actually using this feature? I would rather not add it unless there is a specific use case.

@ramaraochavali
Copy link
Contributor Author

@mattklein123 There are two things that this PR does

  1. Adds the behaviour that a rate limited request is not retried irrespective of the gRPC status. With out this it retries if we have retry-grpc-on has UNAVAILABLE. Our case is the case added in the test case PolicyUnavailableWhenRateLimited
  2. It gives opt-in config for retrying rate limited requests.
    We do not use the opt-in behaviour. But we need 1 definitely. So if you think, the opt-in config is not need (and that is what I was asking in the issue), I will remove it. But we definitely need the code not retry rate limited requests irrespective of what is specificied in retry-grpc-on because those conditions (UNAVAILABLE or RESOURCE_EXHAUSTED can come for other reasons and hence will be retried now).

@mattklein123
Copy link
Member

Adds the behaviour that a rate limited request is not retried irrespective of the gRPC status. With out this it retries if we have retry-grpc-on has UNAVAILABLE. Our case is the case added in the test case PolicyUnavailableWhenRateLimited

Where does this PR do this? I'm totally lost.

@mattklein123 mattklein123 self-assigned this Nov 4, 2018
@@ -186,6 +189,11 @@ RetryStatus RetryStateImpl::shouldRetry(const Http::HeaderMap* response_headers,
}

bool RetryStateImpl::wouldRetryFromHeaders(const Http::HeaderMap& response_headers) {
// We should retry rate limited request only if the policy is set.
if (Http::CodeUtility::isRateLimited(Http::Utility::getResponseStatus(response_headers))) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 This is the check I am talking about. It checks if the request is rate limited and the retry policy for rate_limited is set then only retries. Otherwise just returns, does not retry and check for other gRPC conditions. Am I understanding this incorrectly? or missing some thing here that is causing the confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think opt-in behaviour is not needed, we could some thing similar to https://github.com/envoyproxy/envoy/blob/master/source/common/router/retry_state_impl.cc#L266??

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this help with gRPC though? In a normal gRPC response the status code is going to be 200, so this wouldn't apply?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not intending this for normal gRPC. I was intending this for a use case where the rate limiting filter identifies the request as rate limited and that returns status code of 429. Do you think this configuration is generic and people get confused? That is why in the initial PR I added the clause that returned by rate limiting service in the doc explanation. Do you see any other way of achieving the same?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, stepping back, can you explain what you are trying to achieve? You originally said you don't want to retry rate limited requests, but this PR will allow you to do that? I don't see any functional change in this PR other than an addition of a new rate limit policy which I think you said you don't want to use?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 Our caller is gRPC. I am following that PR of mapping change. However, we see some cases where RESOURCES_EXHAUSTED is returned by the service (other than rate limiting). And retrying those requests to other nodes might help in some cases. So ideally we would want the RESOURCE_EXHAUSTED to be in the retry policy but exclude rate limiting calls.

I think the confusion here is that for a gRPC caller, HTTP status is converted to 200 (OK), so what you are doing here is not going to help for gRPC. If you have an HTTP caller, 429 is not retried by default, so you wouldn't get a retry either.

Thanks for this. This is creating the confusion for me. I did not know that for gRPC HTTP status is converted to 200(OK). I was missing that point. Agree this PR is not going to help for that case. Sorry for all the confusion.

So basically if we want to have RESOURCE_EXHAUSTED in the retry codes list (after #4879 is merged) and do not want to retry rate limited requests - how should we proceed?
I have couple of solutions.

  1. Add a new header called x-envoy-ratelimited to distinguish that it is a rate limited request and do not retry it if we see that header some thing similar to x-envoy-overloaded ? or use the header x-envoy-ratelimited to drive the rate-limited (similar to what I have here in this PR except that it would be driven by the new header) policy instead of Http Status?
  2. Should we add a new policy as proposed here (retry of rate limited requests #4855 (comment) - option 2) - which also might need the above header?

or any other better solutions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess if no code changes are possible, that would be best. For example, do you really need to retry in the case of RESOURCE_EXHAUSTED? That seems generally bad.

If you do, I guess setting a header much like we do for overloaded makes sense to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattklein123 just to confirm, if we implement this new header approach, do you think it is better to configure the behaviour of enforcing retries based on some policy or just do not retry (and document this behaviour) if we see that header exactly as how x-envoy-overloaded works?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rather not add a retry policy until someone explicitly asks for it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense..I agree with it. I have pushed the PR #4972 with above changes. PTAL.

@ramaraochavali
Copy link
Contributor Author

@mattklein123 I am going to close this PR, but based on what you think about other point about RESOURCE_EXHAUSTED, will open another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants