Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry not firing as expected, even with retriable_status_codes #6726

Closed
jaygorrell opened this issue Apr 26, 2019 · 2 comments · Fixed by #7505
Closed

Retry not firing as expected, even with retriable_status_codes #6726

jaygorrell opened this issue Apr 26, 2019 · 2 comments · Fixed by #7505
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!

Comments

@jaygorrell
Copy link

Issue Template

Title: Some 503 conditions not being retried, even with 503 as a retry code

Description:
In an Istio environment every service has a default retry policy now. It looks like this:

          "route": {
           "cluster": "outbound|80||my-service.default.svc.cluster.local",
           "timeout": "0s",
           "retry_policy": {
            "retry_on": "connect-failure,refused-stream,unavailable,cancelled,resource-exhausted,retriable-status-codes",
            "num_retries": 2,
            "retry_host_predicate": [
             {
              "name": "envoy.retry_host_predicates.previous_hosts"
             }
            ],
            "host_selection_retry_max_attempts": "3",
            "retriable_status_codes": [
             503
            ]
           },
           "max_grpc_timeout": "0s"
          }

During scale-down events, I get logs like the following on the client-side that appear to not be re-tried:

[2019-04-25 00:56:30.950][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:502] [C11540] remote close
[2019-04-25 00:56:30.950][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:183] [C11540] closing socket: 0
[2019-04-25 00:56:30.950][31][debug][client] [external/envoy/source/common/http/codec_client.cc:82] [C11540] disconnect. resetting 1 pending requests
[2019-04-25 00:56:30.950][31][debug][client] [external/envoy/source/common/http/codec_client.cc:105] [C11540] request reset
[2019-04-25 00:56:30.950][31][debug][router] [external/envoy/source/common/router/router.cc:644] [C11555][S6290924653342831959] upstream reset: reset reason connection termination
[2019-04-25 00:56:30.951][31][debug][filter] [src/envoy/http/mixer/filter.cc:133] Called Mixer::Filter : encodeHeaders 2
[2019-04-25 00:56:30.951][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1243] [C11555][S6290924653342831959] closing connection due to connection close header
[2019-04-25 00:56:30.951][31][debug][http] [external/envoy/source/common/http/conn_manager_impl.cc:1305] [C11555][S6290924653342831959] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '95'
'content-type', 'text/plain'
'date', 'Thu, 25 Apr 2019 00:56:30 GMT'
'server', 'envoy'
'connection', 'close'

[2019-04-25 00:56:30.951][31][debug][filter] [src/envoy/http/mixer/filter.cc:205] Called Mixer::Filter : onDestroy state: 2
[2019-04-25 00:56:30.951][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:101] [C11555] closing data_to_write=248 type=2
[2019-04-25 00:56:30.951][31][debug][connection] [external/envoy/source/common/network/connection_impl.cc:153] [C11555] setting delayed close timer with timeout 1000 ms
[2019-04-25 00:56:30.951][31][debug][pool] [external/envoy/source/common/http/http1/conn_pool.cc:129] [C11540] client disconnected, failure reason: 
[2019-04-25 00:56:30.951][31][debug][filter] [src/envoy/http/mixer/filter.cc:219] Called Mixer::Filter : log
[2019-04-25 00:56:30.951][31][debug][filter] [./src/envoy/http/mixer/report_data.h:132] No dynamic_metadata found for filter envoy.filters.http.rbac

If I override the default retry with my own that uses retryOn: gateway-error, it completely addresses the issue. I would have expected connect-failure to work on it's own, but if nto that, then certainly retriable-status-codes with 503.

I'm new to debugging these systems but happy to check anything else I can provide if guided a little.

[optional Relevant Links:]
Istio issue that led to opening one here as well: istio/istio#13616
This also seems similar to #5876, but I'm not sure if it's really the same issue or not.

@mattklein123 mattklein123 added enhancement Feature requests. Not bugs or questions. help wanted Needs help! labels Apr 27, 2019
@mattklein123
Copy link
Member

Yeah this is the same issue as #5876, but I'm going to keep this one open and mark it help wanted as it has a lot less text. For someone coming along, the relevant fix needed is simple and is here: #5876 (comment)

@jaygorrell jaygorrell changed the title Retry not applying firing as expected, even with retriable_status_codes Retry not firing as expected, even with retriable_status_codes Apr 28, 2019
@rahulraykor
Copy link

I am also facing same issue.
I am using v 1.9 Envoy.
Is this issue correctly reproduced? If not I can get you steps to reproduce the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests. Not bugs or questions. help wanted Needs help!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants