Rework request retry logic to be based on retry count limit #48758

geoffkizer · 2021-02-25T12:19:13Z

We currently disallow request retries on connection failure when the failure occurs on a "new" connection (one that hasn't been used for previous requests). We do this mainly to ensure that the retry logic doesn't end up in an infinite loop -- as connections fail, sooner or later the request will be retried on a "new" connection, and we will break out of the retry loop.

This logic is suboptimal for a couple reasons:
(1) There's nothing particularly unique about the first request on a connection. Servers can die or choose to close connections at any time.
(2) We currently do a bad job of deciding which request is the first request for an HTTP2 connection. It is timing dependent. This means that in certain scenarios, when the server sends valid REFUSED_STREAM errors, we end up not retrying requests that should be retried.
(3) We can in theory end up retrying a request many, many times until we actually use a new connection and break out of the retry loop. This is particularly problematic for HTTP2, for several reasons: (a) a single connection can handle many requests, yet connection failure only causes one to stop retrying; (b) we treat REFUSED_STREAM errors as retryable in all cases except for the initial request, even though the connection is still valid, which means that we may never actually choose a new connection for the request; (c) we handle GOAWAY to determine which requests are allowed to be retried, but these same requests could just end up being subject to REFUSED_STREAM or GOAWAY on a different connection, etc.

This PR changes the request retry logic to be based on a fixed retry count limit. The limit is 5 retries; we could adjust this as appropriate or make it configurable if desired.

Note that failure to successfully establish a connection will still cause a request to fail immediately. Requests are only retryable when an established connection causes a failure.

Also:

Add relevant tests
Rename some of the RequestRetryType enum values for clarity
Rename HttpRetryProtocolTests file/class to HttpClientHandlerTest.RequestRetry, and move from common code to System.Net.Http since none of these tests were actually running for WinHttpHandler
Fix/simplify some test code around HttpAgnosticLoopbackServer that caused failures with these changes

Fixes #44669

UPDATE:

From exploring certain test failures that were caused by this PR, it's clear to me that we are too lenient about allowing retries in many cases. For example, we are retrying on IO timeouts in one of the HttpWebRequest tests, even though the user is explicitly setting this timeout and presumably wants to fail (not retry) when the timeout is exceeded.

I believe many of these weird cases of lenient retry policy were masked by the way the old retry logic worked, which was that it never allowed retry on the first request on a connection. This means most unit tests never induced retry, even in failure cases that would have caused retry if the request were not the first on the connection -- which is extremely common in practice, but not common in our tests, unfortunately.

So it's actually good that this new retry policy has exposed these issues -- they already existed but were largely hidden.

To address these issues, I am changing the retry logic to be more conservative. We no longer will retry on arbitrary IO errors. We will only retry in cases where we believe that the server is attempting to gracefully tear down a connection -- that is, receiving EOF before any other response data for HTTP/1.1, or receiving GOAWAY for HTTP2.

Please take a look and comment. @stephentoub @scalablecory

ghost · 2021-02-25T12:19:19Z

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently disallow request retries on connection failure when the failure occurs on a "new" connection (one that hasn't been used for previous requests). We do this mainly to ensure that the retry logic doesn't end up in an infinite loop -- as connections fail, sooner or later the request will be retried on a "new" connection, and we will break out of the retry loop.

This logic is suboptimal for a couple reasons:
(1) There's nothing particularly unique about the first request on a connection. Servers can die or choose to close connections at any time.
(2) We currently do a bad job of deciding which request is the first request for an HTTP2 connection. It is timing dependent. This means that in certain scenarios, when the server sends valid REFUSED_STREAM errors, we end up not retrying requests that should be retried.
(3) We can in theory end up retrying a request many, many times until we actually use a new connection and break out of the retry loop. This is particularly problematic for HTTP2, for several reasons: (a) a single connection can handle many requests, yet connection failure only causes one to stop retrying; (b) we treat REFUSED_STREAM errors as retryable in all cases except for the initial request, even though the connection is still valid, which means that we may never actually choose a new connection for the request; (c) we handle GOAWAY to determine which requests are allowed to be retried, but these same requests could just end up being subject to REFUSED_STREAM or GOAWAY on a different connection, etc.

This PR changes the request retry logic to be based on a fixed retry count limit. The limit is 5 retries; we could adjust this as appropriate or make it configurable if desired.

Note that failure to successfully establish a connection will still cause a request to fail immediately. Requests are only retryable when an established connection causes a failure.

Also:

Add relevant tests
Rename some of the RequestRetryType enum values for clarity
Rename HttpRetryProtocolTests file/class to HttpClientHandlerTest.RequestRetry, and move from common code to System.Net.Http since none of these tests were actually running for WinHttpHandler
Fix/simplify some test code around HttpAgnosticLoopbackServer that caused failures with these changes

Fixes #44669

Author:	geoffkizer
Assignees:	-
Labels:	`area-System.Net.Http`
Milestone:	-

geoffkizer · 2021-02-25T12:19:47Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2021-02-25T12:20:01Z

Azure Pipelines successfully started running 1 pipeline(s).

wfurt · 2021-02-26T00:13:29Z

PlatformHandlerTest.cs(338,73): error CS0246: (NETCORE_ENGINEERING_TELEMETRY=Build) The type or namespace name 'HttpRetryProtocolTests'

WinHttp may need also project file change.

wfurt · 2021-02-26T00:15:08Z

hmm. and this one is interesting

   System.Net.Tests.FtpWebRequestTest.Ftp_MakeAndRemoveDir_Success [SKIP]
      Condition(s) not met: "LocalServerAvailable"
Process terminated. Assertion failed.
   at System.Net.Sockets.SocketAsyncContext.Register() in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs:line 1229
   at System.Net.Sockets.SocketAsyncContext.OperationQueue`1.StartAsyncOperation(SocketAsyncContext context, TOperation operation, Int32 observedSequenceNumber, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs:line 792
   at System.Net.Sockets.SocketAsyncContext.PerformSyncOperation[TOperation](OperationQueue`1& queue, TOperation operation, Int32 timeout, Int32 observedSequenceNumber) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs:line 1301
   at System.Net.Sockets.SocketAsyncContext.Connect(Byte[] socketAddress, Int32 socketAddressLen) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs:line 1456
   at System.Net.Sockets.SocketPal.Connect(SafeSocketHandle handle, Byte[] socketAddress, Int32 socketAddressLen) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs:line 1031
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 3220
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 866
   at System.Net.Sockets.Socket.Connect(IPAddress address, Int32 port) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 896
   at System.Net.Sockets.Socket.Connect(String host, Int32 port) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 922
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 852

geoffkizer · 2021-02-28T05:46:46Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2021-02-28T05:46:59Z

Azure Pipelines successfully started running 1 pipeline(s).

geoffkizer · 2021-03-03T22:03:18Z

There's something wacky going on deep in the sockets code that is causing an assert in SocketAsyncContext.

I suspect this is a pre-existing issue that has been exposed by the combination of #47648 and this PR. Something involving sync socket IO and timeouts.

cc @wfurt @antonfirsov @scalablecory @stephentoub... any ideas?

geoffkizer · 2021-04-06T04:40:56Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2021-04-06T04:41:09Z

Azure Pipelines successfully started running 1 pipeline(s).

geoffkizer · 2021-04-07T05:07:05Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2021-04-07T05:07:18Z

Azure Pipelines successfully started running 1 pipeline(s).

geoffkizer · 2021-04-08T21:58:01Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2021-04-08T21:58:15Z

Azure Pipelines successfully started running 1 pipeline(s).

geoffkizer · 2021-04-09T01:39:14Z

This is now ready for review.

The socket assert got fixed in #50788. I also reworked a questionable WebSocket test that was failing due to change in timing.

stephentoub · 2021-04-09T14:55:23Z

src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnection.cs

@@ -569,10 +569,13 @@ public async Task<HttpResponseMessage> SendAsyncCore(HttpRequestMessage request,
                    _readOffset = 0;
                    _readLength = bytesRead;
                }
+                else
+                {
+                    await FillAsync(async).ConfigureAwait(false);


To make sure I understand, this was already done later as part of ReadNextREsponseHeaderLineAsync, but we're preemptively doing it here so that we can definitively say after this that data was received?

src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs

src/libraries/System.Net.WebSockets.Client/tests/ClientWebSocketOptionsTests.cs

…nnectionFailureRetries) instead of isNewConnection logic

…retries, not including initial attempt

…t against timing issues

…sues

… behavior

…erver test more robust against timing issues

…better

… EOF from the server

geoffkizer · 2021-04-18T01:48:11Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2021-04-18T01:48:25Z

Azure Pipelines successfully started running 1 pipeline(s).

geoffkizer · 2021-04-18T02:22:45Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2021-04-18T02:22:57Z

Azure Pipelines successfully started running 1 pipeline(s).

geoffkizer · 2021-04-18T02:45:53Z

/azp run runtime-libraries-coreclr outerloop

azure-pipelines · 2021-04-18T02:46:06Z

Azure Pipelines successfully started running 1 pipeline(s).

dotnet-issue-labeler bot added the area-System.Net.Http label Feb 25, 2021

runfoapp bot mentioned this pull request Feb 25, 2021

HttpWebRequestTest_Sync.ReadWriteTimeout_CancelsResponse failed in CI mono Linux x64 #47728

Closed

Base automatically changed from master to main March 1, 2021 09:08

GrabYourPitchforks mentioned this pull request Mar 11, 2021

Undo the Const < (uint)span.Length hacks in the BCL #49450

Merged

geoffkizer mentioned this pull request Mar 29, 2021

Failed assert in SocketAsyncContext.TryRegister in CI #50380

Closed

geoffkizer force-pushed the requestretry branch from 91ae4d6 to 7c2b0fb Compare April 6, 2021 04:40

geoffkizer marked this pull request as draft April 6, 2021 16:43

geoffkizer force-pushed the requestretry branch from 7c2b0fb to c13abe9 Compare April 7, 2021 05:06

geoffkizer force-pushed the requestretry branch from 49a3b68 to 4a979d8 Compare April 9, 2021 01:37

geoffkizer marked this pull request as ready for review April 9, 2021 01:37

stephentoub reviewed Apr 9, 2021

View reviewed changes

src/libraries/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/HttpConnectionPool.cs Show resolved Hide resolved

stephentoub reviewed Apr 9, 2021

View reviewed changes

src/libraries/System.Net.WebSockets.Client/tests/ClientWebSocketOptionsTests.cs Outdated Show resolved Hide resolved

stephentoub approved these changes Apr 9, 2021

View reviewed changes

geoffkizer force-pushed the requestretry branch from 233d69f to f6381cb Compare April 11, 2021 06:09

Geoffrey Kizer added 15 commits April 17, 2021 18:45

rework request retry logic to be based off a fixed retry limit (MaxCo…

a2e0c47

…nnectionFailureRetries) instead of isNewConnection logic

fix bogus websocket test

d6ee478

better fix for client web socket test

ec48cc8

adjust logic so that MaxConnectionFailureRetries is actually the max …

9900e91

…retries, not including initial attempt

fix up HttpWebRequest test

cfa877c

try to make Connect_ViaProxy_ProxyTunnelRequestIssued test more robus…

4617e70

…t against timing issues

try to make ReadWriteTimeout_CancelsResponse more robust to timing is…

ed57e2f

…sues

another test fix

42cd29d

try to make Http2_ServerSendsInvalidSettingsValue_Error test more robust

3e6d934

don't retry IO failures in HTTP2, in order to provide more consistent…

8de45a7

… behavior

try to make HttpClientTest.SendAsync_CorrectVersionSelected_LoopbackS…

5b800b6

…erver test more robust against timing issues

push temp change to HttpWebRequest to try to understand test failure …

2f23a7f

…better

be more conservative about request retries -- only allow on receiving…

1fcd4d9

… EOF from the server

undo test hack

2ae7aa8

revert test change

608a96f

geoffkizer force-pushed the requestretry branch from 83f9e88 to 608a96f Compare April 18, 2021 01:47

fix merge issue

10ab1ab

geoffkizer merged commit 288bfa0 into dotnet:main Apr 18, 2021

runfoapp bot mentioned this pull request Apr 19, 2021

Symlinks left on disk by Installer runs breaks job log upload on Linux #51502

Closed

ghost locked as resolved and limited conversation to collaborators May 18, 2021

karelz added this to the 6.0.0 milestone May 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework request retry logic to be based on retry count limit #48758

Rework request retry logic to be based on retry count limit #48758

geoffkizer commented Feb 25, 2021 •

edited

Loading

ghost commented Feb 25, 2021

geoffkizer commented Feb 25, 2021

azure-pipelines bot commented Feb 25, 2021

wfurt commented Feb 26, 2021

wfurt commented Feb 26, 2021

geoffkizer commented Feb 28, 2021

azure-pipelines bot commented Feb 28, 2021

geoffkizer commented Mar 3, 2021

geoffkizer commented Apr 6, 2021

azure-pipelines bot commented Apr 6, 2021

geoffkizer commented Apr 7, 2021

azure-pipelines bot commented Apr 7, 2021

geoffkizer commented Apr 8, 2021

azure-pipelines bot commented Apr 8, 2021

geoffkizer commented Apr 9, 2021

stephentoub Apr 9, 2021

geoffkizer Apr 11, 2021

geoffkizer commented Apr 18, 2021

azure-pipelines bot commented Apr 18, 2021

geoffkizer commented Apr 18, 2021

azure-pipelines bot commented Apr 18, 2021

geoffkizer commented Apr 18, 2021

azure-pipelines bot commented Apr 18, 2021

Rework request retry logic to be based on retry count limit #48758

Rework request retry logic to be based on retry count limit #48758

Conversation

geoffkizer commented Feb 25, 2021 • edited Loading

ghost commented Feb 25, 2021

geoffkizer commented Feb 25, 2021

azure-pipelines bot commented Feb 25, 2021

wfurt commented Feb 26, 2021

wfurt commented Feb 26, 2021

geoffkizer commented Feb 28, 2021

azure-pipelines bot commented Feb 28, 2021

geoffkizer commented Mar 3, 2021

geoffkizer commented Apr 6, 2021

azure-pipelines bot commented Apr 6, 2021

geoffkizer commented Apr 7, 2021

azure-pipelines bot commented Apr 7, 2021

geoffkizer commented Apr 8, 2021

azure-pipelines bot commented Apr 8, 2021

geoffkizer commented Apr 9, 2021

stephentoub Apr 9, 2021

Choose a reason for hiding this comment

geoffkizer Apr 11, 2021

Choose a reason for hiding this comment

geoffkizer commented Apr 18, 2021

azure-pipelines bot commented Apr 18, 2021

geoffkizer commented Apr 18, 2021

azure-pipelines bot commented Apr 18, 2021

geoffkizer commented Apr 18, 2021

azure-pipelines bot commented Apr 18, 2021

geoffkizer commented Feb 25, 2021 •

edited

Loading