Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework request retry logic to be based on retry count limit #48758

Merged
merged 16 commits into from
Apr 18, 2021

Conversation

geoffkizer
Copy link
Contributor

@geoffkizer geoffkizer commented Feb 25, 2021

We currently disallow request retries on connection failure when the failure occurs on a "new" connection (one that hasn't been used for previous requests). We do this mainly to ensure that the retry logic doesn't end up in an infinite loop -- as connections fail, sooner or later the request will be retried on a "new" connection, and we will break out of the retry loop.

This logic is suboptimal for a couple reasons:
(1) There's nothing particularly unique about the first request on a connection. Servers can die or choose to close connections at any time.
(2) We currently do a bad job of deciding which request is the first request for an HTTP2 connection. It is timing dependent. This means that in certain scenarios, when the server sends valid REFUSED_STREAM errors, we end up not retrying requests that should be retried.
(3) We can in theory end up retrying a request many, many times until we actually use a new connection and break out of the retry loop. This is particularly problematic for HTTP2, for several reasons: (a) a single connection can handle many requests, yet connection failure only causes one to stop retrying; (b) we treat REFUSED_STREAM errors as retryable in all cases except for the initial request, even though the connection is still valid, which means that we may never actually choose a new connection for the request; (c) we handle GOAWAY to determine which requests are allowed to be retried, but these same requests could just end up being subject to REFUSED_STREAM or GOAWAY on a different connection, etc.

This PR changes the request retry logic to be based on a fixed retry count limit. The limit is 5 retries; we could adjust this as appropriate or make it configurable if desired.

Note that failure to successfully establish a connection will still cause a request to fail immediately. Requests are only retryable when an established connection causes a failure.

Also:

  • Add relevant tests
  • Rename some of the RequestRetryType enum values for clarity
  • Rename HttpRetryProtocolTests file/class to HttpClientHandlerTest.RequestRetry, and move from common code to System.Net.Http since none of these tests were actually running for WinHttpHandler
  • Fix/simplify some test code around HttpAgnosticLoopbackServer that caused failures with these changes

Fixes #44669

UPDATE:

From exploring certain test failures that were caused by this PR, it's clear to me that we are too lenient about allowing retries in many cases. For example, we are retrying on IO timeouts in one of the HttpWebRequest tests, even though the user is explicitly setting this timeout and presumably wants to fail (not retry) when the timeout is exceeded.

I believe many of these weird cases of lenient retry policy were masked by the way the old retry logic worked, which was that it never allowed retry on the first request on a connection. This means most unit tests never induced retry, even in failure cases that would have caused retry if the request were not the first on the connection -- which is extremely common in practice, but not common in our tests, unfortunately.

So it's actually good that this new retry policy has exposed these issues -- they already existed but were largely hidden.

To address these issues, I am changing the retry logic to be more conservative. We no longer will retry on arbitrary IO errors. We will only retry in cases where we believe that the server is attempting to gracefully tear down a connection -- that is, receiving EOF before any other response data for HTTP/1.1, or receiving GOAWAY for HTTP2.

Please take a look and comment. @stephentoub @scalablecory

@ghost
Copy link

ghost commented Feb 25, 2021

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

We currently disallow request retries on connection failure when the failure occurs on a "new" connection (one that hasn't been used for previous requests). We do this mainly to ensure that the retry logic doesn't end up in an infinite loop -- as connections fail, sooner or later the request will be retried on a "new" connection, and we will break out of the retry loop.

This logic is suboptimal for a couple reasons:
(1) There's nothing particularly unique about the first request on a connection. Servers can die or choose to close connections at any time.
(2) We currently do a bad job of deciding which request is the first request for an HTTP2 connection. It is timing dependent. This means that in certain scenarios, when the server sends valid REFUSED_STREAM errors, we end up not retrying requests that should be retried.
(3) We can in theory end up retrying a request many, many times until we actually use a new connection and break out of the retry loop. This is particularly problematic for HTTP2, for several reasons: (a) a single connection can handle many requests, yet connection failure only causes one to stop retrying; (b) we treat REFUSED_STREAM errors as retryable in all cases except for the initial request, even though the connection is still valid, which means that we may never actually choose a new connection for the request; (c) we handle GOAWAY to determine which requests are allowed to be retried, but these same requests could just end up being subject to REFUSED_STREAM or GOAWAY on a different connection, etc.

This PR changes the request retry logic to be based on a fixed retry count limit. The limit is 5 retries; we could adjust this as appropriate or make it configurable if desired.

Note that failure to successfully establish a connection will still cause a request to fail immediately. Requests are only retryable when an established connection causes a failure.

Also:

  • Add relevant tests
  • Rename some of the RequestRetryType enum values for clarity
  • Rename HttpRetryProtocolTests file/class to HttpClientHandlerTest.RequestRetry, and move from common code to System.Net.Http since none of these tests were actually running for WinHttpHandler
  • Fix/simplify some test code around HttpAgnosticLoopbackServer that caused failures with these changes

Fixes #44669

Author: geoffkizer
Assignees: -
Labels:

area-System.Net.Http

Milestone: -

@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@wfurt
Copy link
Member

wfurt commented Feb 26, 2021

PlatformHandlerTest.cs(338,73): error CS0246: (NETCORE_ENGINEERING_TELEMETRY=Build) The type or namespace name 'HttpRetryProtocolTests'

WinHttp may need also project file change.

@wfurt
Copy link
Member

wfurt commented Feb 26, 2021

hmm. and this one is interesting

   System.Net.Tests.FtpWebRequestTest.Ftp_MakeAndRemoveDir_Success [SKIP]
      Condition(s) not met: "LocalServerAvailable"
Process terminated. Assertion failed.
   at System.Net.Sockets.SocketAsyncContext.Register() in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs:line 1229
   at System.Net.Sockets.SocketAsyncContext.OperationQueue`1.StartAsyncOperation(SocketAsyncContext context, TOperation operation, Int32 observedSequenceNumber, CancellationToken cancellationToken) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs:line 792
   at System.Net.Sockets.SocketAsyncContext.PerformSyncOperation[TOperation](OperationQueue`1& queue, TOperation operation, Int32 timeout, Int32 observedSequenceNumber) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs:line 1301
   at System.Net.Sockets.SocketAsyncContext.Connect(Byte[] socketAddress, Int32 socketAddressLen) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs:line 1456
   at System.Net.Sockets.SocketPal.Connect(SafeSocketHandle handle, Byte[] socketAddress, Int32 socketAddressLen) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs:line 1031
   at System.Net.Sockets.Socket.DoConnect(EndPoint endPointSnapshot, SocketAddress socketAddress) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 3220
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 866
   at System.Net.Sockets.Socket.Connect(IPAddress address, Int32 port) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 896
   at System.Net.Sockets.Socket.Connect(String host, Int32 port) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 922
   at System.Net.Sockets.Socket.Connect(EndPoint remoteEP) in /_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.cs:line 852

@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Base automatically changed from master to main March 1, 2021 09:08
@geoffkizer
Copy link
Contributor Author

There's something wacky going on deep in the sockets code that is causing an assert in SocketAsyncContext.

I suspect this is a pre-existing issue that has been exposed by the combination of #47648 and this PR. Something involving sync socket IO and timeouts.

cc @wfurt @antonfirsov @scalablecory @stephentoub... any ideas?

@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@geoffkizer geoffkizer marked this pull request as ready for review April 9, 2021 01:37
@geoffkizer
Copy link
Contributor Author

This is now ready for review.

The socket assert got fixed in #50788. I also reworked a questionable WebSocket test that was failing due to change in timing.

@@ -569,10 +569,13 @@ public async Task<HttpResponseMessage> SendAsyncCore(HttpRequestMessage request,
_readOffset = 0;
_readLength = bytesRead;
}
else
{
await FillAsync(async).ConfigureAwait(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I understand, this was already done later as part of ReadNextREsponseHeaderLineAsync, but we're preemptively doing it here so that we can definitively say after this that data was received?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@geoffkizer
Copy link
Contributor Author

/azp run runtime-libraries-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HTTP2 doesn't retry the first stream on REFUSED_STREAM
4 participants