Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTTP2: Flakyness when server doesn't completely read request #40051

Closed
JamesNK opened this issue Aug 6, 2019 · 37 comments

Comments

@JamesNK
Copy link
Member

commented Aug 6, 2019

I have created a repro for an error that sometimes shows up on the grpc-dotnet CI server.

Repro:

  1. git clone https://github.com/JamesNK/grpc-dotnet.git
  2. git checkout jamesnk/stress-maxsize
  3. dotnet test test\FunctionalTests --filter Name~ReceivedMessageExceedsSize_ThrowError

Logic in the test is like:

  1. Client makes a non-streaming call to the server and sends a large request body
  2. Server ends response without reading complete request body
  3. Client asserts response
  4. Go to 1.

When one thread is executing this will not error.

When multiple threads share one HttpClient it consistently throws this error:

System.Threading.Tasks.TaskCanceledException : A task was canceled.
  Stack Trace: 
    at TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
    at Http2Connection.SendStreamDataAsync(Int32 streamId, ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
    at Http2Stream.SendDataAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
    at HttpContent.CopyToAsyncCore(ValueTask copyTask)
    at Http2Stream.SendRequestBodyAsync(CancellationToken cancellationToken)
    at Http2Connection.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
    at HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
    at RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
    at HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
    at TaskExtensions.TimeoutAfter[T](Task`1 task, TimeSpan timeout, String filePath, Int32 lineNumber) in TaskExtensions.cs line: 58
    at <<ReceivedMessageExceedsSize_ThrowError>b__1>d.MoveNext() in MaxMessageSizeTests.cs line: 77
    at --- End of stack trace from previous location where exception was thrown ---
    at MaxMessageSizeTests.ReceivedMessageExceedsSize_ThrowError() in MaxMessageSizeTests.cs line: 87
    at GenericAdapter`1.BlockUntilCompleted()
    at NoMessagePumpStrategy.WaitForCompletion(AwaitAdapter awaitable)
    at AsyncToSyncAdapter.Await(Func`1 invoke)
    at TestMethodCommand.RunTestMethod(TestExecutionContext context)
    at TestMethodCommand.Execute(TestExecutionContext context)
    at <>c__DisplayClass1_0.<Execute>b__0()
    at BeforeAndAfterTestCommand.RunTestMethodInThreadAbortSafeZone(TestExecutionContext context, Action action)
.NET Core SDK (reflecting any global.json):
 Version:   3.0.100-preview9-013697
 Commit:    baac5f4d5c

// @geoffkizer

@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 6, 2019

I can't repro this using master bits.

I'm guessing this is the same issue as #39586. That fix has not made it into the 3.0 branch yet.

@eiriktsarpalis

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

I'm able to reproduce consistently using latest master bits.

@karelz

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

@geoffkizer can you please help take a look?

@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

@eiriktsarpalis How frequently does this repro for you?

I ran the test like 20 times and didn't see a failure.

@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

@eiriktsarpalis

It would be helpful to know:

Why are we cancelling? Is this because the client has requested cancellation (e.g. due to timeout or whatever)? If so, it then it appears we think we have no connection window credit here. In which case we need to understand whether the server is not giving us credit, or it is but we're not using it for some reason.

Or have we initiated cancellation internally, e.g. we've received RST_STREAM or something like that? If so, we need to understand what caused cancellation here and why.

@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

@eiriktsarpalis A trace of HttpClient events would be helpful here.

@eiriktsarpalis

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

@eiriktsarpalis What version of ASP.NET are you using? An install link would be helpful...

@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

Also, are you running debug or release bits?

@JamesNK

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

I'll see if I can recreate it on the latest SDK. To double check I'm getting the right SDK first, is this the link I want? - https://dotnetcli.blob.core.windows.net/dotnet/Sdk/release/3.0.1xx/dotnet-sdk-latest-win-x64.exe

That's from the Release/3.0.1XX (3.0.x Runtime) column at https://github.com/dotnet/core-sdk/blob/master/README.md#installers-and-binaries

@eiriktsarpalis

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

Also, are you running debug or release bits?

Release.

I'll see if I can recreate it on the latest SDK. To double check I'm getting the right SDK first, is this the link I want? - https://dotnetcli.blob.core.windows.net/dotnet/Sdk/release/3.0.1xx/dotnet-sdk-latest-win-x64.exe

I used the same source, AspNetCore v3.0.0-preview9.19408.9. Shared NetCoreApp bits copied from corefx master build, from within corefx/artifacts/bin/testhost/netcoreapp-Windows_NT-Release-x64

@eiriktsarpalis

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

To be more precise, I use the nightly zipfile.

@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

I honestly don't know what fixes that contains. Does it contain #40112 which was merged yesterday?

@eiriktsarpalis

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

I honestly don't know what fixes that contains. Does it contain #40112 which was merged yesterday?

I don't know either, which is why I copy corefx bits to its shared folder post installation.

EDIT: This is why I'd like this to happen #40110

@JamesNK

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

Just confirmed that I still get this error with 3.0.100-preview9-013744.

  X ReceivedMessageExceedsSize_ThrowError [235ms]
  Error Message:
   System.Threading.Tasks.TaskCanceledException : A task was canceled.
  Stack Trace:
     at System.Net.Http.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
   at System.Net.Http.Http2Connection.SendStreamDataAsync(Int32 streamId, ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.Http2Connection.Http2Stream.SendDataAsync(ReadOnlyMemory`1 buffer, CancellationToken cancellationToken)
   at System.Net.Http.HttpContent.CopyToAsyncCore(ValueTask copyTask)
   at System.Net.Http.Http2Connection.Http2Stream.SendRequestBodyAsync(CancellationToken cancellationToken)
   at System.Net.Http.Http2Connection.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
   at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Grpc.Tests.Shared.TaskExtensions.TimeoutAfter[T](Task`1 task, TimeSpan timeout, String filePath, Int32 lineNumber) in C:\Development\Source\grpc-dotnet\test\Shared\TaskExtensions.cs:line 58
   at Grpc.AspNetCore.FunctionalTests.Server.MaxMessageSizeTests.<>c__DisplayClass0_0.<<ReceivedMessageExceedsSize_ThrowError>b__1>d.MoveNext() in C:\Development\Source\grpc-dotnet\test\FunctionalTests\Server\MaxMessageSizeTests.cs:line 77
--- End of stack trace from previous location where exception was thrown ---
   at Grpc.AspNetCore.FunctionalTests.Server.MaxMessageSizeTests.ReceivedMessageExceedsSize_ThrowError() in C:\Development\Source\grpc-dotnet\test\FunctionalTests\Server\MaxMessageSizeTests.cs:line 87
   at NUnit.Framework.Internal.TaskAwaitAdapter.GenericAdapter`1.BlockUntilCompleted()
   at NUnit.Framework.Internal.MessagePumpStrategy.NoMessagePumpStrategy.WaitForCompletion(AwaitAdapter awaitable)
   at NUnit.Framework.Internal.AsyncToSyncAdapter.Await(Func`1 invoke)
   at NUnit.Framework.Internal.Commands.TestMethodCommand.RunTestMethod(TestExecutionContext context)
   at NUnit.Framework.Internal.Commands.TestMethodCommand.Execute(TestExecutionContext context)
   at NUnit.Framework.Internal.Commands.BeforeAndAfterTestCommand.<>c__DisplayClass1_0.<Execute>b__0()
   at NUnit.Framework.Internal.Commands.BeforeAndAfterTestCommand.RunTestMethodInThreadAbortSafeZone(TestExecutionContext context, Action action)
@JamesNK

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

I don't believe that SDK has #40112 in it yet. I looked at the assembly and there is no _requestBodyAbandoned field on Http2Stream.

It has been over 24 hours since that PR was merged. Is there something going on with the build? Are we using the right SDK? Might be worth checking out.

@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 9, 2019

I nuked a bunch of SDK related stuff to try to ensure it's running the expected bits, and now I can repro this as well on current release/3.0 bits.

Also, apparently this SDK contains a System.Net.Http.dll from the future. Pretty cool!

08/09/2019 07:19 PM 1,475,448 System.Net.Http.dll

@JamesNK

This comment has been minimized.

Copy link
Member Author

commented Aug 9, 2019

The latest 3.0 SDK has #40112. With it I still get this error on my machine.

@davidsh

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

@stephentoub / @geoffkizer Will one of you be able to look at this bug?

@wfurt wfurt self-assigned this Aug 12, 2019

@wfurt

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

I'm also able to reproduce it using latest master bits.
Client gets RST while waiting for credit. That kicks cancelation token and Cancelation exception will surface to caller.

Predictably I get RST EnhanceYourCalm for stream 401. (default server concurrency 100). When I bump the limit to 500, it predictably fails stream 2001.

SO the formula seems 4x Stream concurrency. When I use InProcessTestServer.cs:

.UseKestrel(options =>
{
    options.ListenLocalhost(50050, listenOptions =>
    {
        listenOptions.Protocols = HttpProtocols.Http2;
    });
    options.Limits.Http2.MaxStreamsPerConnection = int.MaxValue;
})

test would run without any issues.

To make it easier to count I set server limit to 5 and here is what I see (in attached pcap)

Frame 22: Settings -> Max stream = 5
Frame 25: client sends new requests, 1,3,5,7,9
Frame 44: Response headers for 1,3, 5, 7, 9
Frame 45:  RST 5
Frame 47: RST 3
Frame 48: RST 7
Frame 49: RST 1
Frame 50: RST 9

Count should be zero now. 

Frame 53: Request 11, 13
Frame 57: Request 15
Frame 60: Response Headers 11,13, RST 11,13

Count should be zero now.

Frame 61: Request 17, 19
Frame 67: Headers 15, 17,19, RST 15,17,19

Count Should be zero

Frame 69: Request 21
Frame 71: Request 23
Frame 74: Request 25

Frame 77: RST EnhanceYourCalm on 21

so this seems to be some enforcement issue with Kestrel server. It almost feels like the streams server resets are still counted.
It would be great if anybody can look at the capture and confirm.

cc: @Tratcher

h2-test.pcapng.zip

@JamesNK

This comment has been minimized.

Copy link
Member Author

commented Aug 12, 2019

// @halter73 @anurse @Tratcher #40051 (comment)

This may be a Kestrel HTTP/2 issue. Could a Kestrel expert investigate? 😄

@Tratcher

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

When the response finishes before the request body a reset is sent and the request is put into a graceful timeout queue. Frames received during the graceful drain timeout will be ignored, frames received after that will be considered a protocol error. If the client ends or resets the stream then it's removed from the queue.

After recent changes (@jkotalik), Kestrel allows 2x MaxStreams to be waiting in the graceful timeout queue and sends EnhanceYourCalm for additional streams. This limits the amount of memory the server needs to hold onto to track stream state. Not sure how you're hitting a 4x multiplier instead.
aspnet/AspNetCore@aab75e8#diff-259f4a7dbb73dd85ca5799a61ac2754fR913

Summary: It would be cleaner for the client to respond to the server's EndStream/reset with its own reset so both sides could safely consider the stream closed, then the stream can be discarded.

@wfurt

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

client is only using odd numbers so it seems like double but the count is not. If there is 2x code in Kestrel then the math does make sense.

@JamesNK

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2019

Does this correctly describe what is going on:

  1. Client makes request to the server
  2. Client sends DATA payload
  3. Server reads headers
  4. Server looks at DATA payload but decides not to read it
  5. Server sends HEADERS and RST_STREAM (NO_ERROR)
  6. Client reads HEADERS and RST_STREAM
  7. ???

What is the state of the world at 7?

The server still considers the stream semi-active. Looking at the spec - https://http2.github.io/http2-spec/#StreamStates - is it in a half-closed state from the server's perspective?

The client considers the stream done because it has the server's RST_STREAM but it isn't sending an acknowledgement. Is that normal behavior for HTTP/2 clients?

What is the action here?

  1. The server shouldn't consider streams in the graceful timeout queue as active. The current 2x limit workaround isn't a good enough fix. And/or...
  2. The client should respond to the server's RST_STREAM (NO_ERROR) with a RST_STREAM (NO_ERROR) of its own. That will explicitly tell the server that the stream is done
@Tratcher

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

Yes, that's the scenario as I understand it, except that some of the request data hasn't been sent yet. This wouldn't be an issue if the client had already sent END_STREAM (or RST_STREAM) as it would already be considered half closed.

https://http2.github.io/http2-spec/#RST_STREAM
A)

The RST_STREAM frame fully terminates the referenced stream and causes it to enter the "closed" state. After receiving a RST_STREAM on a stream, the receiver MUST NOT send additional frames for that stream, with the exception of PRIORITY.

B)

However, after sending the RST_STREAM, the sending endpoint MUST be prepared to receive and process additional frames sent on the stream that might have been sent by the peer prior to the arrival of the RST_STREAM.

At stage 7 the server still needs to gracefully handle incoming frames for an arbitrary period. It can't track these streams indefinitely. To prevent too much memory build up on the server we capped the number of these streams at 2x MaxStreams before the server is considered overloaded and responds with EnhanceYourCalm.

The server shouldn't consider streams in the graceful timeout queue as active. The current 2x limit workaround isn't a good enough fix. And/or...

That could cause the server to consume large amounts of memory tracking streams. Alternatively, we'd have to find a lighter weight way to track these.

The client should respond to the server's RST_STREAM (NO_ERROR) with a RST_STREAM (NO_ERROR) of its own. That will explicitly tell the server that the stream is done

The client isn't obligated to respond with a RST_STREAM, but it would help acknowledge that the stream is fully closed and can be discarded.

@halter73

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

I agree with @Tratcher that the client should send an RST back upon receiving an RST if it hasn't sent an RST/EOS already. I'm pretty sure that's what the server does.

Given a certain amount of time after sending the RST (5 seconds I think), Kestrel will stop tracking the stream and stop counting it against a 2x limit. If the client wants to quickly open new streams (it must be doing so at at least 20/second to hit this limit if I'm right about the timeout being 5s), it needs to explicitly tell the server that it's done with the old ones. I don't think it's reasonable to expect Kestrel to track over 100 half-closed streams at a time when the open stream limit is only 100 to begin with.

@Tratcher

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

@halter73 the streams are legally closed by a reset from either end. The only trick is how to implement "However, after sending the RST_STREAM, the sending endpoint MUST be prepared to receive and process additional frames sent on the stream that might have been sent by the peer prior to the arrival of the RST_STREAM."

We do this today by holding the entire stream structure even after the app is finished with it. We could do something much lighter by keeping a separate collection with just a stream id and a grace period timestamp. We could have 10x MaxStream count in that collection without it being a burden to the server.

@halter73

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

We definitely could make the half-closed stream tracking more efficient, but I don't think that gets us much. Even if we made the limit 10x the max stream count instead of 2x, the client would still only be able to send ~200 requests per second. That may be better than 20 RPS, but not by much. Of course, if the client were to respond to the server RST with an RST/EOS of its own before opening new streams, there wouldn't be this seemingly arbitrary RPS limit.

The only way to get a reasonable RPS without tracking thousands or even millions of half-closed streams (given the 5s grace period) is to have the client finish closing the streams before opening new ones.

@Tratcher

This comment has been minimized.

Copy link
Contributor

commented Aug 13, 2019

The only way to get a reasonable RPS without tracking thousands or even millions of half-closed streams (given the 5s grace period) is to have the client finish closing the streams before opening new ones.

Except the client is not required nor encouraged to do so in the spec, and we can't influence how any other client will behave here. Focusing on the server behavior will be more generally applicable.

@JamesNK

This comment has been minimized.

Copy link
Member Author

commented Aug 13, 2019

We definitely could make the half-closed stream tracking more efficient, but I don't think that gets us much. Even if we made the limit 10x the max stream count instead of 2x, the client would still only be able to send ~200 requests per second.

If it is more efficient to track half-closed streams then it makes sense that they wouldn't be included in the open max stream limit. Half-closed streams would have no limit, or a different/higher limit.

@wfurt

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

It seems strange to respond to RST with RST. For sake of this debate I made quick change to do that and the test is now passing for me.

When I look at RFC https://http2.github.io/http2-spec/#StreamStatesFigure it seems like getting/sending endOfStream moves to half closed state but sending/receiving RST goes to Closed. (so as TCP)

Would it be possible to test this agains different server implementation @JamesNK ?

This also seems like corner case: we fail every single request and that is done faster than server expires half-closed streams.

I'm not sure if this is something we need to address for 3.0. Any thoughts on this @davidsh @geoffkizer @stephentoub ?
Perhaps we could also surface error code from RST instead of generic cancelation.

@halter73

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

I wasn't being strict with my definition of half-closed in my last comment. I was merely referring to server-reset streams that Kestrel is still tracking. I agree this is a corner case. Supporting 20 RPS of server-reset streams seems like enough even if exceeding that kills the connection.

@wfurt

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

what if we do not count dying connections toward limit but if their count grows beyond threshold, server can send GOAWAY to prevent more streams. That would allow client to do graceful transition. Right now server is enforcing limit client is not aware of.

@halter73

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

That sounds reasonable @wfurt. I created an issue in the AspNetCore repo at aspnet/AspNetCore#13097.

I'm not sure it's worth giving this issue a super high priority. Feel free to comment on the issue if you think this is worth prioritizing highly.

@wfurt

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

I don't view it urgent unless this blocks some gRPC scenario. I'm going to move this to 5.0 for now unless somebody objects.

@wfurt wfurt modified the milestones: 3.0, 5.0 Aug 13, 2019

@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 16, 2019

It would be cleaner for the client to respond to the server's EndStream/reset with its own reset so both sides could safely consider the stream closed, then the stream can be discarded.

The RFC specifically prohibits this. https://tools.ietf.org/html/rfc7540#section-5.4.2:

   To avoid looping, an endpoint MUST NOT send a RST_STREAM in response
   to a RST_STREAM frame.
@geoffkizer

This comment has been minimized.

Copy link
Member

commented Aug 16, 2019

@wfurt Nice work tracking this down. There doesn't appear to be a client issue here, so I'm closing.

@geoffkizer geoffkizer closed this Aug 16, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
You can’t perform that action at this time.