-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SocketException - An existing connection was forcibly closed by the remote host. HttpClientFactory HttpClient #52267
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsWe have Azure Function / API which calls On Premise API and we are seeing intermittently SocketException type with below error message - Our Azure function / API are in .NET core 3.1 and uses HttpClientFactory to create httpclient. We use this httpclient to call our On Premise API. We earlier had using block when we called CreateClient, which we have removed now but we don't think this will help to avoid SocketException issue. Based on wireshack report we see reset packet are sent by server when this error happens which means the server is closing the connection and httpclient is using the same connection to make new requests which could cause this issue. Basically if the new HTTP request and the TCP reset are in flight across the network at the same, I think it'll manifest as a socket error and thats kind of a race condition. We are implementing Retry logic so we are hoping that in next retry request will success given a new connection is created. I would like to know that is this a known issue on httpclientfactory as we don't see this issue on .NET framework side when we use static httpclient. We have on premise apis which uses static httpclient to call other on premise apis and we don't see this issue. Would like to know if anything we do change on our Azure Infra side to reduce such issues.
AppInsights also show ResultCode as Faulted which means it didn't receive the http response from server. Stack trace-
|
@punitsinghi any chance you can share those packet packet captures here or privately? |
Without IP Address, here is how it looks. Will this help? With IP address, I will have to check with our Infra team. |
@karelz - Can you please let me know what additional details needed. |
@punitsinghi I am not sure the above screenshot is sufficient to confirm your suspicion. I let @antonfirsov check it out and say more. |
For me this screenshot is not very helpful, would be great to see the whole communication sequence with timestamps, to help detailed diagnosis.
If we can confirm that this race condition exists, and we want to support concurrent HTTP requests to a server that may (for any reason) terminate the connections, we may try to fix this by detecting @punitsinghi if I was you, I would try to get some insights about the server if possible to understand why is it closing down the connections. |
Triage: Well behaving servers should send "connection: close" in the response headers. The server should never reset the connection (unless client takes too long to disconnect). |
@karelz @antonfirsov Thanks for the input. Sure let me see if I can pass on the screenshot privately which will have all the details we have the idletimeout on client and server set to 120 seconds. Based on my understanding that httpclientfactory will use PooledConnectionIdleTimeout and that is 2 minutes while on our IIS it is also 120 seconds. Do you think in this scenario client can take too long to disconnect. https://docs.microsoft.com/en-us/dotnet/api/system.net.http.socketshttphandler.pooledconnectionidletimeout?view=netcore-3.1 Also is that pretty normal that Azure to On Premise API calls get such Connection close due to delay in network. |
@karelz : Having had to write a custom HTTP client I say there's something you can do. If you get the "reset" error while uploading a request (before you start waiting for the response), connect anew and try again. Don't loop though; only use this logic on a connection you got back from the pool. |
By taking too long to disconnect I meant reaction to "connection:close". If both server and client have 2 min timeout for closing connections, then it is possible that server will send the rude closure. The question is how likely it is that exactly at that time client is trying to reuse the connection. That should be pretty rare IMO.
In that case users should get used to react to failures like these with retry policies, etc. @jhudsoncedaron do you mean to close the connection if we didn't finish sending the request yet? It also means, we would have to buffer the request body, which we don't do today AFAIK. |
We already have a mechanism to know whether we can retry or not ( Also note that whether the request can be automatically retried depends on whether the method is idempotent or not: https://tools.ietf.org/html/rfc7230#section-6.3.1. I'd need more details on the exception and the request to say whether we even could retry it or not. Because your case might not be automatically retry-able. Edit: My thoughts are that this probably is not worth it. AFAIK only PUT would be a retry-able request with a body (I know, other methods can have body as well, but usually don't and I'm generalizing here). We would need to introduce request body buffering and keep it in memory until we get the response. If we entertain this idea it would definitely have to be opt-in since it can have detrimental effect on performance. IMHO this is a lot of work for a small gain. If retrying the request in all exceptional cases is your goal, you can always do that in your code on top of |
I actually mean (pseudo-code): try {
await connection.WriteAsync(fist packet data)
} catch (SocketException|IOException) when (_pooledConnection) {
connection = GetConnection()
_pooledConnection = false;
await connection.WriteAsync(fist packet data)
} I ran some experiements in .NET Core 3.0 to try to convince another vendor that network instability is a real thing and they can't assume that I got the HttpResponse just because they sent it, and found the recovery logic wasn't quite this good. |
@jhudsoncedaron Can you please let me know you full custom httpclient implementation |
We already retry requests when we can verify that they haven't been processed, including when a reset happens on an idle connection between requests. So, this is likely a reset mid-request, where we can't verify the server hasn't processed the request. In this case only safe thing to do is to throw. This isn't something we'll be able to change. A mid-request connection reset is generally not correct behavior. A server would do this if it sees a protocol error, a malicious client, or maybe if server-side threw an exception. Consider checking server/firewall logs to understand the root cause here. Since you are already using
Can you grab a wireshark capture of identical requests on both .NET Framework and .NET Core so we can see what the difference is between the two platforms? |
Thanks guys for the input. @karelz / @scalablecory / @antonfirsov - I can share the wireshack privately. Can you please let us know how can I do that. @scalablecory - Thanks for the details. So on a reset due to idle timeout, it should have tried thrice, looking at the code?. Yeah we have implemented Polly and we are currently doing retry on HttpRequestException as some of our legacy APIs are idempotent and some are not. Does HttpRequestException guarantees that server has not processed the request specially on .NET core as on timeout it will throw TaskCanceledException for. NET core apps. We are trying to see will it be safe to retry on such cases. |
@punitsinghi : No. I was able to determine whether or not it was resumable from the stacktrace, but refused to write that code. |
@punitsinghi our emails are linked from our GH profiles. If you want more official way, you would have to go through official Microsoft support.
No, we cannot provide any guarantees. As soon as we send any part of the request out, we can't tell if the server received it or processed it and if it made any action ... that seems to be your scenario per your description. |
Thanks @karelz, we have started discussion through MS support as well. One of the engineer from MS support recommended to increase server idle timeout as the IIS idle time out is 120 seconds (default) and pooledconnectionidletimeout is also 120 seconds and due to a network delay between Azure and On Premise we can run into this issue, but based on @scalablecory reply, I think this may not help as httpclient would have done retry on connection reset due to idle timeout and we still see the issue happening. If you don't mid reviewing the logs from wireshack once and let us know your feedback than we can go back and provide feedback to MS support. We are trying to find the root cause which we haven't found it and your help will be greatly appreciated. @scalablecory - Talked to our Network engineer and according to him that reset mid-request can happen on Idle timeout as well. |
Please work with your MS support contact to upload the Wireshark traces to Microsoft. They can take a first look and they can contact us internally if needed. Thanks! |
Ok Thanks @karelz. |
BTW, the server should try to avoid sending RST if possible in these sorts of cases. From the RFC:
|
I think it's worth it to try adjusting both of these values (server/client idle connection timeout). In particular, try making the client connection timeout significantly less than the server timeout, e.g. 120 and 60 or whatever. It's always better for the client to close an idle connection, because when the server does it there's the timing issue you are seeing here. |
There is a 408 status code that could be used in scenarios like these, see here: https://www.rfc-editor.org/rfc/rfc7231.html#section-6.5.7 However, servers don't seem to implement this for idle connection close (as seen here in particular), and we don't support it in HttpClient. Perhaps we should consider that, though it would only help if the server participates as well. |
This is from my email about what I'm assuming is the same issue, but it seems like this would be better discussed in public.
Kestrel use to try to do this, but it was more trouble than it was worth. Many clients would wait entirely-too-long to close their half of the connection, and I cannot think of a single scenario were Kestrel would close the connection while the client was still trying to send data that wasn't really supposed to be abortive. HTTP clients don't typically pipeline and Kestrel sets the appropriate In this scenario, it wouldn't help anyway. Even if the connection was only half closed, the client still would not get a response to the new request. @dotnet/ncl @dotnet/http I really think we need to consider changing server and/or client idle connection timeouts. With HTTP/1.1 it can cause problems like this when they're exactly the same. The server timeout should be larger by default. |
I think the only interesting case is the race between the server deciding a connection is idle and closing it, and the client at the same time trying to send a new request on the connection. If it's a GET request, or a POST with 100-continue, etc, then a FIN will trigger us to retry the request, whereas a RST generally will not. But then, if it's a POST without 100-continue, we will start sending the request body anyway and won't retry because of that. What might help, as I mention above, is using the 408 response code, as that would allow us to retry in all cases. |
The server cannot respond with a 408 in this case, it's already closed its half of the connection. It cannot send more data. Theoretically, it could continue acking the request to avoid the RST, but what's the point when the last data it flushed happened at least 2 minutes ago? |
Yes, I completely agree. What is the Kestrel default here? SocketsHttpHandler is 120 seconds. |
The idea would be to send a 408 response instead of closing an idle connection. It's not in response to a request; rather it's sent proactively in case the client is racing to use the connection at the same time. That way, if the client receives a 408 response, it knows the server is shutting down the connection and didn't process the request at all, and the request can be retried. |
It's basically GOAWAY for HTTP/1.1. |
If you send an unsolicited 408 on an idle connection, you might block. |
Block what? |
The send() system call that sends the 408 might never finish. |
They're all 120 seconds. KestrelServerLimits.KeepAliveTimeout. I'll update my earlier comment to include the other links from the email.
So the suggestion is to do this every time we close an HTTP/1.1 connection due to a keep-alive timeout? Theoretically, I guess the client should ignore this unsolicited response. Does any server actually do this though? Seems risky. |
I don't know. I'm not aware of any. |
FWIW, SocketsHttpHandler is definitely guilty of this in some cases. But we could make this better if there was a good reason to. |
Thanks We are right now increasing our IIS idle time out to 150 seconds and see if that works in PROD. |
We set IIS Connection Time Out to 240 seconds but we still see the similar error. Looks like the issue could be due to something else. |
@punitsinghi do you have any update here? |
@karelz - Thanks for the follow up, issue was due to F5 server, we applied patch from F5 last week and don't see the error now. Thanks everyone for the input and feedback. |
Closing as external based on the last update. |
We have Azure Function / API which calls On Premise API and we are seeing intermittently SocketException type with below error message -
An error occurred while sending the request. Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.. An existing connection was forcibly closed by the remote host.
Our Azure function / API are in .NET core 3.1 and uses HttpClientFactory to create httpclient. We use this httpclient to call our On Premise API. We earlier had using block when we called CreateClient, which we have removed now but we don't think this will help to avoid SocketException issue. Based on wireshack report we see reset packet are sent by server when this error happens which means the server is closing the connection and httpclient is using the same connection to make new requests which could cause this issue. Basically if the new HTTP request and the TCP reset are in flight across the network at the same, I think it'll manifest as a socket error and thats kind of a race condition. We are implementing Retry logic so we are hoping that in next retry request will success given a new connection is created. I would like to know that is this a known issue on httpclientfactory as we don't see this issue on .NET framework side when we use static httpclient. We have on premise apis which uses static httpclient to call other on premise apis and we don't see this issue. Would like to know if anything we do change on our Azure Infra side to reduce such issues.
AppInsights also show ResultCode as Faulted which means it didn't receive the http response from server.
Stack trace-
The text was updated successfully, but these errors were encountered: