Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider using Happy Eyeballs or similar in SocketsHttpHandler #26177

Open
JustArchi opened this issue May 15, 2018 · 56 comments
Open

Consider using Happy Eyeballs or similar in SocketsHttpHandler #26177

JustArchi opened this issue May 15, 2018 · 56 comments
Assignees
Labels
area-System.Net.Http enhancement Product code improvement that does NOT require public API changes/additions
Milestone

Comments

@JustArchi
Copy link
Contributor

Repro: HttpClientBug.zip

using System;
using System.Net.Http;
using System.Threading.Tasks;

internal static class Program
{
    private static async Task Main()
    {
        AppDomain.CurrentDomain.UnhandledException += OnUnhandledException;
        TaskScheduler.UnobservedTaskException += OnUnobservedTaskException;
        using (var httpClient = new HttpClient())
        {
            try
            {
                await httpClient.GetAsync("https://translate.google.com").ConfigureAwait(false);
                Console.WriteLine("OK");
            }
            catch (Exception e)
            {
                Console.WriteLine(e);
            }
        }
    }

    private static void OnUnobservedTaskException(object sender, UnobservedTaskExceptionEventArgs e)
    {
        Console.WriteLine(e.Exception);
    }

    private static void OnUnhandledException(object sender, UnhandledExceptionEventArgs e)
    {
        Console.WriteLine(e.ExceptionObject);
    }
}

I reproduced this one on Linux and I didn't have much luck on Windows.

Run repro with dotnet run. After a default timeout of around 60 seconds, you'll get:

$ dotnet run
System.OperationCanceledException: The operation was canceled.
   at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e, CancellationTokenSource cts)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at HttpClientBug.Program.Main() in /tmp/httpclientbug/Program.cs:line 17

Doing the same by forcing older curl handler:

$ DOTNET_SYSTEM_NET_HTTP_USESOCKETSHTTPHANDLER=0 dotnet run
OK

Please note that this issue is specific and not reproducible with just any https server, as majority of them work just fine. I encountered this issue when accessing https://translate.google.com, which is what I used in my repro above.

I reproduced this bug on latest master as well as .NET Core 2.1 rc1.

This bug could be some sort of regression because previously my app running master SDK worked just fine with this URL, including SocketHttpHandler that I used for a longer while. It could also be regression caused by Google's servers configuration change that triggered bug existing in the code since quite some time, which is more likely. Of course this one is not reproducible on .NET Core 2.0, since there is no SocketHttpHandler there.

Thank you in advance for looking into this.

.NET Core SDK (reflecting any global.json):
 Version:   2.2.100-preview1-008636
 Commit:    6c9942bae6

Runtime Environment:
 OS Name:     debian
 OS Version:
 OS Platform: Linux
 RID:         debian-x64
 Base Path:   /opt/dotnet/sdk/2.2.100-preview1-008636/

Host (useful for support):
  Version: 2.1.0-preview3-26411-06
  Commit:  8faa8fcfcf

.NET Core SDKs installed:
  2.2.100-preview1-008636 [/opt/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.1.0-preview2-30475 [/opt/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.1.0-preview2-30475 [/opt/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.1.0-preview3-26411-06 [/opt/dotnet/shared/Microsoft.NETCore.App]

[EDIT] Inline C# source code by @karelz

@JustArchi JustArchi changed the title HttpClient doesn't work with certain https servers with SocketHttpHandler (.NET Core 2.1+) HttpClient doesn't work with certain https servers via SocketHttpHandler (.NET Core 2.1+) May 15, 2018
@davidsh
Copy link
Contributor

davidsh commented May 15, 2018

cc: @stephentoub

@karelz
Copy link
Member

karelz commented May 15, 2018

Thanks @JustArchi for your report!
Can you please clarify repro status on Windows? Did it repro or not? Or did you not try at all? (that's ok, just clarifying your comment above)

@karelz
Copy link
Member

karelz commented May 15, 2018

@wfurt can you please take a look?
cc @geoffkizer

@JustArchi
Copy link
Contributor Author

@karelz I checked .NET Core 2.1 rc1 on Windows and couldn't reproduce the issue, got OK from my repro.

image

@wfurt
Copy link
Member

wfurt commented May 15, 2018

I did quick try on Ubuntu 16.04 and I also get OK. What base OS do you use @JustArchi ?

@JustArchi
Copy link
Contributor Author

It's Debian Testing (currently 10/Butcher) with all available updates, kernel 4.16.0-1-amd64 (Debian 4.16.5-1).

I also thought the OS could have something to do with it, but then curl handler wouldn't work either, so it's possible that it's some OS-layer incompatibility.

@JustArchi
Copy link
Contributor Author

JustArchi commented May 15, 2018

It's also good to note that I couldn't reproduce it with other https servers, only Google's one gave me issues, so it's not entirely broken either. I'm sure that my SSL certs are OK, otherwise curl wouldn't work either.

I checked my other dev machine running on Debian sid and I also couldn't reproduce my issue there either, I'll set up other one on testing in the meantime to ensure it's not on Debian's end, but then again, if it was some IP-related block or likewise, curl shouldn't work either.

@JustArchi
Copy link
Contributor Author

JustArchi commented May 15, 2018

I tested another machine on Debian Testing and couldn't reproduce the issue either, so it has to be something with one of my machines. I wonder how I can help narrow down this issue and what is the root cause of it in the first place. Do you have any idea what I could provide do to help?

image

In the meantime I'll keep looking, maybe I find some factor that could possibly be causing this on OS side.

@wfurt
Copy link
Member

wfurt commented May 15, 2018

You can always try packet capture. You can also check if the name resolves to same IP address.
Also is the hardware on both machines similar? number of cores would for example impact size of thread pool.

@karelz
Copy link
Member

karelz commented May 15, 2018

@JustArchi so it seems it is one machine problem (with a specific URL) on Debian development branch. Is that correct?
If that's the case, I think the best thing is to debug it through all layers and identify which one behaves incorrectly.

@JustArchi
Copy link
Contributor Author

JustArchi commented May 15, 2018

Yes, one machine for now, and I found out the first clue.

My dev machines that I tried to reproduce this issue on (and had no luck) don't have IPv6, they used IPv4 address of the translate.google.com and that worked. On the machine with the bug reproduced, I forced usage of IPv4 by manually hardcoding 216.58.215.46 translate.google.com to /etc/hosts, and that one did the trick, I couldn't run into the bug anymore. So this issue is IPv6 related, now the question is whether this is still my machine-specific IPv6 related issue, or something more global.

Since I don't have more IPv6 machines around, could you try to reproduce this bug for me on IPv6-enabled Linux machine? I don't believe Debian has anything to do with it, so it could be any Linux machine.

I wonder if curl handler also uses IPv6 by default, this could be the reason why I see behaviour change between those two, gonna keep digging and let you know if I found out something more. It could also be some IPv6-related bug in socket handler, although in this case I'm pretty sure somebody else would find it out much sooner than me, maybe it's some specific combination, I'll try to find out.

@wfurt
Copy link
Member

wfurt commented May 15, 2018

can you do packet capture and trace the attempt? I'm wondering if IPv6 is really used or if it falls back to IPv4 after some time.

@JustArchi
Copy link
Contributor Author

JustArchi commented May 15, 2018

I did the test and I indeed confirmed my initial thought, this is not a bug in socket http handler itself, but rather different behaviour compared to curl handler. Could be also called a regression if we're comparing default behaviour between .NET Core 2.0 and 2.1.

In general, my machine can't access translate.google.com through IPv6, why exactly - I'm going to find out myself, this is not important for now.

What is important is the fact that it seems that curl handler tries to fallback to IPv4 when IPv6 fails, this is why curl handler works.

If I force curl to use IPv6 exclusively, this also fails:

$ curl -I -m 5 -6 "https://translate.google.com"
curl: (28) Connection timed out after 5001 milliseconds

While -4 works:

$ curl -I -m 5 -4 "https://translate.google.com"
HTTP/2 403

(We're getting 403 but it's irrelevant, the connection being established matters)

So what is curl doing by default? Let's find out:

$ curl -v -I -m 5 "https://translate.google.com"
* Rebuilt URL to: https://translate.google.com/
*   Trying 2a00:1450:400c:c0a::8a...
* TCP_NODELAY set
*   Trying 216.58.215.46...
* TCP_NODELAY set
* Connected to translate.google.com (216.58.215.46) port 443 (#0)

Like you can see, it tries IPv6 first, fails, then tries IPv4 next. SocketHttpHandler doesn't do that, once it gets IPv6 from DNS resolve, it'll try to connect through that IP address only, even if it doesn't work while IPv4 address does.

Question is, if this is intended behaviour (then we can close the issue), or if perhaps we could do something to improve it (e.g. make it work like curl handler), since I strongly believe that it'd greatly benefit HttpClient if it did exactly the same what curl does - attempt to try the next (different) IP address for the same domain, if possible.

I'm pretty sure that you can reproduce this reliably with any IPv6-enabled machine requesting any domain with AAAA and A DNS entries where IPv6 address simply times out, while IPv4 works just fine. It happened to be google server in my case, why exactly - probably a coincidence, temporary issue, or some machine-specific problem I'm going to find out myself, since it's not important for this issue.

@JustArchi JustArchi changed the title HttpClient doesn't work with certain https servers via SocketHttpHandler (.NET Core 2.1+) Behaviour change between CURLHandler and SocketHttpHandler regarding IPv6 fallback (.NET Core 2.1+) May 15, 2018
@davidsh
Copy link
Contributor

davidsh commented May 15, 2018

Like you can see, it tries IPv6 first, fails, then tries IPv4 next. SocketHttpHandler doesn't do that, once it gets IPv6 from DNS resolve, it'll try to connect through that IP address only, even if it doesn't work while IPv4 address does.

SocketsHttpHandler should be trying all the IP addresses (IPv6 and IPv4) that are returned from the DNS resolver API calls. On Windows, there is an API to connect to the DNS name which will automatically try IPv6 and IPv4 in parallel to speed up connection result so that it won't have to waste time failing on the IPv6 address before it gets connected on the IPv4.

@JustArchi
Copy link
Contributor Author

I wonder how I can confirm this then or debug the issue further, since this issue doesn't happen if I manually hardcode translate.google.com to 216.58.215.46 via /etc/hosts. DNS query is then omitted entirely.

I took a quick look if perhaps data returned by DNS resolve could something to do with it, but it looks fine:

$ host translate.google.com
translate.google.com is an alias for www3.l.google.com.
www3.l.google.com has address 216.58.215.46
www3.l.google.com has IPv6 address 2a00:1450:4007:808::200e
$ ping4 translate.google.com
PING www3.l.google.com (216.58.215.46) 56(84) bytes of data.
$ ping6 translate.google.com
PING translate.google.com(par21s17-in-x0e.1e100.net (2a00:1450:4007:808::200e)) 56 data bytes

@JustArchi
Copy link
Contributor Author

JustArchi commented May 15, 2018

It might have something to do with the fact that IPv4 query is not even being tried since we time out with IPv6, and we don't have enough of time to try IPv4 next. I mean, it's not intended to send 2 requests through 2 different IP addresses right away, so there has to be some kind of timeout to move forward.

If CURL got "stuck" on that IPv6, then it'd time out as well, eventually. But it's somehow smart and detects in a fraction of second that IPv6 connection fails, then moves out to IPv4 immediately, making it in time regarding supplied timeout. Socket handler probably times out on IPv6, and doesn't even have enough of time to try out IPv4 next. This is just my theory though, I really don't know socket handler internals, you're the expert here.

There is definitely a bit different behaviour regarding handling this though, since switching to curl handler solves the initial issue for me, as well as hardcoding IPv4 address manually in /etc/hosts. On top of the fact that if DNS query was faulty, then curl couldn't possibly work either.

@wfurt
Copy link
Member

wfurt commented May 15, 2018

just for the record, I verified that it can work via IPv6 - if IPv6 works.
I think this has been general problem as global site may advertise both IPv4/6 but local network or ISP may not support IPv6.
Some systems allows to set preferences:
http://sf-alpha.bjgang.org/wordpress/2012/08/linux-prefer-ipv4-over-ipv6-in-dual-stack-environment-and-prevent-problems-when-only-ipv4-exists/
https://wiki.vpsget.com/index.php/Prefer_IPv4_over_ipv6_._How_to_set_ipv4_precedence
That may be short term solution.

@JustArchi
Copy link
Contributor Author

JustArchi commented May 15, 2018

Yeah the fact that I can't connect through IPv6 is definitely my machine issue and I'll solve it in one way or another, this is a technical difficulty that is on me.

I'm just wondering now if there is anything to improve here in this case or we should keep it like that, since technically it's not broken, but it could be improved in a way to work like curl handler did. If people have such "broken" setups (or rather lack of IPv6 without even knowing about it) then they might see regressions on linux with .NET Core 2.1 where curl handler worked just fine with it on .NET Core 2.0, while new socket handler no longer does and times out.

@karelz
Copy link
Member

karelz commented May 15, 2018

It would be nice to have SocketsHttpHandler more resilient, however, I don't think we should treat it as compatibility problem between curl handler and SocketsHttpHandler - rather as enhancement. (unless we find out lots of people hitting it)

I would strongly recommend to not use the curl handler as workaround, rather use the IPv4 address. Otherwise you will be stuck in the past. Our plan is to eventually get rid of the curl handler entirely (hopefully in next major release).

@JustArchi
Copy link
Contributor Author

JustArchi commented May 15, 2018

It's alright, thank you a lot for your help, I'll leave this issue open as an enhancement, but feel free to close it if you decide that it's not worth it to improve socket handler in this regard.

In the meantime I'll see how I can fix IPv6 on my machine 🙂

Have a nice day!

EDIT: In case somebody would have similar issue on linux, I added precedence ::ffff:0:0/96 100 to /etc/gai.conf - this tells getaddrinfo() to prefer IPv4 addresses over IPv6 when both are available.

@karelz
Copy link
Member

karelz commented May 16, 2018

From offline discussion: There is a concern that SocketsHttpHandler does not try all entries returned by DNS. We should check it. That would qualify as something we might need to fix (potentially in servicing).

@karelz
Copy link
Member

karelz commented May 16, 2018

@rmkerr can you please take a look at SocketsHttpHandler and how it deals with multiple DNS entries?

@rmkerr
Copy link
Contributor

rmkerr commented May 16, 2018

Yep, I'll take a look!

@rmkerr
Copy link
Contributor

rmkerr commented May 16, 2018

Based purely on code inspection this looks correct on the SocketsHttpHandler side. We're using DnsEndpoint with Socket.ConnectAsync:
https://github.com/dotnet/corefx/blob/d5f6e6a10b9b1a9cb42066ebceea21e929adb487/src/System.Net.Http/src/System/Net/Http/SocketsHttpHandler/ConnectHelper.cs#L56-L61
The documentation for DnsEndpoint makes it clear that ConnectAsync will attempt to connect to each address until a connection succeeds, regardless of the address family:

The AddressFamily property of any Socket that is created by calls to the ConnectAsync method
will be the address family of the first address to which a connection can be successfully
established (not necessarily the first address to be resolved).

There could still be a bug in Socket or in DnsEndpoint (or the docs), so this warrants more investigation. I'm going to work on reproducing this issue locally, and will update this thread when I have results.

@karelz
Copy link
Member

karelz commented May 16, 2018

I wonder if we have some logging in Sockets/SocketsHttpHandler that could help us confirm what is happening on @JustArchi's machine. Maybe in combo with Wireshark ...

@wfurt
Copy link
Member

wfurt commented May 16, 2018

How long does it take to try new address @rmkerr ? There was overall 60s timeout on HTTP.

Also from diagnostic @karelz

System.OperationCanceledException: The operation was canceled.
at System.Net.Http.HttpClient.HandleFinishSendAsyncError(Exception e,CancellationTokenSource cts)
at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)

It would be really nice if this clearly states we were not able to connect. According to @JustArchi IPv6 never worked on his machine so TCP would never been established. Cryptic message like the one above is not that helpful for troubleshooting.

@rmkerr
Copy link
Contributor

rmkerr commented May 16, 2018

@karelz it might be useful to have logs if I can't reproduce the issue. I have not been able to repro the issue on Windows, but I will try on Linux before taking that approach.

@wfurt I'm not sure of the exact time, but it is far less than 60 seconds. When running the app in a console on windows there is no visible delay. I think this is likely a Linux specific issue though, so I will try it there next.

@karelz
Copy link
Member

karelz commented Jul 28, 2020

Dialers are checked in #1793 ... however, we need a few DNS APIs to make it easy. Workaround: PInvoke into the DNS APIs.
Moving to 6.0

@karelz karelz modified the milestones: 5.0.0, 6.0.0 Jul 28, 2020
@stephentoub
Copy link
Member

however, we need a few DNS APIs to make it easy. Workaround: PInvoke into the DNS APIs

What APIs?

@scalablecory
Copy link
Contributor

This has been resolved via the API added here in .NET 5: #41949

@NPCDW
Copy link

NPCDW commented Jun 22, 2023

.net 6.0.18 This problem has not been resolved

@PJB3005
Copy link
Contributor

PJB3005 commented Jul 31, 2023

I know commenting to remind people "why isn't this done yet" isn't the most productive, but I just want to make the weight of this issue clear:

As long as this isn't implemented, HttpClient is by-default broken for any software distributed to user machines. Sooner or later somebody is going to have a broken IPv6 configuration, and your app will just break. Implementing the workaround is basically required for everybody using HttpClient. (just look at the amount of issues referencing this one, soon I'm gonna have two myself.)

@PJB3005
Copy link
Contributor

PJB3005 commented Jun 16, 2024

Sorry to shill my own blog, but for the people stumbling upon this, I made a decently robust implementation you can use: https://slugcat.systems/post/24-06-16-ipv6-is-hard-happy-eyeballs-dotnet-httpclient/#the-implementation

@wfurt
Copy link
Member

wfurt commented Jun 17, 2024

do you want to contribute to runtime @PJB3005? It is on my radar for 9.0 but I did not get to it yet.
#87932 has approved API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Net.Http enhancement Product code improvement that does NOT require public API changes/additions
Projects
None yet
Development

No branches or pull requests