Restore Socket.ConnectAsync via hostname on *nix platforms. #17374

NickCraver · 2016-05-22T16:25:58Z

There's a bit of history to this, but due to https://github.com/dotnet/corefx/issues/5829 which was resolved by dotnet/corefx@30bd4b7, we are no longer able to use hostnames in socket connections for some platforms. Currently, it's a runtime PlatformNotSupportedException.

This breaks StackExchange.Redis on at least OS X and Linux, you can see the issue filed here: StackExchange/StackExchange.Redis#410

The best notes for this are in the dotnet/corefx@30bd4b7 commit, copied here for ease of consumption:

On unix, once connect fails on a socket, that socket becomes unusable for further operations, including additional connect attempts. This is at direct odds with Socket's instance Connect methods, some of which allow for multiple connect attempts, either due to multiple IPAddresses being provided or due to a string hostname / DnsEndPoint being provided that could then result in multiple IPAddresses to be tried.

We've explored multiple workarounds for this, all of which have problems. The workaround still in the code involves creating a temporary socket for each address, connecting to it, and if that's successful then immediately disconnecting and connecting with the actual socket. But that causes mayhem for a server not expecting that pattern, e.g. that fails if the client disconnects and attempts a reconnect.

This leaves us with a few choices, none of which are great:

Remove the offending Connect instance methods. Ideally they'd be replaced with static methods, which can be implemented with POSIX-compliant behavior. But these methods are heavily used and work on Windows.

Always throw from the instance Connect methods when there's a possibility that multiple addresses will be tried, e.g. from Connect(IPAddress[], ...) but also from Connect(EndPoint) if a DnsEndPoint is specified. This will break a lot of existing code, but it's also predictable, and developers will know quickly when using a problematic API and move away from it to supported patterns.

Throw from the Connect methods if multiple addresses are actually supplied, e.g. calling Connect(IPAddress[]) with an array of length 1 would work but an array of length 2 will fail. This will allow a lot more existing code to work, but it's also very unpredictable, e.g. if a string host is provided and gets mapped by DNS to a single IPAddress in the test environment but then multiple IPAddresses in the production environment, everything will work fine locally but then fail in production.

When presented with multiple addresses, try the first one, and if it fails, then throw a PlatformNotSupportedException. This may be slightly more predictable than (3), but is still unpredictable, e.g. if a DNS server returns addresses in a different quantity or order from another server.

Currently option 2 is implemented, which breaks this codepath:
SocketTaskExtensions.ConnectAsync(this Socket socket, string host, int port)
Socket.BeginConnect(string host, int port, AsyncCallback requestCallback, object state)

This hits line 2334: ThrowIfNotSupportsMultipleConnectAttempts();

The exact codepath doesn't matter, because several more feed into this one. In fact even if you call BeginConnect(EndPoint remoteEP, AsyncCallback callback, object state), it's still calling BeginConnect(dnsEP.Host, dnsEP.Port, callback, state); in the host case anyway. So even using the explicit overload of a single endpoint, it still breaks.

I would like us to change to option 3 in the comments above, only throwing for the case that breaks. The current state needlessly breaks existing code for no real reason. The comments in option 3 about deployments being a surprise break is perfectly valid, but it's no worse than option 2 and we're seeing that break in actual runtime usage already out in the wild.

If we can't prevent the issue in general (it looks like all practical options have been exhausted there), I would argue for 2 changes:

Only break when multiple endpoints are actually passed in.
Provide a much clearer exception message on why we're throwing, here's a link to the current one.

The text was updated successfully, but these errors were encountered:

NickCraver · 2016-05-22T16:26:44Z

cc @stephentoub @ericeil

NickCraver · 2016-05-23T00:56:27Z

Another library affected by this behavior: shayhatsor/zookeeper#7

stephentoub · 2016-05-24T07:35:15Z

I would like us to change to option 2 in the comments above, only throwing for the case that breaks

Thanks, @NickCraver. Just to clarify, you meant change to option 3, right?

NickCraver · 2016-05-24T10:09:11Z

@stephentoub ...I guess getting the number right would be important there huh? Fixed the text, thanks!

tmds · 2016-05-24T13:21:38Z

Option 3 and 4 seem equally bad to me, so I'd opt for option 4 as it will work in more cases than 3.

The better solution is to provide static Connect methods as noted in the PR.

In the meantime, we should also seriously explore adding static methods for v1 for the same functionality

Methods accepting multiple addresses on Linux/OSX should throw as it isn't supported. Moving away from those methods is part of the cross platform porting. Option 3 and 4 will mask these problems. When the DNS/IP config changes, applications will suddenly fail.

ericeil · 2016-05-24T17:26:33Z

How about a slightly different option, let's call it option 5:

(5) When presented with multiple addresses, treat the first address as the only address. If the connection fails, throw the usual exception for connection failure, rather than PlatformNotSupportedException.

The advantage vs. option 4 would be that DNS changes would not suddenly cause strange exceptions to be thrown. The disadvantage is that adding a DNS entry would silently have no effect. Thoughts?

stephentoub · 2016-05-24T17:43:06Z

I'd be fine with (5). It's at least explainable. It would be nice if the resulting exception message could be expanded a bit, though, to highlight that only the first address was tried, why, and what the dev's options are instead.

tmds · 2016-05-24T19:44:07Z

Option 5 is the same thing as option 4 but throwing a different exception. As to the type of exception, my preference is PlatformNoSupportedException.

No one in favor of adding static Connect methods as the xplat way of handling this?

@NickCraver how would you feel if you'd need to change your code to a static Connect method to make it xplat?

stephentoub · 2016-05-24T20:22:54Z

No one in favor of adding static Connect methods as the xplat way of handling this?

I'm in favor of it long term, but that's not going to happen for 1.0 or in general in the near future. Such methods do not exist in the full framework, which then has significant impact on portability.

tmds · 2016-05-25T06:22:56Z

I'm in favor of it long term, but that's not going to happen for 1.0 or in general in the near future. Such methods do not exist in the full framework, which then has significant impact on portability.

I forgot about that. Have you thought of how you want to tackle this problem in general?

For the (long) time being.
I prefer option 4. If you pass multiple addresses and it throws an unabletoconnect exception, I'd assume it couldn't connect to any of them. If it can't connect to the first and the platform is unable to connect to another, platformnotsupported exception seems more fitting.

NickCraver · 2016-05-25T10:33:06Z

I'd still opt for option 3 here after this discussion. If we're not actually going to execute the input that was handed to the method, it should fail and indicate how to fix it on the platforms that don't support it. This results in immediate and actionable failure.

Option 4 is bad to me, because it's silently not doing what I told it to. When people go a look up the methods on MSDN, they won't indicate this behavior and it'll result in a lot of developer frustration. We know what's happening inside the method though. We should simply detect and throw, and save a ton of developer hours and help migration along at the same time. This is happening on new platforms, where people are finding what they're hitting, it wouldn't happen anywhere existing. Hiding the error behind very unclear behavior draws that pain far out instead of just eating it near-term, while developers are actively porting.

As far as the signatures differing in Core or not to make this work - I'm indifferent there as it's already an #if def. That being said, we're passing a single address anyway.

TL;DR: I/O operations differ across platforms, this isn't really any kind of surprise. We can responsibly help developers quickly fix the issue while porting and not cause hair loss debugging it - so IMO, we should go that route: option 3.

How does Mono handle this? I'll try and look later today.

tmds · 2016-05-25T10:56:20Z

When I resolve "localhost" on my machine, it already gives me 2 addresses: 1 for IPv4 and 1 for IPv6. So option3 would fail to connect to localhost and option 4 would succeed.

NickCraver · 2016-05-25T11:28:23Z

@tmds Awesome - that's a good thing. It means we're failing early and changing calls early, not finding them later in deployment (which was one of the primary concerns). I'm not sure how common v6 will be on test VMs, though.

mgravell · 2016-05-25T11:37:47Z

I have to second Nick's thoughts with a preference to 3, but again with an improved message telling them what they should be doing instead, i.e. which "supported pattern" you want them to change their code to. This has the advantage of not biting anyone unnecessarily, while also never hiding actual runtime problems (i.e. by ignoring all-but-one address). Edit: and it retains API/signature compatibility with regular .NET, which is a big plus for compat.

tmds · 2016-05-25T11:55:32Z

@NickCraver @mgravell if you like catching problems early. The current implementation tells you your code isn't xplat and you should use the single IPAddress Connect method.

Hostname to IPAddress resolution returns addresses in the order they should be tried. So trying the first will succeed in most cases.

It is not uncommon to resolve names to multiple addresses (IPv6 being a major reason).

Option 3 doesn't fail early, it allows you to pass in certain cases which depend on your network configuration. If you know you'll only have 1 address, then call the Connect-overload which accepts 1 address. It works fine on all platforms.

ericeil · 2016-05-25T18:30:49Z

How does Mono handle this?

It looks to me like Mono tries connecting to all addresses. If an attempt fails, they destroy the underlying native socket and create a new one. When an attempt succeeds, they keep the socket that was connected.

A disadvantage of this would be that any configuration that had been done on the original socket would be lost; Mono doesn't appear to make any attempt to, for example, remember socket options and restore them on the new socket, and doing so would require extra bookkeeping in Socket, which would likely make every Socket object larger.

I don't have a good feel for what the real-world impact of this approach might be.

tmds · 2016-05-25T20:25:56Z

@ericeil I'd like to keep the options and so would StackExchange.Redis: https://github.com/StackExchange/StackExchange.Redis/blob/master/StackExchange.Redis/StackExchange/Redis/SocketManager.cs#L178-L179

As for the discussion on whether to connect when there is exactly 1 address (option 3) or at least 1 address (option 4): in my real-world home network when I resolve "localhost", my pc's hostname or another pc's hostname I get 2, 9 and 2 addresses respectively.

stephentoub · 2016-05-26T18:12:00Z

I don't have a good feel for what the real-world impact of this approach might be.

Seems like it could end up being super expensive, especially when needing to support methods like public void SetSocketOption(SocketOptionLevel optionLevel, SocketOptionName optionName, byte[] optionValue), as we'd need to copy the byte array just in case it was modified after this call and just in case multiple addresses might be used.

As for the discussion

How about a compromise between options 3 and 6:
7.We create a new underlying socket fd for each connection attempt and swap that in to the Socket on success... but only if the socket hasn't been configured beyond the defaults (either at all, or for some limited set of properties we deem critical and that we'll track and manually copy over). If it has been configured (we can track for all properties/methods that something was configured even if we don't track what was configured), then attempting to connect with multiple addresses throws a PlatformNotSupportedException. We can optionally turn the dial from no properties tracked to all properties tracked over time, as needed, reducing the cases that result in PNSE.

Too complicated?

tmds · 2016-05-26T19:10:55Z

Option 7 is the best solution given the constraints of the problem.

In order of increasing probability of connect succeeding:
Option 2 > Option 3 > Option 4&5 > Option 6&7

We should aim for 4&5 as a minimum as occurrence of multiple addresses is high.

Note if you want some more info on the order of addresses returned when resolving a name, see man getaddrinfo.

ericeil · 2016-05-26T21:22:32Z

With Option 7, wouldn't the StackExchange code still get PlatformNotSupportedException (since it configures the socket prior to calling Connect)?

stephentoub · 2016-05-26T21:55:20Z

With Option 7, wouldn't the StackExchange code still get PlatformNotSupportedException

Yes, unless the properties it configures are ones we choose to special-case.

ericeil · 2016-05-27T21:12:01Z

It sounds like option 7 is not tenable for v1.0. Without adding some state to track at least some of the options set on a socket, it's not going to solve the problem. If we add state to track options, we are going to have to live with that extra overhead forever, even after we add static Connect methods to handle this properly. And, it's not clear to me that we could pick a suitable subset of all possible socket state that would make this "just work" for everyone, in the short timeframe available before 1.0 ships.

Mimicking Mono's behavior (is that option "6?") similarly does not solve the problem for even the uses that lead to this discussion.

Options 3-5 behave unpredictably, and it's not clear to me that we're going to be able to pick the right heuristic that will work often enough to not cause headaches for everyone. I certainly don't feel like we've settled on a consensus choice in this discussion so far.

That leaves me thinking that the existing behavior (option 2) is the best choice, for 1.0, because it gives us predictable, early, failures for all usages, and leaves open all other options for possible future implementation. For code that needs this functionality on 1.0, there is a relatively simple workaround.

So, I am moving this out of the RTM milestone, but will leave the issue open for further discussion.

vlesierse · 2016-05-30T12:46:28Z

Currently experienced the same problem with an attempt to port the .NET Driver for MongoDB to .NET Core. How to solve this in RTM if the code suddenly starts to throw errors when you try to connect to localhost. Do you have to write around the problem yourself to get a single connection endpoint using DNS?

tmds · 2016-05-31T22:58:36Z

@vlesierse yes, that is the xplat way to handle this. Resolve to IPAdresses[]. Try them one by one until one works or all failed.

As outlined in https://github.com/dotnet/corefx/issues/8768, on Unix we don't currently support using the instance Connect/ConnectAsync methods that take a string host or a DnsEndPoint, because they could map to multiple addresses, which means we might need to try reconnecting on the same socket after a failed attempt, and that's not supported with BSD sockets. They are potential workarounds we can explore as outlined in that issue, but they're non-trivial and/or have undesirable ramifications. However, one simple thing we can do is allow a string/DnsEndPoint version of an IPAddress, e.g. just as someone can provide an IPAddress, they can provide a string version of that IPAddress, such as "127.0.0.1". This is a common thing to do, and we can make it work just by attempting to parse the address.

…forms. Having the socket do the DNS lookup is apparently not supported by corefx. See issue: https://github.com/dotnet/corefx/issues/5829 and https://github.com/dotnet/corefx/issues/8768.

As outlined in https://github.com/dotnet/corefx/issues/8768, on Unix we don't currently support using the instance Connect/ConnectAsync methods that take a string host or a DnsEndPoint, because they could map to multiple addresses, which means we might need to try reconnecting on the same socket after a failed attempt, and that's not supported with BSD sockets. They are potential workarounds we can explore as outlined in that issue, but they're non-trivial and/or have undesirable ramifications. However, one simple thing we can do is allow a string/DnsEndPoint version of an IPAddress, e.g. just as someone can provide an IPAddress, they can provide a string version of that IPAddress, such as "127.0.0.1". This is a common thing to do, and we can make it work just by attempting to parse the address.

Due to https://github.com/dotnet/corefx/issues/8768, on non-Windows platforms Socket.Connect* methods throw PaltformNotSupportedException when connecting via host names. This change resolves the IP endpoints first when the endpoint is a Dns endpoint, tries to connect to each IP endpoint, then picks the first that works.

ericeil · 2016-09-09T19:04:09Z

I couldn't find an existing issue covering the need for static Connect methods, so I created dotnet/corefx#11564. The existing instance methods are probably as good as they're going to get without either heroic effort or introcuding hard-to-grok behavior. Either way, it doesn't seem worth the cost, unless we get a lot of feedback that a) the current situation is unacceptable, and b) adding more functional static methods won't be enough.

stephentoub · 2017-02-16T18:48:10Z

I'm going to reopen this and explore the option 7 I suggested.

I'm thinking of something like this:

Have a fixed "allowed" list of options, including at least the options configured via named properties on Socket, e.g. Socket.NoDelay, but potentially extended to include other high-value settings.
Track whether, prior to a socket.Connect{Async}(multipleAddresses), Socket.Handle is accessed or SetSocketOptions is used with something not on the allowed list.
When socket.Connect{Async}(multipleAddresses) is called, rather than always throwing a PlatformNotSupportedException, we check the flag set by that tracking and only throw if one of these invalidating operations happened.
If we don't throw, then we can safely fall back to creating a different underlying native socket under the covers for each IPAddress, getsocketopt'ing on the original socket and setsocketopt'ing on the new socket for each option in the allowed list, prior to each connect attempt. This will add cost to all but the first attempt, but that additional cost should be the minority case (assuming that most of the time the first address will work), and if it doesn't, that additional cost should be better than outright failing, and at least some of that cost would be incurred anyway by an optimal implementation that was configuring only the options needed.

Reasonable? If so, are there a particular set of socket options folks believe would address the 99% case?

cc: @NickCraver, @tmds, @glennc, @geoffkizer, @danroth27, @richlander, @pgavlin, @halter73

geoffkizer · 2017-02-16T18:59:59Z

Is it really that bad to just add static methods that do the right thing? As it stands, there's no "right" way for users to write their code on Unix platforms. Seems like this is a functional gap and we need to address it.

I realize that's a compat pain between .NET Core and .NET Framework, but how do we value that vs actually implementing the right functionality here?

Note we already have static ConnectAsync in netstandard.

NickCraver · 2017-02-16T19:18:14Z

+1 to option 7.

@geoffkizer Yes, it's bad. A major goal (the primary goal?) of netstandard2.0 is to make cross-platform things just work. When we have 1 method that works on Windows (and may even test to work on other platforms), that's bad, unexpected behavior for every user or library author to encounter and fix. Currently we have code that may not blow up until you get to production (and we've seen issues filed for exactly that), it's the worst possible situation of usability. This is a compat pain we can greatly lessen with option 7.

I think we obviously disagree on the "right" functionality here. I see 7 as far more desirable by most libraries I cna think of. Obviously the overhead is a weight we should evaluate, but I'd much rather have a slightly heavier socket that works over a user blowing up production. If the overhead is an issue for something really socket intensive, they'd be free to go the static route, as they are today.

tmds · 2017-02-16T19:51:07Z

@stephentoub Personally I am fine with the PNSE informing me to adjust my code to Dns.Resolve or call a static method.

Some observations from the references made to this issue:

mongo-csharp-driver, slamby-api, CoreFTP implemented option 4/5
CoreFTP is doing some funky DNS caching
StackExchange.Redis Closed(=Umerged) PR does a Dns resolve and then tries the addresses. Once it finds an address that works, it closes the socket and creates a new socket to connect to the address.
StackExchange.Redis is using the SIO_LOOPBACK_FAST_PATH which probably won't be covered by option 7

stephentoub · 2017-02-17T01:33:10Z

Thanks, @NickCraver and @tmds. @tmds, isn't SIO_LOOPBACK_FAST_PATH a Windows-only option, anyway?

NickCraver · 2017-02-17T02:31:17Z

Yep - Windows only, so that'll need to be behind a platform check either way (as it is already for anyone using it).

Putting platform-specific optimizations behind platform-specific checks isn't a problem, IMO. They are optimizations, and completely optional. Basic functionality like "connect to this hostname" obviously my opinion sways sharply the other way on ;)

tmds · 2017-02-17T08:34:51Z

isn't SIO_LOOPBACK_FAST_PATH a Windows-only option, anyway?

Yes, this is exactly what makes it interesting. It won't be covered by the 99% of option 7.

Putting platform-specific optimizations behind platform-specific checks isn't a problem, IMO

When option 7 is implemented most 1% options could simply be placed after the connect call. That isn't the case here. SIO_LOOPBACK_FAST_PATH must be set before the Connect call.

@stephentoub I see TcpClient already handles the 95% case. Offering yet another alternative to the developer to solve the PNSE.

stephentoub · 2017-02-17T15:04:57Z

It won't be covered by the 99% of option 7.

I'm not understanding. It doesn't apply to Unix, and on Windows this issue doesn't exist. So why is it interesting?

Offering yet another alternative to the developer to solve the PNSE.

This is about a) making 15 years of existing code and already compiled binaries work better, and b) making the obvious/simple code a dev new to the platform would write just work.

tmds · 2017-02-17T15:38:36Z

I'm not understanding. It doesn't apply to Unix, and on Windows this issue doesn't exist. So why is it interesting?

You're right. Not relevant.

This is about a) making 15 years of existing code and already compiled binaries work better, and b) making the obvious/simple code a dev new to the platform would write just work.

It's great when things just work.

NickCraver · 2017-02-24T00:03:41Z

For anyone following this, PR with huge improvements is active here: dotnet/corefx#16373

NickCraver changed the title ~~Restore Socket.ConnectAsync bis hostname on *nix platforms.~~ Restore Socket.ConnectAsync via hostname on *nix platforms. May 22, 2016

joshfree assigned ericeil May 22, 2016

ericeil closed this as completed Sep 9, 2016

stephentoub unassigned ericeil Feb 16, 2017

stephentoub reopened this Feb 16, 2017

stephentoub self-assigned this Feb 16, 2017

stephentoub closed this as completed in dotnet/corefx#16373 Feb 25, 2017

msftgits transferred this issue from dotnet/corefx Jan 31, 2020

msftgits added this to the 2.0.0 milestone Jan 31, 2020

ericeil mentioned this issue Jan 31, 2020

Consider adding static Socket.Connect methods to support DNS endpoints #18483

Closed

cocowalla mentioned this issue Jan 31, 2020

PlatformNotSupportedException with TcpClient.ConnectAsync #24917

Closed

ghost locked as resolved and limited conversation to collaborators Dec 31, 2020

Restore Socket.ConnectAsync via hostname on *nix platforms. #17374

Restore Socket.ConnectAsync via hostname on *nix platforms. #17374

Comments

NickCraver commented May 22, 2016

NickCraver commented May 22, 2016

NickCraver commented May 23, 2016

stephentoub commented May 24, 2016

NickCraver commented May 24, 2016

tmds commented May 24, 2016

ericeil commented May 24, 2016 • edited Loading

stephentoub commented May 24, 2016 • edited Loading

tmds commented May 24, 2016

stephentoub commented May 24, 2016

tmds commented May 25, 2016

NickCraver commented May 25, 2016

tmds commented May 25, 2016 • edited Loading

NickCraver commented May 25, 2016

mgravell commented May 25, 2016 • edited Loading

tmds commented May 25, 2016

ericeil commented May 25, 2016

tmds commented May 25, 2016

stephentoub commented May 26, 2016 • edited Loading

tmds commented May 26, 2016

ericeil commented May 26, 2016

stephentoub commented May 26, 2016

ericeil commented May 27, 2016

vlesierse commented May 30, 2016

tmds commented May 31, 2016

ericeil commented Sep 9, 2016

stephentoub commented Feb 16, 2017

geoffkizer commented Feb 16, 2017

NickCraver commented Feb 16, 2017

tmds commented Feb 16, 2017

stephentoub commented Feb 17, 2017

NickCraver commented Feb 17, 2017

tmds commented Feb 17, 2017

stephentoub commented Feb 17, 2017

tmds commented Feb 17, 2017

NickCraver commented Feb 24, 2017

ericeil commented May 24, 2016 •

edited

Loading

stephentoub commented May 24, 2016 •

edited

Loading

tmds commented May 25, 2016 •

edited

Loading

mgravell commented May 25, 2016 •

edited

Loading

stephentoub commented May 26, 2016 •

edited

Loading