Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore Socket.ConnectAsync via hostname on *nix platforms. #17374

Closed
NickCraver opened this issue May 22, 2016 · 35 comments · Fixed by dotnet/corefx#16373
Closed

Restore Socket.ConnectAsync via hostname on *nix platforms. #17374

NickCraver opened this issue May 22, 2016 · 35 comments · Fixed by dotnet/corefx#16373
Assignees
Labels
area-System.Net.Sockets os-linux Linux OS (any supported distro)
Milestone

Comments

@NickCraver
Copy link
Member

There's a bit of history to this, but due to https://github.com/dotnet/corefx/issues/5829 which was resolved by dotnet/corefx@30bd4b7, we are no longer able to use hostnames in socket connections for some platforms. Currently, it's a runtime PlatformNotSupportedException.

This breaks StackExchange.Redis on at least OS X and Linux, you can see the issue filed here: StackExchange/StackExchange.Redis#410

The best notes for this are in the dotnet/corefx@30bd4b7 commit, copied here for ease of consumption:

On unix, once connect fails on a socket, that socket becomes unusable for further operations, including additional connect attempts. This is at direct odds with Socket's instance Connect methods, some of which allow for multiple connect attempts, either due to multiple IPAddresses being provided or due to a string hostname / DnsEndPoint being provided that could then result in multiple IPAddresses to be tried.

We've explored multiple workarounds for this, all of which have problems. The workaround still in the code involves creating a temporary socket for each address, connecting to it, and if that's successful then immediately disconnecting and connecting with the actual socket. But that causes mayhem for a server not expecting that pattern, e.g. that fails if the client disconnects and attempts a reconnect.

This leaves us with a few choices, none of which are great:

  1. Remove the offending Connect instance methods. Ideally they'd be replaced with static methods, which can be implemented with POSIX-compliant behavior. But these methods are heavily used and work on Windows.
  2. Always throw from the instance Connect methods when there's a possibility that multiple addresses will be tried, e.g. from Connect(IPAddress[], ...) but also from Connect(EndPoint) if a DnsEndPoint is specified. This will break a lot of existing code, but it's also predictable, and developers will know quickly when using a problematic API and move away from it to supported patterns.
  3. Throw from the Connect methods if multiple addresses are actually supplied, e.g. calling Connect(IPAddress[]) with an array of length 1 would work but an array of length 2 will fail. This will allow a lot more existing code to work, but it's also very unpredictable, e.g. if a string host is provided and gets mapped by DNS to a single IPAddress in the test environment but then multiple IPAddresses in the production environment, everything will work fine locally but then fail in production.
  4. When presented with multiple addresses, try the first one, and if it fails, then throw a PlatformNotSupportedException. This may be slightly more predictable than (3), but is still unpredictable, e.g. if a DNS server returns addresses in a different quantity or order from another server.

Currently option 2 is implemented, which breaks this codepath:
SocketTaskExtensions.ConnectAsync(this Socket socket, string host, int port)
Socket.BeginConnect(string host, int port, AsyncCallback requestCallback, object state)

This hits line 2334: ThrowIfNotSupportsMultipleConnectAttempts();

The exact codepath doesn't matter, because several more feed into this one. In fact even if you call BeginConnect(EndPoint remoteEP, AsyncCallback callback, object state), it's still calling BeginConnect(dnsEP.Host, dnsEP.Port, callback, state); in the host case anyway. So even using the explicit overload of a single endpoint, it still breaks.

I would like us to change to option 3 in the comments above, only throwing for the case that breaks. The current state needlessly breaks existing code for no real reason. The comments in option 3 about deployments being a surprise break is perfectly valid, but it's no worse than option 2 and we're seeing that break in actual runtime usage already out in the wild.

If we can't prevent the issue in general (it looks like all practical options have been exhausted there), I would argue for 2 changes:

  1. Only break when multiple endpoints are actually passed in.
  2. Provide a much clearer exception message on why we're throwing, here's a link to the current one.
@NickCraver
Copy link
Member Author

cc @stephentoub @ericeil

@NickCraver NickCraver changed the title Restore Socket.ConnectAsync bis hostname on *nix platforms. Restore Socket.ConnectAsync via hostname on *nix platforms. May 22, 2016
@NickCraver
Copy link
Member Author

Another library affected by this behavior: shayhatsor/zookeeper#7

@stephentoub
Copy link
Member

I would like us to change to option 2 in the comments above, only throwing for the case that breaks

Thanks, @NickCraver. Just to clarify, you meant change to option 3, right?

@NickCraver
Copy link
Member Author

@stephentoub ...I guess getting the number right would be important there huh? Fixed the text, thanks!

@tmds
Copy link
Member

tmds commented May 24, 2016

Option 3 and 4 seem equally bad to me, so I'd opt for option 4 as it will work in more cases than 3.

The better solution is to provide static Connect methods as noted in the PR.

In the meantime, we should also seriously explore adding static methods for v1 for the same functionality

Methods accepting multiple addresses on Linux/OSX should throw as it isn't supported. Moving away from those methods is part of the cross platform porting. Option 3 and 4 will mask these problems. When the DNS/IP config changes, applications will suddenly fail.

@ericeil
Copy link
Contributor

ericeil commented May 24, 2016

How about a slightly different option, let's call it option 5:

(5) When presented with multiple addresses, treat the first address as the only address. If the connection fails, throw the usual exception for connection failure, rather than PlatformNotSupportedException.

The advantage vs. option 4 would be that DNS changes would not suddenly cause strange exceptions to be thrown. The disadvantage is that adding a DNS entry would silently have no effect. Thoughts?

@stephentoub
Copy link
Member

stephentoub commented May 24, 2016

I'd be fine with (5). It's at least explainable. It would be nice if the resulting exception message could be expanded a bit, though, to highlight that only the first address was tried, why, and what the dev's options are instead.

@tmds
Copy link
Member

tmds commented May 24, 2016

Option 5 is the same thing as option 4 but throwing a different exception. As to the type of exception, my preference is PlatformNoSupportedException.

No one in favor of adding static Connect methods as the xplat way of handling this?

@NickCraver how would you feel if you'd need to change your code to a static Connect method to make it xplat?

@stephentoub
Copy link
Member

No one in favor of adding static Connect methods as the xplat way of handling this?

I'm in favor of it long term, but that's not going to happen for 1.0 or in general in the near future. Such methods do not exist in the full framework, which then has significant impact on portability.

@tmds
Copy link
Member

tmds commented May 25, 2016

I'm in favor of it long term, but that's not going to happen for 1.0 or in general in the near future. Such methods do not exist in the full framework, which then has significant impact on portability.

I forgot about that. Have you thought of how you want to tackle this problem in general?

For the (long) time being.
I prefer option 4. If you pass multiple addresses and it throws an unabletoconnect exception, I'd assume it couldn't connect to any of them. If it can't connect to the first and the platform is unable to connect to another, platformnotsupported exception seems more fitting.

@NickCraver
Copy link
Member Author

I'd still opt for option 3 here after this discussion. If we're not actually going to execute the input that was handed to the method, it should fail and indicate how to fix it on the platforms that don't support it. This results in immediate and actionable failure.

Option 4 is bad to me, because it's silently not doing what I told it to. When people go a look up the methods on MSDN, they won't indicate this behavior and it'll result in a lot of developer frustration. We know what's happening inside the method though. We should simply detect and throw, and save a ton of developer hours and help migration along at the same time. This is happening on new platforms, where people are finding what they're hitting, it wouldn't happen anywhere existing. Hiding the error behind very unclear behavior draws that pain far out instead of just eating it near-term, while developers are actively porting.

As far as the signatures differing in Core or not to make this work - I'm indifferent there as it's already an #if def. That being said, we're passing a single address anyway.

TL;DR: I/O operations differ across platforms, this isn't really any kind of surprise. We can responsibly help developers quickly fix the issue while porting and not cause hair loss debugging it - so IMO, we should go that route: option 3.

How does Mono handle this? I'll try and look later today.

@tmds
Copy link
Member

tmds commented May 25, 2016

When I resolve "localhost" on my machine, it already gives me 2 addresses: 1 for IPv4 and 1 for IPv6. So option3 would fail to connect to localhost and option 4 would succeed.

@NickCraver
Copy link
Member Author

@tmds Awesome - that's a good thing. It means we're failing early and changing calls early, not finding them later in deployment (which was one of the primary concerns). I'm not sure how common v6 will be on test VMs, though.

@mgravell
Copy link
Member

mgravell commented May 25, 2016

I have to second Nick's thoughts with a preference to 3, but again with an improved message telling them what they should be doing instead, i.e. which "supported pattern" you want them to change their code to. This has the advantage of not biting anyone unnecessarily, while also never hiding actual runtime problems (i.e. by ignoring all-but-one address). Edit: and it retains API/signature compatibility with regular .NET, which is a big plus for compat.

@tmds
Copy link
Member

tmds commented May 25, 2016

@NickCraver @mgravell if you like catching problems early. The current implementation tells you your code isn't xplat and you should use the single IPAddress Connect method.

Hostname to IPAddress resolution returns addresses in the order they should be tried. So trying the first will succeed in most cases.

It is not uncommon to resolve names to multiple addresses (IPv6 being a major reason).

Option 3 doesn't fail early, it allows you to pass in certain cases which depend on your network configuration. If you know you'll only have 1 address, then call the Connect-overload which accepts 1 address. It works fine on all platforms.

@ericeil
Copy link
Contributor

ericeil commented May 25, 2016

How does Mono handle this?

It looks to me like Mono tries connecting to all addresses. If an attempt fails, they destroy the underlying native socket and create a new one. When an attempt succeeds, they keep the socket that was connected.

A disadvantage of this would be that any configuration that had been done on the original socket would be lost; Mono doesn't appear to make any attempt to, for example, remember socket options and restore them on the new socket, and doing so would require extra bookkeeping in Socket, which would likely make every Socket object larger.

I don't have a good feel for what the real-world impact of this approach might be.

@tmds
Copy link
Member

tmds commented May 25, 2016

@ericeil I'd like to keep the options and so would StackExchange.Redis: https://github.com/StackExchange/StackExchange.Redis/blob/master/StackExchange.Redis/StackExchange/Redis/SocketManager.cs#L178-L179

As for the discussion on whether to connect when there is exactly 1 address (option 3) or at least 1 address (option 4): in my real-world home network when I resolve "localhost", my pc's hostname or another pc's hostname I get 2, 9 and 2 addresses respectively.

@stephentoub
Copy link
Member

stephentoub commented May 26, 2016

I don't have a good feel for what the real-world impact of this approach might be.

Seems like it could end up being super expensive, especially when needing to support methods like public void SetSocketOption(SocketOptionLevel optionLevel, SocketOptionName optionName, byte[] optionValue), as we'd need to copy the byte array just in case it was modified after this call and just in case multiple addresses might be used.

As for the discussion

How about a compromise between options 3 and 6:
7.We create a new underlying socket fd for each connection attempt and swap that in to the Socket on success... but only if the socket hasn't been configured beyond the defaults (either at all, or for some limited set of properties we deem critical and that we'll track and manually copy over). If it has been configured (we can track for all properties/methods that something was configured even if we don't track what was configured), then attempting to connect with multiple addresses throws a PlatformNotSupportedException. We can optionally turn the dial from no properties tracked to all properties tracked over time, as needed, reducing the cases that result in PNSE.

Too complicated?

@tmds
Copy link
Member

tmds commented May 26, 2016

Option 7 is the best solution given the constraints of the problem.

In order of increasing probability of connect succeeding:
Option 2 > Option 3 > Option 4&5 > Option 6&7

We should aim for 4&5 as a minimum as occurrence of multiple addresses is high.

Note if you want some more info on the order of addresses returned when resolving a name, see man getaddrinfo.

@ericeil
Copy link
Contributor

ericeil commented May 26, 2016

With Option 7, wouldn't the StackExchange code still get PlatformNotSupportedException (since it configures the socket prior to calling Connect)?

@stephentoub
Copy link
Member

With Option 7, wouldn't the StackExchange code still get PlatformNotSupportedException

Yes, unless the properties it configures are ones we choose to special-case.

@ericeil
Copy link
Contributor

ericeil commented May 27, 2016

It sounds like option 7 is not tenable for v1.0. Without adding some state to track at least some of the options set on a socket, it's not going to solve the problem. If we add state to track options, we are going to have to live with that extra overhead forever, even after we add static Connect methods to handle this properly. And, it's not clear to me that we could pick a suitable subset of all possible socket state that would make this "just work" for everyone, in the short timeframe available before 1.0 ships.

Mimicking Mono's behavior (is that option "6?") similarly does not solve the problem for even the uses that lead to this discussion.

Options 3-5 behave unpredictably, and it's not clear to me that we're going to be able to pick the right heuristic that will work often enough to not cause headaches for everyone. I certainly don't feel like we've settled on a consensus choice in this discussion so far.

That leaves me thinking that the existing behavior (option 2) is the best choice, for 1.0, because it gives us predictable, early, failures for all usages, and leaves open all other options for possible future implementation. For code that needs this functionality on 1.0, there is a relatively simple workaround.

So, I am moving this out of the RTM milestone, but will leave the issue open for further discussion.

@vlesierse
Copy link

Currently experienced the same problem with an attempt to port the .NET Driver for MongoDB to .NET Core. How to solve this in RTM if the code suddenly starts to throw errors when you try to connect to localhost. Do you have to write around the problem yourself to get a single connection endpoint using DNS?

@tmds
Copy link
Member

tmds commented May 31, 2016

@vlesierse yes, that is the xplat way to handle this. Resolve to IPAdresses[]. Try them one by one until one works or all failed.

stephentoub referenced this issue in stephentoub/corefx Jun 7, 2016
As outlined in https://github.com/dotnet/corefx/issues/8768, on Unix we don't currently support using the instance Connect/ConnectAsync methods that take a string host or a DnsEndPoint, because they could map to multiple addresses, which means we might need to try reconnecting on the same socket after a failed attempt, and that's not supported with BSD sockets.  They are potential workarounds we can explore as outlined in that issue, but they're non-trivial and/or have undesirable ramifications.

However, one simple thing we can do is allow a string/DnsEndPoint version of an IPAddress, e.g. just as someone can provide an IPAddress, they can provide a string version of that IPAddress, such as "127.0.0.1".  This is a common thing to do, and we can make it work just by attempting to parse the address.
stephentoub referenced this issue in stephentoub/corefx Jun 7, 2016
As outlined in https://github.com/dotnet/corefx/issues/8768, on Unix we don't currently support using the instance Connect/ConnectAsync methods that take a string host or a DnsEndPoint, because they could map to multiple addresses, which means we might need to try reconnecting on the same socket after a failed attempt, and that's not supported with BSD sockets.  They are potential workarounds we can explore as outlined in that issue, but they're non-trivial and/or have undesirable ramifications.

However, one simple thing we can do is allow a string/DnsEndPoint version of an IPAddress, e.g. just as someone can provide an IPAddress, they can provide a string version of that IPAddress, such as "127.0.0.1".  This is a common thing to do, and we can make it work just by attempting to parse the address.
stephentoub referenced this issue in stephentoub/corefx Jun 9, 2016
As outlined in https://github.com/dotnet/corefx/issues/8768, on Unix we don't currently support using the instance Connect/ConnectAsync methods that take a string host or a DnsEndPoint, because they could map to multiple addresses, which means we might need to try reconnecting on the same socket after a failed attempt, and that's not supported with BSD sockets.  They are potential workarounds we can explore as outlined in that issue, but they're non-trivial and/or have undesirable ramifications.

However, one simple thing we can do is allow a string/DnsEndPoint version of an IPAddress, e.g. just as someone can provide an IPAddress, they can provide a string version of that IPAddress, such as "127.0.0.1".  This is a common thing to do, and we can make it work just by attempting to parse the address.
DarkKilauea referenced this issue in DarkKilauea/IrcDotNet Jun 25, 2016
stephentoub referenced this issue in stephentoub/corefx Jun 29, 2016
As outlined in https://github.com/dotnet/corefx/issues/8768, on Unix we don't currently support using the instance Connect/ConnectAsync methods that take a string host or a DnsEndPoint, because they could map to multiple addresses, which means we might need to try reconnecting on the same socket after a failed attempt, and that's not supported with BSD sockets.  They are potential workarounds we can explore as outlined in that issue, but they're non-trivial and/or have undesirable ramifications.

However, one simple thing we can do is allow a string/DnsEndPoint version of an IPAddress, e.g. just as someone can provide an IPAddress, they can provide a string version of that IPAddress, such as "127.0.0.1".  This is a common thing to do, and we can make it work just by attempting to parse the address.
jeremymeng referenced this issue in jeremymeng/StackExchange.Redis Sep 2, 2016
Due to https://github.com/dotnet/corefx/issues/8768, on non-Windows platforms
Socket.Connect* methods throw PaltformNotSupportedException when connecting via
host names. This change resolves the IP endpoints first when the endpoint is a
Dns endpoint, tries to connect to each IP endpoint, then picks the first that
works.
jeremymeng referenced this issue in jeremymeng/StackExchange.Redis Sep 6, 2016
Due to https://github.com/dotnet/corefx/issues/8768, on non-Windows platforms
Socket.Connect* methods throw PaltformNotSupportedException when connecting via
host names. This change resolves the IP endpoints first when the endpoint is a
Dns endpoint, tries to connect to each IP endpoint, then picks the first that
works.
@ericeil
Copy link
Contributor

ericeil commented Sep 9, 2016

I couldn't find an existing issue covering the need for static Connect methods, so I created dotnet/corefx#11564. The existing instance methods are probably as good as they're going to get without either heroic effort or introcuding hard-to-grok behavior. Either way, it doesn't seem worth the cost, unless we get a lot of feedback that a) the current situation is unacceptable, and b) adding more functional static methods won't be enough.

@ericeil ericeil closed this as completed Sep 9, 2016
@stephentoub
Copy link
Member

I'm going to reopen this and explore the option 7 I suggested.

I'm thinking of something like this:

  • Have a fixed "allowed" list of options, including at least the options configured via named properties on Socket, e.g. Socket.NoDelay, but potentially extended to include other high-value settings.
  • Track whether, prior to a socket.Connect{Async}(multipleAddresses), Socket.Handle is accessed or SetSocketOptions is used with something not on the allowed list.
  • When socket.Connect{Async}(multipleAddresses) is called, rather than always throwing a PlatformNotSupportedException, we check the flag set by that tracking and only throw if one of these invalidating operations happened.
  • If we don't throw, then we can safely fall back to creating a different underlying native socket under the covers for each IPAddress, getsocketopt'ing on the original socket and setsocketopt'ing on the new socket for each option in the allowed list, prior to each connect attempt. This will add cost to all but the first attempt, but that additional cost should be the minority case (assuming that most of the time the first address will work), and if it doesn't, that additional cost should be better than outright failing, and at least some of that cost would be incurred anyway by an optimal implementation that was configuring only the options needed.

Reasonable? If so, are there a particular set of socket options folks believe would address the 99% case?

cc: @NickCraver, @tmds, @glennc, @geoffkizer, @danroth27, @richlander, @pgavlin, @halter73

@stephentoub stephentoub reopened this Feb 16, 2017
@stephentoub stephentoub self-assigned this Feb 16, 2017
@geoffkizer
Copy link
Contributor

Is it really that bad to just add static methods that do the right thing? As it stands, there's no "right" way for users to write their code on Unix platforms. Seems like this is a functional gap and we need to address it.

I realize that's a compat pain between .NET Core and .NET Framework, but how do we value that vs actually implementing the right functionality here?

Note we already have static ConnectAsync in netstandard.

@NickCraver
Copy link
Member Author

+1 to option 7.

@geoffkizer Yes, it's bad. A major goal (the primary goal?) of netstandard2.0 is to make cross-platform things just work. When we have 1 method that works on Windows (and may even test to work on other platforms), that's bad, unexpected behavior for every user or library author to encounter and fix. Currently we have code that may not blow up until you get to production (and we've seen issues filed for exactly that), it's the worst possible situation of usability. This is a compat pain we can greatly lessen with option 7.

I think we obviously disagree on the "right" functionality here. I see 7 as far more desirable by most libraries I cna think of. Obviously the overhead is a weight we should evaluate, but I'd much rather have a slightly heavier socket that works over a user blowing up production. If the overhead is an issue for something really socket intensive, they'd be free to go the static route, as they are today.

@tmds
Copy link
Member

tmds commented Feb 16, 2017

@stephentoub Personally I am fine with the PNSE informing me to adjust my code to Dns.Resolve or call a static method.

Some observations from the references made to this issue:

  • mongo-csharp-driver, slamby-api, CoreFTP implemented option 4/5
  • CoreFTP is doing some funky DNS caching
  • StackExchange.Redis Closed(=Umerged) PR does a Dns resolve and then tries the addresses. Once it finds an address that works, it closes the socket and creates a new socket to connect to the address.
  • StackExchange.Redis is using the SIO_LOOPBACK_FAST_PATH which probably won't be covered by option 7

@stephentoub
Copy link
Member

Thanks, @NickCraver and @tmds. @tmds, isn't SIO_LOOPBACK_FAST_PATH a Windows-only option, anyway?

@NickCraver
Copy link
Member Author

Yep - Windows only, so that'll need to be behind a platform check either way (as it is already for anyone using it).

Putting platform-specific optimizations behind platform-specific checks isn't a problem, IMO. They are optimizations, and completely optional. Basic functionality like "connect to this hostname" obviously my opinion sways sharply the other way on ;)

@tmds
Copy link
Member

tmds commented Feb 17, 2017

isn't SIO_LOOPBACK_FAST_PATH a Windows-only option, anyway?

Yes, this is exactly what makes it interesting. It won't be covered by the 99% of option 7.

Putting platform-specific optimizations behind platform-specific checks isn't a problem, IMO

When option 7 is implemented most 1% options could simply be placed after the connect call. That isn't the case here. SIO_LOOPBACK_FAST_PATH must be set before the Connect call.

@stephentoub I see TcpClient already handles the 95% case. Offering yet another alternative to the developer to solve the PNSE.

@stephentoub
Copy link
Member

It won't be covered by the 99% of option 7.

I'm not understanding. It doesn't apply to Unix, and on Windows this issue doesn't exist. So why is it interesting?

Offering yet another alternative to the developer to solve the PNSE.

This is about a) making 15 years of existing code and already compiled binaries work better, and b) making the obvious/simple code a dev new to the platform would write just work.

@tmds
Copy link
Member

tmds commented Feb 17, 2017

I'm not understanding. It doesn't apply to Unix, and on Windows this issue doesn't exist. So why is it interesting?

You're right. Not relevant.

This is about a) making 15 years of existing code and already compiled binaries work better, and b) making the obvious/simple code a dev new to the platform would write just work.

It's great when things just work.

@NickCraver
Copy link
Member Author

For anyone following this, PR with huge improvements is active here: dotnet/corefx#16373

@msftgits msftgits transferred this issue from dotnet/corefx Jan 31, 2020
@msftgits msftgits added this to the 2.0.0 milestone Jan 31, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Sockets os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants