Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New feature: Connect to a specific host and port #614

Closed
wants to merge 3 commits into from

Conversation

@mkauf
Copy link
Contributor

commented Jan 25, 2016

This pull request adds the option CURLOPT_CONNECT_TO_HOST. It is used to connect to a specific host and port. The main use case is to direct a request at a specific cluster node in a cluster of uniform servers. Only the "network layer" is affected (connection establishment and proxy code). This option does not change the data that is sent to a server.

This new feature is a solution for a problem that has been discussed here: http://curl.haxx.se/mail/lib-2015-05/0156.html . CURLOPT_RESOLVE is not a good solution for this problem, because it may affect other transfers.

CURLOPT_CONNECT_TO_HOST does not affect other transfers and it supports hostnames as well as IP addresses. With this option, it is not necessary to "override" the SNI hostname anymore (see #607 ), because the request can be directed at the correct server.

It is also present in the "curl" tool:
curl --connect-to-host node1.example.com:8443 https://www.example.com

I know that the feature window is currently closed, but I hope to get some feedback, and maybe this will be ready for curl 7.48.0 :-)

@mkauf mkauf force-pushed the mkauf:connect_to_host_pr branch to 86b66db Feb 2, 2016
@bagder

This comment has been minimized.

Copy link
Member

commented Feb 4, 2016

Can you please clarify and exemplify exactly when and how CURLOPT_RESOLVE or Host: header manipulations aren't good enough for you? We prefer to not add feature for theoretical use cases or for unspecified hypothetical users.

@bagder bagder self-assigned this Feb 4, 2016
@mkauf

This comment has been minimized.

Copy link
Contributor Author

commented Feb 6, 2016

It's not a theoretical use case :) I work for Ergon Informatik AG, and we plan to use this feature in the next release of Airlock WAF.

Host header manipulations don't work with TLS/SSL, because some web servers (e.g. Apache) check whether the host name in the "Host" header matches the host name that was used for SNI (server name indication).

CURLOPT_RESOLVE has several issues:

  • It may affect other transfers. See the example program multi-resolv.c.txt. It should create three connections to https://www.high5.nl/ , the first to www1.high5.nl, the second to www2.high5.nl, the third to www3.high5.nl. But all three connections are made to www1.high5.nl. (note: I don't own that domain. I have searched a domain that has www1/www2/www3 subdomains on different IP addresses)
  • Connections are reused that (arguably) should not be reused. Example: curl -v https://www.michael-kaufmann.ch/ --resolve www.michael-kaufmann.ch:443:80.74.148.210 --next https://www.michael-kaufmann.ch --resolve www.michael-kaufmann.ch:443:87.239.214.151
    curl creates a connection and reuses it for the second request. curl should send the requests to different IP addresses.
  • It cannot map host names to other host names (only IP addresses are supported)
  • It does not work with HTTP proxies

Fixing CURLOPT_RESOLVE would of course also be an option, but that may break some programs that depend on these oddities.

@bagder

This comment has been minimized.

Copy link
Member

commented Feb 8, 2016

I can see how it can be useful to a redirect to another name instead of a fixed IP. Your version also has a "single shot" nature that the *_RESOLVE approach doesn't - but *RESOLVE does allow you to remove a name if you really wanted to.

But then your approach doesn't work with HTTP redirects (right?) as it seems to redirect all connect attempts for that handle to the specified host name, which seems like an unfortunate downside.

My ideal situation would be to expand/fix *RESOLVE so that it can provide the features you want as well rather than to introduce a new separate option as it'll be even harder for users to grasp the differences and when to use which.

@mkauf

This comment has been minimized.

Copy link
Contributor Author

commented Feb 9, 2016

The approach also works with HTTP redirects: The code checks whether the hostname/port is the same as in the first request. If the hostname/port differs, then the "connect to host" is not used.

I agree that improving CURLOPT_RESOLVE would be better for the users of libcurl. The interface is quite weird though: Entries can be added/removed, and libcurl keeps track of the "resolve map". It's not even possible to clear all existing entries without knowing them. Setting CURLOPT_RESOLVE to NULL has no effect. A backwards-compatible way to support this would be a special string like "-" that removes all entries from the map.

I would prefer an interface similar to CURLOPT_HTTPHEADER that replaces all entries at once. But that would probably break existing code.

What do you think?

@bagder

This comment has been minimized.

Copy link
Member

commented Mar 14, 2016

I think discussing APIs and designs is much better done on the mailing list...

lib/url.c Outdated
char *endp = NULL;
*host_portno = 0x0; /* cut off number from host name */
host_portno ++;
port = strtol(host_portno, &endp, 10);

This comment has been minimized.

Copy link
@bagder

bagder Mar 23, 2016

Member

strtoul() perhaps or otherwise check for <0 values?

lib/url.c Outdated
}

if(!*hostptr)
infof(data, "Host name is missing in connect to host string\n");

This comment has been minimized.

Copy link
@bagder

bagder Mar 23, 2016

Member

shouldn't illegal syntax cause this function to return an error?

This comment has been minimized.

Copy link
@mkauf

mkauf Mar 26, 2016

Author Contributor

shouldn't illegal syntax cause this function to return an error?

Currently, this function is very similar to parse_proxy(). It emits error messages but does not cancel the request.

transfers of other easy handles that have been added to the same multi handle.

If an HTTP proxy is used for a request having a \fIconnect to host\fP, the HTTP
proxy is automatically switched to tunnel mode for this specific request.

This comment has been minimized.

Copy link
@bagder

bagder Mar 23, 2016

Member

Why is that?

lib/url.c Outdated

/*************************************************************
* Do not use the "connect to host" if this is a redirect,
* except if the host name and port are the same as the first time

This comment has been minimized.

Copy link
@bagder

bagder Mar 23, 2016

Member

That little piece of "magic" needs to be in the docs too!

This comment has been minimized.

Copy link
@mkauf

mkauf Mar 26, 2016

Author Contributor

Agreed, I have added it to the docs.

@bagder

This comment has been minimized.

Copy link
Member

commented Mar 23, 2016

This pull request doesn't merge cleanly anymore, can you please rebase?

@mkauf mkauf force-pushed the mkauf:connect_to_host_pr branch from 86b66db to 30bab0f Mar 26, 2016
@mkauf

This comment has been minimized.

Copy link
Contributor Author

commented Mar 26, 2016

I have rebased the pull request and improved the documentation.

@mkauf

This comment has been minimized.

Copy link
Contributor Author

commented Mar 30, 2016

My ideal situation would be to expand/fix *RESOLVE so that it can provide the features you want as well rather than to introduce a new separate option as it'll be even harder for users to grasp the differences and when to use which.

I have rethought this... CURLOPT_RESOLVE and CURLOPT_CONNECT_TO_HOST are quite complementary features. One of them cannot replace the other.

With CURLOPT_RESOLVE, it's possible to manipulate curl's DNS cache. It's possible to simulate a custom "DNS server" with a special resolving strategy by changing curl's DNS cache before each request. It's possible to remove entries, even entries that have not been added by CURLOPT_RESOLVE. Probably people use CURLOPT_RESOLVE in other odd ways and it's just impossible to change the behavior of this feature without breaking something.

So CURLOPT_RESOLVE is a "low level" feature. The connection cache cannot take it into account because that would break some valid use cases.

On the other hand, CURLOPT_CONNECT_TO_HOST is a "high level" feature. It has a single purpose and therefore the connection cache can take it into account.

Having both options may lead to some confusion, but most people won't care because both features will work just fine for simple use cases, e.g. downloading a single file with the curl utility.

@bagder

This comment has been minimized.

Copy link
Member

commented Apr 2, 2016

I've basically arrived at the same conclusion as you state here @mkauf, and I plan to merge this PR...

@mkauf mkauf force-pushed the mkauf:connect_to_host_pr branch from 30bab0f to 3701947 Apr 2, 2016
@bagder

This comment has been minimized.

Copy link
Member

commented Apr 9, 2016

Shouldn't this option provide a list of mappings host A => host B for more than one host? An HTTP request with redirects can connect to virtually any number of hosts along the chain. Also, when you invoke a single command line you can provide a bunch of URLs/requests that you then may want to connect to different hosts etc.

If we agree that it should be possible to provide more than one such "redirected connect", maybe we can use/extend the existing --resolve command line option for this! But instead of the replacement IP address we use another host name. Internally we can of course split it up and store the info in two different lists as they're handled somewhat differently internally, but I think it would simplify the API and command line interface.

Or am I wrong?

@bagder

This comment has been minimized.

Copy link
Member

commented Apr 9, 2016

Oh, right. The --resolve list doesn't allow a redirected connect port, but we could probably support that without too much problem for this case.

@mkauf mkauf force-pushed the mkauf:connect_to_host_pr branch from 3701947 to 2106381 Apr 10, 2016
@mkauf mkauf force-pushed the mkauf:connect_to_host_pr branch from 2106381 to 21984f0 Apr 10, 2016
@mkauf

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2016

I have updated the pull request, it's now possible to specify multiple --connect-to parameters, similar to --resolve. I think that we shouldn't mix these two settings.

I have also renamed CURLOPT_CONNECT_TO_HOST to CURLOPT_CONNECT_TO, because it's now possible to only override the host or only override the port. I think that it's very flexible and powerful.

What do you think?

@bagder

This comment has been minimized.

Copy link
Member

commented Apr 17, 2016

That sounds like a really good approach!

@bagder

This comment has been minimized.

Copy link
Member

commented Apr 17, 2016

I'm in the progress of merging this. Found a few flaws that I'm cleaning up:

  • mixed code/declaration isn't C90 compatible
  • unchecked memory allocation caused torture test to fail
  • minor funny code indentations
  • fixed picky compiler warnings on int/long uses
  • made the host name comparison case insensitive as host names are

I also squashed all commits into a single one before I merged it in commit cd8d236

The test cases you added are very useful. Thanks!

Feel free to check out this code and verify that I didn't mess it up!

@mkauf

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2016

Great! :-) Thank you for your time and effort! The code changes are fine.

I'm very glad that this feature made it into libcurl.

@mkauf mkauf closed this Apr 18, 2016
@mkauf mkauf deleted the mkauf:connect_to_host_pr branch Apr 18, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jan 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
2 participants
You can’t perform that action at this time.