-
-
Notifications
You must be signed in to change notification settings - Fork 6.7k
Draft: Add libproxy support #9068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Updated MR initially made by dwmw2
| option(ENABLE_LIBPROXY "Define if you want to enable libproxy support" ON) | ||
| if(ENABLE_LIBPROXY) | ||
| find_package(libproxy REQUIRED) | ||
| set(ENABLE_LIBPROXY ON) | ||
| list(APPEND CURL_LIBS ${LIBPROXY_LIBRARIES}) | ||
| endif() | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As it is written, it now requires libproxy by default in new builds. I don't think, it should.
Also line 232 makes no sense.
Maybe:
set(CURL_ENABLE_LIBPROXY "AUTO" CACHE STRING "Enable libproxy support")
set_property(CACHE CURL_ENABLE_LIBPROXY PROPERTY STRINGS "ON" "OFF" "AUTO")
string(TOUPPER "${CURL_ENABLE_LIBPROXY}" CURL_ENABLE_LIBPROXY)
if(CURL_ENABLE_LIBPROXY EQUAL "ON")
find_package(libproxy REQUIRED)
elseif(CURL_ENABLE_LIBPROXY EQUAL "AUTO")
find_package(libproxy)
endif()
if(libproxy_FOUND)
set(ENABLE_LIBPROXY ON)
list(APPEND CURL_LIBS ${LIBPROXY_LIBRARIES})
endif()
This is untested - I've written it here as a proposal.
Also note, that you don't add include paths anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your feedback, i will updated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually the default value has been set on purpose. I want to ensure that newer builds of curl are using it by default. I could change the naming then to disable libproxy. Not sure how to deal with it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default if available - this is what all other options are doing (unless they're OFF by default).
I don't want my scripts suddenly fail because of a new optional component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default if available - this is what all other options are doing (unless they're OFF by default).
I don't want my scripts suddenly fail because of a new optional component.
If you do "by default if available" then it's going to be enabled at random depending on what build dependencies are specified and how eagle-eyed the human packager is to check all the various build options. Here's the standard "don't do that" wiki page. Humans are just not smart enough to avoid traps like this. Please pick a default and force people building curl to deviate intentionally.
libproxy should definitely be enabled by default on Linux platforms. Not sure about others.
P.S. If you use meson instead of CMake, it is possible to safely use auto dependencies. But doesn't matter, because you use CMake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I'll note I disagree with the wiki page here:
In the case of binary-based distributions, like RPM or DEB based ones, automagic dependencies does not change anything: if the user has something installed and is building by hand, it's usually what he wants to enable, while if it's the maintainer, he'll just have to add a dependency over the packages required to run the binaries he has created.
Basically the wiki page suggests this is only bad for Gentoo and not normal Linux distributions. It's not true. If you use automagic dependencies, Fedora/Debian/Ubuntu/yocto/buildroot/whatever packagers probably won't notice, and whether the feature gets enabled or disabled is likely to depend on what BuildRequires are specified. In short, not a good idea. Failing the build unless you pass -DENABLE_FOO=OFF is a cheap way to avoid broken downstream packages and hardly difficult to update your build scripts to handle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally in the "OFF by default" camp for everything, but this is an established pattern in curl and the only thing worse than this pattern, IMHO, is inconsistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the general curl pattern is that we go with features OFF to start with, then over time if we deem the "thing" almost a defacto thing that most users want and use, we can possibly switch the default to ON if present and working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bagder Yeah, something like that, but it's also not really done correctly in at least few cases, where "ON" somehow means "AUTO". Just look at the libssh2/libssh, gssapi... I wonder how many builds would break if we changed that to "ON" means "REQUIRED". Would there even be a desire to tidy this up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You partially inspired me to write a blog post. See rule three.
|
What I don't understand is why libcurl has to do that itself. If this is actually for using libcurl from another application, the application could just pass the proxy found by libproxy to libcurl. If the use case is for the curl tool, a wrapper script passing the proxy to curl might be another solution, but that might of course be quite some effort if that's a common use case. |
|
Speaking of Windows, proxy detection can be done with reasonably small amount of code and with minor, if any, build-level work. An example here. [ This is "works for me" code and hasn't been vetted with a wide spread of users and systems, though the idea/code is similar to what Chromium did at the time. ] |
The thing is that then each app that makes use of curl would have to implement proxy detection logic. This would be n vs 1. I don't think that every app developer is aware of proxy and even iff then most likely misses pac proxy which are on another level. It would make much more sense to do this here in curl. At the moment we do have several application in use which are using curl as a backend but they do not offer any configuration logic. Thus we cannot use them at all. Hardcoding a single proxy wouldn't work either as for e.g. login and real data flow different proxies are used. |
I'm totally open to change the implementation and use something else. We just have the need for a working solution in curl. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I miss documentation for the new option, and no tests verifying it.
I miss documentation for libproxy. There also seems to be no tests in the libproxy project nor does it run any CI jobs. It does not give an impression of something to lean on in production.
Any thoughts on how user of curl the command line tool would like to use it?
(I've also set this as 'needs-votes' as I'm not personally convinced this is actually a PR we want merged - I welcome more feedback and opinions.)
|
|
||
| dnl We don't know the version number "statically" so we use a dash here | ||
| AC_INIT([curl], [-], [a suitable curl mailing list: https://curl.se/mail/]) | ||
| AC_INIT([curl],[-],[a suitable curl mailing list: https://curl.se/mail/]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These two first changes seem unrelated. Surely you don't need 2.71?
| @@ -182,6 +182,9 @@ Proxy port to use. See \fICURLOPT_PROXYPORT(3)\fP | |||
| Proxy type. See \fICURLOPT_PROXYTYPE(3)\fP | |||
| .IP CURLOPT_NOPROXY | |||
| Filter out hosts from proxy use. \fICURLOPT_NOPROXY(3)\fP | |||
| .IP CURLOPT_LIBPROXY | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option name should describe what it does, what it's for. Not what library it uses. It might end up not using that library at some point.
Also, since this is about setting a proxy, wouldn't it be better if this used the existing proxy options somehow to also avoid the problem with more mutexed options?
| #include "curl_memory.h" | ||
| #include "memdebug.h" | ||
|
|
||
| static pxProxyFactory *factory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why this needs to use a global variable. I think it shouldn't.
| } | ||
| } | ||
|
|
||
| return result; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- When is this pointer free'd again?
- libproxy doesn't return the proxy URL using the same schemes as curl supports does it? I mean for SOCKS4/4a/5/5h does it?
- Can a libproxy proxy be used over unix domain sockets?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll let jbrummer answer 1. Edit: actually, looks like it is leaked. ;)
Regarding 2, libproxy may return socks:// (which means: try socks5://, then socks4a://, then socks4://, in that order), or it may return direct:// (which means "connect directly"). These need to be translated into curl proxy syntax. Then I don't know about socks5h:// because GLib doesn't know about socks5h://... but I guess that should that be attempted first, before socks5://?
Regarding 3. libproxy just returns URIs. It would not generally make sense to connect to an HTTP proxy via anything other than TCP, but it's theoretically possible, if you are running a proxy server locally and have taught it to respond via UNIX domain sockets rather than TCP. Same for the other common proxy types, HTTPS and FTP and SOCKS (at least I think SOCKS goes over TCP?). But in theory, if you pass a different URI scheme, it could return any sort of proxy URI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
libproxy may return socks:// (which means: try socks5://, then socks4a://, then socks4://, in that order)
BTW this does not mean libproxy might not return more specific URI schemes too. libproxy has basically no documentation, so better assume it will return whatever it wants. ;) But specifically socks:// and direct:// are going to require special handling here. In particular, if you do not see direct:// then you must not connect directly.
| @@ -182,6 +182,9 @@ Proxy port to use. See \fICURLOPT_PROXYPORT(3)\fP | |||
| Proxy type. See \fICURLOPT_PROXYTYPE(3)\fP | |||
| .IP CURLOPT_NOPROXY | |||
| Filter out hosts from proxy use. \fICURLOPT_NOPROXY(3)\fP | |||
| .IP CURLOPT_LIBPROXY | |||
| Use the libproxy API to determine which proxy, if any, to | |||
| use. \fICURLOPT_LIBPROXY(3)\fP | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This man page is missing.
| * | (__| |_| | _ <| |___ | ||
| * \___|\___/|_| \_\_____| | ||
| * | ||
| * Copyright (C) 2011 - 2016, Daniel Stenberg, <daniel@haxx.se>, et al. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want the copyright ranges to end the year of the most recent change: 2022.
| @@ -194,6 +194,10 @@ static CURLcode global_init(long flags, bool memoryfuncs) | |||
| } | |||
| #endif | |||
|
|
|||
| #ifdef ENABLE_LIBPROXY | |||
| Curl_libproxy_global_init(); | |||
| #endif | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Provide the functions as empty macros if the feature is not built-in, then you don't need #ifdefs in the code like this.
| @@ -310,6 +314,10 @@ void curl_global_cleanup(void) | |||
| free(leakpointer); | |||
| #endif | |||
|
|
|||
| #ifdef ENABLE_LIBPROXY | |||
| Curl_libproxy_global_cleanup(); | |||
| #endif | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another macro opportunity.
libproxy is barely maintained and objectively pretty terrible -- no docs or tests or CI are far from the worst problems here -- but it's also standard on Linux. If your software does not use it, it's probably broken. Otherwise, I'm not sure how you'd know to check GNOME proxy settings, KDE proxy settings, desktop xyz proxy settings, proxy environment variables, how to read and execute web proxy autoconfig JS, etc. Even if you don't do HTTP at all, you still need to check desktop proxy settings because a SOCKS proxy could be configured. Currently some software has been migrating from libsoup to libcurl, and I've noticed incorrect proxy handling is the most common issue with such migrations. libsoup handles proxies transparently, but libcurl does not. Applications using libcurl currently need to manually proxy proxy settings (bad pun intended) from libproxy to curl proxy options, which is easy to get wrong and annoying. This may be by design, but it's questionable. There is one better option: depend on GLib and use GProxyResolver, which is smart enough to use libproxy only if it really needs to. (I maintain this code and am slowly reducing its use of libproxy.) But that does you no good unless you're willing to depend on GLib. |
| result = libproxy_results[0]; | ||
| for(i=1; libproxy_results[i]; i++) | ||
| free(libproxy_results[i]); | ||
| free(libproxy_results); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will crash on Windows, see this bug. Example solution. Just make sure to require 0.4.16.
| if(proxy) | ||
| infof(data, "Uses proxy env variable %s == '%s'", envp, proxy); | ||
| #ifdef ENABLE_LIBPROXY | ||
| if(/*data->set.libproxy && */!proxy) { | ||
| char *libproxy_proxy = Curl_libproxy_detect_proxy(data->state.url); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like libproxy_proxy is leaked, right?
Also, stray debug comment above.
|
To add to the problems mentioned already, libproxy also contains and uses its own
I also find it curious that there's no expire time set for a proxy. With PAC files etc supported, proxies can very well change depending on things like time of day or date but with this API the application doesn't get any clues on when to recheck - a browser rechecks with the PAC for every URL. |
|
I think the sensible thing to do before continuing on this PR is to improve libproxy before we build more code to use it.
libproxy is not even a lot of code, it could even make sense to just plainly write a new library based on how you would want a library like this to work. |
|
@bagder What about @mcatanzaro proposal to make use of glib? This is an established library and works very well. Any thoughts? Maybe this is the better way. |
Has to do what for itself? Work out which proxy to use for a given URL? The point here is that libcurl doesn't have to do that for itself. It just asks libproxy. In the ideal case, libproxy ends up just doing a D-Bus call to a standalone tool like pacrunner, which has already started up a JS interpreter with the PAC file (that in turn was provided from DHCP or VPN config) and can serve a response. Should the application do that for itself? I don't really think so; no more than the application should have to do DNS for itself, or TLS for itself. Unless it wants to override such things, we shouldn't make it. |
That's all true. Libproxy itself is a vile C++ monstrosity which has long since jumped the shark. The libproxy API, on the other hand, is sane and simple enough and other implementations exist. There's one bundled with pacrunner, which basically just D-Bus calls as I mentioned before. And pacrunner gets its configuration prodded into it by NetworkManager/ConnMan etc., from the underlying DHCP/VPN services or manual network configuration etc. I'm all for using something other than the original libproxy; this PR is about the API. I am kind of ambivalent about doing a GLib/GProxyResolver thing that reloads the PAC file into a JS interpreter in every process instead of the pacrunner approach, but if you wrap that into an implementation of the libproxy API, that's just fine. It's just another implementation. |
|
@dwmw2 Hi, it's been a while :) With the pacrunner approach you mean something like https://docs.01.org/clearlinux/latest/guides/clear/autoproxy.html ? |
Indeed :)
Yes. PacRunner is a separate dæmon which loads a PAC file into a JS interpreter like duktape (it supports others including mozjs, v8,etc.). It can have its configuration poked into it by systemd, NetworkManager, ConnMan etc. according to the underlying network that we're connected to. Then clients just query it — instead of all of the clients having to obtain the information about the current network from whatever happens to be managing it, and having to fetch the PAC file for themselves and run it in their own JS interpreter. |
glib seems like such a huge beast to "suck in" for such a limited functionality, but I don't know anything about it so I can't even tell if that would make it easier to make use of curl's own URL parser and HTTP transfer abilities. |
CVE-2020-25219 is a humorous example of the badness here. Anyway, my recommendation is to use libproxy really needs async public APIs. (But using sync APIs on a thread is better than not supporting proxies properly.)
You have to recheck with every URL indeed, so the concept of expire time does not match how it's expected to be used. The whole point of PAC is to be able to vary results in arbitrary ways on a per-URL basis. Recheck for every URL is how both libproxy and GProxyResolver are expected to be used, and it's what GSocketClient does.
We've previously discussed ConnMan's pacrunner here. TL;DR I think it runs its JS interpreter as root? Hence I'm not impressed that ConnMan's pacrunner is actually better than upstream libproxy. But as with upstream libproxy problems, this would be a fixable problem if somebody was sufficiently motivated to work on it (not me!).
Note GLib ensures the JS interpreter is always used in a subprocess.
Using curl's own URL parser via GLib would not be possible. And its HTTP transfer ability... well, currently that's libproxy's hand-rolled fun with std::string, but I have a draft merge request with it changed it to use libsoup instead, so... probably not better as far as libcurl is concerned. |
|
@bagder FTR: I'm currently in contact with all participant of this bug to prepare the best solution for curl. Furthermore I've contacted libproxy maintainer and it looks like I (and my company) are going to support the development and fix the missing issues. How many thumbs up do you need to remove the needs-votes label? Looks like 19 is already the highest count in your ticket system. |
|
I blogged about this case a few days ago. I need to see an improved libproxy before I will accept its use by libcurl. I outlined 5 different areas in my blogpost in which I would like to see improvements, and while I don't demand that they all are top notch, I also don't think you should expect me to accept this PR in the near term since there is work to do first. So, I'm glad there area thumbs-ups and that people want to use this. Let's make sure it becomes good enough for us to use it. |
|
Forgive me for jumping in on the discussion. After looking at the proposed changes, it seems like the PR basically wants to do 2 things:
The rest is build system and other necessary evil stuff. I propose there's a way for this and similar extensions to be achieved without necessitating changes to libcurl. For this specific case one could, instead of calling
Alternatively, one would set the URL through a helper function that does (2) to (4), and also takes care of Similarly, instead of calling With very little effort, the proposed wrappers can be implemented as a stand-alone project, completely external to libcurl, reusable by multiple applications. Looking at the documentation for If all of the above holds true, then instead of adding libproxy support one would add That said, I don't have all the context of this discussion. In case my comments are irrelevant, please disregard. |
|
I've stated my opinion on this effort and as there are no indications of these issues getting addressed in the short term, I close this PR. A fixed/improvied libproxy would need to have a different API anyway (to avoid sync resolves and sync HTTP fetches) so it would have to be integrated slightly differently when or if it ever happens. Thanks a lot anyway for the work put into this! |
| char *result = NULL; | ||
|
|
||
| if(factory) { | ||
| char **libproxy_results = px_proxy_factory_get_proxies(factory, url); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just so we don't forget: this should use an asynchronous API. That means changing libproxy to expose an async API, and also changing curl to use it in an asynchronous manner. Network requests should block, but the application thread should not block.
| if(libproxy_results) { | ||
| int i; | ||
|
|
||
| /* We only cope with one; can't fall back on failure */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is unfortunate too. Ideally if curl cannot connect to the first result, it should try the other results sequentially instead of ignoring them. i.e. curl should understand that proxy configuration is a list of possible proxy addresses, not just a single address.
Updated MR initially made by dwmw2 (#977)
As our company uses a pac proxy with dozens of entries such a support is necessary within curl. Let's revive this old (but updated) PR and get back to talking.
@dwmw2