-
Notifications
You must be signed in to change notification settings - Fork 6.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add global proxy setting #51749
Add global proxy setting #51749
Conversation
This is an automated comment for commit 2367434 with description of existing statuses. It's updated for the latest CI running
|
The downside of this approach is that everytime a HTTP request is created, one has to remember to grab & pass proxy config. I feel like this should be hardcoded within the HTTP request "creator". Maybe it can be done for the generic case (i.e the non-S3 specific configuration we are adding with this PR), but it will not work for the existing storage conf one. One idea that could work is to indeed hardcode the proxy logic within the HTTP request creator which is applied if the proxy config argument hasn't been specified. Not sure if these are feasible, will read the code and come back. |
@tavplubix kind ping |
@arthurpassos, I'm reviewing it right now. |
struct ProxyConfiguration | ||
{ | ||
enum class Protocol | ||
{ | ||
HTTP, | ||
HTTPS, | ||
ANY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the only difference from ClientConfigurationPerRequest
is support for ANY
, right? But does it really make sense if we handle ANY
as HTTP
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not the only difference. The main and most important difference is that ClientConfigurationPerRequest
is a S3 specific structure and relies on Aws::Http::Scheme
(a third party module enum).
ProxyConfiguration
is a ClickHouse domain enum, meant to be used across ClickHouse source code. It removes this extra dependency and tight coupling.
For instance, URL functions also need proxy, but should not have a dependency on AWS stuff. There might be other parts of CH that depend on proxy stuff and will adhere to this implementation later, no dependency on AWS stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not only that, but ProxyConfiguration::Protocol::ANY
is not handled as HTTP
. Both RemoteResolver
and ListResolver
will literally use ANY (HTTP or HTTPs) if ANY is specified. You can check that in ProxyConfigurationResolverProvider.cpp
.
You might have been confused by the EnvironmentResolver
, which is not currently handling ANY as ANY. I'll address that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClientConfigurationPerRequest is a S3 specific structure and relies on Aws::Http::Scheme (a third party module enum)
I don't mind using Aws::Http::Scheme
in ClickHouse code (since we already do), but okay
You might have been confused by the EnvironmentResolver, which is not currently handling ANY as ANY
Yes, and also by
ClickHouse/src/IO/S3/ProxyConfigurationResolverAdapter.cpp
Lines 18 to 20 in 63c7ba5
case DB::ProxyConfiguration::Protocol::ANY: | |
// default to HTTP since there is no ANY in AWS::Scheme and we don't want an exception | |
return Aws::Http::Scheme::HTTP; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, if I'm not mistaken, ANY is used only as an argument of a factory (to get any proxy) and is never used as a part of an actual proxy configuration (each proxy is either HTTP or HTTPS). In other words, can ProxyConfiguration::protocol
be ANY
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ANY
is only used to get any resolver
, but ProxyConfiguration::protocol
will never be ANY
.
The only reason ANY
exists is to keep backwards compatibility with the "old" behavior, where the user did not have the option to specify if the proxy was meant for HTTP or for HTTPs requests. ClickHouse would pick any.
So ANY
is only used in the context of the S3 storage conf proxy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, ANY is only used to get any resolver, but ProxyConfiguration::protocol will never be ANY.
So there's no real difference between ClientConfigurationPerRequest
and ProxyConfiguration
(except for using a third party module enum)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are correct
class ProxyConfigurationResolverAdapter : public ProxyConfiguration | ||
{ | ||
public: | ||
explicit ProxyConfigurationResolverAdapter(std::shared_ptr<ProxyConfigurationResolver> resolver_) | ||
: resolver(resolver_) {} | ||
/// Returns proxy configuration on each HTTP request. | ||
ClientConfigurationPerRequest getConfiguration(const Aws::Http::HttpRequest & request) override; | ||
void errorReport(const ClientConfigurationPerRequest & config) override; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either I'm missing something or it should have failed to compile. The base class, ProxyConfiguration
is a trivial struct with no virtual methods, and there are no other base classes. So there's nothing to override, these methods are not virtual
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we get rid of ProxyConfigurationResolverAdapter
at all? As far as I understood, ClientConfigurationPerRequest
to ProxyConfiguration
and vice versa (and the only difference between them is ANY
, but there's another comment about this).
There are only two usages of ProxyConfigurationProvider
with ProxyConfigurationResolverAdapter
that look like
auto proxy_configuration_resolver = S3::ProxyConfigurationProvider::get(protocol);
auto per_request_configuration = [=] (const Aws::Http::HttpRequest & req) { return proxy_configuration_resolver->getConfiguration(req); };
auto error_report = [=] (const ClientConfigurationPerRequest & req) { proxy_configuration_resolver->errorReport(req); };
Let's just add two simple methods like ProxyConfiguration::toClientConfiguration()
and ProxyConfiguration::fromClientConfiguration(...)
. AFAIU, it will allow to replace the adapter with simple code like:
auto proxy_configuration_resolver = S3::ProxyConfigurationResolverProvider::get(protocol);
auto per_request_configuration = [=] (const Aws::Http::HttpRequest & req) { return proxy_configuration_resolver->resolve().toClientConfiguration(); };
auto error_report = [=] (const ClientConfigurationPerRequest & req) { proxy_configuration_resolver->errorReport(ProxyConfiguration::fromClientConfiguration(req)); };
Or we can even override the implicit conversion operator...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either I'm missing something or it should have failed to compile. The base class,
ProxyConfiguration
is a trivial struct with no virtual methods, and there are no other base classes. So there's nothing to override, these methods are not virtual
This is confusing because of backwards compatibility. Previously, proxy support was added only to S3 storage. At that moment, the DB::S3::ProxyConfiguration
class was introduced. See https://github.com/ClickHouse/ClickHouse/blob/master/src/Disks/ObjectStorages/S3/ProxyConfiguration.h. This is the base class that contains those methods.
For me, it's poorly named. It's named ProxyConfiguration
, but it's not a configuration, it's a class that will provide / resolve / generate the configuration.
On the other hand, the DB::ProxyConfiguration
, struct you are seeing on this PR, is an actual configuration class.
So, to summarize, DB::S3::ProxyConfiguration
is a poorly named class and serves as a resolver for S3 proxy configuration. DB::ProxyConfiguration
is an actual configuration class mean to be used across ClickHouse without any dependency (i.e, can be used for URL, S3, and whatever else).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, can we get rid of
ProxyConfigurationResolverAdapter
at all? As far as I understood,ClientConfigurationPerRequest
toProxyConfiguration
and vice versa (and the only difference between them isANY
, but there's another comment about this).There are only two usages of
ProxyConfigurationProvider
withProxyConfigurationResolverAdapter
that look likeauto proxy_configuration_resolver = S3::ProxyConfigurationProvider::get(protocol); auto per_request_configuration = [=] (const Aws::Http::HttpRequest & req) { return proxy_configuration_resolver->getConfiguration(req); }; auto error_report = [=] (const ClientConfigurationPerRequest & req) { proxy_configuration_resolver->errorReport(req); };
Let's just add two simple methods like
ProxyConfiguration::toClientConfiguration()
andProxyConfiguration::fromClientConfiguration(...)
. AFAIU, it will allow to replace the adapter with simple code like:auto proxy_configuration_resolver = S3::ProxyConfigurationResolverProvider::get(protocol); auto per_request_configuration = [=] (const Aws::Http::HttpRequest & req) { return proxy_configuration_resolver->resolve().toClientConfiguration(); }; auto error_report = [=] (const ClientConfigurationPerRequest & req) { proxy_configuration_resolver->errorReport(ProxyConfiguration::fromClientConfiguration(req)); };
Or we can even override the implicit conversion operator...
I can do something along these lines for the sake of the PR, but I honestly prefer to avoid throwing the "burden" of converting stuff and "knowing implementation details" to the client. As of now, for the client, it's seamless. It does not know there are indirection layers, conversions and where the proxy configuration is coming from. If that changes, client code will not have to change.
I also don't like the idea of having a DB::ProxyConfiguration
constructor that takes ClientConfiguration
or the other way around. This adds extra dependencies.
I usually code with the notion of "the less a class knows, the better". If URL functions need to have access to DB::ProxyConfiguration
, it should only see that. With those auxiliary methods or that constructor, it'll "see" DB::S3::ClientConfiguration
, a useless dependency for URL functions (or anything outside the S3 scope).
I do understand though it's hard to keep track of a big number of classes and it might be confusing because of the similar names. Sometimes namespaces are not enough for us to keep track of that.
Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"knowing implementation details" to the client. As of now, for the client, it's seamless
The client is basically 6 lines of code in 2 places (3 lines in each). Updating these 6 lines is much simpler than introducing some abstract adapters (about 100 lines of code).
If that changes, client code will not have to change.
If that changes, we can handle changing 6 lines of code.
This adds extra dependencies.
It's not a big problem (we have a monorepo and a static binary). Also, it's quite natural to have an easy way to convert "new" ProxyConfiguration
to "old" ClientConfiguration
and vice versa
I usually code with the notion of "the less a class knows, the better"
I understand this, and it's a really good principle sometimes. But I also like "less code is better" and "simpler is better". And in this particular case, all these abstractions look to me a bit overengineered (or maybe like "premature optimization", but in terms of abstractions)
With those auxiliary methods or that constructor, it'll "see" DB::S3::ClientConfiguration, a useless dependency for URL functions (or anything outside the S3 scope).
I don't think it's a big problem, especially if URL function will not explicitly use this stuff and will only include some headers containing it. But if you want to do it the right way, then the right way is probably different.
S3 is basically the same as URL with something extra on top of basic HTTP(S). S3 and URL should depend on some common http stuff. And proxies must be a part of that common http stuff, it's not something S3-specific. So the right way is to refactor the code in the S3 scope, and replace DB::S3::ClientConfiguration
with something that can be used in both URL and S3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fair enough. I'll see what I can do about this refactor, worst case scenario I'll add the auxiliary functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it's not that simple. There is extra logic implemented in DB::S3::ProxyConfigurationProvider
, this extra logic is required to keep backwards compatibility.
To integrate the old behavior / settings with the new behavior / settings, here's what's implemented:
- Try to fetch from scoped settings (storage conf proxy specific).
- If it can't find storage conf specific proxy settings, try to fetch from general settings.
- If it can't find there, grab from env.
This is the job of DB::S3::ProxyConfigurationProvider
.
If I manage to remove ClientConfigurationPerRequest
and use DB::ProxyConfiguration
only, then maybe it's ok to eliminate DB::S3::ProxyConfigurationResolverAdapter
and DB::ProxyConfigurationProvider
by adding a method a backwards compatible method in DB::ProxyConfigurationResolverProvider
. Will look into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed ClientConfigurationPerRequest
, DB::S3::ProxyConfigurationProvider
, DB::S3::ProxyConfiguration
and DB::S3::ProxyConfigurationResolverAdapter
.
Can you double check it's ok now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will review it tomorrow
struct ProxyConfiguration | ||
{ | ||
enum class Protocol | ||
{ | ||
HTTP, | ||
HTTPS, | ||
ANY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClientConfigurationPerRequest is a S3 specific structure and relies on Aws::Http::Scheme (a third party module enum)
I don't mind using Aws::Http::Scheme
in ClickHouse code (since we already do), but okay
You might have been confused by the EnvironmentResolver, which is not currently handling ANY as ANY
Yes, and also by
ClickHouse/src/IO/S3/ProxyConfigurationResolverAdapter.cpp
Lines 18 to 20 in 63c7ba5
case DB::ProxyConfiguration::Protocol::ANY: | |
// default to HTTP since there is no ANY in AWS::Scheme and we don't want an exception | |
return Aws::Http::Scheme::HTTP; |
if (resolver_configs > 2) | ||
{ | ||
throw Exception(ErrorCodes::BAD_ARGUMENTS, "Only two remote proxy resolvers are allowed, one for HTTP and one for HTTPs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it allows having two http
(or two https
) proxies configured, but doesn't allow, for example, 2 http
+ 1 https
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm.. indeed. Correct behavior should be: 1 HTTP and 1 HTTPs. Will fix that tomorrow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tavplubix what do you think about allowing the user to specify N HTTP remote resolvers and N HTTPs remote resolvers, but we only pick the first one of each and simply ignore the rest. It'll make code a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another thing that might be worth discussing is that these exceptions are not thrown at startup time, rather upon request.
In this case, the request will be aborted. Should we try the other proxy methods first (list and environment)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allowing the user to specify N HTTP remote resolvers and N HTTPs remote resolvers, but we only pick the first one of each and simply ignore the rest.
It's okay
Another thing that might be worth discussing is that these exceptions are not thrown at startup time, rather upon request.
It's okay that requests will fail if the server is misconfigured (it would be better to check on startup, but it's not necessary)
Should we try the other proxy methods first (list and environment)?
No, we should not try to continue working if the server is misconfigured
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing the user to specify N HTTP remote resolvers and N HTTPs remote resolvers, but we only pick the first one of each and simply ignore the rest.
It's okay
Ok, removed the hard check for only two resolvers. It now simply picks the first configuration that matches the protocol.
Should we try the other proxy methods first (list and environment)?
No, we should not try to continue working if the server is misconfigured
If it is misconfigured, it'll throw an exception. No fallthrough. The only caveat is "missing protocol". For instance, if the user configured a remote resolver (has precedence over other methods) for HTTP, but not for HTTPs. And there is a list resolver configured for HTTPs, it'll grab the list resolver. I believe that's acceptable, right? It's more permissive and less checks are present in the code
@tavplubix I believe all comments have been addressed. I am considering to add support for |
@tavplubix CI is green and you have approved it, can we merge it? |
@tavplubix thanks for looking into this, really appreciate it |
This reverts commit 2bade7d.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Attempt to create a generic proxy resolver for CH while keeping backwards compatibility with existing S3 storage conf proxy resolver.
Adds a fallback proxy resolver which is based on the
http_proxy
andhttps_proxy
environment variables.Make S3 table functions respect it. URL functions as well.
Closes #14097
Documentation entry for user-facing changes
Adds new global proxy setting with new syntax. Adds fallback proxy resolver based on
http_proxy
andhttps_proxy
environment variables.