Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Router management Proxy routing should have a way to route to any node in the cluster #8686

Closed
vogievetsky opened this issue Oct 16, 2019 · 9 comments

Comments

@vogievetsky
Copy link
Contributor

Description

It would be amazing if the router's management proxy could proxy to any node in the cluster

Right now the router can proxy to the overlord and coordinator leaders (see https://druid.apache.org/docs/latest/design/router.html#management-proxy-routing ) via:

/proxy/coordinator/* and /proxy/overlord/*

I would like tp propose adding a route like:

/proxy/node/<node-host>/<node-port>/* that would proxy * to the node at <node-host>:<node-port> where <node-host> and <node-port> would correspond to the host and port columns as reported by the sys.servers table.

Motivation

This would enable a host of cool status and diagnostic checks from the router web console. In particular it would turbo charge what can be done in the Druid Doctor ( #8672 ) allowing the creation of a check that would make sure all runtime props are set correctly and that all nodes are reachable.

@gianm
Copy link
Contributor

gianm commented Oct 16, 2019

This sounds like an open proxy and therefore I am scared of it.

In the context of the Druid Doctor I think it would be better to put the checks into the Router itself, and have it contact other nodes on the caller's behalf if necessary, rather than run an open proxy.

But even then, the issue implies you are thinking about contacting all servers in a cluster as part of a health check. For large clusters this has potential scalability problems. It might be better for the servers to all self-check themselves, and report the results of the self-check to a central location. Maybe they can report a hash of the important properties, and if the hashes differ we know there is a proxy.

@gianm
Copy link
Contributor

gianm commented Oct 16, 2019

It might be better for the servers to all self-check themselves, and report the results of the self-check to a central location.

Of course this has scalability problems too 🙂

I thought it might be worth considering, though, because I am worried that since the Doctor is user-driven, it could cause stampedes of broadcast blasts to all servers in the cluster if too many users use it at once.

@vogievetsky
Copy link
Contributor Author

The proxy could only forward to Druid nodes themselves so it is not open to everything. Also there is already an open proxy in the sampler (http firehose). I tried to implement it that way and it works for non TLS clusters, (for TLS clusters we would need to address: #8668

From a security perspective my model is that if you have admin access to the management proxy enabled router you can pretty much do anything anyway at that point.

From a scalability perspective I do not think it is a problem if this is a user initiated action (with a warning in the UI). Also remember that it would be OK for this to be slow, taking a 100ms pause between every request. People would be fine with it as long as there is visual feedback with a ticking Contacting node 12 out of 343 that shows that progress is being made.

@gianm
Copy link
Contributor

gianm commented Oct 17, 2019

Also there is already an open proxy in the sampler (http firehose).

I think it's different, because the http firehose at least can't be used unless you are an authenticated user and have write access to some datasource. (Not ideal but there is at least something that protects it.) The proxy route IIRC would be open to anyone anywhere. I believe it relies on the target server to do authentication.

Limiting it to just Druid servers could lower the scariness levels enough.

From a scalability perspective I do not think it is a problem if this is a user initiated action (with a warning in the UI). Also remember that it would be OK for this to be slow, taking a 100ms pause between every request.

I think with the pause, the scalability issue is less scary, so that sounds good. Or, we could also limit it to a certain number of servers. Probably if someone has, like, 100 servers, then they have figured out how to set their properties properly by then.

@vogievetsky
Copy link
Contributor Author

Yeah just to be clear I never pictured it to be a proxy to an arbitrary destination only to other Druid servers. If there was some server uuid we could use that instead of a host port combo but right now from the sys table the best server id I can see is host:port

image

The issue with having this check be encapsulated by a single backend API is that it might give someone the idea to hit it as part of a routine health monitoring which limits how heavy these tests might should be. I am very much picturing a 'deep' test suite (that can take its sweet time) and might be intensive on the cluster. Something you run a handful of times in the clusters lifetime (in leu of posting to the user-group). Something to catch issues like: https://groups.google.com/forum/#!topic/druid-user/w3Uy-dXZSII

@jnaous
Copy link
Contributor

jnaous commented Oct 17, 2019 via email

@himanshug
Copy link
Contributor

Motivation ... allowing the creation of a check that would make sure all runtime props are set correctly and that all nodes are reachable

The proxy could only forward to Druid nodes themselves so it is not open to everything.

to achieve above you would have to check if "node-host,port" received in request is indeed a druid process which requires that particular process successfully registered itself in zookeeper and discovered by router. so the two requirements above depend on each other.

I would agree with @jnaous on the async philosophy in general for this kind of use case.

#8672 (comment) applies too :)

OTOH I think you want the proxy because it is doable quickly and would give you a lot of freedom as to try things in web console, gather feed back, iterate more and when we know the concrete useful use cases, they could be implemented in router for efficiency (and/or proactive checkups ) . If intention is to allow quick experimentation really then I think it is ok to setup the proxy mechanism at router but with a configuration that enables such proxy and set to false by default. Only ask would be that web console should continue to work for all other things even if generic proxy is disabled at router.

@stale
Copy link

stale bot commented Jul 25, 2020

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jul 25, 2020
@stale
Copy link

stale bot commented Aug 23, 2020

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Aug 23, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants