Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No protections for select query #5006

Closed
niketh opened this issue Oct 25, 2017 · 14 comments
Closed

No protections for select query #5006

niketh opened this issue Oct 25, 2017 · 14 comments
Labels

Comments

@niketh
Copy link
Contributor

niketh commented Oct 25, 2017

Select query (unlike groupBy) has no protection against bad queries. I propose adding a limit of 50K like in the case of groupBy so that historicals/brokers don't get overwhelmed. An unchecked select query causes historicals to OOM while finalizing results to be sent to the broker.

@b-slim
Copy link
Contributor

b-slim commented Oct 25, 2017

My 2 cents, Unlike group By select query is not suppose to produce an unexpectedly large amount of rows due to some grouping by high cardinality dimensions. In fact, the select query has paginations that limit the amount of row to pull at a given query, I think that is sort of enough to limit the number of rows pulled in a single query.

@RestfulBlue
Copy link

Any updates? Druid still can be killed by OOM using simple select query. of course I can use restrictions, but not all users understand what they are doing and the fact that users can very simply drop the entire cluster does not please me. Is it really hard to implement restriction for select query in configuration?

@gianm
Copy link
Contributor

gianm commented Aug 1, 2018

@RestfulBlue Have you considered the Scan query (http://druid.io/docs/latest/querying/scan-query.html) is meant to be a more memory friendly replacement for the Select query. It won't run your historicals out of memory. The only issue it really has is #4933, and that's a broker thing, not historicals. And it only can happen with very large or unlimited scan limits.

You could also consider adding a property for maxResults to SelectQueryConfig, and having that be a cap on the number of results that a Select query is allowed to return. With the way Select is written, though, it might be challenging to implement. Select's real problem is that it selects threshold-number of rows per each segment in parallel and then limits them later. This causes it to use up a lot more memory than you expect, and can also make it tough to implement a maxResults limit.

@RestfulBlue
Copy link

RestfulBlue commented Aug 7, 2018

@gianm yes, i have considered, but currently druid api can be used only from some ui apps, like grafana/metabase/etc. if we let every analytic/data scientist in our company access to api , then entire druid cluster will crash really soon :( only one way to protect it now is timeout with low value ( about 10s). in 12.2 druid ingestion via kafka indexing task looks stable. We have separated cluster for stress testing, with throughput about 500000rows/sec on hardware with 8 cores, 64gb mem, 6 servers and now there s no failed tasks ( in 12.1 1/3 of task was failed), but currently query api looks really unstable and too many things can go wrong

@gianm
Copy link
Contributor

gianm commented Aug 7, 2018

@RestfulBlue I'd really try to move to Scan queries if you can… ultimately I believe we are going to be deprecating and removing Select queries. They are broken beyond repair, imo, and Scan is a good replacement (especially after #6088).

@KenjiTakahashi
Copy link
Contributor

I'm with @gianm on this. I have examined the select in quite a detail in the past (before scan came to be, it is to be found somewhere in the issues/mailing list) and, sorry to say, but IMHO the whole design is just wrong, there's no use in trying to improve on it. Especially now, when we have scan.

FYI: We have since moved all but one (and a very rarely occurring one, that needs #6088, actually) of our queries to scan and never looked back.

@asdf2014
Copy link
Member

asdf2014 commented Aug 8, 2018

some ui apps, like grafana/metabase/etc. if we let every analytic/data scientist in our company access to api , then entire druid cluster will crash really soon

Hi, @RestfulBlue . Maybe you can try Apache Superset, which supports permission control to prevent ordinary users from sending queries that consume a lot of server resources, in addition, Superset also can predefine some dashboards instead of directly exposing the query api.

Tips: There is an article about superset on my personal blog that might help you.

@RestfulBlue
Copy link

@asdf2014 we use grafana as main gui, we have our self developed druid datasource and we have good restrictions as well. Using grafana users just cant build bad query and it is not a problem. Problems come then analysts/devs/others start using direct druid api , for example using jdbc

@asdf2014
Copy link
Member

Hi, @RestfulBlue . Indeed, we should analyze the query sql based on cost, then we can add some limitations for malicious query to protect druid cluster.

@RestfulBlue
Copy link

@asdf2014 jdbc its just example, in some places native queryes preferred, as sql has some problems.i think every query should have limitations like groupby. groupby looks really good query : disk spilling, limitation applyes to total rows count(not per granularity), when timeseries and topn doesnt has this restrictions. timeseries with large interval and none granularity - oom, same with topn. i think every query should have limitations like groupby

@asdf2014
Copy link
Member

@RestfulBlue Good point. I agree with you.

@stale
Copy link

stale bot commented Jun 20, 2019

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 2 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jun 20, 2019
@stale
Copy link

stale bot commented Jul 4, 2019

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

@stale stale bot closed this as completed Jul 4, 2019
@gianm
Copy link
Contributor

gianm commented Jul 4, 2019

Since #7133 is done, even more workloads can be moved to Scan now. That is what I'd recommend.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants