Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix display limit in sql lab #5392

Merged
merged 5 commits into from Jul 16, 2018
Merged

Fix display limit in sql lab #5392

merged 5 commits into from Jul 16, 2018

Conversation

villebro
Copy link
Member

Limits query by DISPLAY_SQL_MAX_ROW (default value 1,000) instead of SQL_MAX_ROW (default value 1,000,000). Fixes #5390. This could also be fixed by changing the default values of SQL_MAX_ROW to something more manageable, say 1,000. Raises the question of the need for having a separate SQL_MAX_ROW as opposed to DISPLAY_SQL_MAX_ROW in config.py, as there doesn't seem to be any other reference to SQL_MAX_ROW in the code base.

@timifasubaa
Copy link
Contributor

timifasubaa commented Jul 13, 2018

@villebro
Copy link
Member Author

Also worth pointing out, the documentation/comments in config.py are unclear on the difference between the two variables. If there is a need for having separate variables, their differences need to be clearly stated in config.py.

@villebro
Copy link
Member Author

@timifasubaa I think so, but I'm not familiar with the history of the variables, especially why SQL_MAX_ROW was originally split in two (see #3139).

@timifasubaa
Copy link
Contributor

timifasubaa commented Jul 13, 2018

I see. There have been 2 attempts at limiting query results. One at the query level and the other at the display level.
At the query level, we have tried to wrap the user query with another query so it will look like `select * from (user_query) limit sql_lab_row_limit; Now we parse the sql and change the limit in place or add a limit if none was specified.

The second attempt is to only get a subset of the results from the results backend to send to the frontend. That's the display limit. As seen here (https://github.com/apache/incubator-superset/blame/a17f7141b75f4aca2458d3190ad92f96840ccbcd/superset/views/core.py#L2480)

Now most of the work is done on the backend by modifying the query so it should be fine using the same variable.

I recommend we use SQL_MAX_ROW across the board and set it to 1000.

@timifasubaa timifasubaa self-requested a review July 13, 2018 20:48
@villebro
Copy link
Member Author

Agreed @timifasubaa, this looks much better.

@timifasubaa
Copy link
Contributor

There is a small js issue breaking the build rn. A team mate is working on resolving it. Once that's done, I will merge.
Thanks for working on this.

@timifasubaa
Copy link
Contributor

@villebro you'll need to rebase

@villebro
Copy link
Member Author

Thanks for the help @timifasubaa , should now be good to go.

@codecov-io
Copy link

Codecov Report

Merging #5392 into master will decrease coverage by <.01%.
The diff coverage is 50%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5392      +/-   ##
==========================================
- Coverage   77.15%   77.15%   -0.01%     
==========================================
  Files          44       44              
  Lines        8892     8891       -1     
==========================================
- Hits         6861     6860       -1     
  Misses       2031     2031
Impacted Files Coverage Δ
superset/views/core.py 72.86% <0%> (ø) ⬆️
superset/config.py 92.53% <100%> (-0.06%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 22b7c2d...4c6d4a3. Read the comment docs.

@timifasubaa timifasubaa merged commit bd47587 into apache:master Jul 16, 2018
timifasubaa pushed a commit to airbnb/superset-fork that referenced this pull request Jul 25, 2018
* Fix display limit in sql lab

* Add redundant variable from config.py

* Fix test config

* Change to SQL_MAX_ROW

* Remove last remaining occurence of DISPLAY_SQL_MAX_ROW
@mistercrunch
Copy link
Member

I think this isn't right. I got a user saying that their CSV export today was just 1k rows and I was confused as I thought the limit was much higher like 1M for CSVs. I got confused because I read the code and thought "mmh I thought there was 2 different settings for this..." but couldn't find evidence of that since this PR had changed that.

So when is async mode we should apply the SQL_MAX_ROW limit (1M by default) to send to the results backend, so that the csv/ endpoint can hit that in async mode. The results/ endpoint in async mode (reading from the results backend) should read only the first DISPLAY_SQL_MAX_ROW rows only and return that to prevent the UI from crashing. I think this was all working prior to this PR.

In non-async mode, we should apply DISPLAY_SQL_MAX_ROW on the sql_json/ endpoint, and apply no limit o the csv/ endpoint. I think this was not working before, we would just return SQL_MAX_ROW and crash the browser. This didn't happen much at Lyft and Airbnb since we run most big databases in async mode.

Sorry this is all a bit confusing and clearly needs to be documented better. We also need to allow the user to know what is happening here and maybe allow them to change their limits, though we shouldn't let them crash their browser too easily.

@timifasubaa @villebro ^^^

@villebro
Copy link
Member Author

@mistercrunch Ok that makes sense. Sounds fairly trivial, if needed I can make this fix.

@mistercrunch
Copy link
Member

@villebro let's see what @timifasubaa and @john-bodley were going with this. The main question is around allowing users to export large CSVs when in async mode.

There could be solutions around allowing CSV exports of entire tables instead of queries (capped at say 1M), and forcing users to run the CREATE TABLE AS feature first and CSV export that table.

@villebro villebro deleted the sqllab_max_row branch August 30, 2018 18:29
@timifasubaa
Copy link
Contributor

Taking a look at this

@mistercrunch
Copy link
Member

@timifasubaa ping

@timifasubaa
Copy link
Contributor

Sorry for the delay. Currently working on this.

wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
* Fix display limit in sql lab

* Add redundant variable from config.py

* Fix test config

* Change to SQL_MAX_ROW

* Remove last remaining occurence of DISPLAY_SQL_MAX_ROW
@ofajardo
Copy link

ofajardo commented Feb 2, 2019

Just a comment from a confused user: I was trying queries with ORDER BY and they either were failing (Order by is not allowed in sub-queries message) or just getting things out of order. It took me some time to understand there is this select * from (yourquery) limit SQL_MAX_ROW thing, as I think it is not documented anywhere. Now I set SQL_MAX_ROW to 0 to turn that thing off and I am happy again.
Suggestion: please leave SQL_MAX_ROW so that that subquerying can be turned off, and document that more clearly.
I think the display limit is good.
( I have used several other SQL Editors and there was always a display limit, but I have never seen before that sub querying, that makes it really confusing for newbies).

@mistercrunch
Copy link
Member

mistercrunch commented Feb 2, 2019

@ofajardo which database engine were you using? The routine that "squeezes in" the limit clause is different from a database engine to the next.

@ofajardo
Copy link

ofajardo commented Feb 2, 2019

Teradata. Maybe the version for teradata needs adjustement?

This is how the query after transformation:

"SELECT TOP 100000 * FROM (select DB_SHORT_NAME, count(*) as db_count from RWD_META_MDH.batch group by db_short_name order by db_count desc) AS inner_qry"

two not so good things here:
1- The order by gets inside the subquery, which is not good. It either gets rejected or the results you get back are not ordered.
2- It refuses that you do something as simple as select count() from, it wants that you set an alias for the count() which is not difficult but very puzzling if you don't know your query is getting inside a subquery.

@saurabh1-singh
Copy link

Can I get the answer here because I want that only 500 rows are displayed in the table visualization and I could download the original amount of rows returned from the DB. What config values should I use to help me achieve this. @villebro @mistercrunch

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL Lab query limit too large to handle default settings
6 participants