Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sqllab] force limit queries only when there is no existing limit #5023

Merged
merged 3 commits into from
May 31, 2018

Conversation

timifasubaa
Copy link
Contributor

@timifasubaa timifasubaa commented May 17, 2018

This PR (#4947) implemented a more efficient way to limit queries by attaching the limit statement to the statement. However, it always attaches the default limit to all queries even when a fifferent (possibly smaller) limit is specified in the query. This is because query.limit is always set to the default value, never the specified limit value.
In this PR, I do some query parsing to make sure query.limit is correctly specified upstream. Then the query augnmenting logic is only called when there no limit specified.

@mistercrunch @john-bodley @michellethomas

@timifasubaa timifasubaa force-pushed the fix_sqllab_commit branch 6 times, most recently from 2894e8a to 71ca5cc Compare May 17, 2018 03:46
@codecov-io
Copy link

codecov-io commented May 17, 2018

Codecov Report

Merging #5023 into master will increase coverage by 0.02%.
The diff coverage is 92.3%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5023      +/-   ##
==========================================
+ Coverage    77.5%   77.52%   +0.02%     
==========================================
  Files          44       44              
  Lines        8720     8728       +8     
==========================================
+ Hits         6758     6766       +8     
  Misses       1962     1962
Impacted Files Coverage Δ
superset/views/core.py 74.66% <ø> (ø) ⬆️
superset/sql_lab.py 75.4% <100%> (+0.13%) ⬆️
superset/db_engine_specs.py 53.98% <90%> (+0.54%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f611797...a9d7faf. Read the comment docs.

@@ -171,6 +171,7 @@ def handle_error(msg):
# Limit enforced only for retrieving the data, not for the CTA queries.
superset_query = SupersetQuery(rendered_query)
executed_sql = superset_query.stripped()
SQL_MAX_ROWS = int(app.config.get('SQL_MAX_ROW', None))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is implicit on the get method and will make int raise

In [1]: int(None)
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll send out the fir for this.

@@ -185,7 +186,8 @@ def handle_error(msg):
query.user_id, start_dttm.strftime('%Y_%m_%d_%H_%M_%S'))
executed_sql = superset_query.as_create_table(query.tmp_table_name)
query.select_as_cta_used = True
elif (query.limit and superset_query.is_select()):
elif (not query.limit and superset_query.is_select() and SQL_MAX_ROWS):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait I think you remove all handling of query.limit where it's defined

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch I tested it and it works correctly.

I changed the place where query.limit is defined upstream and it is None only when it is not specified. It is only in those cases that we append the default limit to the end.

tokens = sql.split()
try:
if 'limit' in tokens:
limit_pos = tokens.index('limit') + 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that looks brittle here, for example limit can exist within a subquery, or there can be line breaks or tabs as separators instead of spaces. We should use the same method / regex approach as the one I defined. Maybe that method in db_engine_spec can receive max_limit and only apply alter if the max
_limit is lower

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, true. I hadn't thought of the subquery case. But split can handle the tabs and newlines just fine. I'll go ahead to reuse your regex approach as that is more robust.
And your suggestion of adding the logic within the checking function makes sense.

@timifasubaa timifasubaa force-pushed the fix_sqllab_commit branch 2 times, most recently from 8c86fa1 to 04725b0 Compare May 19, 2018 01:05
@timifasubaa
Copy link
Contributor Author

@mistercrunch That makes sense. I took out your regex and made it a helper method in utils. Now I apply the limit upstream if it exists then we modify the query with the max value if the specified value is more than the max value or there is no specified value. Lmk what you think.

@timifasubaa timifasubaa force-pushed the fix_sqllab_commit branch 3 times, most recently from ee75b9f to d7eb638 Compare May 19, 2018 01:45
@timifasubaa
Copy link
Contributor Author

timifasubaa commented May 22, 2018

PING

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be simpler to just make apply_limit_to_sql extract the LIMIT number, check if <, and replace conditionally, all using pretty much the same one regex.

Then add unit tests similar to the existing ones confirming that it works as expected.

"limit specified was not an integer: '{}'".format(specified_limit))
specified_limit = None
return specified_limit
return None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return None is implicit in python

sql_without_limit = get_query_without_limit(sql)
if sql_without_limit != sql:
limit_clause = sql.partition(sql_without_limit)[2]
limit_match = re.search('limit\s(\d+)', limit_clause, re.IGNORECASE)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the same regex as above to extract the limit number.

""", '', sql)


def get_limit_from_sql(sql):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit tests

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only works for databases that uses the LIMIT syntax. Say if you take MS SQL Server that goes for the SELECT TOP 100 * FROM ... this piece of code doesn't apply.

This calls for moving this logic within db_engine_spec. I'm thinking apply_limit_to_sql should be be expected to apply a limit only if needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch As is, we will append limit <default_limit> to MS SQL Server queries.
I moved the get_limit_from_query to db_engine_spec and that should allow us override the method for dbs with different ways of specifying the limit.

@@ -171,6 +171,8 @@ def handle_error(msg):
# Limit enforced only for retrieving the data, not for the CTA queries.
superset_query = SupersetQuery(rendered_query)
executed_sql = superset_query.stripped()
SQL_MAX_ROWS = app.config.get('SQL_MAX_ROW', None)
SQL_MAX_ROWS = int(SQL_MAX_ROWS) if SQL_MAX_ROWS else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we not assume that SQL_MAX_ROWS is an int, why casting?

Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I agree that this PR resolves a regression introduced in #4947 but I think (per my comment there) that adding a LIMIT to a query without the user's knowledge isn't ideal and thus it seems more sense to revert #4947. Additionally the work in #4834 with pre-fetching results future dilutes the need to define an explicit limit.

@@ -171,6 +171,8 @@ def handle_error(msg):
# Limit enforced only for retrieving the data, not for the CTA queries.
superset_query = SupersetQuery(rendered_query)
executed_sql = superset_query.stripped()
SQL_MAX_ROWS = app.config.get('SQL_MAX_ROW', None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also app.config.get('SQL_MAX_ROW', None) is equivalent to app.config.get('SQL_MAX_ROW') as the second slot is only used to return a non-None value if the key doesn't exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

limit_match = re.search('limit\s(\d+)', limit_clause, re.IGNORECASE)
if limit_match:
specified_limit = limit_match.group(1)
try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try/except is unnecessary as the regex explicitly matched a an integer. If the value was a non-integer limit_match would be None.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. Taken out completely.

if limit_match:
specified_limit = limit_match.group(1)
try:
specified_limit = int(specified_limit) if specified_limit else None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that it's applicable because the try/except is unnecessary but there's no need to do the if/else logic as the specified_limit is assigned to None on line 870.

@@ -196,13 +196,11 @@ def test_run_async_query(self):
self.assertEqual([{'name': 'Admin'}], df.to_dict(orient='records'))
self.assertEqual(QueryStatus.SUCCESS, query.status)
self.assertTrue('FROM tmp_async_1' in query.select_sql)
self.assertTrue('LIMIT 666' in query.select_sql)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why were these checks removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@timifasubaa timifasubaa force-pushed the fix_sqllab_commit branch 2 times, most recently from 34313e0 to 3ffe373 Compare May 24, 2018 04:01
@timifasubaa timifasubaa force-pushed the fix_sqllab_commit branch 7 times, most recently from bab9b23 to 1c5dc79 Compare May 29, 2018 22:51

@classmethod
def get_limit_from_sql(cls, sql):
limit_pattern = re.compile(r"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize that this logic existed previously but shouldn't we use something like sqlparse to obtain the limit rather than using a regex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored the sqlparse option but there were no nice way to just get the limit without recursively parsing through the query.

@timifasubaa
Copy link
Contributor Author

@mistercrunch I saw the notification of your comment over email but I don't see it here.
The reason why it's duplicates is that I need to put () around the \d+ to select for it. The previous regex does not have the selector.

Copy link
Member

@mistercrunch mistercrunch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a unit test that makes sure that the lower limit won't be replaced (the original issue we're tackling in this PR)?

@timifasubaa
Copy link
Contributor Author

On it!

@timifasubaa timifasubaa force-pushed the fix_sqllab_commit branch 5 times, most recently from 9ddcb28 to e264f12 Compare May 30, 2018 19:49
@timifasubaa
Copy link
Contributor Author

@john-bodley @mistercrunch Ready for stamp

@timifasubaa timifasubaa merged commit cefc206 into apache:master May 31, 2018
@@ -104,15 +104,32 @@ def apply_limit_to_sql(cls, sql, limit, database):
)
return database.compile_sqla_query(qry)
elif LimitMethod.FORCE_LIMIT:
no_limit = re.sub(r"""
sql_without_limit = cls.get_query_without_limit(sql)
return '{sql_without_limit} LIMIT {limit}'.format(**locals())
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic can generate sql like

SELECT id,
       name
FROM main.ab_permission
LIMIT 100
OFFSET 0 LIMIT 5000

which is wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which engine is this on?

wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
[sqllab] force limit queries only when there is no existing limit
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants