Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sqllab] Fix sqllab limit regex issue with sqlparse #5295

Merged
merged 2 commits into from
Jul 16, 2018

Conversation

timifasubaa
Copy link
Contributor

@timifasubaa timifasubaa commented Jun 27, 2018

This PR fixes the sqllab regex issues with Offset and tricky limits that regex doesn't handle correctly. It replaces the regex approach with sqlparse.

The queries below failed before but now they are successful. Fixes #5272

select * from table where a=True limit 1, 2
select * from table where a=True limit 1 offset 10
select 'limit 10'

@conglei @john-bodley @mistercrunch

@timifasubaa timifasubaa force-pushed the fix_sqllab_limit_regex branch 2 times, most recently from 2bb3290 to e596ff2 Compare June 27, 2018 00:20

@classmethod
def get_query_without_limit(cls, sql):
return re.sub(r"""
before_limit = re.sub(r"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably should rename the function given the change in return values. Also can we make these staticmethods.

sql_without_limit = cls.get_query_without_limit(sql)
return '{sql_without_limit} LIMIT {limit}'.format(**locals())
sql_before_limit, sql_after_limit = cls.get_query_without_limit(sql)
return '{sql_before_limit} LIMIT {limit}{sql_after_limit}'.format(**locals())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if one explicitly specifies a limit in the SQL which is less than the override?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not override

LIMIT\s+\d+ # LIMIT $ROWS
;? # optional semi-colon
(\s|;)*$ # remove trailing spaces tabs or semicolons
LIMIT\s+(\d+) # LIMIT $ROWS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we align the comments?

""", '', sql)

after_limit_pattern = re.compile(r"""
(?ix) # case insensitive, verbose
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly can we align the comments?


@classmethod
def get_query_without_limit(cls, sql):
return re.sub(r"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a staticmethod?

@@ -114,23 +114,31 @@ def get_limit_from_sql(cls, sql):
(?ix) # case insensitive, verbose
\s+ # whitespace
LIMIT\s+(\d+) # LIMIT $ROWS
;? # optional semi-colon
(\s|;)*$ # remove trailing spaces tabs or semicolons
.*$ # everything else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this regex will fail for queries like (albeit being somewhat unusual):

SELECT ‘LIMIT 10’

Copy link
Contributor

@conglei conglei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @timifasubaa for continuing working on this.

In general, there are two issues to be fix.

  1. LIMIT inside Quote. This can be done using something similar to https://robrich.org/archive/2007/11/29/regex-match-content-unless-it-is-inside-quotes.aspx

  2. The new logic will break the tests

@john-bodley
Copy link
Member

@conglei @timifasubaa sqlparse could probably be leveraged to deal with the word LIMIT outside of the LIMIT clause.

@timifasubaa timifasubaa force-pushed the fix_sqllab_limit_regex branch 6 times, most recently from 77d17d0 to 9675b17 Compare June 29, 2018 21:30
@timifasubaa timifasubaa changed the title [WIP] Fix sqllab limit regex [sqllab] Fix sqllab limit regex issue with sqlparse Jun 29, 2018
@timifasubaa
Copy link
Contributor Author

@john-bodley @conglei I tried out sqlparse and it worked fine for our concerns. Feel free to suggest any extra cases you want me to cover with tests.

@timifasubaa timifasubaa force-pushed the fix_sqllab_limit_regex branch 5 times, most recently from 0ec90f5 to f753df7 Compare June 30, 2018 00:32
cls.get_substrings_before_and_after_limit(sql)
)
sql = '{sql_before_limit} LIMIT {limit} {sql_after_limit}'.format(**locals())
re.sub('\s+', ' ', sql).strip()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you want to be stripping additional whitespace in free-from SQL as the author may have included this on purpose, i.e.,

SELECT * FROM my_table WHERE paragraph LIKE '%.  The%'

'LIMIT 777' AS a
, b
FROM
table LIMIT 1000, 999999"""),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note I believe the form here is LIMIT [<offset>,] <row-count> per this and it seems this is flipped.

limit_pos = None

# Add all items to before_str until there is a limit
for pos, item in enumerate(self._parsed[0].tokens):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given you're now using sqlparse rather than trying to extract the before and after portions of the query, why don't you simply replace the relevant token in-place, i.e.,

>>> sqlparse.parse('SELECT * FROM foo LIMIT 1000')[0].tokens
[..., <Keyword 'LIMIT' at 0x10DE741F0>, <Whitespace ' ' at 0x10DE74258>, <Integer '1000' at 0x10DE742C0>]

and

>>> sqlparse.parse('SELECT * FROM foo LIMIT 10, 1000')[0].tokens
[..., <Keyword 'LIMIT' at 0x10DE74668>, <Whitespace ' ' at 0x10DE746D0>, <IdentifierList '10, 10...' at 0x10DE751D0>]

It seems once you find the LIMIT keyword just jump two tokens which will contain either an IdentifierList or Integer and update the token accordingly, i.e., for the first case (example code):

>>> s = sqlparse.parse('SELECT * FROM foo LIMIT 1000')[0]
>>> s.tokens[-1].value = '999'
>>> str(s)
'SELECT * FROM foo LIMIT 999'

@timifasubaa timifasubaa force-pushed the fix_sqllab_limit_regex branch 8 times, most recently from b5ebb70 to 5e90d6a Compare July 2, 2018 22:01
@timifasubaa timifasubaa force-pushed the fix_sqllab_limit_regex branch 7 times, most recently from 78029be to c5512c6 Compare July 3, 2018 17:38
@codecov-io
Copy link

codecov-io commented Jul 3, 2018

Codecov Report

Merging #5295 into master will decrease coverage by 15.72%.
The diff coverage is 95.12%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #5295       +/-   ##
===========================================
- Coverage   77.15%   61.43%   -15.73%     
===========================================
  Files          44      373      +329     
  Lines        8892    23554    +14662     
  Branches        0     2725     +2725     
===========================================
+ Hits         6861    14471     +7610     
- Misses       2031     9070     +7039     
- Partials        0       13       +13
Impacted Files Coverage Δ
superset/sql_parse.py 99.13% <100%> (+0.34%) ⬆️
superset/db_engine_specs.py 53.6% <75%> (-0.23%) ⬇️
...erset/assets/src/explore/components/ControlRow.jsx 100% <0%> (ø)
superset/assets/src/visualizations/sunburst.js 8.37% <0%> (ø)
.../src/dashboard/components/menu/WithPopoverMenu.jsx 82.14% <0%> (ø)
...ssets/src/dashboard/util/backgroundStyleOptions.js 100% <0%> (ø)
.../dashboard/util/logging/findNonTabChildChartIds.js 0% <0%> (ø)
superset/assets/src/welcome/DashboardTable.jsx 83.33% <0%> (ø)
...t/assets/src/SqlLab/components/QueryStateLabel.jsx 100% <0%> (ø)
superset/assets/src/visualizations/country_map.js 4.9% <0%> (ø)
... and 321 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 22b7c2d...dca0bd0. Read the comment docs.

return sql

@classmethod
def get_limit_from_sql(cls, sql):
limit_pattern = re.compile(r"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the get_limit_from_sql method now obsolete?

'{}, {}'.format(next(limit.get_identifiers()), new_limit)
)
flattened = self._parsed[0].tokens
str_res = ''
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simply do str(flattened) or flattened.value.

break
if not limit_token:
return limit_token
return self._get_limit_from_token(limit_token)
Copy link
Member

@john-bodley john-bodley Jul 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not have this logic on like #149 and thus there's no need for the break? This also means that limit_token only needs to be scoped if there exists a limit keyword. Functions return None by default.

def _extract_limit_from_outermost_layer(self, statement):
    for pos, item in enumerate(statement.tokens):
        if item.ttype in Keyword and item.value.lower() == 'limit':
             return statement.tokens[pos + 2]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the term outermost may be misleading. You can argue that iterating backwards would find the outermost LIMIT condition.

return self.sql + ' LIMIT ' + str(new_limit)
limit_pos = None
# Add all items to before_str until there is a limit
for pos, item in enumerate(self._parsed[0].tokens):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we repeating the logic here for searching for the limit keyword. Can't we do this all in one place, i.e., I think the logic can further be simplified, by finding and replacing in the same function.

break
limit = self._parsed[0].tokens[limit_pos + 2]
if limit.ttype == sqlparse.tokens.Literal.Number.Integer:
self._parsed[0].tokens[limit_pos + 2].value = new_limit
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self._parsed[0].tokens array contains a list of references to the Token objects and thus you can mutate the token in-place (the reference remains unchanged) so this can simply be

limit.value = new_limit


def test_limit_with_explicit_offset(self):
self.sql_limit_regex(
textwrap.dedent("""\
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the need for textwrap and the \ after the tripple-quotes?

@john-bodley
Copy link
Member

@timifasubaa this is generally the right approach but I sense there could be some additional simplification, i.e., I don't think we need to first i) find (and store) the limit and then go through and then ii) re-find the limit and replace it. It seems that this could all happen in the same function which would reduce the need to repeat code.

@timifasubaa timifasubaa force-pushed the fix_sqllab_limit_regex branch 2 times, most recently from 8206588 to f599919 Compare July 5, 2018 22:07
@timifasubaa
Copy link
Contributor Author

I will reexamine the usage of limit in the query table in another PR and potentially further simplify the code.

@timifasubaa timifasubaa force-pushed the fix_sqllab_limit_regex branch 7 times, most recently from 0c10221 to 8cad9ed Compare July 13, 2018 20:14
@timifasubaa timifasubaa merged commit f8a6e09 into apache:master Jul 16, 2018
timifasubaa added a commit to airbnb/superset-fork that referenced this pull request Jul 25, 2018
* include items after limit to the modified query

* use sqlparse
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
* include items after limit to the modified query

* use sqlparse
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

LIMIT statement is not correctly caught by the current regular expression
5 participants