refactor(sql): optimize sql query parser #9673

lilykuang · 2020-04-28T18:35:05Z

SUMMARY

When I was researching on optimizing sql parser, I have tried using flatten method to reduce processing time. However, flatten turns sql query into 1 dimensional iterator and it loses the ability to identify item as tokenList. TokenList is used as identifier to process as a table, see below https://github.com/apache/incubator-superset/blob/e5c3e8964d8933069569809325200474157592c2/superset/sql_parse.py#L171-L175

I have found that _extract_from_token was recursively called on the same item for n times (n is length of item.tokens). It could be called only one time for item

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

ADDITIONAL INFORMATION

REVIEWERS

codecov-io · 2020-04-28T18:42:01Z

Codecov Report

Merging #9673 into master will decrease coverage by 0.76%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #9673      +/-   ##
==========================================
- Coverage   70.79%   70.02%   -0.77%     
==========================================
  Files         587      588       +1     
  Lines       30435    32060    +1625     
  Branches     3152     3166      +14     
==========================================
+ Hits        21545    22449     +904     
- Misses       8776     9498     +722     
+ Partials      114      113       -1

Flag	Coverage Δ
#cypress	`53.68% <ø> (-0.03%)`	⬇️
#javascript	`59.03% <ø> (+0.03%)`	⬆️
#python	`69.61% <100.00%> (-1.34%)`	⬇️

Impacted Files	Coverage Δ
superset/sql_parse.py	`99.36% <100.00%> (-0.01%)`	⬇️
superset/viz_sip38.py	`0.00% <0.00%> (ø)`
superset/errors.py	`100.00% <0.00%> (ø)`
superset/views/core.py	`75.11% <0.00%> (+0.22%)`	⬆️
superset/models/helpers.py	`88.55% <0.00%> (+0.66%)`	⬆️
superset/models/core.py	`86.13% <0.00%> (+0.88%)`	⬆️
superset/db_engine_specs/base.py	`88.72% <0.00%> (+0.93%)`	⬆️
.../src/dashboard/components/gridComponents/Chart.jsx	`89.88% <0.00%> (+1.12%)`	⬆️
superset/views/base.py	`75.16% <0.00%> (+2.81%)`	⬆️
superset-frontend/src/chart/Chart.jsx	`71.66% <0.00%> (+5.00%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65d185f...e5c3e89. Read the comment docs.

john-bodley

Could you please and a summary in the PR description to describe how the parsing has been optimized?

john-bodley · 2020-04-29T07:52:18Z

superset/sql_parse.py

@@ -222,6 +220,15 @@ def __extract_from_token(self, token: Token):  # pylint: disable=too-many-branch
                    if not self.__is_identifier(token2):
                        self.__extract_from_token(item)

+    def __process_token_group(self, tokens):  # flatten nested tokens to 1D array


sqlparse has a flatten method and thus you may want to explore whether this serves the same purpose as said function.

villebro · 2020-05-26T17:07:51Z

While I don't doubt this does what the PR description says, it would be great if we could transfer some understanding that you've gathered during research for this PR into the code/docs. Reviewing this is quite difficult in part because of the following:

Docstrings are very short or missing. Perhaps some examples could be added to the docstrings to explain what kind of tokens a typical SQL statement is decomposed into.
Documentation for sqlparse is also quite incomprehensive.
Variable names are not very descriptive. For example, subtoken might be more explanatory than token2.

This might be overkill, but it would be great if we could add a test that ensures that self._extract_from_token for a certain test case isn't called more than x times. That would make sure this parsing logic stays performant in the future, too.

codecov-commenter · 2020-05-27T00:50:52Z

Codecov Report

Merging #9673 into master will decrease coverage by 0.40%.
The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master    #9673      +/-   ##
==========================================
- Coverage   70.79%   70.38%   -0.41%     
==========================================
  Files         587      585       -2     
  Lines       30435    31057     +622     
  Branches     3152     3277     +125     
==========================================
+ Hits        21545    21861     +316     
- Misses       8776     9084     +308     
+ Partials      114      112       -2

Flag	Coverage Δ
#cypress	`53.91% <ø> (+0.19%)`	⬆️
#javascript	`59.36% <ø> (+0.36%)`	⬆️
#python	`69.96% <66.66%> (-0.99%)`	⬇️

Impacted Files	Coverage Δ
superset/sql_parse.py	`99.29% <66.66%> (-0.09%)`	⬇️
superset/examples/countries.py	`0.00% <0.00%> (-100.00%)`	⬇️
superset/utils/decorators.py	`55.00% <0.00%> (-32.28%)`	⬇️
superset/viz.py	`57.20% <0.00%> (-14.78%)`	⬇️
superset/db_engine_specs/sqlite.py	`63.33% <0.00%> (-10.01%)`	⬇️
.../src/dashboard/components/RefreshIntervalModal.jsx	`90.00% <0.00%> (-10.00%)`	⬇️
...explore/components/AdhocMetricEditPopoverTitle.jsx	`84.61% <0.00%> (-8.72%)`	⬇️
superset-frontend/src/explore/exploreUtils.js	`74.45% <0.00%> (-8.59%)`	⬇️
...ponents/AdhocFilterEditPopoverSimpleTabContent.jsx	`86.29% <0.00%> (-7.42%)`	⬇️
superset/jinja_context.py	`80.95% <0.00%> (-4.77%)`	⬇️
... and 193 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65d185f...e2e3c9b. Read the comment docs.

villebro

LGTM, thanks for fixing!

* optimize sql query parser * update extract from token * update doc string * pylint doc string

superset-github-bot bot added the preset-io label Apr 28, 2020

pull-request-size bot added the size/S label Apr 28, 2020

lilykuang marked this pull request as draft April 28, 2020 18:35

john-bodley reviewed Apr 29, 2020

View reviewed changes

lilykuang added 2 commits May 12, 2020 11:50

optimize sql query parser

ed46939

update extract from token

e5c3e89

lilykuang force-pushed the lily/sql-query-parser branch from 9142502 to e5c3e89 Compare May 12, 2020 19:49

pull-request-size bot added size/XS and removed size/S labels May 12, 2020

lilykuang changed the title ~~[sql] optimize sql query parser~~ [refactor] optimize sql query parser May 12, 2020

lilykuang closed this May 12, 2020

lilykuang reopened this May 12, 2020

lilykuang changed the title ~~[refactor] optimize sql query parser~~ refactor(sql): optimize sql query parser May 12, 2020

lilykuang marked this pull request as ready for review May 12, 2020 20:08

lilykuang closed this May 13, 2020

lilykuang reopened this May 13, 2020

lilykuang requested a review from john-bodley May 14, 2020 17:15

pull-request-size bot added size/S and removed size/XS labels May 27, 2020

update doc string

7ebf6e3

lilykuang force-pushed the lily/sql-query-parser branch from 65e4c9d to 7ebf6e3 Compare June 9, 2020 17:22

pylint doc string

e2e3c9b

villebro approved these changes Jun 10, 2020

View reviewed changes

villebro merged commit f6cd3a9 into apache:master Jun 10, 2020

villebro deleted the lily/sql-query-parser branch June 10, 2020 06:16

auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020

refactor(sql): optimize sql query parser (apache#9673)

65b8c24

* optimize sql query parser * update extract from token * update doc string * pylint doc string

mistercrunch added the 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels label Feb 28, 2024

mistercrunch added the 🚢 0.37.0 label Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(sql): optimize sql query parser #9673

refactor(sql): optimize sql query parser #9673

lilykuang commented Apr 28, 2020 •

edited

codecov-io commented Apr 28, 2020 •

edited

john-bodley left a comment

john-bodley Apr 29, 2020

villebro commented May 26, 2020

codecov-commenter commented May 27, 2020 •

edited

villebro left a comment

refactor(sql): optimize sql query parser #9673

refactor(sql): optimize sql query parser #9673

Conversation

lilykuang commented Apr 28, 2020 • edited

CATEGORY

SUMMARY

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

ADDITIONAL INFORMATION

REVIEWERS

codecov-io commented Apr 28, 2020 • edited

Codecov Report

john-bodley left a comment

Choose a reason for hiding this comment

john-bodley Apr 29, 2020

Choose a reason for hiding this comment

villebro commented May 26, 2020

codecov-commenter commented May 27, 2020 • edited

Codecov Report

villebro left a comment

Choose a reason for hiding this comment

lilykuang commented Apr 28, 2020 •

edited

codecov-io commented Apr 28, 2020 •

edited

codecov-commenter commented May 27, 2020 •

edited