Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(sql): optimize sql query parser #9673

Merged
merged 4 commits into from Jun 10, 2020

Conversation

lilykuang
Copy link
Member

@lilykuang lilykuang commented Apr 28, 2020

CATEGORY

Choose one

  • Bug Fix
  • Enhancement (new features, refinement)
  • Refactor
  • Add tests
  • Build / Development Environment
  • Documentation

SUMMARY

When I was researching on optimizing sql parser, I have tried using flatten method to reduce processing time. However, flatten turns sql query into 1 dimensional iterator and it loses the ability to identify item as tokenList. TokenList is used as identifier to process as a table, see below https://github.com/apache/incubator-superset/blob/e5c3e8964d8933069569809325200474157592c2/superset/sql_parse.py#L171-L175

I have found that _extract_from_token was recursively called on the same item for n times (n is length of item.tokens). It could be called only one time for item

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TEST PLAN

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

REVIEWERS

@lilykuang lilykuang marked this pull request as draft April 28, 2020 18:35
@codecov-io
Copy link

codecov-io commented Apr 28, 2020

Codecov Report

Merging #9673 into master will decrease coverage by 0.76%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9673      +/-   ##
==========================================
- Coverage   70.79%   70.02%   -0.77%     
==========================================
  Files         587      588       +1     
  Lines       30435    32060    +1625     
  Branches     3152     3166      +14     
==========================================
+ Hits        21545    22449     +904     
- Misses       8776     9498     +722     
+ Partials      114      113       -1     
Flag Coverage Δ
#cypress 53.68% <ø> (-0.03%) ⬇️
#javascript 59.03% <ø> (+0.03%) ⬆️
#python 69.61% <100.00%> (-1.34%) ⬇️
Impacted Files Coverage Δ
superset/sql_parse.py 99.36% <100.00%> (-0.01%) ⬇️
superset/viz_sip38.py 0.00% <0.00%> (ø)
superset/errors.py 100.00% <0.00%> (ø)
superset/views/core.py 75.11% <0.00%> (+0.22%) ⬆️
superset/models/helpers.py 88.55% <0.00%> (+0.66%) ⬆️
superset/models/core.py 86.13% <0.00%> (+0.88%) ⬆️
superset/db_engine_specs/base.py 88.72% <0.00%> (+0.93%) ⬆️
.../src/dashboard/components/gridComponents/Chart.jsx 89.88% <0.00%> (+1.12%) ⬆️
superset/views/base.py 75.16% <0.00%> (+2.81%) ⬆️
superset-frontend/src/chart/Chart.jsx 71.66% <0.00%> (+5.00%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65d185f...e5c3e89. Read the comment docs.

Copy link
Member

@john-bodley john-bodley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please and a summary in the PR description to describe how the parsing has been optimized?

@@ -222,6 +220,15 @@ def __extract_from_token(self, token: Token): # pylint: disable=too-many-branch
if not self.__is_identifier(token2):
self.__extract_from_token(item)

def __process_token_group(self, tokens): # flatten nested tokens to 1D array
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sqlparse has a flatten method and thus you may want to explore whether this serves the same purpose as said function.

@lilykuang lilykuang changed the title [sql] optimize sql query parser [refactor] optimize sql query parser May 12, 2020
@lilykuang lilykuang closed this May 12, 2020
@lilykuang lilykuang reopened this May 12, 2020
@lilykuang lilykuang changed the title [refactor] optimize sql query parser refactor(sql): optimize sql query parser May 12, 2020
@lilykuang lilykuang marked this pull request as ready for review May 12, 2020 20:08
@lilykuang lilykuang closed this May 13, 2020
@lilykuang lilykuang reopened this May 13, 2020
@villebro
Copy link
Member

While I don't doubt this does what the PR description says, it would be great if we could transfer some understanding that you've gathered during research for this PR into the code/docs. Reviewing this is quite difficult in part because of the following:

  • Docstrings are very short or missing. Perhaps some examples could be added to the docstrings to explain what kind of tokens a typical SQL statement is decomposed into.
  • Documentation for sqlparse is also quite incomprehensive.
  • Variable names are not very descriptive. For example, subtoken might be more explanatory than token2.

This might be overkill, but it would be great if we could add a test that ensures that self._extract_from_token for a certain test case isn't called more than x times. That would make sure this parsing logic stays performant in the future, too.

@codecov-commenter
Copy link

codecov-commenter commented May 27, 2020

Codecov Report

Merging #9673 into master will decrease coverage by 0.40%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9673      +/-   ##
==========================================
- Coverage   70.79%   70.38%   -0.41%     
==========================================
  Files         587      585       -2     
  Lines       30435    31057     +622     
  Branches     3152     3277     +125     
==========================================
+ Hits        21545    21861     +316     
- Misses       8776     9084     +308     
+ Partials      114      112       -2     
Flag Coverage Δ
#cypress 53.91% <ø> (+0.19%) ⬆️
#javascript 59.36% <ø> (+0.36%) ⬆️
#python 69.96% <66.66%> (-0.99%) ⬇️
Impacted Files Coverage Δ
superset/sql_parse.py 99.29% <66.66%> (-0.09%) ⬇️
superset/examples/countries.py 0.00% <0.00%> (-100.00%) ⬇️
superset/utils/decorators.py 55.00% <0.00%> (-32.28%) ⬇️
superset/viz.py 57.20% <0.00%> (-14.78%) ⬇️
superset/db_engine_specs/sqlite.py 63.33% <0.00%> (-10.01%) ⬇️
.../src/dashboard/components/RefreshIntervalModal.jsx 90.00% <0.00%> (-10.00%) ⬇️
...explore/components/AdhocMetricEditPopoverTitle.jsx 84.61% <0.00%> (-8.72%) ⬇️
superset-frontend/src/explore/exploreUtils.js 74.45% <0.00%> (-8.59%) ⬇️
...ponents/AdhocFilterEditPopoverSimpleTabContent.jsx 86.29% <0.00%> (-7.42%) ⬇️
superset/jinja_context.py 80.95% <0.00%> (-4.77%) ⬇️
... and 193 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65d185f...e2e3c9b. Read the comment docs.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for fixing!

@villebro villebro merged commit f6cd3a9 into apache:master Jun 10, 2020
@villebro villebro deleted the lily/sql-query-parser branch June 10, 2020 06:16
auxten pushed a commit to auxten/incubator-superset that referenced this pull request Nov 20, 2020
* optimize sql query parser

* update extract from token

* update doc string

* pylint doc string
@mistercrunch mistercrunch added the 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels label Feb 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/S 🚢 0.37.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants