New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(sql): optimize sql query parser #9673
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9673 +/- ##
==========================================
- Coverage 70.79% 70.02% -0.77%
==========================================
Files 587 588 +1
Lines 30435 32060 +1625
Branches 3152 3166 +14
==========================================
+ Hits 21545 22449 +904
- Misses 8776 9498 +722
+ Partials 114 113 -1
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please and a summary in the PR description to describe how the parsing has been optimized?
superset/sql_parse.py
Outdated
@@ -222,6 +220,15 @@ def __extract_from_token(self, token: Token): # pylint: disable=too-many-branch | |||
if not self.__is_identifier(token2): | |||
self.__extract_from_token(item) | |||
|
|||
def __process_token_group(self, tokens): # flatten nested tokens to 1D array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sqlparse
has a flatten
method and thus you may want to explore whether this serves the same purpose as said function.
9142502
to
e5c3e89
Compare
While I don't doubt this does what the PR description says, it would be great if we could transfer some understanding that you've gathered during research for this PR into the code/docs. Reviewing this is quite difficult in part because of the following:
This might be overkill, but it would be great if we could add a test that ensures that |
Codecov Report
@@ Coverage Diff @@
## master #9673 +/- ##
==========================================
- Coverage 70.79% 70.38% -0.41%
==========================================
Files 587 585 -2
Lines 30435 31057 +622
Branches 3152 3277 +125
==========================================
+ Hits 21545 21861 +316
- Misses 8776 9084 +308
+ Partials 114 112 -2
Continue to review full report at Codecov.
|
65e4c9d
to
7ebf6e3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing!
* optimize sql query parser * update extract from token * update doc string * pylint doc string
CATEGORY
Choose one
SUMMARY
When I was researching on optimizing sql parser, I have tried using
flatten
method to reduce processing time. However,flatten
turns sql query into 1 dimensional iterator and it loses the ability to identify item as tokenList. TokenList is used as identifier to process as a table, see below https://github.com/apache/incubator-superset/blob/e5c3e8964d8933069569809325200474157592c2/superset/sql_parse.py#L171-L175I have found that
_extract_from_token
was recursively called on the sameitem
for n times (n is length ofitem.tokens
). It could be called only one time foritem
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
TEST PLAN
ADDITIONAL INFORMATION
REVIEWERS