Skip to content

[SPARK-14398] [SQL] Audit non-reserved keyword list in ANTLR4 parser#12191

Closed
bomeng wants to merge 12 commits intoapache:masterfrom
bomeng:SPARK-14398
Closed

[SPARK-14398] [SQL] Audit non-reserved keyword list in ANTLR4 parser#12191
bomeng wants to merge 12 commits intoapache:masterfrom
bomeng:SPARK-14398

Conversation

@bomeng
Copy link
Contributor

@bomeng bomeng commented Apr 6, 2016

What changes were proposed in this pull request?

I have compared non-reserved list in Antlr3 and Antlr4 one by one as well as all the existing keywords defined in Antlr4, added the missing keywords to the non-reserved keywords list. If we need to support more syntax, we can add more keywords by then.

Any recommendation for the above is welcome.

How was this patch tested?

I manually checked the keywords one by one. Please let me know if there is a better way to test.

Another thought: I suggest to put all the keywords definition and non-reserved list in order, that will be much easier to check in the future.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55066 has finished for PR 12191 at commit 5c130ba.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55080 has finished for PR 12191 at commit c6fbf0f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public class Java8DatasetAggregatorSuite extends JavaDatasetAggregatorSuiteBase
    • public class JavaDatasetAggregatorSuite extends JavaDatasetAggregatorSuiteBase
    • class JavaDatasetAggregatorSuiteBase implements Serializable

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55091 has finished for PR 12191 at commit 18bff08.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

@bomeng what is the point of adding non-reserved keywords if they are not used in parser rules?

The main point of this ticket is that we need to make sure that we do not have regressions compared to the old situation; a non-reserved keyword in the old situation should not be reserved in the new situation. Did you find any of these cases?

@viirya
Copy link
Member

viirya commented Apr 6, 2016

If we don't actually use these non-reserved keywords in any rules, I think we don't need to add them to the list. It might cause confusing too.

@bomeng
Copy link
Contributor Author

bomeng commented Apr 6, 2016

Sorry for my misunderstanding. I thought we want to keep all the keywords that were defined in the Antlr3 and later if we want to use them, we do not have to add them back case by case.

Among the items I added, some of them (e.g. ASC, DESC) needs to be in the non-reserved list, since they are used the parser and were non-reserved before. Should I only focus on those? What is the best way to do it? Please advise. Thanks.

@hvanhovell
Copy link
Contributor

@bomeng No worries. Please focus on the keywords that are reserved in the ANTLR4 parser, but were not in the ANTLR3 parser. The exception being join keywords.

We can add the other keywords when we need them.

@bomeng
Copy link
Contributor Author

bomeng commented Apr 7, 2016

Another try... This time, I've scanned all the existing keywords one by one and added missing non-reserved ones back. So it is more conservative approach. Later on, if we need to support more syntax, we can add more keywords by then. Thanks.

@SparkQA
Copy link

SparkQA commented Apr 7, 2016

Test build #55247 has finished for PR 12191 at commit 2cff552.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

| STATISTICS | ANALYZE | PARTITIONED | EXTERNAL | DEFINED | RECORDWRITER
| REVOKE | GRANT | LOCK | UNLOCK | MSCK | EXPORT | IMPORT | LOAD | VALUES | COMMENT | ROLE
| ROLES | COMPACTIONS | PRINCIPALS | TRANSACTIONS | INDEX | INDEXES | LOCKS | OPTION
| ASC | DESC | LIMIT | METADATA | MINUS | PLUS | RENAME | SETS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PLUS (+) and MINUS (-) are bit funny, and really shouldn't be used as identifiers. Lets leave them out.

@bomeng
Copy link
Contributor Author

bomeng commented Apr 8, 2016

@hvanhovell I've made the changes by removing +/-. I really want to sort out the keywords in the file if you agree, right now, I have to search one by one and it is tedious. Do you think it is worth to do another JIRA for that?

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55384 has finished for PR 12191 at commit 8d02b4d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 8, 2016

Test build #55386 has finished for PR 12191 at commit 1b5f511.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@bomeng
Copy link
Contributor Author

bomeng commented Apr 11, 2016

@hvanhovell

@viirya
Copy link
Member

viirya commented Apr 13, 2016

@bomeng Can you update description too? Thanks.

@bomeng
Copy link
Contributor Author

bomeng commented Apr 13, 2016

description is updated. thanks.

@SparkQA
Copy link

SparkQA commented Apr 14, 2016

Test build #55855 has finished for PR 12191 at commit 6b5924c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

@bomeng sorry for not getting back to you sooner. Is sorting the list only for asthetics and ease of searching? It seems like it is not really worth effort if it is, what do you think?

It might have a little merit in terms of performance to group all nonReserved keywords together. The parser has to check if a Token is on the nonReserved list and it does this by switch statements. Having a complete range of nonReserved tokens might allow a JIT/Compiler to optimize this.

@bomeng
Copy link
Contributor Author

bomeng commented Apr 18, 2016

Yes, the reason for sorting the keywords is for ease of searching purpose.
I have checked the generated codes and see the switch/case for each non-reserved words. But to my understanding, case A: case B: ... won't have performance difference as case A | B: ... this should be easily optimized by the compiler.

@hvanhovell
Copy link
Contributor

The compiler should emit a tableswitch instead of a lookupswitch when the nonReserved keywords are grouped together; which is a bit faster. I don't think the improvement is large enought to warrant another change and another PR. So lets merge this one and be done.

LGTM

@hvanhovell
Copy link
Contributor

Merging to master. Thanks!

@asfgit asfgit closed this in 74fe235 Apr 19, 2016
lw-lin pushed a commit to lw-lin/spark that referenced this pull request Apr 20, 2016
## What changes were proposed in this pull request?

I have compared non-reserved list in Antlr3 and Antlr4 one by one as well as all the existing keywords defined in Antlr4, added the missing keywords to the non-reserved keywords list.  If we need to support more syntax, we can add more keywords by then.

Any recommendation for the above is welcome.

## How was this patch tested?

I manually checked the keywords one by one. Please let me know if there is a better way to test.

Another thought: I suggest to put all the keywords definition and non-reserved list in order, that will be much easier to check in the future.

Author: bomeng <bmeng@us.ibm.com>

Closes apache#12191 from bomeng/SPARK-14398.
@bomeng bomeng deleted the SPARK-14398 branch April 27, 2016 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants