[SPARK-26215][SQL] Define reserved/non-reserved keywords based on the ANSI SQL standard #23259

maropu · 2018-12-08T05:09:42Z

What changes were proposed in this pull request?

This pr targeted to define reserved/non-reserved keywords for Spark SQL based on the ANSI SQL standards and the other database-like systems (e.g., PostgreSQL). We assume that they basically follow the ANSI SQL-2011 standard, but it is slightly different between each other. Therefore, this pr documented all the keywords in docs/sql-reserved-and-non-reserved-key-words.md.

NOTE: This pr only added a small set of keywords as reserved ones and these keywords are reserved in all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, and SQL-2016) and PostgreSQL. This is because there is room to discuss which keyword should be reserved or not, .e.g., interval units (day, hour, minute, second, ...) are reserved in the ANSI SQL standards though, they are not reserved in PostgreSQL. Therefore, we need more researches about the other database-like systems (e.g., Oracle Databases, DB2, SQL server) in follow-up activities.

References:

The reserved/non-reserved SQL keywords in the ANSI SQL standards: https://developer.mimer.com/wp-content/uploads/2018/05/Standard-SQL-Reserved-Words-Summary.pdf
SQL Key Words in PostgreSQL: https://www.postgresql.org/docs/current/sql-keywords-appendix.html

How was this patch tested?

Added tests in TableIdentifierParserSuite.

maropu · 2018-12-08T05:10:18Z

To discuss this topic smoothly, I made this pr.
Any comment/suggestion is welcome.

cc: @gatorsmile @cloud-fan @viirya

viirya · 2018-12-08T05:25:35Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

@@ -769,7 +774,7 @@ nonReserved
    | REVOKE | GRANT | LOCK | UNLOCK | MSCK | REPAIR | RECOVER | EXPORT | IMPORT | LOAD | VALUES | COMMENT | ROLE
    | ROLES | COMPACTIONS | PRINCIPALS | TRANSACTIONS | INDEX | INDEXES | LOCKS | OPTION | LOCAL | INPATH
    | ASC | DESC | LIMIT | RENAME | SETS
-    | AT | NULLS | OVERWRITE | ALL | ANY | ALTER | AS | BETWEEN | BY | CREATE | DELETE
+    | AT | NULLS | OVERWRITE | ANY | ALTER | AS | BETWEEN | BY | CREATE | DELETE


Doesn't ANY move to reserved?

yea, thanks. you're right.

SparkQA · 2018-12-08T08:01:42Z

Test build #99854 has finished for PR 23259 at commit 01bc383.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-12-08T20:24:55Z

Retest this please.

dongjoon-hyun · 2018-12-08T20:34:23Z

cc @hvanhovell , @mgaido91 , too.

SparkQA · 2018-12-09T00:17:13Z

Test build #99880 has finished for PR 23259 at commit 01bc383.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-12-09T08:13:30Z

thanks @maropu for starting it!

Which SQL standard does Spark SQL follow (e.g., 2011 or 2016)?

I think SQL 2011 is good, but if we can't find a public version, maybe it's also OK to follow postgres

Where should we hanlde reserved key words?

I think it should be SqlBase.g4, but a problem is, the g4 files defines non-reserved keywords, not reserved ones. Maybe we need to update it.

Where should we docment the list of reserved/non-reserved key words?

I think the new files you created in this PR is a good place to document it

mgaido91 · 2018-12-10T10:59:27Z

+1 for SQL 2011. I downloaded the standard but I couldn't find any section dedicated to, In postgres doc, though, they are stating that they are not following the standard strictly: https://www.postgresql.org/docs/11/sql-keywords-appendix.html. Shall we follow that list and follow the standard as it is mentioned there?

gatorsmile · 2019-01-01T05:13:11Z

ping @maropu Anything is blocking this PR?

maropu · 2019-01-30T09:31:17Z

I'm working on this now, so I'll update in a few days.

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

+    | SERDEPROPERTIES | SET | SETS | SHOW | SKEWED | SORT | SORTED | START | STATISTICS | STORED | STRATIFY
+    | STRUCT | TABLE | TABLES | TABLESAMPLE | TBLPROPERTIES | TEMPORARY | TERMINATED | THEN | TO | TOUCH
+    | TRAILING | TRANSACTION | TRANSACTIONS | TRANSFORM | TRUE | TRUNCATE | UNARCHIVE | UNBOUNDED | UNCACHE
+    | UNLOCK | UNSET | USE | VALUES | VIEW | WHEN | WHERE | WINDOW | WITH
    ;


maropu · 2019-01-30T14:14:54Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

-    | DATABASE | SELECT | FROM | WHERE | HAVING | TO | TABLE | WITH | NOT
-    | DIRECTORY
-    | BOTH | LEADING | TRAILING
+    : ADD | AFTER | ALL | ALTER | ANALYZE | AND | ANY | ARCHIVE | ARRAY | AS | ASC | AT | BETWEEN | BOTH


To compare to the reserved entries easily, I just sorted the nonReserved ones in an alphabetical order.

maropu · 2019-01-30T14:34:16Z

Not finished yet, so I need more time to brush up code (some code is still wrong...)

SparkQA · 2019-01-30T15:51:12Z

Test build #101889 has finished for PR 23259 at commit 46029b2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-30T18:06:26Z

Test build #101890 has finished for PR 23259 at commit bbd5990.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-30T18:29:50Z

Test build #101892 has finished for PR 23259 at commit 71455d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-01T08:54:39Z

Test build #101985 has finished for PR 23259 at commit 15c2046.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-01T12:08:43Z

Test build #101986 has finished for PR 23259 at commit 80005cb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-02T07:33:48Z

Test build #102009 has finished for PR 23259 at commit f6bf2e0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-02-04T00:48:18Z

ok, ready to review, anyone could do this?

maropu · 2019-02-21T22:55:34Z

retest this please

SparkQA · 2019-02-22T00:19:42Z

Test build #102599 has finished for PR 23259 at commit 140bb94.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-02-22T00:30:00Z

retest this please

SparkQA · 2019-02-22T00:59:03Z

Test build #102602 has finished for PR 23259 at commit 6d4b5ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-02-22T01:01:02Z

retest this please

SparkQA · 2019-02-22T02:31:32Z

Test build #102605 has finished for PR 23259 at commit 6d4b5ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-02-22T02:40:35Z

I'm fixing the test failure...

SparkQA · 2019-02-22T03:21:37Z

Test build #102607 has finished for PR 23259 at commit 6d4b5ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-02-22T06:42:08Z

docs/sql-reserved-and-non-reserved-keywords.md

+displayTitle: SQL Reserved/Non-Reserved Keywords
+---
+
+When `spark.sql.parser.ansi.enabled` is set to true (false by default), some keywords are reserved for Spark SQL.


In Spark SQL there are 2 kinds of keywords: non-reserved and reserved. Non-reserved keywords have a special meaning only in particular contexts and can be used as identifiers (e.g., table names, view names, column names, column aliases, table aliases) in other contexts. Reserved keywords can't be used as table alias, but can be used as other identifiers. The list of reserved and non-reserved keywords can change according to the config `spark.sql.parser.ansi.enabled`, which is false by default.

cloud-fan · 2019-02-22T06:45:21Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

-    | ANTI | FULL | INNER | LEFT | SEMI | RIGHT | NATURAL | JOIN | CROSS | ON
-    | UNION | INTERSECT | EXCEPT | SETMINUS
+    | {ansi}? ansiReserved
+    | {!ansi}? reserved


nit: maybe defaultReserved

cloud-fan · 2019-02-22T06:47:47Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

+// Let's say you add a new token `NEWTOKEN` and this is not reserved regardless of a `spark.sql.parser.ansi.enabled`
+// value. In this case, you must add a token `NEWTOKEN` in both `ansiNonReserved` and `nonReserved`.
+
+// A list of the reserved keywords below in Spark SQL. These keywords cannot be used for identifiers


// The list of the reserved keywords when `spark.sql.parser.ansi.enabled` is true. Currently, we only reserve // the ANSI keywords that almost all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, // and SQL-2016) and PostgreSQL reserve.

cloud-fan · 2019-02-22T06:48:25Z

sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4

+    ;
+
+// When `spark.sql.parser.ansi.enabled` = true, the `ansiNonReserved` keywords can be used for identifiers.
+// Otherwise (`spark.sql.parser.ansi.enabled` = false), we follow the existing Spark SQL behaviour until v3.0:


The list of the non-reserved keywords when `spark.sql.parser.ansi.enabled` is true.

cloud-fan · 2019-02-22T06:54:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+    val keyword = ctx.getText
+    if (ctx.ansiReserved() != null) {
+      throw new ParseException(
+        s"'$keyword' is reserved and you cannot use this keyword as an identifier.", ctx)


hmmm, do we need to do it in this PR? I think this PR just changes the list of non-reserved/reserved keywords according to the config.

ah..., ok. But, we have no behaivour change between ansi=true and ansi=false now?

ok, I reverted the unrelated stuffs.

cloud-fan · 2019-02-22T06:59:10Z

...atalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/TableIdentifierParserSuite.scala

@@ -53,10 +55,96 @@ class TableIdentifierParserSuite extends SparkFunSuite {
    "bigint", "binary", "boolean", "current_date", "current_timestamp", "date", "double", "float",
    "int", "smallint", "timestamp", "at", "position", "both", "leading", "trailing", "extract")

-  val hiveStrictNonReservedKeyword = Seq("anti", "full", "inner", "left", "semi", "right",
+  val hiveStrictNonReservedKeywords = Seq("anti", "full", "inner", "left", "semi", "right",


hmm, seems we don't have reserved keywords at all before this PR. We only have non-reserved keywords and strict-non-reserved keywords...

cloud-fan · 2019-02-22T07:31:33Z

LGTM. This PR has only a small behavior change when ansi mode is on: some keywords can't be used in table aliases. In the followup we can forbid reserved keywords as identifiers, when ansi mode is on.

maropu · 2019-02-22T07:37:45Z

Thanks for the active review!

SparkQA · 2019-02-22T08:05:02Z

Test build #102615 has finished for PR 23259 at commit 5b4b5ca.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-02-22T08:05:02Z

Test build #102627 has finished for PR 23259 at commit d1ab4f4.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-02-22T08:08:42Z

retest this please

SparkQA · 2019-02-22T12:42:39Z

Test build #102629 has finished for PR 23259 at commit d1ab4f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2019-02-22T23:39:45Z

Thanks! Merged to master.

maropu · 2019-02-22T23:40:01Z

I'll created a new jira for the followup: https://issues.apache.org/jira/browse/SPARK-26976

… ANSI SQL standard ## What changes were proposed in this pull request? This pr targeted to define reserved/non-reserved keywords for Spark SQL based on the ANSI SQL standards and the other database-like systems (e.g., PostgreSQL). We assume that they basically follow the ANSI SQL-2011 standard, but it is slightly different between each other. Therefore, this pr documented all the keywords in `docs/sql-reserved-and-non-reserved-key-words.md`. NOTE: This pr only added a small set of keywords as reserved ones and these keywords are reserved in all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, and SQL-2016) and PostgreSQL. This is because there is room to discuss which keyword should be reserved or not, .e.g., interval units (day, hour, minute, second, ...) are reserved in the ANSI SQL standards though, they are not reserved in PostgreSQL. Therefore, we need more researches about the other database-like systems (e.g., Oracle Databases, DB2, SQL server) in follow-up activities. References: - The reserved/non-reserved SQL keywords in the ANSI SQL standards: https://developer.mimer.com/wp-content/uploads/2018/05/Standard-SQL-Reserved-Words-Summary.pdf - SQL Key Words in PostgreSQL: https://www.postgresql.org/docs/current/sql-keywords-appendix.html ## How was this patch tested? Added tests in `TableIdentifierParserSuite`. Closes #23259 from maropu/SPARK-26215-WIP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

…mode is on ## What changes were proposed in this pull request? This pr added code to forbid reserved keywords as identifiers when ANSI mode is on. This is a follow-up of SPARK-26215(#23259). ## How was this patch tested? Added tests in `TableIdentifierParserSuite`. Closes #23880 from maropu/SPARK-26976. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

… ANSI SQL standard ## What changes were proposed in this pull request? This pr targeted to define reserved/non-reserved keywords for Spark SQL based on the ANSI SQL standards and the other database-like systems (e.g., PostgreSQL). We assume that they basically follow the ANSI SQL-2011 standard, but it is slightly different between each other. Therefore, this pr documented all the keywords in `docs/sql-reserved-and-non-reserved-key-words.md`. NOTE: This pr only added a small set of keywords as reserved ones and these keywords are reserved in all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, and SQL-2016) and PostgreSQL. This is because there is room to discuss which keyword should be reserved or not, .e.g., interval units (day, hour, minute, second, ...) are reserved in the ANSI SQL standards though, they are not reserved in PostgreSQL. Therefore, we need more researches about the other database-like systems (e.g., Oracle Databases, DB2, SQL server) in follow-up activities. References: - The reserved/non-reserved SQL keywords in the ANSI SQL standards: https://developer.mimer.com/wp-content/uploads/2018/05/Standard-SQL-Reserved-Words-Summary.pdf - SQL Key Words in PostgreSQL: https://www.postgresql.org/docs/current/sql-keywords-appendix.html ## How was this patch tested? Added tests in `TableIdentifierParserSuite`. Closes apache#23259 from maropu/SPARK-26215-WIP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>

viirya reviewed Dec 8, 2018

View reviewed changes

gatorsmile mentioned this pull request Jan 1, 2019

[SPARK-23264][SQL] Make INTERVAL keyword optional in INTERVAL clauses when ANSI mode enabled #20433

Closed

maropu force-pushed the SPARK-26215-WIP branch from 01bc383 to 46029b2 Compare January 30, 2019 13:55

maropu changed the title ~~[SPARK-26215][SQL][WIP] Define reserved/non-reserved keywords based on the ANSI SQL standard~~ [SPARK-26215][SQL] Define reserved keywords based on the ANSI SQL standard Jan 30, 2019

maropu force-pushed the SPARK-26215-WIP branch from 46029b2 to bbd5990 Compare January 30, 2019 14:01

maropu commented Jan 30, 2019

View reviewed changes

maropu force-pushed the SPARK-26215-WIP branch from bbd5990 to 71455d8 Compare January 30, 2019 14:17

maropu changed the title ~~[SPARK-26215][SQL] Define reserved keywords based on the ANSI SQL standard~~ [SPARK-26215][SQL] Define reserved/non-reserved keywords based on the ANSI SQL standard Jan 30, 2019

maropu force-pushed the SPARK-26215-WIP branch 2 times, most recently from 15c2046 to 42740b8 Compare February 1, 2019 08:52

maropu force-pushed the SPARK-26215-WIP branch from 42740b8 to 80005cb Compare February 1, 2019 08:56

maropu force-pushed the SPARK-26215-WIP branch from 80005cb to f6bf2e0 Compare February 2, 2019 03:31

Fix bugs

5b4b5ca

cloud-fan reviewed Feb 22, 2019

View reviewed changes

maropu added 2 commits February 22, 2019 16:09

Update docs

83a107e

Revert some stuffs

d1ab4f4

maropu closed this Feb 22, 2019

maropu mentioned this pull request Feb 24, 2019

[SPARK-26976][SQL] Forbid reserved keywords as identifiers when ANSI mode is on #23880

Closed

[SPARK-26215][SQL] Define reserved/non-reserved keywords based on the ANSI SQL standard #23259

[SPARK-26215][SQL] Define reserved/non-reserved keywords based on the ANSI SQL standard #23259

Conversation

maropu commented Dec 8, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

maropu commented Dec 8, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 8, 2018

dongjoon-hyun commented Dec 8, 2018

dongjoon-hyun commented Dec 8, 2018

SparkQA commented Dec 9, 2018

cloud-fan commented Dec 9, 2018 • edited Loading

mgaido91 commented Dec 10, 2018

gatorsmile commented Jan 1, 2019

maropu commented Jan 30, 2019

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Choose a reason for hiding this comment

maropu commented Jan 30, 2019

SparkQA commented Jan 30, 2019

SparkQA commented Jan 30, 2019

SparkQA commented Jan 30, 2019

SparkQA commented Feb 1, 2019

SparkQA commented Feb 1, 2019

SparkQA commented Feb 2, 2019

maropu commented Feb 4, 2019

maropu commented Feb 21, 2019

SparkQA commented Feb 22, 2019

maropu commented Feb 22, 2019

SparkQA commented Feb 22, 2019

maropu commented Feb 22, 2019

SparkQA commented Feb 22, 2019

maropu commented Feb 22, 2019

SparkQA commented Feb 22, 2019

cloud-fan Feb 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Feb 22, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 22, 2019

maropu commented Feb 22, 2019

SparkQA commented Feb 22, 2019

SparkQA commented Feb 22, 2019

maropu commented Feb 22, 2019

SparkQA commented Feb 22, 2019

maropu commented Feb 22, 2019

maropu commented Feb 22, 2019 • edited Loading

maropu commented Dec 8, 2018 •

edited

Loading

cloud-fan commented Dec 9, 2018 •

edited

Loading

cloud-fan Feb 22, 2019 •

edited

Loading

cloud-fan Feb 22, 2019 •

edited

Loading

maropu commented Feb 22, 2019 •

edited

Loading