Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-26215][SQL] Define reserved/non-reserved keywords based on the ANSI SQL standard #23259

Closed
wants to merge 18 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Dec 8, 2018

What changes were proposed in this pull request?

This pr targeted to define reserved/non-reserved keywords for Spark SQL based on the ANSI SQL standards and the other database-like systems (e.g., PostgreSQL). We assume that they basically follow the ANSI SQL-2011 standard, but it is slightly different between each other. Therefore, this pr documented all the keywords in docs/sql-reserved-and-non-reserved-key-words.md.

NOTE: This pr only added a small set of keywords as reserved ones and these keywords are reserved in all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, and SQL-2016) and PostgreSQL. This is because there is room to discuss which keyword should be reserved or not, .e.g., interval units (day, hour, minute, second, ...) are reserved in the ANSI SQL standards though, they are not reserved in PostgreSQL. Therefore, we need more researches about the other database-like systems (e.g., Oracle Databases, DB2, SQL server) in follow-up activities.

References:

How was this patch tested?

Added tests in TableIdentifierParserSuite.

@maropu
Copy link
Member Author

maropu commented Dec 8, 2018

To discuss this topic smoothly, I made this pr.
Any comment/suggestion is welcome.

cc: @gatorsmile @cloud-fan @viirya

@@ -769,7 +774,7 @@ nonReserved
| REVOKE | GRANT | LOCK | UNLOCK | MSCK | REPAIR | RECOVER | EXPORT | IMPORT | LOAD | VALUES | COMMENT | ROLE
| ROLES | COMPACTIONS | PRINCIPALS | TRANSACTIONS | INDEX | INDEXES | LOCKS | OPTION | LOCAL | INPATH
| ASC | DESC | LIMIT | RENAME | SETS
| AT | NULLS | OVERWRITE | ALL | ANY | ALTER | AS | BETWEEN | BY | CREATE | DELETE
| AT | NULLS | OVERWRITE | ANY | ALTER | AS | BETWEEN | BY | CREATE | DELETE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't ANY move to reserved?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, thanks. you're right.

@SparkQA
Copy link

SparkQA commented Dec 8, 2018

Test build #99854 has finished for PR 23259 at commit 01bc383.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.

@dongjoon-hyun
Copy link
Member

cc @hvanhovell , @mgaido91 , too.

@SparkQA
Copy link

SparkQA commented Dec 9, 2018

Test build #99880 has finished for PR 23259 at commit 01bc383.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

cloud-fan commented Dec 9, 2018

thanks @maropu for starting it!

Which SQL standard does Spark SQL follow (e.g., 2011 or 2016)?

I think SQL 2011 is good, but if we can't find a public version, maybe it's also OK to follow postgres

Where should we hanlde reserved key words?

I think it should be SqlBase.g4, but a problem is, the g4 files defines non-reserved keywords, not reserved ones. Maybe we need to update it.

Where should we docment the list of reserved/non-reserved key words?

I think the new files you created in this PR is a good place to document it

@mgaido91
Copy link
Contributor

+1 for SQL 2011. I downloaded the standard but I couldn't find any section dedicated to, In postgres doc, though, they are stating that they are not following the standard strictly: https://www.postgresql.org/docs/11/sql-keywords-appendix.html. Shall we follow that list and follow the standard as it is mentioned there?

@gatorsmile
Copy link
Member

ping @maropu Anything is blocking this PR?

@maropu
Copy link
Member Author

maropu commented Jan 30, 2019

I'm working on this now, so I'll update in a few days.

@maropu maropu changed the title [SPARK-26215][SQL][WIP] Define reserved/non-reserved keywords based on the ANSI SQL standard [SPARK-26215][SQL] Define reserved keywords based on the ANSI SQL standard Jan 30, 2019
| SERDEPROPERTIES | SET | SETS | SHOW | SKEWED | SORT | SORTED | START | STATISTICS | STORED | STRATIFY
| STRUCT | TABLE | TABLES | TABLESAMPLE | TBLPROPERTIES | TEMPORARY | TERMINATED | THEN | TO | TOUCH
| TRAILING | TRANSACTION | TRANSACTIONS | TRANSFORM | TRUE | TRUNCATE | UNARCHIVE | UNBOUNDED | UNCACHE
| UNLOCK | UNSET | USE | VALUES | VIEW | WHEN | WHERE | WINDOW | WITH
;

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

| DATABASE | SELECT | FROM | WHERE | HAVING | TO | TABLE | WITH | NOT
| DIRECTORY
| BOTH | LEADING | TRAILING
: ADD | AFTER | ALL | ALTER | ANALYZE | AND | ANY | ARCHIVE | ARRAY | AS | ASC | AT | BETWEEN | BOTH
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To compare to the reserved entries easily, I just sorted the nonReserved ones in an alphabetical order.

@maropu maropu changed the title [SPARK-26215][SQL] Define reserved keywords based on the ANSI SQL standard [SPARK-26215][SQL] Define reserved/non-reserved keywords based on the ANSI SQL standard Jan 30, 2019
@maropu
Copy link
Member Author

maropu commented Jan 30, 2019

Not finished yet, so I need more time to brush up code (some code is still wrong...)

@SparkQA
Copy link

SparkQA commented Jan 30, 2019

Test build #101889 has finished for PR 23259 at commit 46029b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2019

Test build #101890 has finished for PR 23259 at commit bbd5990.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 30, 2019

Test build #101892 has finished for PR 23259 at commit 71455d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu force-pushed the SPARK-26215-WIP branch 2 times, most recently from 15c2046 to 42740b8 Compare February 1, 2019 08:52
@SparkQA
Copy link

SparkQA commented Feb 1, 2019

Test build #101985 has finished for PR 23259 at commit 15c2046.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 1, 2019

Test build #101986 has finished for PR 23259 at commit 80005cb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2019

Test build #102009 has finished for PR 23259 at commit f6bf2e0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Feb 4, 2019

ok, ready to review, anyone could do this?

@maropu
Copy link
Member Author

maropu commented Feb 21, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102599 has finished for PR 23259 at commit 140bb94.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Feb 22, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102602 has finished for PR 23259 at commit 6d4b5ab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Feb 22, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102605 has finished for PR 23259 at commit 6d4b5ab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Feb 22, 2019

I'm fixing the test failure...

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102607 has finished for PR 23259 at commit 6d4b5ab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

displayTitle: SQL Reserved/Non-Reserved Keywords
---

When `spark.sql.parser.ansi.enabled` is set to true (false by default), some keywords are reserved for Spark SQL.
Copy link
Contributor

@cloud-fan cloud-fan Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Spark SQL there are 2 kinds of keywords: non-reserved and reserved. Non-reserved keywords have a
special meaning only in particular contexts and can be used as identifiers (e.g., table names, view names,
column names, column aliases, table aliases) in other contexts. Reserved keywords can't be used as
table alias, but can be used as other identifiers.

The list of reserved and non-reserved keywords can change according to the config
`spark.sql.parser.ansi.enabled`, which is false by default.

| ANTI | FULL | INNER | LEFT | SEMI | RIGHT | NATURAL | JOIN | CROSS | ON
| UNION | INTERSECT | EXCEPT | SETMINUS
| {ansi}? ansiReserved
| {!ansi}? reserved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe defaultReserved

// Let's say you add a new token `NEWTOKEN` and this is not reserved regardless of a `spark.sql.parser.ansi.enabled`
// value. In this case, you must add a token `NEWTOKEN` in both `ansiNonReserved` and `nonReserved`.

// A list of the reserved keywords below in Spark SQL. These keywords cannot be used for identifiers
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// The list of the reserved keywords when `spark.sql.parser.ansi.enabled` is true. Currently, we only reserve
// the ANSI keywords that almost all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011,
// and SQL-2016) and PostgreSQL reserve.

;

// When `spark.sql.parser.ansi.enabled` = true, the `ansiNonReserved` keywords can be used for identifiers.
// Otherwise (`spark.sql.parser.ansi.enabled` = false), we follow the existing Spark SQL behaviour until v3.0:
Copy link
Contributor

@cloud-fan cloud-fan Feb 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The list of the non-reserved keywords when `spark.sql.parser.ansi.enabled` is true.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

val keyword = ctx.getText
if (ctx.ansiReserved() != null) {
throw new ParseException(
s"'$keyword' is reserved and you cannot use this keyword as an identifier.", ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, do we need to do it in this PR? I think this PR just changes the list of non-reserved/reserved keywords according to the config.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah..., ok. But, we have no behaivour change between ansi=true and ansi=false now?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I reverted the unrelated stuffs.

@@ -53,10 +55,96 @@ class TableIdentifierParserSuite extends SparkFunSuite {
"bigint", "binary", "boolean", "current_date", "current_timestamp", "date", "double", "float",
"int", "smallint", "timestamp", "at", "position", "both", "leading", "trailing", "extract")

val hiveStrictNonReservedKeyword = Seq("anti", "full", "inner", "left", "semi", "right",
val hiveStrictNonReservedKeywords = Seq("anti", "full", "inner", "left", "semi", "right",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, seems we don't have reserved keywords at all before this PR. We only have non-reserved keywords and strict-non-reserved keywords...

@cloud-fan
Copy link
Contributor

LGTM. This PR has only a small behavior change when ansi mode is on: some keywords can't be used in table aliases. In the followup we can forbid reserved keywords as identifiers, when ansi mode is on.

@maropu
Copy link
Member Author

maropu commented Feb 22, 2019

Thanks for the active review!

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102615 has finished for PR 23259 at commit 5b4b5ca.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102627 has finished for PR 23259 at commit d1ab4f4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Feb 22, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Feb 22, 2019

Test build #102629 has finished for PR 23259 at commit d1ab4f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Feb 22, 2019

Thanks! Merged to master.

@maropu maropu closed this Feb 22, 2019
@maropu
Copy link
Member Author

maropu commented Feb 22, 2019

I'll created a new jira for the followup: https://issues.apache.org/jira/browse/SPARK-26976

maropu added a commit that referenced this pull request Feb 22, 2019
… ANSI SQL standard

## What changes were proposed in this pull request?
This pr targeted to define reserved/non-reserved keywords for Spark SQL based on the ANSI SQL standards and the other database-like systems (e.g., PostgreSQL). We assume that they basically follow the ANSI SQL-2011 standard, but it is slightly different between each other. Therefore, this pr documented all the keywords in `docs/sql-reserved-and-non-reserved-key-words.md`.

NOTE: This pr only added a small set of keywords as reserved ones and these keywords are reserved in all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, and SQL-2016) and PostgreSQL. This is because there is room to discuss which keyword should be reserved or not, .e.g., interval units (day, hour, minute, second, ...) are reserved in the ANSI SQL standards though, they are not reserved in PostgreSQL. Therefore, we need more researches about the other database-like systems (e.g., Oracle Databases, DB2, SQL server) in follow-up activities.

References:
 - The reserved/non-reserved SQL keywords in the ANSI SQL standards: https://developer.mimer.com/wp-content/uploads/2018/05/Standard-SQL-Reserved-Words-Summary.pdf
 - SQL Key Words in PostgreSQL: https://www.postgresql.org/docs/current/sql-keywords-appendix.html

## How was this patch tested?
Added tests in `TableIdentifierParserSuite`.

Closes #23259 from maropu/SPARK-26215-WIP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
maropu added a commit that referenced this pull request Mar 13, 2019
…mode is on

## What changes were proposed in this pull request?
This pr added code to forbid reserved keywords as identifiers when ANSI mode is on.
This is a follow-up of SPARK-26215(#23259).

## How was this patch tested?
Added tests in `TableIdentifierParserSuite`.

Closes #23880 from maropu/SPARK-26976.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
mccheah pushed a commit to palantir/spark that referenced this pull request May 15, 2019
… ANSI SQL standard

## What changes were proposed in this pull request?
This pr targeted to define reserved/non-reserved keywords for Spark SQL based on the ANSI SQL standards and the other database-like systems (e.g., PostgreSQL). We assume that they basically follow the ANSI SQL-2011 standard, but it is slightly different between each other. Therefore, this pr documented all the keywords in `docs/sql-reserved-and-non-reserved-key-words.md`.

NOTE: This pr only added a small set of keywords as reserved ones and these keywords are reserved in all the ANSI SQL standards (SQL-92, SQL-99, SQL-2003, SQL-2008, SQL-2011, and SQL-2016) and PostgreSQL. This is because there is room to discuss which keyword should be reserved or not, .e.g., interval units (day, hour, minute, second, ...) are reserved in the ANSI SQL standards though, they are not reserved in PostgreSQL. Therefore, we need more researches about the other database-like systems (e.g., Oracle Databases, DB2, SQL server) in follow-up activities.

References:
 - The reserved/non-reserved SQL keywords in the ANSI SQL standards: https://developer.mimer.com/wp-content/uploads/2018/05/Standard-SQL-Reserved-Words-Summary.pdf
 - SQL Key Words in PostgreSQL: https://www.postgresql.org/docs/current/sql-keywords-appendix.html

## How was this patch tested?
Added tests in `TableIdentifierParserSuite`.

Closes apache#23259 from maropu/SPARK-26215-WIP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants