Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14762][SQL] TPCDS Q90 fails to parse #12537

Closed
wants to merge 4 commits into from

Conversation

hvanhovell
Copy link
Contributor

What changes were proposed in this pull request?

TPCDS Q90 fails to parse because it uses a reserved keyword as an Identifier; AT was used as an alias for one of the subqueries. AT is not a reserved keyword and should have been registerd as a in the nonReserved rule.

In order to prevent this from happening again I have added tests for all keywords that are non-reserved in Hive. See the nonReserved, sql11ReservedKeywordsUsedAsCastFunctionName & sql11ReservedKeywordsUsedAsIdentifier rules in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g.

How was this patch tested?

Added tests to for all Hive non reserved keywords to TableIdentifierParserSuite.

cc @davies

@hvanhovell
Copy link
Contributor Author

This might fail Hive tests.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56412 has finished for PR 12537 at commit 5cbd4b6.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56413 has finished for PR 12537 at commit de9d8be.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

hiveNonReservedKeyword.foreach { nonReserved =>
assert(TableIdentifier(nonReserved) === parseTableIdentifier(nonReserved))
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need one more test case. Table identifier has illegal characters. For example, tab:1. Then, it will directly fall into the group hiveNativeCommands. Then, we got a weird Hive parser error, just like

NoViableAltException(302@[192:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])
    at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
    at org.antlr.runtime.DFA.predict(DFA.java:116)
    at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
    at org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45920)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we need to fix this. The HiveSqlParser passes on queries it can't parse to Hive. I think we should disable this behavior after we have integrated the SparkSqlParser & HiveSqlParser. Could you create a JIRA ticket under https://issues.apache.org/jira/browse/SPARK-14776?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just thinking if we can capture it during the Parser? The idea is to issue a better error message when users input an incorrect error message.

How about relaxing the .g4 parsing rule for tableIdentifier? That is, allowing Parser to accept the illegal inputs? Then, verify its correctness in the visitTableIdentifier. If users input an unexpected table/database name, we can easily tell users such inputs are not allowed as the table/database names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also found a bug in the table identifier. Hive does not allow _a. Do you want me to fix it? Or you want to do it? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! @hvanhovell I see your points. Fortunately, Hive metastore does not enforce it. Otherwise, we are still unable to have names starting with underscores.

Since we are unable to trust Hive metastore APIs to issue exceptions if we inputting the illegal inputs, I am trying to see if anything we did is not consistent with Hive. Otherwise, it could cause a disaster result. For example, one of them is #12446. We could drop all the partitions if users input a wrong partition column name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile I think the metastore will allow names starting with underscores, since Hive allows names starting with underscores if you put the name in backticks. However we are talking about Hive, and it has surprised me before. Could you open a PR to test if the metastore allows this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metastore allows it. That is why I agreed we can keep it without any change, but we need to add a few test cases to check if this assumption still works if we upgrade the version later. Let me submit a PR for adding these test cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do it after #12081 is merged. : ) Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current Parser, we still have another issue. For example, our Parser can accept

CREATE TABLE `tab:1`  ...

However, Hive metastore does not allow it. We hit an exception from the metastore:

org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: [tab:1]: is not a valid table name;
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: [tab:1]: is not a valid table name;
    at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:71)
    at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:147)
    at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:188)

In the next PR, I will add the extra logics to verify the names before calling Hive metastore APIs.

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56486 has finished for PR 12537 at commit c9875e6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56501 has finished for PR 12537 at commit c9875e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #56536 has finished for PR 12537 at commit d48b19e.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 21, 2016

Test build #2845 has finished for PR 12537 at commit d48b19e.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 22, 2016

Test build #2857 has finished for PR 12537 at commit d48b19e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Apr 22, 2016

LGTM, merging into master, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants