New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-14762][SQL] TPCDS Q90 fails to parse #12537
Conversation
This might fail Hive tests. |
Test build #56412 has finished for PR 12537 at commit
|
Test build #56413 has finished for PR 12537 at commit
|
hiveNonReservedKeyword.foreach { nonReserved => | ||
assert(TableIdentifier(nonReserved) === parseTableIdentifier(nonReserved)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need one more test case. Table identifier has illegal characters. For example, tab:1
. Then, it will directly fall into the group hiveNativeCommands
. Then, we got a weird Hive parser error, just like
NoViableAltException(302@[192:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])
at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)
at org.antlr.runtime.DFA.predict(DFA.java:116)
at org.apache.hadoop.hive.ql.parse.HiveParser_FromClauseParser.tableName(HiveParser_FromClauseParser.java:4747)
at org.apache.hadoop.hive.ql.parse.HiveParser.tableName(HiveParser.java:45920)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we need to fix this. The HiveSqlParser
passes on queries it can't parse to Hive. I think we should disable this behavior after we have integrated the SparkSqlParser
& HiveSqlParser
. Could you create a JIRA ticket under https://issues.apache.org/jira/browse/SPARK-14776?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just thinking if we can capture it during the Parser? The idea is to issue a better error message when users input an incorrect error message.
How about relaxing the .g4
parsing rule for tableIdentifier
? That is, allowing Parser
to accept the illegal inputs? Then, verify its correctness in the visitTableIdentifier
. If users input an unexpected table/database name, we can easily tell users such inputs are not allowed as the table/database names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also found a bug in the table identifier. Hive does not allow _a
. Do you want me to fix it? Or you want to do it? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! @hvanhovell I see your points. Fortunately, Hive metastore does not enforce it. Otherwise, we are still unable to have names starting with underscores.
Since we are unable to trust Hive metastore APIs to issue exceptions if we inputting the illegal inputs, I am trying to see if anything we did is not consistent with Hive. Otherwise, it could cause a disaster result. For example, one of them is #12446. We could drop all the partitions if users input a wrong partition column name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile I think the metastore will allow names starting with underscores, since Hive allows names starting with underscores if you put the name in backticks. However we are talking about Hive, and it has surprised me before. Could you open a PR to test if the metastore allows this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metastore allows it. That is why I agreed we can keep it without any change, but we need to add a few test cases to check if this assumption still works if we upgrade the version later. Let me submit a PR for adding these test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do it after #12081 is merged. : ) Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current Parser, we still have another issue. For example, our Parser can accept
CREATE TABLE `tab:1` ...
However, Hive metastore does not allow it. We hit an exception from the metastore:
org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: [tab:1]: is not a valid table name;
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: org.apache.hadoop.hive.ql.metadata.HiveException: [tab:1]: is not a valid table name;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:71)
at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:147)
at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:188)
In the next PR, I will add the extra logics to verify the names before calling Hive metastore APIs.
Test build #56486 has finished for PR 12537 at commit
|
retest this please |
Test build #56501 has finished for PR 12537 at commit
|
Test build #56536 has finished for PR 12537 at commit
|
Test build #2845 has finished for PR 12537 at commit
|
Test build #2857 has finished for PR 12537 at commit
|
LGTM, merging into master, thanks! |
What changes were proposed in this pull request?
TPCDS Q90 fails to parse because it uses a reserved keyword as an Identifier;
AT
was used as an alias for one of the subqueries.AT
is not a reserved keyword and should have been registerd as a in thenonReserved
rule.In order to prevent this from happening again I have added tests for all keywords that are non-reserved in Hive. See the
nonReserved
,sql11ReservedKeywordsUsedAsCastFunctionName
&sql11ReservedKeywordsUsedAsIdentifier
rules in https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g.How was this patch tested?
Added tests to for all Hive non reserved keywords to
TableIdentifierParserSuite
.cc @davies