-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-22431][SQL] Ensure that the datatype in the schema for the table/view metadata is parseable by Spark before persisting it #19747
Conversation
Ok to test |
Test build #83879 has finished for PR 19747 at commit
|
@@ -68,6 +69,48 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { | |||
import hiveContext._ | |||
import spark.implicits._ | |||
|
|||
test("SPARK-22431: table ctas - illegal nested type") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO it would be better to put all illegal cases together since they share the same logic except the sql statements.
@@ -895,6 +897,18 @@ private[hive] object HiveClientImpl { | |||
Option(hc.getComment).map(field.withComment).getOrElse(field) | |||
} | |||
|
|||
private def verifyColumnDataType(schema: StructType): Unit = { | |||
schema.map(col => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema.foreach { field =>
Thanks @wzhfy for your comments. I have addressed them in the latest commit. |
I synced up and noticed there are some recent changes that have gone in that changes the alter table schema codepath in the HiveExternalCatalog. I'll take a look and see what changes might be needed for that. |
Test build #83888 has finished for PR 19747 at commit
|
@@ -40,6 +40,22 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { | |||
|
|||
setupTestData() | |||
|
|||
test("SPARK-22431: table with nested type col with special char") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move these two to InMemoryCatalogedDDLSuite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gatorsmile for your comments. I have addressed them in the latest commit.
@@ -68,6 +69,36 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton { | |||
import hiveContext._ | |||
import spark.implicits._ | |||
|
|||
test("SPARK-22431: illegal nested type") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move these to HiveCatalogedDDLSuite
CatalystSqlParser.parseDataType(typeString) | ||
} catch { | ||
case e: ParseException => | ||
throw new SparkException(s"Cannot recognize the data type: $typeString", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-> AnalysisException
@@ -507,6 +508,7 @@ private[hive] class HiveClientImpl( | |||
// these properties are still available to the others that share the same Hive metastore. | |||
// If users explicitly alter these Hive-specific properties through ALTER TABLE DDL, we respect | |||
// these user-specified values. | |||
verifyColumnDataType(table.dataSchema) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do it in HiveExternalCatalog.verifyColumnNames
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @gatorsmile for the review. I'll incorporate your other comments in my next commit.
In the current codeline, another recent PR changed verifyColumnNames to verifyDataSchema.
The reason I could not put the check in verifyDataSchema ( or the old verifyColumnNames):
- verifyDataSchema is called in the beginning of the doCreateTable method. But we cannot error out that early in the doCreateTable method, as later on in that method, we create the datasource table. If the datasource table cannot be stored in hive compatible format, it falls back to storing it in Spark sql specific format which will work fine.
- For e.g If I put the check there, then the create datasource table would throw an exception right away, which we do not want.
CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET
I have taken care of adding the check in the new HiveClientImpl.alterTableDataSchema as well and have added some new tests. |
Test build #83968 has finished for PR 19747 at commit
|
Test build #83969 has finished for PR 19747 at commit
|
|
||
withView("v") { | ||
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q") | ||
assert(spark.sql("SELECT * FROM v").count() == 1L) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check the contents instead of number of row counts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same applies to the other test cases
@@ -895,6 +898,19 @@ private[hive] object HiveClientImpl { | |||
Option(hc.getComment).map(field.withComment).getOrElse(field) | |||
} | |||
|
|||
private def verifyColumnDataType(schema: StructType): Unit = { | |||
schema.foreach(field => { | |||
val typeString = field.dataType.catalogString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catalogString
is generated by Spark. It is not related to the restriction of Hive or the interaction between Hive and Spark
See my fix: gatorsmile@bdcb9c8
After you applying my fix, you also need to update the test cases to make the exception types consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have taken your change and incorporated it in the latest commit. Thanks.
…persisting to metastore, and add unit tests
…eption/error message change and also check contents for the query results
Thanks @gatorsmile for your comments. Please take a look. Thanks. |
Test build #84238 has finished for PR 19747 at commit
|
spark.sql("ALTER TABLE t3 ADD COLUMNS (newcol2 STRUCT<`col1`:STRING, col2:Int>)") | ||
|
||
val df3 = spark.sql("SELECT * FROM t3") | ||
checkAnswer(df3, Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkAnswer(spark.table("t3"), Nil)
spark.sql("ALTER TABLE t2 ADD COLUMNS (newcol2 STRUCT<`col1`:STRING, col2:Int>)") | ||
|
||
val df2 = spark.sql("SELECT * FROM t2") | ||
checkAnswer(df2, Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
checkAnswer(spark.sql("SELECT * FROM v"), Row(Row("a", 1)) :: Nil) | ||
|
||
spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `b`, 1 AS b) q1") | ||
val df = spark.sql("SELECT * FROM v") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
spark.sql("CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET") | ||
checkAnswer(sql("SELECT * FROM t"), Nil) | ||
spark.sql("CREATE TABLE x (q STRUCT<col1:INT, col2:STRING>, i1 INT)") | ||
checkAnswer(sql("SELECT * FROM x"), Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
test("SPARK-22431: table with nested type") { | ||
withTable("t", "x") { | ||
spark.sql("CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET") | ||
checkAnswer(sql("SELECT * FROM t"), Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
test("SPARK-22431: view with nested type") { | ||
withView("v") { | ||
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q") | ||
checkAnswer(spark.sql("SELECT * FROM v"), Row(Row("a", 1)) :: Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
spark.sql("CREATE VIEW t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q") | ||
checkAnswer(sql("SELECT * FROM t"), Row(Row("a", 1)) :: Nil) | ||
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q") | ||
checkAnswer(sql("SELECT * FROM t"), Row(Row("a", 1)) :: Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same issues in these two test cases
LGTM except a few minor comments. |
Thanks @gatorsmile. |
LGTM |
Test build #84271 has finished for PR 19747 at commit
|
LGTM - merging to master. Thanks for working on this! |
great! Thank you @gatorsmile, @hvanhovell, @wzhfy |
What changes were proposed in this pull request?
Description:
create view x as select struct('a' as `$q`, 1 as b) q
Issue/Analysis: Right now, we can create a view with a schema that cannot be read back by Spark from the Hive metastore. For more details, please see the discussion about the analysis and proposed fix options in comment 1 and comment 2 in the SPARK-22431
Proposed changes:
With the fix:
How was this patch tested?
@hvanhovell, Please review and share your thoughts/comments. Thank you so much.