[Spark-22431][SQL] Ensure that the datatype in the schema for the table/view metadata is parseable by Spark before persisting it #19747

skambha · 2017-11-14T17:04:44Z

What changes were proposed in this pull request?

JIRA: SPARK-22431 : Creating Permanent view with illegal type

Description:

It is possible in Spark SQL to create a permanent view that uses an nested field with an illegal name.
For example if we create the following view:
create view x as select struct('a' as `$q`, 1 as b) q
A simple select fails with the following exception:

select * from x;

org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int>
  at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
...

Issue/Analysis: Right now, we can create a view with a schema that cannot be read back by Spark from the Hive metastore. For more details, please see the discussion about the analysis and proposed fix options in comment 1 and comment 2 in the SPARK-22431

Proposed changes:

Fix the hive table/view codepath to check whether the schema datatype is parseable by Spark before persisting it in the metastore. This change is localized to HiveClientImpl to do the check similar to the check in FromHiveColumn. This is fail-fast and we will avoid the scenario where we write something to the metastore that we are unable to read it back.
Added new unit tests
Ran the sql related unit test suites ( hive/test, sql/test, catalyst/test) OK

With the fix:

create view x as select struct('a' as `$q`, 1 as b) q;
17/11/28 10:44:55 ERROR SparkSQLDriver: Failed in [create view x as select struct('a' as `$q`, 1 as b) q]
org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int>
	at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$getSparkSQLDataType(HiveClientImpl.scala:884)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:906)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:906)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
...

How was this patch tested?

New unit tests have been added.

@hvanhovell, Please review and share your thoughts/comments. Thank you so much.

hvanhovell · 2017-11-14T23:26:44Z

Ok to test

SparkQA · 2017-11-15T06:24:36Z

Test build #83879 has finished for PR 19747 at commit 6267033.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2017-11-15T06:24:09Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

@@ -68,6 +69,48 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
  import hiveContext._
  import spark.implicits._

+  test("SPARK-22431: table ctas - illegal nested type") {


IMHO it would be better to put all illegal cases together since they share the same logic except the sql statements.

wzhfy · 2017-11-15T06:37:53Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -895,6 +897,18 @@ private[hive] object HiveClientImpl {
    Option(hc.getComment).map(field.withComment).getOrElse(field)
  }

+  private def verifyColumnDataType(schema: StructType): Unit = {
+    schema.map(col => {


schema.foreach { field =>

skambha · 2017-11-15T10:07:31Z

Thanks @wzhfy for your comments. I have addressed them in the latest commit.

skambha · 2017-11-15T10:13:06Z

I synced up and noticed there are some recent changes that have gone in that changes the alter table schema codepath in the HiveExternalCatalog. I'll take a look and see what changes might be needed for that.

SparkQA · 2017-11-15T12:47:54Z

Test build #83888 has finished for PR 19747 at commit cdc4a07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-15T18:32:01Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

@@ -40,6 +40,22 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {

  setupTestData()

+  test("SPARK-22431: table with nested type col with special char") {


Move these two to InMemoryCatalogedDDLSuite

Thanks @gatorsmile for your comments. I have addressed them in the latest commit.

gatorsmile · 2017-11-15T18:32:21Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

@@ -68,6 +69,36 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
  import hiveContext._
  import spark.implicits._

+  test("SPARK-22431: illegal nested type") {


Move these to HiveCatalogedDDLSuite

gatorsmile · 2017-11-15T18:33:00Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

+        CatalystSqlParser.parseDataType(typeString)
+      } catch {
+        case e: ParseException =>
+          throw new SparkException(s"Cannot recognize the data type: $typeString", e)


-> AnalysisException

gatorsmile · 2017-11-15T18:36:08Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -507,6 +508,7 @@ private[hive] class HiveClientImpl(
    // these properties are still available to the others that share the same Hive metastore.
    // If users explicitly alter these Hive-specific properties through ALTER TABLE DDL, we respect
    // these user-specified values.
+    verifyColumnDataType(table.dataSchema)


Do it in HiveExternalCatalog.verifyColumnNames?

Thanks @gatorsmile for the review. I'll incorporate your other comments in my next commit.

In the current codeline, another recent PR changed verifyColumnNames to verifyDataSchema.

The reason I could not put the check in verifyDataSchema ( or the old verifyColumnNames):

verifyDataSchema is called in the beginning of the doCreateTable method. But we cannot error out that early in the doCreateTable method, as later on in that method, we create the datasource table. If the datasource table cannot be stored in hive compatible format, it falls back to storing it in Spark sql specific format which will work fine.

For e.g If I put the check there, then the create datasource table would throw an exception right away, which we do not want.

CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET

skambha · 2017-11-17T14:03:00Z

I have taken care of adding the check in the new HiveClientImpl.alterTableDataSchema as well and have added some new tests.

SparkQA · 2017-11-17T16:38:24Z

Test build #83968 has finished for PR 19747 at commit e5c2cf3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-11-17T16:45:16Z

Test build #83969 has finished for PR 19747 at commit 3be7b47.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-23T07:32:29Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+
+    withView("v") {
+      spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
+      assert(spark.sql("SELECT * FROM v").count() == 1L)


Could you check the contents instead of number of row counts?

The same applies to the other test cases

gatorsmile · 2017-11-23T07:43:18Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala

@@ -895,6 +898,19 @@ private[hive] object HiveClientImpl {
    Option(hc.getComment).map(field.withComment).getOrElse(field)
  }

+  private def verifyColumnDataType(schema: StructType): Unit = {
+    schema.foreach(field => {
+      val typeString = field.dataType.catalogString


catalogString is generated by Spark. It is not related to the restriction of Hive or the interaction between Hive and Spark

See my fix: gatorsmile@bdcb9c8

After you applying my fix, you also need to update the test cases to make the exception types consistent.

I have taken your change and incorporated it in the latest commit. Thanks.

…persisting to metastore, and add unit tests

…ther

… comments

…eption/error message change and also check contents for the query results

skambha · 2017-11-28T02:03:18Z

Thanks @gatorsmile for your comments.
I have incorporated them in the latest commit: a1c8a6d

Please take a look. Thanks.

SparkQA · 2017-11-28T04:41:52Z

Test build #84238 has finished for PR 19747 at commit a1c8a6d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-11-28T05:03:45Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+      spark.sql("ALTER TABLE t3 ADD COLUMNS (newcol2 STRUCT<`col1`:STRING, col2:Int>)")
+
+      val df3 = spark.sql("SELECT * FROM t3")
+      checkAnswer(df3, Nil)


checkAnswer(spark.table("t3"), Nil)

gatorsmile · 2017-11-28T05:03:53Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+      spark.sql("ALTER TABLE t2 ADD COLUMNS (newcol2 STRUCT<`col1`:STRING, col2:Int>)")
+
+      val df2 = spark.sql("SELECT * FROM t2")
+      checkAnswer(df2, Nil)


The same here

gatorsmile · 2017-11-28T05:04:07Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+      checkAnswer(spark.sql("SELECT * FROM v"), Row(Row("a", 1)) :: Nil)
+
+      spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `b`, 1 AS b) q1")
+      val df = spark.sql("SELECT * FROM v")


The same here

gatorsmile · 2017-11-28T05:04:15Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+      spark.sql("CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET")
+      checkAnswer(sql("SELECT * FROM t"), Nil)
+      spark.sql("CREATE TABLE x (q STRUCT<col1:INT, col2:STRING>, i1 INT)")
+      checkAnswer(sql("SELECT * FROM x"), Nil)


The same here

gatorsmile · 2017-11-28T05:04:21Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+  test("SPARK-22431: table with nested type") {
+    withTable("t", "x") {
+      spark.sql("CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET")
+      checkAnswer(sql("SELECT * FROM t"), Nil)


The same here

gatorsmile · 2017-11-28T05:04:45Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala

+  test("SPARK-22431: view with nested type") {
+    withView("v") {
+      spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
+      checkAnswer(spark.sql("SELECT * FROM v"), Row(Row("a", 1)) :: Nil)


The same here

gatorsmile · 2017-11-28T05:06:24Z

sql/core/src/test/scala/org/apache/spark/sql/execution/command/DDLSuite.scala

+      spark.sql("CREATE VIEW t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q")
+      checkAnswer(sql("SELECT * FROM t"), Row(Row("a", 1)) :: Nil)
+      spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
+      checkAnswer(sql("SELECT * FROM t"), Row(Row("a", 1)) :: Nil)


The same issues in these two test cases

gatorsmile · 2017-11-28T05:06:36Z

LGTM except a few minor comments.

… cases

skambha · 2017-11-28T18:15:57Z

Thanks @gatorsmile.
I have addressed your comments in the latest commit. Please take a look. Thanks.

gatorsmile · 2017-11-28T19:18:53Z

LGTM

SparkQA · 2017-11-28T20:56:38Z

Test build #84271 has finished for PR 19747 at commit d2458d4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-11-28T21:00:28Z

LGTM - merging to master. Thanks for working on this!

skambha · 2017-11-28T21:45:37Z

great! Thank you @gatorsmile, @hvanhovell, @wzhfy

wzhfy reviewed Nov 15, 2017

View reviewed changes

gatorsmile reviewed Nov 15, 2017

View reviewed changes

gatorsmile reviewed Nov 23, 2017

View reviewed changes

skambha added 8 commits November 27, 2017 09:40

Add check to ensure that the schema col datatype is parseable before …

564903d

…persisting to metastore, and add unit tests

Add : in error message

da8a48d

Remove empty line

3af4552

remove empty line

df2277e

Address review comments - use foreach and combine some testcases toge…

c719f50

…ther

a) Add check for alterTableDataSchema and add tests b) Address review…

dfb8e21

… comments

Remove unused import

2a586e2

Apply Xiao's fix and then changes to the tests to account for the exc…

a1c8a6d

…eption/error message change and also check contents for the query results

skambha force-pushed the spark22431 branch from 3be7b47 to a1c8a6d Compare November 28, 2017 02:01

gatorsmile reviewed Nov 28, 2017

View reviewed changes

Update tests to use spark.table instead of the spark.sql call in some…

d2458d4

… cases

asfgit closed this in a10b328 Nov 28, 2017

		@@ -40,6 +40,22 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {

		setupTestData()

		test("SPARK-22431: table with nested type col with special char") {

[Spark-22431][SQL] Ensure that the datatype in the schema for the table/view metadata is parseable by Spark before persisting it #19747

[Spark-22431][SQL] Ensure that the datatype in the schema for the table/view metadata is parseable by Spark before persisting it #19747

Conversation

skambha commented Nov 14, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

hvanhovell commented Nov 14, 2017

SparkQA commented Nov 15, 2017

wzhfy Nov 15, 2017 • edited Loading

Choose a reason for hiding this comment

wzhfy Nov 15, 2017 • edited Loading

Choose a reason for hiding this comment

skambha commented Nov 15, 2017

skambha commented Nov 15, 2017

SparkQA commented Nov 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skambha Nov 16, 2017 • edited Loading

Choose a reason for hiding this comment

skambha commented Nov 17, 2017

SparkQA commented Nov 17, 2017

SparkQA commented Nov 17, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile Nov 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

skambha commented Nov 28, 2017

SparkQA commented Nov 28, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Nov 28, 2017

skambha commented Nov 28, 2017

gatorsmile commented Nov 28, 2017

SparkQA commented Nov 28, 2017

hvanhovell commented Nov 28, 2017

skambha commented Nov 28, 2017

skambha commented Nov 14, 2017 •

edited

Loading

wzhfy Nov 15, 2017 •

edited

Loading

wzhfy Nov 15, 2017 •

edited

Loading

skambha Nov 16, 2017 •

edited

Loading

gatorsmile Nov 23, 2017 •

edited

Loading