Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-22431][SQL] Ensure that the datatype in the schema for the table/view metadata is parseable by Spark before persisting it #19747

Closed
wants to merge 9 commits into from

Conversation

skambha
Copy link
Contributor

@skambha skambha commented Nov 14, 2017

What changes were proposed in this pull request?

  • JIRA: SPARK-22431 : Creating Permanent view with illegal type

Description:

  • It is possible in Spark SQL to create a permanent view that uses an nested field with an illegal name.
  • For example if we create the following view:
    create view x as select struct('a' as `$q`, 1 as b) q
  • A simple select fails with the following exception:
select * from x;

org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int>
  at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
...

Issue/Analysis: Right now, we can create a view with a schema that cannot be read back by Spark from the Hive metastore. For more details, please see the discussion about the analysis and proposed fix options in comment 1 and comment 2 in the SPARK-22431

Proposed changes:

  • Fix the hive table/view codepath to check whether the schema datatype is parseable by Spark before persisting it in the metastore. This change is localized to HiveClientImpl to do the check similar to the check in FromHiveColumn. This is fail-fast and we will avoid the scenario where we write something to the metastore that we are unable to read it back.
  • Added new unit tests
  • Ran the sql related unit test suites ( hive/test, sql/test, catalyst/test) OK

With the fix:

create view x as select struct('a' as `$q`, 1 as b) q;
17/11/28 10:44:55 ERROR SparkSQLDriver: Failed in [create view x as select struct('a' as `$q`, 1 as b) q]
org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int>
	at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$getSparkSQLDataType(HiveClientImpl.scala:884)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:906)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:906)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
...

How was this patch tested?

  • New unit tests have been added.

@hvanhovell, Please review and share your thoughts/comments. Thank you so much.

@hvanhovell
Copy link
Contributor

Ok to test

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83879 has finished for PR 19747 at commit 6267033.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -68,6 +69,48 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
import hiveContext._
import spark.implicits._

test("SPARK-22431: table ctas - illegal nested type") {
Copy link
Contributor

@wzhfy wzhfy Nov 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO it would be better to put all illegal cases together since they share the same logic except the sql statements.

@@ -895,6 +897,18 @@ private[hive] object HiveClientImpl {
Option(hc.getComment).map(field.withComment).getOrElse(field)
}

private def verifyColumnDataType(schema: StructType): Unit = {
schema.map(col => {
Copy link
Contributor

@wzhfy wzhfy Nov 15, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema.foreach { field =>

@skambha
Copy link
Contributor Author

skambha commented Nov 15, 2017

Thanks @wzhfy for your comments. I have addressed them in the latest commit.

@skambha
Copy link
Contributor Author

skambha commented Nov 15, 2017

I synced up and noticed there are some recent changes that have gone in that changes the alter table schema codepath in the HiveExternalCatalog. I'll take a look and see what changes might be needed for that.

@SparkQA
Copy link

SparkQA commented Nov 15, 2017

Test build #83888 has finished for PR 19747 at commit cdc4a07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -40,6 +40,22 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {

setupTestData()

test("SPARK-22431: table with nested type col with special char") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these two to InMemoryCatalogedDDLSuite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gatorsmile for your comments. I have addressed them in the latest commit.

@@ -68,6 +69,36 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
import hiveContext._
import spark.implicits._

test("SPARK-22431: illegal nested type") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these to HiveCatalogedDDLSuite

CatalystSqlParser.parseDataType(typeString)
} catch {
case e: ParseException =>
throw new SparkException(s"Cannot recognize the data type: $typeString", e)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> AnalysisException

@@ -507,6 +508,7 @@ private[hive] class HiveClientImpl(
// these properties are still available to the others that share the same Hive metastore.
// If users explicitly alter these Hive-specific properties through ALTER TABLE DDL, we respect
// these user-specified values.
verifyColumnDataType(table.dataSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do it in HiveExternalCatalog.verifyColumnNames?

Copy link
Contributor Author

@skambha skambha Nov 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gatorsmile for the review. I'll incorporate your other comments in my next commit.

In the current codeline, another recent PR changed verifyColumnNames to verifyDataSchema.

The reason I could not put the check in verifyDataSchema ( or the old verifyColumnNames):

  • verifyDataSchema is called in the beginning of the doCreateTable method. But we cannot error out that early in the doCreateTable method, as later on in that method, we create the datasource table. If the datasource table cannot be stored in hive compatible format, it falls back to storing it in Spark sql specific format which will work fine.
  • For e.g If I put the check there, then the create datasource table would throw an exception right away, which we do not want.

CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET

@skambha
Copy link
Contributor Author

skambha commented Nov 17, 2017

I have taken care of adding the check in the new HiveClientImpl.alterTableDataSchema as well and have added some new tests.

@SparkQA
Copy link

SparkQA commented Nov 17, 2017

Test build #83968 has finished for PR 19747 at commit e5c2cf3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 17, 2017

Test build #83969 has finished for PR 19747 at commit 3be7b47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


withView("v") {
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
assert(spark.sql("SELECT * FROM v").count() == 1L)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you check the contents instead of number of row counts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same applies to the other test cases

@@ -895,6 +898,19 @@ private[hive] object HiveClientImpl {
Option(hc.getComment).map(field.withComment).getOrElse(field)
}

private def verifyColumnDataType(schema: StructType): Unit = {
schema.foreach(field => {
val typeString = field.dataType.catalogString
Copy link
Member

@gatorsmile gatorsmile Nov 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catalogString is generated by Spark. It is not related to the restriction of Hive or the interaction between Hive and Spark

See my fix: gatorsmile@bdcb9c8

After you applying my fix, you also need to update the test cases to make the exception types consistent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have taken your change and incorporated it in the latest commit. Thanks.

@skambha
Copy link
Contributor Author

skambha commented Nov 28, 2017

Thanks @gatorsmile for your comments.
I have incorporated them in the latest commit: a1c8a6d

Please take a look. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84238 has finished for PR 19747 at commit a1c8a6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

spark.sql("ALTER TABLE t3 ADD COLUMNS (newcol2 STRUCT<`col1`:STRING, col2:Int>)")

val df3 = spark.sql("SELECT * FROM t3")
checkAnswer(df3, Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkAnswer(spark.table("t3"), Nil)

spark.sql("ALTER TABLE t2 ADD COLUMNS (newcol2 STRUCT<`col1`:STRING, col2:Int>)")

val df2 = spark.sql("SELECT * FROM t2")
checkAnswer(df2, Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

checkAnswer(spark.sql("SELECT * FROM v"), Row(Row("a", 1)) :: Nil)

spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `b`, 1 AS b) q1")
val df = spark.sql("SELECT * FROM v")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

spark.sql("CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET")
checkAnswer(sql("SELECT * FROM t"), Nil)
spark.sql("CREATE TABLE x (q STRUCT<col1:INT, col2:STRING>, i1 INT)")
checkAnswer(sql("SELECT * FROM x"), Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

test("SPARK-22431: table with nested type") {
withTable("t", "x") {
spark.sql("CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET")
checkAnswer(sql("SELECT * FROM t"), Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

test("SPARK-22431: view with nested type") {
withView("v") {
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
checkAnswer(spark.sql("SELECT * FROM v"), Row(Row("a", 1)) :: Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

spark.sql("CREATE VIEW t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q")
checkAnswer(sql("SELECT * FROM t"), Row(Row("a", 1)) :: Nil)
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
checkAnswer(sql("SELECT * FROM t"), Row(Row("a", 1)) :: Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same issues in these two test cases

@gatorsmile
Copy link
Member

LGTM except a few minor comments.

@skambha
Copy link
Contributor Author

skambha commented Nov 28, 2017

Thanks @gatorsmile.
I have addressed your comments in the latest commit. Please take a look. Thanks.

@gatorsmile
Copy link
Member

LGTM

@SparkQA
Copy link

SparkQA commented Nov 28, 2017

Test build #84271 has finished for PR 19747 at commit d2458d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hvanhovell
Copy link
Contributor

LGTM - merging to master. Thanks for working on this!

@asfgit asfgit closed this in a10b328 Nov 28, 2017
@skambha
Copy link
Contributor Author

skambha commented Nov 28, 2017

great! Thank you @gatorsmile, @hvanhovell, @wzhfy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants