Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-22431][SQL] Ensure that the datatype in the schema for the table/view metadata is parseable by Spark before persisting it #19747

Closed
wants to merge 9 commits into from

Conversation

Projects
None yet
5 participants
@skambha
Copy link
Contributor

commented Nov 14, 2017

What changes were proposed in this pull request?

  • JIRA: SPARK-22431 : Creating Permanent view with illegal type

Description:

  • It is possible in Spark SQL to create a permanent view that uses an nested field with an illegal name.
  • For example if we create the following view:
    create view x as select struct('a' as `$q`, 1 as b) q
  • A simple select fails with the following exception:
select * from x;

org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int>
  at org.apache.spark.sql.hive.client.HiveClientImpl$.fromHiveColumn(HiveClientImpl.scala:812)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
  at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getTableOption$1$$anonfun$apply$11$$anonfun$7.apply(HiveClientImpl.scala:378)
...

Issue/Analysis: Right now, we can create a view with a schema that cannot be read back by Spark from the Hive metastore. For more details, please see the discussion about the analysis and proposed fix options in comment 1 and comment 2 in the SPARK-22431

Proposed changes:

  • Fix the hive table/view codepath to check whether the schema datatype is parseable by Spark before persisting it in the metastore. This change is localized to HiveClientImpl to do the check similar to the check in FromHiveColumn. This is fail-fast and we will avoid the scenario where we write something to the metastore that we are unable to read it back.
  • Added new unit tests
  • Ran the sql related unit test suites ( hive/test, sql/test, catalyst/test) OK

With the fix:

create view x as select struct('a' as `$q`, 1 as b) q;
17/11/28 10:44:55 ERROR SparkSQLDriver: Failed in [create view x as select struct('a' as `$q`, 1 as b) q]
org.apache.spark.SparkException: Cannot recognize hive type string: struct<$q:string,b:int>
	at org.apache.spark.sql.hive.client.HiveClientImpl$.org$apache$spark$sql$hive$client$HiveClientImpl$$getSparkSQLDataType(HiveClientImpl.scala:884)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:906)
	at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$org$apache$spark$sql$hive$client$HiveClientImpl$$verifyColumnDataType$1.apply(HiveClientImpl.scala:906)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
...

How was this patch tested?

  • New unit tests have been added.

@hvanhovell, Please review and share your thoughts/comments. Thank you so much.

@hvanhovell

This comment has been minimized.

Copy link
Contributor

commented Nov 14, 2017

Ok to test

@SparkQA

This comment has been minimized.

Copy link

commented Nov 15, 2017

Test build #83879 has finished for PR 19747 at commit 6267033.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@@ -68,6 +69,48 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
import hiveContext._
import spark.implicits._

test("SPARK-22431: table ctas - illegal nested type") {

This comment has been minimized.

Copy link
@wzhfy

wzhfy Nov 15, 2017

Contributor

IMHO it would be better to put all illegal cases together since they share the same logic except the sql statements.

@@ -895,6 +897,18 @@ private[hive] object HiveClientImpl {
Option(hc.getComment).map(field.withComment).getOrElse(field)
}

private def verifyColumnDataType(schema: StructType): Unit = {
schema.map(col => {

This comment has been minimized.

Copy link
@wzhfy

wzhfy Nov 15, 2017

Contributor

schema.foreach { field =>

@skambha

This comment has been minimized.

Copy link
Contributor Author

commented Nov 15, 2017

Thanks @wzhfy for your comments. I have addressed them in the latest commit.

@skambha

This comment has been minimized.

Copy link
Contributor Author

commented Nov 15, 2017

I synced up and noticed there are some recent changes that have gone in that changes the alter table schema codepath in the HiveExternalCatalog. I'll take a look and see what changes might be needed for that.

@SparkQA

This comment has been minimized.

Copy link

commented Nov 15, 2017

Test build #83888 has finished for PR 19747 at commit cdc4a07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@@ -40,6 +40,22 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {

setupTestData()

test("SPARK-22431: table with nested type col with special char") {

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 15, 2017

Member

Move these two to InMemoryCatalogedDDLSuite

This comment has been minimized.

Copy link
@skambha

skambha Nov 17, 2017

Author Contributor

Thanks @gatorsmile for your comments. I have addressed them in the latest commit.

@@ -68,6 +69,36 @@ class SQLQuerySuite extends QueryTest with SQLTestUtils with TestHiveSingleton {
import hiveContext._
import spark.implicits._

test("SPARK-22431: illegal nested type") {

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 15, 2017

Member

Move these to HiveCatalogedDDLSuite

CatalystSqlParser.parseDataType(typeString)
} catch {
case e: ParseException =>
throw new SparkException(s"Cannot recognize the data type: $typeString", e)

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 15, 2017

Member

-> AnalysisException

@@ -507,6 +508,7 @@ private[hive] class HiveClientImpl(
// these properties are still available to the others that share the same Hive metastore.
// If users explicitly alter these Hive-specific properties through ALTER TABLE DDL, we respect
// these user-specified values.
verifyColumnDataType(table.dataSchema)

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 15, 2017

Member

Do it in HiveExternalCatalog.verifyColumnNames?

This comment has been minimized.

Copy link
@skambha

skambha Nov 16, 2017

Author Contributor

Thanks @gatorsmile for the review. I'll incorporate your other comments in my next commit.

In the current codeline, another recent PR changed verifyColumnNames to verifyDataSchema.

The reason I could not put the check in verifyDataSchema ( or the old verifyColumnNames):

  • verifyDataSchema is called in the beginning of the doCreateTable method. But we cannot error out that early in the doCreateTable method, as later on in that method, we create the datasource table. If the datasource table cannot be stored in hive compatible format, it falls back to storing it in Spark sql specific format which will work fine.
  • For e.g If I put the check there, then the create datasource table would throw an exception right away, which we do not want.

CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET

@skambha

This comment has been minimized.

Copy link
Contributor Author

commented Nov 17, 2017

I have taken care of adding the check in the new HiveClientImpl.alterTableDataSchema as well and have added some new tests.

@SparkQA

This comment has been minimized.

Copy link

commented Nov 17, 2017

Test build #83968 has finished for PR 19747 at commit e5c2cf3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Copy link

commented Nov 17, 2017

Test build #83969 has finished for PR 19747 at commit 3be7b47.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

withView("v") {
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
assert(spark.sql("SELECT * FROM v").count() == 1L)

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 23, 2017

Member

Could you check the contents instead of number of row counts?

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 23, 2017

Member

The same applies to the other test cases

@@ -895,6 +898,19 @@ private[hive] object HiveClientImpl {
Option(hc.getComment).map(field.withComment).getOrElse(field)
}

private def verifyColumnDataType(schema: StructType): Unit = {
schema.foreach(field => {
val typeString = field.dataType.catalogString

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 23, 2017

Member

catalogString is generated by Spark. It is not related to the restriction of Hive or the interaction between Hive and Spark

See my fix: gatorsmile@bdcb9c8

After you applying my fix, you also need to update the test cases to make the exception types consistent.

This comment has been minimized.

Copy link
@skambha

skambha Nov 28, 2017

Author Contributor

I have taken your change and incorporated it in the latest commit. Thanks.

@skambha skambha force-pushed the skambha:spark22431 branch from 3be7b47 to a1c8a6d Nov 28, 2017

@skambha

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2017

Thanks @gatorsmile for your comments.
I have incorporated them in the latest commit: a1c8a6d

Please take a look. Thanks.

@SparkQA

This comment has been minimized.

Copy link

commented Nov 28, 2017

Test build #84238 has finished for PR 19747 at commit a1c8a6d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
spark.sql("ALTER TABLE t3 ADD COLUMNS (newcol2 STRUCT<`col1`:STRING, col2:Int>)")

val df3 = spark.sql("SELECT * FROM t3")
checkAnswer(df3, Nil)

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 28, 2017

Member

checkAnswer(spark.table("t3"), Nil)

spark.sql("ALTER TABLE t2 ADD COLUMNS (newcol2 STRUCT<`col1`:STRING, col2:Int>)")

val df2 = spark.sql("SELECT * FROM t2")
checkAnswer(df2, Nil)

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 28, 2017

Member

The same here

checkAnswer(spark.sql("SELECT * FROM v"), Row(Row("a", 1)) :: Nil)

spark.sql("ALTER VIEW v AS SELECT STRUCT('a' AS `b`, 1 AS b) q1")
val df = spark.sql("SELECT * FROM v")

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 28, 2017

Member

The same here

spark.sql("CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET")
checkAnswer(sql("SELECT * FROM t"), Nil)
spark.sql("CREATE TABLE x (q STRUCT<col1:INT, col2:STRING>, i1 INT)")
checkAnswer(sql("SELECT * FROM x"), Nil)

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 28, 2017

Member

The same here

test("SPARK-22431: table with nested type") {
withTable("t", "x") {
spark.sql("CREATE TABLE t(q STRUCT<`$a`:INT, col2:STRING>, i1 INT) USING PARQUET")
checkAnswer(sql("SELECT * FROM t"), Nil)

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 28, 2017

Member

The same here

test("SPARK-22431: view with nested type") {
withView("v") {
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
checkAnswer(spark.sql("SELECT * FROM v"), Row(Row("a", 1)) :: Nil)

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 28, 2017

Member

The same here

spark.sql("CREATE VIEW t AS SELECT STRUCT('a' AS `$a`, 1 AS b) q")
checkAnswer(sql("SELECT * FROM t"), Row(Row("a", 1)) :: Nil)
spark.sql("CREATE VIEW v AS SELECT STRUCT('a' AS `a`, 1 AS b) q")
checkAnswer(sql("SELECT * FROM t"), Row(Row("a", 1)) :: Nil)

This comment has been minimized.

Copy link
@gatorsmile

gatorsmile Nov 28, 2017

Member

The same issues in these two test cases

@gatorsmile

This comment has been minimized.

Copy link
Member

commented Nov 28, 2017

LGTM except a few minor comments.

@skambha

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2017

Thanks @gatorsmile.
I have addressed your comments in the latest commit. Please take a look. Thanks.

@gatorsmile

This comment has been minimized.

Copy link
Member

commented Nov 28, 2017

LGTM

@SparkQA

This comment has been minimized.

Copy link

commented Nov 28, 2017

Test build #84271 has finished for PR 19747 at commit d2458d4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@hvanhovell

This comment has been minimized.

Copy link
Contributor

commented Nov 28, 2017

LGTM - merging to master. Thanks for working on this!

@asfgit asfgit closed this in a10b328 Nov 28, 2017

@skambha

This comment has been minimized.

Copy link
Contributor Author

commented Nov 28, 2017

great! Thank you @gatorsmile, @hvanhovell, @wzhfy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.