Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21912][SQL] ORC/Parquet table should not create invalid column names #19124

Closed
wants to merge 12 commits into from
Closed

[SPARK-21912][SQL] ORC/Parquet table should not create invalid column names #19124

wants to merge 12 commits into from

Conversation

dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Sep 4, 2017

What changes were proposed in this pull request?

Currently, users meet job abortions while creating or altering ORC/Parquet tables with invalid column names. We had better prevent this by raising AnalysisException with a guide to use aliases instead like Paquet data source tables.

BEFORE

scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
17/09/04 13:28:21 ERROR Utils: Aborting task
java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<a b:int>' but ' ' is found.
17/09/04 13:28:21 ERROR FileFormatWriter: Job job_20170904132821_0001 aborted.
17/09/04 13:28:21 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1)
org.apache.spark.SparkException: Task failed while writing rows.

AFTER

scala> sql("CREATE TABLE orc1 USING ORC AS SELECT 1 `a b`")
17/09/04 13:27:40 ERROR CreateDataSourceTableAsSelectCommand: Failed to write to table orc1
org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

How was this patch tested?

Pass the Jenkins with a new test case.

@SparkQA
Copy link

SparkQA commented Sep 4, 2017

Test build #81391 has finished for PR 19124 at commit 808dfe0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -169,6 +171,16 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
}
}
}

private def checkFieldName(name: String): Unit = {
// ,;{}()\n\t= and space are special characters in ORC schema
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an exhaustive list ? eg. looks like ? is not allowed either. Given that the underlying lib (ORC) can evolve to support / not support certain chars, its safer to rely on some method rather than coming up with a blacklist. Can you simply call TypeInfoUtils.getTypeInfoFromTypeString or any related method which would do this check ?

Caused by: java.lang.IllegalArgumentException: Error: : expected at the position 8 of 'struct<i?:int,j:int,k:string>' but '?' is found.
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:360)
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:331)
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:483)
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:305)
  at org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:770)
  at org.apache.spark.sql.hive.orc.OrcSerializer.<init>(OrcFileFormat.scala:194)
  at org.apache.spark.sql.hive.orc.OrcOutputWriter.<init>(OrcFileFormat.scala:231)
  at org.apache.spark.sql.hive.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:91)
...
...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review, @tejasapatil !
That's a good idea. Right, It's not an exhaustive list. I'll update the PR.

@SparkQA
Copy link

SparkQA commented Sep 5, 2017

Test build #81394 has finished for PR 19124 at commit a738943.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} catch {
case _: IllegalArgumentException =>
throw new AnalysisException(
s"""Attribute name "$name" contains invalid character(s).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Attribute-> Column

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for review. Sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you that column is more accurate here. Previously, I borrowed this from ParquetSchemaConverter

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaConverter.scala#L565-L572

withTable("orc1") {
Seq(" ", "?", ",", ";", "{", "}", "(", ")", "\n", "\t", "=").foreach { name =>
val m = intercept[AnalysisException] {
sql(s"CREATE TABLE orc1 USING ORC AS SELECT 1 `column$name`")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is CTAS. How about CREATE TABLE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I'll check the code path, too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be the same situation with Parquet. CREATE TABLE passes but SELECT raises exceptions.

scala> sql("CREATE TABLE parquet1(`a b` int) using parquet")
res1: org.apache.spark.sql.DataFrame = []

scala> sql("select * from parquet1").show
org.apache.spark.sql.AnalysisException: Attribute name "a b" contains invalid character(s) among " ,;{}()\n\t=". Please use alias to rename it.;

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add Datasource specific operation on createDataSourceTables for Parquet and ORC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . I tried the following in CreateDataSourceTableCommand. We can add a check for ParquetFileFormat, but not for OrcFileFormat. Should I change the PR title and scope instead?

    table.provider.get.toLowerCase match {
      case "parquet" =>
        dataSource.schema.map(_.name).foreach(
          org.apache.spark.sql.execution.datasources.parquet.ParquetSchemaConverter.checkFieldName)
      case "orc" =>
        dataSource.schema.map(_.name).foreach(
          org.apache.spark.sql.hive.OrcRelation.checkFieldName)
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try in another way.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21912][SQL] Creating ORC datasource table should check invalid column names [SPARK-21912][SQL] Creating ORC/Parquet datasource table should check invalid column names Sep 5, 2017
@@ -85,6 +87,13 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo
}
}

table.provider.get.toLowerCase match {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are able to check here for a normal CREATE TABLE.

Copy link
Member

@gatorsmile gatorsmile Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just covers CREATE DATA SOURCE TABLES. How about CREATE HIVE SERDE TABLES?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya. That's a good point!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile . I have a question. Do we have an issue on Hive SERDE table?

CREATE TABLE t(`a b` INT) USING hive OPTIONS (fileFormat 'parquet')

I thought Hive schema is preferred than Parquet schema.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is Hive schema? What is Parquet schema?

Could you insert rows to the table you created?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. It fails.

scala> sql("set spark.sql.hive.convertMetastoreParquet=false")
res5: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> sql("INSERT INTO t VALUES(1)")
17/09/05 11:34:03 ERROR Utils: Aborting task
org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: field ended by ';': expected ';' but got 'b' at line 1:   optional int32 a b
	at org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
	at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:123)


import org.apache.spark.sql.AnalysisException

private[sql] object OrcFileFormat {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, we already have new ORC dependency.

@SparkQA
Copy link

SparkQA commented Sep 5, 2017

Test build #81403 has finished for PR 19124 at commit cd539fe.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2017

Test build #81404 has finished for PR 19124 at commit 66aff54.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 5, 2017

Test build #81405 has finished for PR 19124 at commit aa78eaf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Sep 5, 2017

Hi, @gatorsmile .

I can fix it in most cases, but we have the following test case.

-- !query 2
CREATE TABLE showcolumn1 (col1 int, `col 2` int) USING parquet
-- !query 2 schema
struct<>
-- !query 2 output

In case of Parquet, currently, CREATE TABLE is allowed and CTAS and SELECT shows AnalysisException. How can I proceed this?

@dongjoon-hyun
Copy link
Member Author

I just updated the output answer file to show the result for review only.

@SparkQA
Copy link

SparkQA commented Sep 5, 2017

Test build #81408 has finished for PR 19124 at commit 0bf3b43.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -23,7 +23,8 @@ CREATE TABLE showcolumn1 (col1 int, `col 2` int) USING parquet
-- !query 2 schema
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change it to JSON

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, thank you for the guide!

@@ -145,15 +146,27 @@ class DetermineTableStats(session: SparkSession) extends Rule[LogicalPlan] {
* `PreprocessTableInsertion`.
*/
object HiveAnalysis extends Rule[LogicalPlan] {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought HiveAnalysis is the best place to check this.

@dongjoon-hyun
Copy link
Member Author

Thank you, @gatorsmile . The PR becomes much general.
When I started this PR, I didn't notice that the current Spark had so many missing cases like this.

-- !query 2 schema
struct<>
-- !query 2 output



-- !query 3
CREATE TABLE showcolumn2 (price int, qty int, year int, month int) USING parquet partitioned by (year, month)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need to change this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. It's reverted.

@gatorsmile
Copy link
Member

That is normal. When we find a bug, it normally means we ignore it in more than one place. Thus, we need to check all the other code paths that could trigger it.

@@ -85,6 +88,14 @@ case class CreateDataSourceTableCommand(table: CatalogTable, ignoreIfExists: Boo
}
}

table.provider.get.toLowerCase(Locale.ROOT) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should do it in DataSourceAnalysis

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Right.

@@ -83,6 +83,8 @@ class OrcFileFormat extends FileFormat with DataSourceRegister with Serializable
classOf[MapRedOutputFormat[_, _]])
}

org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.checkFieldNames(dataSchema)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this when we move the check to DataSourceAnalysis ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I'll check this and remove it. Maybe, we can remove the similar logic from ParquetFileFormat, too.

@dongjoon-hyun
Copy link
Member Author

For that, no. It's not considered yet like the other code path.

@gatorsmile
Copy link
Member

Could this PR cover this scenario?

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81432 has finished for PR 19124 at commit 8ee87dd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

I created SPARK-21929 for "Support ALTER TABLE table_name ADD COLUMNS(..) for ORC data source".

For Parquet ALTER TABLE, yes. I think I can include that here.
But, for the title of PR, I'm not sure. It's not clear because it's partial.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21912][SQL] Creating ORC/Parquet datasource table should check invalid column names [SPARK-21912][SQL] ORC/Parquet datasource table should check invalid column names Sep 6, 2017
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21912][SQL] ORC/Parquet datasource table should check invalid column names [SPARK-21912][SQL] ORC/Parquet table should create invalid column names Sep 6, 2017
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-21912][SQL] ORC/Parquet table should create invalid column names [SPARK-21912][SQL] ORC/Parquet table should not create invalid column names Sep 6, 2017
@@ -206,6 +206,9 @@ case class AlterTableAddColumnsCommand(
reorderedSchema.map(_.name), "in the table definition of " + table.identifier,
conf.caseSensitiveAnalysis)

val newDataSchema = StructType(catalogTable.dataSchema ++ columns)
DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this command, it's not easy to get CatalogTable at DataSourceStrategy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema
    val newSchema = catalogTable.schema.copy(fields = reorderedSchema.toArray)

    SchemaUtils.checkColumnNameDuplication(
      reorderedSchema.map(_.name), "in the table definition of " + table.identifier,
      conf.caseSensitiveAnalysis)
    DDLUtils.checkFieldNames(catalogTable.copy(schema = newSchema))

    catalog.alterTableSchema(table, newSchema)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ur, actually. Excluding partition columns was intentional.
Maybe, I used a misleading PR title and description here.
So far, I checked dataSchema only. I think partition columns are okay because they are not a part of Parquet/ORC file schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to use the following?

    val reorderedSchema = catalogTable.dataSchema ++ columns ++ catalogTable.partitionSchema
    val newDataSchema = StructType(catalogTable.dataSchema ++ columns)

    SchemaUtils.checkColumnNameDuplication(
      reorderedSchema.map(_.name), "in the table definition of " + table.identifier,
      conf.caseSensitiveAnalysis)
    DDLUtils.checkFieldNames(catalogTable.copy(schema = newDataSchema))

    catalog.alterTableSchema(
      table, catalogTable.schema.copy(fields = reorderedSchema.toArray))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I found that your code is better. I'll updated it like yours.

}

// TODO: After SPARK-21929, we need to check ORC, too.
Seq("PARQUET").foreach { source =>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added only Parquet test case due to SPARK-21929.

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81435 has finished for PR 19124 at commit c6e9ab6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private[sql] object OrcFileFormat {
private def checkFieldName(name: String): Unit = {
try {
TypeDescription.fromString(s"struct<$name:int>")
Copy link
Member

@viirya viirya Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseName looks not public though .. I don't like this line too but could not think of another alternative for now.

Copy link
Member

@viirya viirya Sep 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, I forgot that is java...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I agree that it's a little urgly now.

} else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) {
ParquetSchemaConverter.checkFieldNames(table.dataSchema)
} else {
table.provider.get.toLowerCase(Locale.ROOT) match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table.provider could be None in the previous versions of Spark. Thus, .get is risky.


private[sql] def checkFieldNames(table: CatalogTable): Unit = {
val serde = table.storage.serde
if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way is not right. Let use your previous way with a foreach loop

    table.provider.foreach {
      _.toLowerCase(Locale.ROOT) match {
        case "hive" =>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep!

val serde = table.storage.serde
if (serde == HiveSerDe.sourceToSerDe("orc").get.serde) {
OrcFileFormat.checkFieldNames(table.dataSchema)
} else if (serde == HiveSerDe.sourceToSerDe("parquet").get.serde) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have different Parquet serde. For example, parquet.hive.serde.ParquetHiveSerDe and org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe. How about ORC?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, it's only org.apache.hadoop.hive.ql.io.orc.OrcSerde. I checked again whether Apache ORC 1.4.0 have some renamed one under hive-storage API or not. But, it doesn't have it.

For parquet, I'll handle that too.

@dongjoon-hyun
Copy link
Member Author

Oh, thank you for review, @viirya, @HyukjinKwon and @gatorsmile !
I'll follow up your comments!

@@ -848,4 +851,19 @@ object DDLUtils {
}
}
}

private[sql] def checkFieldNames(table: CatalogTable): Unit = {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll rename this to checkDataSchemaFieldNames.

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81443 has finished for PR 19124 at commit 46847f8.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please

@SparkQA
Copy link

SparkQA commented Sep 6, 2017

Test build #81445 has finished for PR 19124 at commit 46847f8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


SchemaUtils.checkColumnNameDuplication(
reorderedSchema.map(_.name), "in the table definition of " + table.identifier,
conf.caseSensitiveAnalysis)
DDLUtils.checkDataSchemaFieldNames(catalogTable.copy(schema = newSchema))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

newSchema also contains partition schema. How about partition schema? Do we have the same limits on it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's okay. Inside checkDataSchemaFieldNames, we only uses table.dataSchema like the following.

ParquetSchemaConverter.checkFieldNames(table.dataSchema)

For the partition columns, we have been allowing the special characters.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add test cases and ensure the partitioning columns with special characters work?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR passes the above two test cases, too.

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merged to master.

@asfgit asfgit closed this in eea2b87 Sep 7, 2017
@dongjoon-hyun
Copy link
Member Author

@gatorsmile . Thank you for your help! This PR is almost made by you.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-21912 branch September 7, 2017 05:28
@dongjoon-hyun
Copy link
Member Author

Thank you for your reviewing and helping this PR, @tejasapatil , @viirya , and @HyukjinKwon , too!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants