Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21144][SQL][BRANCH-2.2] Check column name duplication in read/write paths #18356

Closed
wants to merge 2 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jun 20, 2017

What changes were proposed in this pull request?

This pr fixed unexpected results when the data schema and partition schema have the duplicate columns.

withTempPath { dir =>
  val basePath = dir.getCanonicalPath
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
  spark.read.parquet(basePath).show()
}

+---+
|foo|
+---+
|  1|
|  1|
|  a|
|  a|
|  1|
|  a|
+---+

This patch added code to check name duplication when reading from/writing to files. This comes from #17758 to fix this issue for v2.2.

How was this patch tested?

Created a new test suite SchemaUtilsSuite and added tests in existing suites: DataFrameSuite, DDLSuite, JDBCWriteSuite, DataFrameReaderWriterSuite, HiveMetastoreCatalog, and HiveDDLSuite.

@maropu
Copy link
Member Author

maropu commented Jun 20, 2017

cc: @gatorsmile

@maropu
Copy link
Member Author

maropu commented Jun 20, 2017

@gatorsmile This pr included whole changes in #17758 though, you originally meant this pr should include a part of them to fix this issue only? Oh, my bad and it seems I misunderstood. I'll look into code to fix this issue.

@SparkQA
Copy link

SparkQA commented Jun 20, 2017

Test build #78268 has finished for PR 18356 at commit 32f0130.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} else {
schema.map(_.name.toLowerCase)
}
checkDuplication(columnNames, "table definition of " + table.identifier)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revert all the unrelated changes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@gatorsmile
Copy link
Member

gatorsmile commented Jun 20, 2017

To avoid any potential issue, could you revert all the unrelated changes?

@@ -181,6 +182,10 @@ case class DataSource(
throw new AnalysisException(
s"Unable to infer schema for $format. It must be specified manually.")
}

SchemaUtils.checkSchemaColumnNameDuplication(
dataSchema, "the datasource", sparkSession.sessionState.conf.caseSensitiveAnalysis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change we need for 2.2

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it should be dataSchema + partitionSchema.

We also need to issue some meaningful error message. Users still can bypass the error by manually specify the schema.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I'll update soon. Thanks!

Seq((true, ("a", "a")), (false, ("aA", "Aa"))).foreach { case (caseSensitive, (c0, c1)) =>
withSQLConf(SQLConf.CASE_SENSITIVE.key -> caseSensitive.toString) {
withTempDir { src =>
Seq(1).toDF(c0).write.mode("overwrite").json(s"$src/$c1=1")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @maropu .
It seems we can simply merge JSON and Parquet test case into one by using format(...) instead of json() or parquet().

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For spark.read.json,

spark.read.format('..').load

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, thanks! I'll update

@maropu
Copy link
Member Author

maropu commented Jun 21, 2017

@gatorsmile Is the test valid? https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala#L45 This test fails when this pr applied because dataSchema and partSchema are overlapped.

@SparkQA
Copy link

SparkQA commented Jun 21, 2017

Test build #78345 has finished for PR 18356 at commit c77d932.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@shaneknapp
Copy link
Contributor

test this please

checkColumnNameDuplication(
(dataSchema ++ partitionSchema).map(_.name),
"in the datasource",
sparkSession.sessionState.conf.caseSensitiveAnalysis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the new RC of 2.2 is out, how about submitting the PR to the master branch using your SchemaUtils.scala with the corresponding test cases, instead of creating a new function checkColumnNameDuplication?

This can simplify the future back-port merging to 2.2.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll do that. Thanks.

val e = intercept[AnalysisException] {
spark.read.format(format).load(src.toString)
}
assert(e.getMessage.contains("Found duplicate column(s) in the datasource: "))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, could you also add another test case? Even if there exists duplicate columns between data schema and partition schema, users still can query the data by manually specifying the schema?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll add test cases in a new pr.

@SparkQA
Copy link

SparkQA commented Jun 21, 2017

Test build #78357 has finished for PR 18356 at commit c77d932.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu maropu closed this Jun 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants