[SPARK-21144][SQL][BRANCH-2.2] Check column name duplication in read/write paths #18356

maropu · 2017-06-20T00:51:47Z

What changes were proposed in this pull request?

This pr fixed unexpected results when the data schema and partition schema have the duplicate columns.

withTempPath { dir =>
  val basePath = dir.getCanonicalPath
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=1").toString)
  spark.range(0, 3).toDF("foo").write.parquet(new Path(basePath, "foo=a").toString)
  spark.read.parquet(basePath).show()
}

+---+
|foo|
+---+
|  1|
|  1|
|  a|
|  a|
|  1|
|  a|
+---+

This patch added code to check name duplication when reading from/writing to files. This comes from #17758 to fix this issue for v2.2.

How was this patch tested?

Created a new test suite SchemaUtilsSuite and added tests in existing suites: DataFrameSuite, DDLSuite, JDBCWriteSuite, DataFrameReaderWriterSuite, HiveMetastoreCatalog, and HiveDDLSuite.

maropu · 2017-06-20T00:52:06Z

cc: @gatorsmile

maropu · 2017-06-20T00:56:18Z

@gatorsmile ~~This pr included whole changes in #17758 though, you originally meant this pr should include a part of them to fix this issue only?~~ Oh, my bad and it seems I misunderstood. I'll look into code to fix this issue.

SparkQA · 2017-06-20T03:15:26Z

Test build #78268 has finished for PR 18356 at commit 32f0130.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-06-20T03:42:39Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

-    } else {
-      schema.map(_.name.toLowerCase)
-    }
-    checkDuplication(columnNames, "table definition of " + table.identifier)


Could you revert all the unrelated changes?

gatorsmile · 2017-06-20T03:43:15Z

To avoid any potential issue, could you revert all the unrelated changes?

gatorsmile · 2017-06-20T03:44:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

@@ -181,6 +182,10 @@ case class DataSource(
      throw new AnalysisException(
        s"Unable to infer schema for $format. It must be specified manually.")
    }
+
+    SchemaUtils.checkSchemaColumnNameDuplication(
+      dataSchema, "the datasource", sparkSession.sessionState.conf.caseSensitiveAnalysis)


This is the change we need for 2.2

Actually, it should be dataSchema + partitionSchema.

We also need to issue some meaningful error message. Users still can bypass the error by manually specify the schema.

yea, I'll update soon. Thanks!

dongjoon-hyun · 2017-06-20T20:05:12Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+    Seq((true, ("a", "a")), (false, ("aA", "Aa"))).foreach { case (caseSensitive, (c0, c1)) =>
+      withSQLConf(SQLConf.CASE_SENSITIVE.key -> caseSensitive.toString) {
+        withTempDir { src =>
+          Seq(1).toDF(c0).write.mode("overwrite").json(s"$src/$c1=1")


Hi, @maropu .
It seems we can simply merge JSON and Parquet test case into one by using format(...) instead of json() or parquet().

For spark.read.json,

spark.read.format('..').load

oh, thanks! I'll update

maropu · 2017-06-21T01:46:57Z

@gatorsmile Is the test valid? https://github.com/apache/spark/blob/master/sql/hive/src/test/scala/org/apache/spark/sql/sources/ParquetHadoopFsRelationSuite.scala#L45 This test fails when this pr applied because dataSchema and partSchema are overlapped.

SparkQA · 2017-06-21T01:57:20Z

Test build #78345 has finished for PR 18356 at commit c77d932.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

shaneknapp · 2017-06-21T04:21:18Z

test this please

gatorsmile · 2017-06-21T05:25:17Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala

+    checkColumnNameDuplication(
+      (dataSchema ++ partitionSchema).map(_.name),
+      "in the datasource",
+      sparkSession.sessionState.conf.caseSensitiveAnalysis)


Since the new RC of 2.2 is out, how about submitting the PR to the master branch using your SchemaUtils.scala with the corresponding test cases, instead of creating a new function checkColumnNameDuplication?

This can simplify the future back-port merging to 2.2.

ok, I'll do that. Thanks.

gatorsmile · 2017-06-21T05:33:01Z

sql/core/src/test/scala/org/apache/spark/sql/test/DataFrameReaderWriterSuite.scala

+            val e = intercept[AnalysisException] {
+              spark.read.format(format).load(src.toString)
+            }
+            assert(e.getMessage.contains("Found duplicate column(s) in the datasource: "))


BTW, could you also add another test case? Even if there exists duplicate columns between data schema and partition schema, users still can query the data by manually specifying the schema?

ok, I'll add test cases in a new pr.

SparkQA · 2017-06-21T05:34:57Z

Test build #78357 has finished for PR 18356 at commit c77d932.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile reviewed Jun 20, 2017

View reviewed changes

Fail when when data schema and partition schema have duplicate columns

e55e4ef

maropu force-pushed the SPARK-21144-2 branch from 32f0130 to e55e4ef Compare June 20, 2017 15:26

dongjoon-hyun reviewed Jun 20, 2017

View reviewed changes

Apply dongjoon comments

c77d932

gatorsmile reviewed Jun 21, 2017

View reviewed changes

maropu closed this Jun 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21144][SQL][BRANCH-2.2] Check column name duplication in read/write paths #18356

[SPARK-21144][SQL][BRANCH-2.2] Check column name duplication in read/write paths #18356

maropu commented Jun 20, 2017

maropu commented Jun 20, 2017

maropu commented Jun 20, 2017 •

edited

Loading

SparkQA commented Jun 20, 2017

gatorsmile Jun 20, 2017

maropu Jun 20, 2017

gatorsmile commented Jun 20, 2017 •

edited

Loading

gatorsmile Jun 20, 2017

gatorsmile Jun 20, 2017

maropu Jun 20, 2017

dongjoon-hyun Jun 20, 2017

dongjoon-hyun Jun 20, 2017

maropu Jun 21, 2017

maropu commented Jun 21, 2017 •

edited

Loading

SparkQA commented Jun 21, 2017

shaneknapp commented Jun 21, 2017

gatorsmile Jun 21, 2017

maropu Jun 21, 2017

gatorsmile Jun 21, 2017

maropu Jun 21, 2017

SparkQA commented Jun 21, 2017

[SPARK-21144][SQL][BRANCH-2.2] Check column name duplication in read/write paths #18356

[SPARK-21144][SQL][BRANCH-2.2] Check column name duplication in read/write paths #18356

Conversation

maropu commented Jun 20, 2017

What changes were proposed in this pull request?

How was this patch tested?

maropu commented Jun 20, 2017

maropu commented Jun 20, 2017 • edited Loading

SparkQA commented Jun 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented Jun 20, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maropu commented Jun 21, 2017 • edited Loading

SparkQA commented Jun 21, 2017

shaneknapp commented Jun 21, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jun 21, 2017

maropu commented Jun 20, 2017 •

edited

Loading

gatorsmile commented Jun 20, 2017 •

edited

Loading

maropu commented Jun 21, 2017 •

edited

Loading