[SPARK-26012][SQL]Null and '' values should not cause dynamic partition failure of string types #23010

eatoncys · 2018-11-12T04:57:57Z

What changes were proposed in this pull request?

Dynamic partition will fail when both '' and null values are taken as dynamic partition values simultaneously.
For example, the test bellow will fail before this PR:

test("Null and '' values should not cause dynamic partition failure of string types") {
withTable("t1", "t2") {
spark.range(3).write.saveAsTable("t1")
spark.sql("select id, cast(case when id = 1 then '' else null end as string) as p" +
" from t1").write.partitionBy("p").saveAsTable("t2")
checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), Row(2, null)))
}
}

The error is: 'org.apache.hadoop.fs.FileAlreadyExistsException: File already exists'.
This PR adds exception protection to file conflicts, renaming the file when files conflict.

(Please fill in changes proposed in this fix)

How was this patch tested?

New added test.

…ng types

SparkQA · 2018-11-12T08:05:01Z

Test build #98715 has finished for PR 23010 at commit 1f18e27.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

eatoncys · 2018-12-02T13:57:37Z

retest this please

SparkQA · 2018-12-02T17:26:05Z

Test build #99576 has finished for PR 23010 at commit 1f18e27.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-12-02T19:15:14Z

cc @cloud-fan

cloud-fan · 2018-12-03T10:24:22Z

The root cause is, DynamicPartitionDataWriter treats null and empty string as different partition values, and creates new files. However, null and empty string are converted to __HIVE_DEFAULT_PARTITION__ at the end.

I think we should deal with invalid partition values ahead, so that we don't need to worry about them during writing.

eatoncys · 2018-12-03T11:13:09Z

@cloud-fan, Thanks for review, Do you mean we should filter out invalid partitions in sql before write?

eatoncys · 2018-12-03T11:26:17Z

But we may forget to filter null values when we write sql. The following function protects this situation and writes the value of null partitions as __HIVE_DEFAULT_PARTITION__

def getPartitionPathString(col: String, value: String): String = { val partitionString = if (value == null || value.isEmpty) { DEFAULT_PARTITION_NAME } else { escapePathName(value) } escapePathName(col) + "=" + partitionString }

But DynamicPartitionDataWriter only compares whether the values in memory are the same, so a file writing error occurs.

cloud-fan · 2019-01-14T06:48:56Z

We should move the logic of normalizing invalid partition values before writing, instead of during writing.

SparkQA · 2019-01-14T12:54:50Z

Test build #101182 has finished for PR 23010 at commit 49dfe73.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

eatoncys · 2019-01-14T12:56:38Z

@cloud-fan I hvae added a conversion before calculating partition value to convert empty values of string type to null, would you like to review it again, thanks.

cloud-fan · 2019-01-14T13:01:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriter.scala

+    val partitionExpression =
+      toBoundExprs(description.partitionColumns, description.allColumns).map {
+        case e: Expression if e.dataType == StringType =>
+          Empty2Null(e)


We need to do it earlier. In FileFormatWriter.write, we sort the input RDD by partition columns, we need to normalize partition values before sorting.

@cloud-fan Thanks for review, I have moved it before sort, PartitionColumns is retained because it is used to calculate getPartitionPath

SparkQA · 2019-01-15T02:39:07Z

Test build #101218 has finished for PR 23010 at commit f2f777a.

This patch fails to build.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-01-15T04:33:11Z

Test build #101220 has finished for PR 23010 at commit e750515.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-01-15T06:23:56Z

Test build #101217 has finished for PR 23010 at commit 780aa48.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
case class Empty2Null(child: Expression) extends UnaryExpression with String2StringExpression

SparkQA · 2019-01-15T08:05:02Z

Test build #101232 has finished for PR 23010 at commit f9701fb.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

eatoncys · 2019-01-15T08:12:48Z

retest this please

SparkQA · 2019-01-15T11:26:09Z

Test build #101240 has finished for PR 23010 at commit f9701fb.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

eatoncys · 2019-01-15T12:02:25Z

retest this please

SparkQA · 2019-01-15T16:02:11Z

Test build #101259 has finished for PR 23010 at commit f9701fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

eatoncys · 2019-01-16T00:34:59Z

@cloud-fan Would you like to review it again, thanks.

maropu · 2019-01-16T02:23:07Z

branch-2.3 has the same issue (I run the test in branch-2.3 and it failed), so I added "2.3.2" in Affects Version/s. Since the datasource impl. is totally different, we can't simply backport this fix there though...

maropu · 2019-01-16T02:33:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatWriterSuite.scala

+        " from t1").write.partitionBy("p").saveAsTable("t2")
+      checkAnswer(spark.table("t2").sort("id"), Seq(Row(0, null), Row(1, null), Row(2, null)))
+    }
+  }


Can you add tests w/o codegen by using CodegenInterpretedPlanTest?

Sorry, I don't quite understand what 'test of w/o CodeGen' means. Would you like to give an example, thanks.

Its ok to do class FileFormatWriterSuite extends QueryTest with SharedSQLContext with CodegenInterpretedPlanTest. By default, this current test check Empty2Null with the codegen mode (Empty2Null.doGenCode ) only. Plz check the implementation of CodegenInterpretedPlanTest.

Ok, modified, thanks.

eatoncys · 2019-04-04T00:40:06Z

@cloud-fan，sorry，I will fix this PR next week.

eatoncys · 2019-04-04T08:48:20Z

@cloud-fan I have add an analyzer rule to do the empty-string-to-null for partition columns, would you like to review it again, thanks.

SparkQA · 2019-04-04T10:33:07Z

Test build #104284 has finished for PR 23010 at commit 5a0c58e.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class Empty2Null(child: Expression) extends UnaryExpression with String2StringExpression
case class UpdateEmptyValueOfPartitionToNull(conf: SQLConf) extends Rule[LogicalPlan]

SparkQA · 2019-04-04T10:48:40Z

Test build #104285 has finished for PR 23010 at commit 8366975.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-08T15:09:38Z

Test build #104383 has finished for PR 23010 at commit 9e02dd8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-04-09T03:44:25Z

Test build #104414 has finished for PR 23010 at commit 360c785.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileFormatWriterSuite extends QueryTest with SharedSQLContext
class InsertSuite extends DataSourceTest with SharedSQLContext with CodegenInterpretedPlanTest

SparkQA · 2019-04-09T05:05:44Z

Test build #104413 has finished for PR 23010 at commit ab2ea90.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

eatoncys · 2019-04-09T06:05:41Z

retest this please

SparkQA · 2019-04-09T07:05:02Z

Test build #104422 has finished for PR 23010 at commit 360c785.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileFormatWriterSuite extends QueryTest with SharedSQLContext
class InsertSuite extends DataSourceTest with SharedSQLContext with CodegenInterpretedPlanTest

eatoncys · 2019-04-09T07:12:42Z

retest this please

SparkQA · 2019-04-09T11:42:29Z

Test build #104424 has finished for PR 23010 at commit 360c785.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileFormatWriterSuite extends QueryTest with SharedSQLContext
class InsertSuite extends DataSourceTest with SharedSQLContext with CodegenInterpretedPlanTest

cloud-fan · 2019-04-09T14:02:02Z