[SPARK-32508][SQL] Disallow empty part col values in partition spec before static partition writing #29316

cxzl25 · 2020-07-31T11:02:01Z

What changes were proposed in this pull request?

Write to static partition, check in advance that the partition field is empty.

Why are the changes needed?

When writing to the current static partition, the partition field is empty, and an error will be reported when all tasks are completed.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add ut

…ion writing

cxzl25 · 2020-08-07T10:03:18Z

@gatorsmile

In this pr(SPARK-19129 #16583), SessionCatalog blocked the empty value of the partition, but it did not block the InsertIntoHiveTable, which caused the task to be executed first and failed in loadPartition.

Maybe we can check this behavior in advance to avoid task execution?

cxzl25 · 2020-09-08T10:50:21Z

@cloud-fan If you have time, help review this pull request. Thank you.

cloud-fan · 2020-09-08T11:56:03Z

ok to test

cloud-fan · 2020-09-08T11:57:22Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+   */
+  protected def requireNonEmptyValueInPartitionSpec(specs: Seq[TablePartitionSpec]): Unit = {
+    specs.foreach { s =>
+      if (s.values.exists(_.isEmpty)) {


should this be done in an analyzer rule? how do we do the same check for file source tables?

I move it to PreprocessTableInsertion rule.

SparkQA · 2020-09-08T14:52:08Z

Test build #128405 has finished for PR 29316 at commit 221564b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-08T15:17:17Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/InsertSuite.scala

+    withTable("t1") {
+      spark.sql(
+        """
+          |CREATE TABLE t1 (c1 int)


why is this a problem only for hive tables?

InsertIntoHadoopFsRelationCommand
When manageFilesourcePartitions is turned on,catalog.listPartitions is called, here is a check to see if the partition value is empty.

In the case that manageFilesourcePartitions is not turned on, the partition value is currently not checked, which means that the SQL execution will not fail. If I now move the check logic to the PreprocessTableInsertion rule, this will cause the execution to fail.

Perhaps this check can only be performed when tracksPartitionsInCatalog is equal to true and the static partition is written.

hive calls getPartition when loadPartition, here it will check whether the partition value is empty.

public Partition getPartition(...){ || (val != null && val.length() == 0)) { throw new HiveException("get partition: Value for key " + field.getName() + " is null or empty"); }

In the case that manageFilesourcePartitions is not turned on, the partition value is currently not checked, which means that the SQL execution will not fail.

what's the behavior then?

InsertIntoHadoopFsRelationCommand

When writing static partition or dynamic partition, the DynamicPartitionDataWriter will be used, and the partition value of empty will generate the default value(HIVE_DEFAULT_PARTITION) through getPartitionPathString.

When manageFilesourcePartitions is not turned on, the partition information is maintained through the filesystem, so it is not checked whether the partition value is empty.
insert ovwriter table a partition(d='') select 1 sql will run successfully.

what's the behavior of hive? Does hive treat d='' as d=HIVE_DEFAULT_PARTITION and can run it?

hive> INSERT OVERWRITE TABLE t1 PARTITION(d='') select 1;

Caused by: org.apache.hadoop.hive.ql.parse.SemanticException: Line 1:61 Partition not found '''' at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer$tableSpec.<init>(BaseSemanticAnalyzer.java:856) at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer$tableSpec.<init>(BaseSemanticAnalyzer.java:727) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.getMetaData(SemanticAnalyzer.java:1652) ... 23 more Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: get partition: Value for key d is null or empty at org.apache.hadoop.hive.ql.metadata.Hive.getPartition(Hive.java:1900)

hive> INSERT OVERWRITE TABLE t1 PARTITION(d) select 1,'' as d;
result:

Loading data to table x.t1 partition (d=null) Time taken for load dynamic partitions : 302 Loading partition {d=__HIVE_DEFAULT_PARTITION__} Time taken for adding to write entity : 1 Partition x.t1{d=__HIVE_DEFAULT_PARTITION__} stats: [numFiles=1, numRows=1, totalSize=201, rawDataSize=85]

SparkQA · 2020-09-08T20:20:55Z

Test build #128416 has finished for PR 29316 at commit ab4e04c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-08T20:47:37Z

Test build #128415 has finished for PR 29316 at commit e012e7a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-09T13:51:24Z

Test build #128442 has finished for PR 29316 at commit 232a835.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-10T08:04:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

@@ -402,6 +403,22 @@ case class PreprocessTableInsertion(conf: SQLConf) extends Rule[LogicalPlan] {
          s"including ${staticPartCols.size} partition column(s) having constant value(s).")
    }

+    val partitionsTrackedByCatalog = conf.manageFilesourcePartitions &&


We shouldn't check the config. The flag is decided by the config that was used to create this table, so catalogTable.get.tracksPartitionsInCatalog is good enough.

manageFilesourcePartitions==false && tracksPartitionsInCatalog=true
InsertIntoHadoopFsRelationCommand does not call catalog.listPartitions and AlterTableAddPartitionCommand, so it does not check the partition value is empty.

set spark.sql.hive.manageFilesourcePartitions=false; insert ovwriter table a partition(d='1',h='') select 1;

currently this sql execution will not fail.

I mean, conf.manageFilesourcePartitions is used to create the table. When we need to read/alter table, we should respect table.tracksPartitionsInCatalog.

cloud-fan · 2020-09-10T08:05:12Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

+    // check static partition
+    if (partitionsTrackedByCatalog &&
+      normalizedPartSpec.nonEmpty &&
+      staticPartCols.size == partColNames.size) {


why do we need this check staticPartCols.size == partColNames.size?

staticPartCols only has fields that specify the partition value
partColNames are all partition fields
Because it only checks the writing to the static partition

insert ovwriter table a partition(d='1',h) select 1,''

staticPartCols=[d] partColNames=[d,h]

insert ovwriter table a partition(d='1',h='') select 1

staticPartCols=[d,h] partColNames=[d,h]

partition(d='',h) should we fail for this case?

Spark/Hive allows the partition value to be empty for dynamic partition write.

Loading data to table x.t1 partition (d=1, h=null) Time taken for load dynamic partitions : 203 Loading partition {d=1, h=__HIVE_DEFAULT_PARTITION__} Time taken for adding to write entity : 1 Partition x.t1{d=1, h=__HIVE_DEFAULT_PARTITION__} stats: [numFiles=1, numRows=1, totalSize=201, rawDataSize=85]

your example is partition(d='1',h), I was asking about partition(d='',h)

hive> insert overwrite table t1 partition (d='',h) select 1,'';

ERROR ql.Driver (SessionState.java:printError(956)) - FAILED: IllegalArgumentException Can not create a Path from an empty string java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.<init>(Path.java:135) at org.apache.hadoop.fs.Path.<init>(Path.java:94) at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFileSinkPlan(SemanticAnalyzer.java:6070)

spark-sql> insert overwrite table t1 partition (d='',h) select 1,'';

[main] INFO InsertIntoHiveTable: Partition `x`.`t1` {d=__HIVE_DEFAULT_PARTITION__, h=__HIVE_DEFAULT_PARTITION__} stats: [numFiles=1, numRows=0, totalSize=212]

then shall we check all staticPartCols have partition values?

Yes, we can do this check, but it will cause errors in such SQL execution, because it was successful before.

but the behavior is inconsistent. It's weird to see partition(d='',h) works, but partition(d='',h='1') does not.

make sense.

SparkQA · 2020-09-10T23:03:30Z

Test build #128531 has finished for PR 29316 at commit a7b9a17.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-11T07:05:01Z

Test build #128549 has finished for PR 29316 at commit 65f781a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-11T12:44:39Z

Test build #128560 has finished for PR 29316 at commit 965ed5a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-15T14:14:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/rules.scala

+      catalogTable.get.partitionColumnNames.nonEmpty &&
+      catalogTable.get.tracksPartitionsInCatalog
+    if (partitionsTrackedByCatalog &&
+      normalizedPartSpec.nonEmpty) {


nit: the if condition can be put in one line

cloud-fan · 2020-09-15T14:17:50Z

sql/core/src/test/scala/org/apache/spark/sql/sources/InsertSuite.scala

+
+      sql("INSERT INTO TABLE insertTable PARTITION(part1='1', part2='2') SELECT 1")
+      sql("INSERT INTO TABLE insertTable PARTITION(part1='1', part2) SELECT 1 ,'2' AS part2")
+      sql("INSERT INTO TABLE insertTable PARTITION(part1='1', part2) SELECT 1 ,'' AS part2")


So the partition column can be empty string if it's dynamic. Shall we convert the empty string/null in partition spec to __HIVE_DEFAULT_PARTITION__ before calling listPartitions/loadPartition?

Generally speaking, it is meaningless for the partition value to be empty, so the static partition value is not allowed to be empty.
Dynamic partition may be that the user does not know that the partition field is null or empty, and finally wrote the __HIVE_DEFAULT_PARTITION__ partition.

listPartitions

spark-sql> show partitions inserttable ; part1=1/part2=__HIVE_DEFAULT_PARTITION__ Time taken: 0.2 seconds, Fetched 1 row(s) spark-sql> desc formatted inserttable partition(part1='1',part2=''); Error in query: Partition spec is invalid. The spec ([part1=1, part2=]) contains an empty partition column value; spark-sql> desc formatted inserttable partition(part1='1',part2='__HIVE_DEFAULT_PARTITION__'); col_name data_type comment ... Time taken: 0.348 seconds, Fetched 27 row(s)

The partition value the user sees is __HIVE_DEFAULT_PARTITION__, so the user will not specify the partition value empty to query the partition details.

loadPartition
Because in DynamicPartitionDataWriter#partitionPathExpression, the partition value will be null or emtpy converted to __HIVE_DEFAULT_PARTITION__, so it can be executed successfully without the need to increase early conversion.

SparkQA · 2020-09-15T17:12:15Z

Test build #128720 has finished for PR 29316 at commit b969c73.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-16T07:05:03Z

Test build #128735 has finished for PR 29316 at commit c8ac369.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-09-16T14:24:28Z

Test build #128755 has finished for PR 29316 at commit 76acc07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-09-17T06:50:29Z

thanks, merging to master!

Disallow empty part col values in partition spec before static partit…

224b979

…ion writing

probot-autolabeler bot added the SQL label Jul 31, 2020

code style

221564b

cloud-fan reviewed Sep 8, 2020

View reviewed changes

cxzl25 added 2 commits September 8, 2020 22:53

move to analyzer rule

e012e7a

remove unused import

ab4e04c

cloud-fan reviewed Sep 8, 2020

View reviewed changes

check when tracksPartitionsInCatalog==true

232a835

cloud-fan reviewed Sep 10, 2020

View reviewed changes

modify the check logic

a7b9a17

respect tracksPartitionsInCatalog

65f781a

fix ut

965ed5a

cloud-fan reviewed Sep 15, 2020

View reviewed changes

nit

b969c73

trigger test

c8ac369

trigger test

76acc07

cloud-fan closed this in 92b75dc Sep 17, 2020

[SPARK-32508][SQL] Disallow empty part col values in partition spec before static partition writing #29316

[SPARK-32508][SQL] Disallow empty part col values in partition spec before static partition writing #29316

Conversation

cxzl25 commented Jul 31, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

cxzl25 commented Aug 7, 2020

cxzl25 commented Sep 8, 2020

cloud-fan commented Sep 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan Sep 10, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 8, 2020

SparkQA commented Sep 8, 2020

SparkQA commented Sep 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 10, 2020

SparkQA commented Sep 11, 2020

SparkQA commented Sep 11, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Sep 15, 2020

SparkQA commented Sep 16, 2020

SparkQA commented Sep 16, 2020

cloud-fan commented Sep 17, 2020

cloud-fan Sep 10, 2020 •

edited

Loading