[SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null #32018

AngersZhuuuu · 2021-04-01T02:40:33Z

What changes were proposed in this pull request?

When we insert data into a partition table partition with empty DataFrame. We will call PartitioningUtils.getPathFragment()
then to update this partition's metadata too.
When we insert to a partition when partition value is null, it will throw exception like

[info]   java.lang.NullPointerException:
[info]   at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51)
[info]   at scala.collection.immutable.StringOps.length(StringOps.scala:51)
[info]   at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:35)
[info]   at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[info]   at scala.collection.immutable.StringOps.foreach(StringOps.scala:33)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.escapePathName(ExternalCatalogUtils.scala:69)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.getPartitionValueString(ExternalCatalogUtils.scala:126)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.$anonfun$getPathFragment$1(PartitioningUtils.scala:354)
[info]   at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
[info]   at scala.collection.Iterator.foreach(Iterator.scala:941)
[info]   at scala.collection.Iterator.foreach$(Iterator.scala:941)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
[info]   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
[info]   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

PartitioningUtils.getPathFragment() should support null value too

Why are the changes needed?

Fix bug

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

… null

AngersZhuuuu · 2021-04-01T02:56:30Z

ping @MaxGekk Since your pr make partition value support value as null, here I think we need to handle null value and I meet this in #30057 's UT https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136751/testReport/

SparkQA · 2021-04-01T03:27:00Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41370/

SparkQA · 2021-04-01T03:27:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41370/

SparkQA · 2021-04-01T07:07:23Z

Test build #136787 has finished for PR 32018 at commit 960f2c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk

I am not sure that it is right fix. I guess, null should be replaced by "__HIVE_DEFAULT_PARTITION__" like:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala

Lines 122 to 127 in 0494dc9

    
           def getPartitionPathString(col: String, value: String): String = { 
        
             val partitionString = if (value == null || value.isEmpty) { 
        
               DEFAULT_PARTITION_NAME 
        
             } else { 
        
               escapePathName(value) 
        
             }

AngersZhuuuu · 2021-04-01T07:47:08Z

I am not sure that it is right fix. I guess, null should be replaced by "__HIVE_DEFAULT_PARTITION__" like:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalogUtils.scala

Lines 122 to 127 in 0494dc9

def getPartitionPathString(col: String, value: String): String = {

val partitionString = if (value == null || value.isEmpty) {

DEFAULT_PARTITION_NAME

} else {

escapePathName(value)

}

Looks like we should handle it in

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

Lines 351 to 355 in 0494dc9

    
           def getPathFragment(spec: TablePartitionSpec, partitionSchema: StructType): String = { 
        
             partitionSchema.map { field => 
        
               escapePathName(field.name) + "=" + escapePathName(spec(field.name)) 
        
             }.mkString("/") 
        
           }

AngersZhuuuu · 2021-04-01T07:49:59Z

What confused me is that

spark-sql> CREATE TABLE t(i STRING, c string) USING PARQUET PARTITIONED BY (c);
Time taken: 2.12 seconds
spark-sql> INSERT OVERWRITE t PARTITION (c=null) VALUES ('1');
Time taken: 4.984 seconds
spark-sql> desc formatted t partition(c=null);
i	string	NULL
c	string	NULL
# Partition Information
# col_name	data_type	comment
c	string	NULL

# Detailed Partition Information
Database	default
Table	t
Partition Values	[c=null]
Location	hdfs://tl0/user/hive/warehouse/t/c=null
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties	[path=hdfs://tl0/user/hive/warehouse/t, serialization.format=1]
Partition Parameters	{rawDataSize=-1, numFiles=1, transient_lastDdlTime=1617244501, totalSize=396, COLUMN_STATS_ACCURATE=false, numRows=-1}
Created Time	Thu Apr 01 10:35:01 SGT 2021
Last Access	UNKNOWN
Partition Statistics	396 bytes

# Storage Information
Location	hdfs://tl0/user/hive/warehouse/t
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Time taken: 0.135 seconds, Fetched 25 row(s)

The path can be c=null, for current code, which case the path will be null and which case it can be __HIVE_DEFAULT_PARTITION__?

MaxGekk · 2021-04-01T08:02:11Z

which case the path will be null and which case it can be HIVE_DEFAULT_PARTITION?

The null partition value should be replaced to __HIVE_DEFAULT_PARTITION__ if it goes via DSv1 and external catalog (Hive MetaStore doesn't accept null part values, and null part value in filesystem like col0=null is not compatible with other systems). v1 In-Memory catalog follows Hive external catalog as far as I know.

DSv2 impl should handle null itself. So, we shouldn't replace it by __HIVE_DEFAULT_PARTITION__.

AngersZhuuuu · 2021-04-01T08:09:37Z

which case the path will be null and which case it can be HIVE_DEFAULT_PARTITION?

The null partition value should be replaced to __HIVE_DEFAULT_PARTITION__ if it goes via DSv1 and external catalog (Hive MetaStore doesn't accept null part values, and null part value in filesystem like col0=null is not compatible with other systems). v1 In-Memory catalog follows Hive external catalog as far as I know.

DSv2 impl should handle null itself. So, we shouldn't replace it by __HIVE_DEFAULT_PARTITION__.

Thanks you a lot for clarify this problem. This make me confused for a long time.

AngersZhuuuu · 2021-04-01T13:19:06Z

DSv2 impl should handle null itself. So, we shouldn't replace it by __HIVE_DEFAULT_PARTITION__.

I know Why I am confused, I test this in spark 3.0, this behavior has been change to keep consistence.

AngersZhuuuu · 2021-04-01T13:22:21Z

ping @MaxGekk Updated, current code should be ok since InsertIntoHadoopFsRelationCommand only used in DsV1 or convert from hive related command.

and I have checked that PartitioningUtils.getPathFragment only used in InsertIntoHadoopFsRelationCommand

MaxGekk

@AngersZhuuuu Can you add a test for the changes?

MaxGekk · 2021-04-01T13:35:30Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

@@ -350,7 +350,12 @@ object PartitioningUtils {
   */
  def getPathFragment(spec: TablePartitionSpec, partitionSchema: StructType): String = {
    partitionSchema.map { field =>
-      escapePathName(field.name) + "=" + escapePathName(spec(field.name))
+      val value = if (spec(field.name) == null || spec(field.name).isEmpty) {


Can we re-use existing function or if not, could you extract common code?

MaxGekk · 2021-04-01T13:36:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala

+      val value = if (spec(field.name) == null || spec(field.name).isEmpty) {
+        DEFAULT_PARTITION_NAME
+      } else {
+        escapePathName(spec(field.name))


Probably, look up time to spec is not big deal but I would store spec(field.name) to a val.

SparkQA · 2021-04-01T14:31:42Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41398/

AngersZhuuuu · 2021-04-01T14:32:15Z

@AngersZhuuuu Can you add a test for the changes?

UT added and it from the case https://issues.apache.org/jira/browse/SPARK-24937, without current change it will failed

22:30:04.596 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[info] - SPARK-34926: PartitioningUtils.getPathFragment() should respect partition value is null *** FAILED *** (609 milliseconds)
[info]   java.lang.NullPointerException:
[info]   at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51)
[info]   at scala.collection.immutable.StringOps.length(StringOps.scala:51)
[info]   at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:35)
[info]   at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[info]   at scala.collection.immutable.StringOps.foreach(StringOps.scala:33)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.escapePathName(ExternalCatalogUtils.scala:69)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.getPartitionValueString(ExternalCatalogUtils.scala:126)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.$anonfun$getPathFragment$1(PartitioningUtils.scala:354)
[info]   at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
[info]   at scala.collection.Iterator.foreach(Iterator.scala:941)
[info]   at scala.collection.Iterator.foreach$(Iterator.scala:941)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
[info]   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
[info]   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

AngersZhuuuu · 2021-04-01T14:32:27Z

Also cc @wangyum

SparkQA · 2021-04-01T15:02:00Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41398/

SparkQA · 2021-04-01T15:42:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41401/

SparkQA · 2021-04-01T15:47:05Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41401/

SparkQA · 2021-04-01T17:06:38Z

Test build #136821 has finished for PR 32018 at commit 992001b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-01T17:54:03Z

Test build #136816 has finished for PR 32018 at commit 03325c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-04-01T18:17:42Z

Test build #136817 has finished for PR 32018 at commit 1305fe5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk

LGTM, @AngersZhuuuu could you update PR's description.

AngersZhuuuu · 2021-04-01T23:41:05Z

LGTM, @AngersZhuuuu could you update PR's description.

DOne

MaxGekk · 2021-04-02T05:22:49Z

GA are failing on Avro tests, for instance. And jenkins build failed on the latest commit. @AngersZhuuuu To continue with the fix, let's re-trigger tests. Also @cloud-fan could you look at this PR since you reviewed previous changes related to null part values.

MaxGekk · 2021-04-02T05:23:27Z

jenkins, retest this, please

SparkQA · 2021-04-02T06:40:02Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41419/

SparkQA · 2021-04-02T06:40:03Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41419/

MaxGekk · 2021-04-02T07:24:58Z

+1, LGTM. Merging to master. The failed GA is a known issue.
Thank you @AngersZhuuuu, and @cloud-fan @wangyum for your review.

MaxGekk · 2021-04-02T07:28:30Z

BTW, @AngersZhuuuu does the issue exist in 3.1/3.0/2.4? If so, please, backport the changes.

AngersZhuuuu · 2021-04-02T07:42:00Z

BTW, @AngersZhuuuu does the issue exist in 3.1/3.0/2.4? If so, please, backport the changes.

Need to check. I will update here after check this

SparkQA · 2021-04-02T11:12:35Z

Test build #136841 has finished for PR 32018 at commit 992001b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2021-04-02T13:56:54Z

LGTM2

AngersZhuuuu · 2021-04-02T16:16:57Z

BTW, @AngersZhuuuu does the issue exist in 3.1/3.0/2.4? If so, please, backport the changes.

Checked all branch, we need to backport to branch-3.0/branch-3.1. Should I raise separated pr or you can just merge to that branchs?

MaxGekk · 2021-04-02T16:22:49Z

Could you open separate PRs per each branch, please.

AngersZhuuuu · 2021-04-02T16:32:36Z

Could you open separate PRs per each branch, please.

OK, I will ping you when finish these things.

[SPARK-34926][SQL] ExternalCatalogUtils.escapePathName should support…

960f2c1

… null

github-actions bot added the SQL label Apr 1, 2021

MaxGekk reviewed Apr 1, 2021

View reviewed changes

update

03325c3

MaxGekk reviewed Apr 1, 2021

View reviewed changes

follow comment

1305fe5

AngersZhuuuu changed the title ~~[SPARK-34926][SQL] ExternalCatalogUtils.escapePathName should support null~~ [SPARK-34926][SQL] PartitioningUtils. getPathFragment() should respect partition value is null Apr 1, 2021

AngersZhuuuu changed the title ~~[SPARK-34926][SQL] PartitioningUtils. getPathFragment() should respect partition value is null~~ [SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null Apr 1, 2021

AngersZhuuuu added 2 commits April 1, 2021 22:27

Update InsertSuite.scala

012ffaf

Update InsertSuite.scala

992001b

MaxGekk approved these changes Apr 1, 2021

View reviewed changes

wangyum approved these changes Apr 2, 2021

View reviewed changes

cloud-fan approved these changes Apr 2, 2021

View reviewed changes

MaxGekk closed this in 65da928 Apr 2, 2021

	def getPartitionPathString(col: String, value: String): String = {
	val partitionString = if (value == null \|\| value.isEmpty) {
	DEFAULT_PARTITION_NAME
	} else {
	escapePathName(value)
	}

[SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null #32018

[SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null #32018

Conversation

AngersZhuuuu commented Apr 1, 2021 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AngersZhuuuu commented Apr 1, 2021

SparkQA commented Apr 1, 2021

SparkQA commented Apr 1, 2021

SparkQA commented Apr 1, 2021

MaxGekk left a comment

Choose a reason for hiding this comment

AngersZhuuuu commented Apr 1, 2021 • edited Loading

AngersZhuuuu commented Apr 1, 2021

MaxGekk commented Apr 1, 2021

AngersZhuuuu commented Apr 1, 2021

AngersZhuuuu commented Apr 1, 2021

AngersZhuuuu commented Apr 1, 2021 • edited Loading

MaxGekk left a comment

Choose a reason for hiding this comment

MaxGekk Apr 1, 2021

Choose a reason for hiding this comment

MaxGekk Apr 1, 2021

Choose a reason for hiding this comment

SparkQA commented Apr 1, 2021

AngersZhuuuu commented Apr 1, 2021

AngersZhuuuu commented Apr 1, 2021

SparkQA commented Apr 1, 2021

SparkQA commented Apr 1, 2021

SparkQA commented Apr 1, 2021

SparkQA commented Apr 1, 2021

SparkQA commented Apr 1, 2021

SparkQA commented Apr 1, 2021

MaxGekk left a comment

Choose a reason for hiding this comment

AngersZhuuuu commented Apr 1, 2021

MaxGekk commented Apr 2, 2021

MaxGekk commented Apr 2, 2021

SparkQA commented Apr 2, 2021

SparkQA commented Apr 2, 2021

MaxGekk commented Apr 2, 2021

MaxGekk commented Apr 2, 2021

AngersZhuuuu commented Apr 2, 2021

SparkQA commented Apr 2, 2021

HyukjinKwon commented Apr 2, 2021

AngersZhuuuu commented Apr 2, 2021

MaxGekk commented Apr 2, 2021

AngersZhuuuu commented Apr 2, 2021

AngersZhuuuu commented Apr 1, 2021 •

edited

Loading

AngersZhuuuu commented Apr 1, 2021 •

edited

Loading

AngersZhuuuu commented Apr 1, 2021 •

edited

Loading