Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null #32018

Closed
wants to merge 5 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Apr 1, 2021

What changes were proposed in this pull request?

When we insert data into a partition table partition with empty DataFrame. We will call PartitioningUtils.getPathFragment()
then to update this partition's metadata too.
When we insert to a partition when partition value is null, it will throw exception like

[info]   java.lang.NullPointerException:
[info]   at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51)
[info]   at scala.collection.immutable.StringOps.length(StringOps.scala:51)
[info]   at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:35)
[info]   at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[info]   at scala.collection.immutable.StringOps.foreach(StringOps.scala:33)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.escapePathName(ExternalCatalogUtils.scala:69)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.getPartitionValueString(ExternalCatalogUtils.scala:126)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.$anonfun$getPathFragment$1(PartitioningUtils.scala:354)
[info]   at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
[info]   at scala.collection.Iterator.foreach(Iterator.scala:941)
[info]   at scala.collection.Iterator.foreach$(Iterator.scala:941)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
[info]   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
[info]   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

PartitioningUtils.getPathFragment() should support null value too

Why are the changes needed?

Fix bug

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added UT

@github-actions github-actions bot added the SQL label Apr 1, 2021
@AngersZhuuuu
Copy link
Contributor Author

ping @MaxGekk Since your pr make partition value support value as null, here I think we need to handle null value and I meet this in #30057 's UT https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/136751/testReport/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41370/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41370/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Test build #136787 has finished for PR 32018 at commit 960f2c1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure that it is right fix. I guess, null should be replaced by "__HIVE_DEFAULT_PARTITION__" like:

def getPartitionPathString(col: String, value: String): String = {
val partitionString = if (value == null || value.isEmpty) {
DEFAULT_PARTITION_NAME
} else {
escapePathName(value)
}

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Apr 1, 2021

I am not sure that it is right fix. I guess, null should be replaced by "__HIVE_DEFAULT_PARTITION__" like:

def getPartitionPathString(col: String, value: String): String = {
val partitionString = if (value == null || value.isEmpty) {
DEFAULT_PARTITION_NAME
} else {
escapePathName(value)
}

Looks like we should handle it in

def getPathFragment(spec: TablePartitionSpec, partitionSchema: StructType): String = {
partitionSchema.map { field =>
escapePathName(field.name) + "=" + escapePathName(spec(field.name))
}.mkString("/")
}

@AngersZhuuuu
Copy link
Contributor Author

What confused me is that

spark-sql> CREATE TABLE t(i STRING, c string) USING PARQUET PARTITIONED BY (c);
Time taken: 2.12 seconds
spark-sql> INSERT OVERWRITE t PARTITION (c=null) VALUES ('1');
Time taken: 4.984 seconds
spark-sql> desc formatted t partition(c=null);
i	string	NULL
c	string	NULL
# Partition Information
# col_name	data_type	comment
c	string	NULL

# Detailed Partition Information
Database	default
Table	t
Partition Values	[c=null]
Location	hdfs://tl0/user/hive/warehouse/t/c=null
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Storage Properties	[path=hdfs://tl0/user/hive/warehouse/t, serialization.format=1]
Partition Parameters	{rawDataSize=-1, numFiles=1, transient_lastDdlTime=1617244501, totalSize=396, COLUMN_STATS_ACCURATE=false, numRows=-1}
Created Time	Thu Apr 01 10:35:01 SGT 2021
Last Access	UNKNOWN
Partition Statistics	396 bytes

# Storage Information
Location	hdfs://tl0/user/hive/warehouse/t
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Time taken: 0.135 seconds, Fetched 25 row(s)

The path can be c=null, for current code, which case the path will be null and which case it can be __HIVE_DEFAULT_PARTITION__?

@MaxGekk
Copy link
Member

MaxGekk commented Apr 1, 2021

which case the path will be null and which case it can be HIVE_DEFAULT_PARTITION?

The null partition value should be replaced to __HIVE_DEFAULT_PARTITION__ if it goes via DSv1 and external catalog (Hive MetaStore doesn't accept null part values, and null part value in filesystem like col0=null is not compatible with other systems). v1 In-Memory catalog follows Hive external catalog as far as I know.

DSv2 impl should handle null itself. So, we shouldn't replace it by __HIVE_DEFAULT_PARTITION__.

@AngersZhuuuu
Copy link
Contributor Author

which case the path will be null and which case it can be HIVE_DEFAULT_PARTITION?

The null partition value should be replaced to __HIVE_DEFAULT_PARTITION__ if it goes via DSv1 and external catalog (Hive MetaStore doesn't accept null part values, and null part value in filesystem like col0=null is not compatible with other systems). v1 In-Memory catalog follows Hive external catalog as far as I know.

DSv2 impl should handle null itself. So, we shouldn't replace it by __HIVE_DEFAULT_PARTITION__.

Thanks you a lot for clarify this problem. This make me confused for a long time.

@AngersZhuuuu
Copy link
Contributor Author

DSv2 impl should handle null itself. So, we shouldn't replace it by __HIVE_DEFAULT_PARTITION__.

I know Why I am confused, I test this in spark 3.0, this behavior has been change to keep consistence.

@AngersZhuuuu
Copy link
Contributor Author

AngersZhuuuu commented Apr 1, 2021

ping @MaxGekk Updated, current code should be ok since InsertIntoHadoopFsRelationCommand only used in DsV1 or convert from hive related command.

and I have checked that PartitioningUtils.getPathFragment only used in InsertIntoHadoopFsRelationCommand

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AngersZhuuuu Can you add a test for the changes?

@@ -350,7 +350,12 @@ object PartitioningUtils {
*/
def getPathFragment(spec: TablePartitionSpec, partitionSchema: StructType): String = {
partitionSchema.map { field =>
escapePathName(field.name) + "=" + escapePathName(spec(field.name))
val value = if (spec(field.name) == null || spec(field.name).isEmpty) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-use existing function or if not, could you extract common code?

val value = if (spec(field.name) == null || spec(field.name).isEmpty) {
DEFAULT_PARTITION_NAME
} else {
escapePathName(spec(field.name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, look up time to spec is not big deal but I would store spec(field.name) to a val.

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-34926][SQL] ExternalCatalogUtils.escapePathName should support null [SPARK-34926][SQL] PartitioningUtils. getPathFragment() should respect partition value is null Apr 1, 2021
@AngersZhuuuu AngersZhuuuu changed the title [SPARK-34926][SQL] PartitioningUtils. getPathFragment() should respect partition value is null [SPARK-34926][SQL] PartitioningUtils.getPathFragment() should respect partition value is null Apr 1, 2021
@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41398/

@AngersZhuuuu
Copy link
Contributor Author

@AngersZhuuuu Can you add a test for the changes?

UT added and it from the case https://issues.apache.org/jira/browse/SPARK-24937, without current change it will failed

22:30:04.596 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[info] - SPARK-34926: PartitioningUtils.getPathFragment() should respect partition value is null *** FAILED *** (609 milliseconds)
[info]   java.lang.NullPointerException:
[info]   at scala.collection.immutable.StringOps$.length$extension(StringOps.scala:51)
[info]   at scala.collection.immutable.StringOps.length(StringOps.scala:51)
[info]   at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:35)
[info]   at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[info]   at scala.collection.immutable.StringOps.foreach(StringOps.scala:33)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.escapePathName(ExternalCatalogUtils.scala:69)
[info]   at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.getPartitionValueString(ExternalCatalogUtils.scala:126)
[info]   at org.apache.spark.sql.execution.datasources.PartitioningUtils$.$anonfun$getPathFragment$1(PartitioningUtils.scala:354)
[info]   at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
[info]   at scala.collection.Iterator.foreach(Iterator.scala:941)
[info]   at scala.collection.Iterator.foreach$(Iterator.scala:941)
[info]   at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
[info]   at scala.collection.IterableLike.foreach(IterableLike.scala:74)
[info]   at scala.collection.IterableLike.foreach$(IterableLike.scala:73)

@AngersZhuuuu
Copy link
Contributor Author

Also cc @wangyum

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41398/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41401/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41401/

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Test build #136821 has finished for PR 32018 at commit 992001b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Test build #136816 has finished for PR 32018 at commit 03325c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 1, 2021

Test build #136817 has finished for PR 32018 at commit 1305fe5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @AngersZhuuuu could you update PR's description.

@AngersZhuuuu
Copy link
Contributor Author

LGTM, @AngersZhuuuu could you update PR's description.

DOne

@MaxGekk
Copy link
Member

MaxGekk commented Apr 2, 2021

GA are failing on Avro tests, for instance. And jenkins build failed on the latest commit. @AngersZhuuuu To continue with the fix, let's re-trigger tests. Also @cloud-fan could you look at this PR since you reviewed previous changes related to null part values.

@MaxGekk
Copy link
Member

MaxGekk commented Apr 2, 2021

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41419/

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/41419/

@MaxGekk
Copy link
Member

MaxGekk commented Apr 2, 2021

+1, LGTM. Merging to master. The failed GA is a known issue.
Thank you @AngersZhuuuu, and @cloud-fan @wangyum for your review.

@MaxGekk MaxGekk closed this in 65da928 Apr 2, 2021
@MaxGekk
Copy link
Member

MaxGekk commented Apr 2, 2021

BTW, @AngersZhuuuu does the issue exist in 3.1/3.0/2.4? If so, please, backport the changes.

@AngersZhuuuu
Copy link
Contributor Author

BTW, @AngersZhuuuu does the issue exist in 3.1/3.0/2.4? If so, please, backport the changes.

Need to check. I will update here after check this

@SparkQA
Copy link

SparkQA commented Apr 2, 2021

Test build #136841 has finished for PR 32018 at commit 992001b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

LGTM2

@AngersZhuuuu
Copy link
Contributor Author

BTW, @AngersZhuuuu does the issue exist in 3.1/3.0/2.4? If so, please, backport the changes.

Checked all branch, we need to backport to branch-3.0/branch-3.1. Should I raise separated pr or you can just merge to that branchs?

@MaxGekk
Copy link
Member

MaxGekk commented Apr 2, 2021

Could you open separate PRs per each branch, please.

@AngersZhuuuu
Copy link
Contributor Author

Could you open separate PRs per each branch, please.

OK, I will ping you when finish these things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants