[SPARK-33706][SQL] Require fully specified partition identifier in partitionExists() #30667

MaxGekk · 2020-12-08T08:17:23Z

What changes were proposed in this pull request?

Check that the partition identifier passed to SupportsPartitionManagement.partitionExists() is fully specified (specifies all values of partition fields).
Remove the custom implementation of partitionExists() from InMemoryPartitionTable, and re-use the default implementation from SupportsPartitionManagement.

Why are the changes needed?

The method is supposed to check existence of one partition but currently it can return true for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running existing test suites:

$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SupportsPartitionManagementSuite"

…Exists

MaxGekk · 2020-12-08T08:22:13Z

We don't need to catch exceptions from partitionExists() in ALTER TABLE .. ADD/DROP PARTITION there

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/AlterTableAddPartitionExec.scala

Line 40 in 0a612b6

partSpecs.partition(p => table.partitionExists(p.ident))

because the partition spec resolver guaranties fully specified partition id after this PR #30624

MaxGekk · 2020-12-08T08:23:10Z

@cloud-fan @stczwd @rdblue May I ask you to review this PR.

cloud-fan · 2020-12-08T08:27:31Z

...talyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java

@@ -80,9 +79,14 @@ void createPartition(
     * @return true if the partition exists, false otherwise
     */
    default boolean partitionExists(InternalRow ident) {


can we also update the javadoc of this method?

What do you propose? The doc says clearly that the method tests existence of A partition. Do you want to say that ident must contain values for ALL partition fields?

how about @param ident a partition identifier which must contain all partition fields in order?

cloud-fan · 2020-12-08T08:28:52Z

...talyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java

+      if (ident.numFields() == partitionNames.length) {
+        return listPartitionIdentifiers(partitionNames, ident).length > 0;
+      } else {
+        throw new IllegalArgumentException("The number of fields (" + ident.numFields() +


@MaxGekk so spark will never call this method with partial partition spec?

Currently, it shouldn't especially after this PR #30624. Maybe I missed some places but as far as I can see, all calls of partitionExists() pass fully specified partition ids.

OK, then requiring implementations to fail for this case makes sense to me.

As an alternative, we could replace the exact check by an assert. Not sure that it is the appropriate solution for default API implementation, though ...

SparkQA · 2020-12-08T09:21:35Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37019/

SparkQA · 2020-12-08T10:00:27Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37019/

MaxGekk · 2020-12-08T10:47:13Z

...talyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsPartitionManagement.java

-        String[] partitionNames = partitionSchema().names();
-        String[] requiredNames = Arrays.copyOfRange(partitionNames, 0, ident.numFields());
-        return listPartitionIdentifiers(requiredNames, ident).length > 0;
+      String[] partitionNames = partitionSchema().names();


I hope the indentation of 2 spaces is ok. I see 2 spaces in other Java classes.

SparkQA · 2020-12-08T11:30:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37029/

SparkQA · 2020-12-08T12:01:49Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37029/

SparkQA · 2020-12-08T12:57:17Z

Test build #132419 has finished for PR 30667 at commit 1a10ee2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-12-08T15:19:03Z

Test build #132428 has finished for PR 30667 at commit 4c5cd18.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2020-12-11T07:15:32Z

Any objections for the changes?

cloud-fan · 2020-12-11T12:48:02Z

I'm merging it to master/3.1, as it doesn't change any user-visible behavior. Spark will never call partitionExists with partial partition spec, so this PR is more of code cleanup and API doc improvement.

…rtitionExists() ### What changes were proposed in this pull request? 1. Check that the partition identifier passed to `SupportsPartitionManagement.partitionExists()` is fully specified (specifies all values of partition fields). 2. Remove the custom implementation of `partitionExists()` from `InMemoryPartitionTable`, and re-use the default implementation from `SupportsPartitionManagement`. ### Why are the changes needed? The method is supposed to check existence of one partition but currently it can return `true` for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running existing test suites: ``` $ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SupportsPartitionManagementSuite" ``` Closes #30667 from MaxGekk/check-len-partitionExists. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8b97b19) Signed-off-by: Wenchen Fan <wenchen@databricks.com>

MaxGekk added 7 commits December 5, 2020 22:29

Change partitionExists()

7d45e74

Merge remote-tracking branch 'origin/master' into check-len-partition…

b45598f

…Exists

Remove partitionExists() from InMemoryPartitionTable

ba5ccda

Fix indentation

6db6db0

Fix error message

2a23b1d

Add tests for partitionExists()

c82f740

Remove an unused import

1a10ee2

cloud-fan reviewed Dec 8, 2020

View reviewed changes

github-actions bot added the SQL label Dec 8, 2020

Update Java doc

4c5cd18

MaxGekk commented Dec 8, 2020

View reviewed changes

cloud-fan closed this in 8b97b19 Dec 11, 2020

MaxGekk deleted the check-len-partitionExists branch February 19, 2021 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33706][SQL] Require fully specified partition identifier in partitionExists() #30667

[SPARK-33706][SQL] Require fully specified partition identifier in partitionExists() #30667

MaxGekk commented Dec 8, 2020

MaxGekk commented Dec 8, 2020

MaxGekk commented Dec 8, 2020

cloud-fan Dec 8, 2020 •

edited

Loading

MaxGekk Dec 8, 2020

cloud-fan Dec 8, 2020

cloud-fan Dec 8, 2020

MaxGekk Dec 8, 2020

cloud-fan Dec 8, 2020

MaxGekk Dec 8, 2020 •

edited

Loading

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

MaxGekk Dec 8, 2020 •

edited

Loading

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

MaxGekk commented Dec 11, 2020

cloud-fan commented Dec 11, 2020

[SPARK-33706][SQL] Require fully specified partition identifier in partitionExists() #30667

[SPARK-33706][SQL] Require fully specified partition identifier in partitionExists() #30667

Conversation

MaxGekk commented Dec 8, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

MaxGekk commented Dec 8, 2020

MaxGekk commented Dec 8, 2020

cloud-fan Dec 8, 2020 • edited Loading

Choose a reason for hiding this comment

MaxGekk Dec 8, 2020

Choose a reason for hiding this comment

cloud-fan Dec 8, 2020

Choose a reason for hiding this comment

cloud-fan Dec 8, 2020

Choose a reason for hiding this comment

MaxGekk Dec 8, 2020

Choose a reason for hiding this comment

cloud-fan Dec 8, 2020

Choose a reason for hiding this comment

MaxGekk Dec 8, 2020 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

MaxGekk Dec 8, 2020 • edited Loading

Choose a reason for hiding this comment

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

SparkQA commented Dec 8, 2020

MaxGekk commented Dec 11, 2020

cloud-fan commented Dec 11, 2020

cloud-fan Dec 8, 2020 •

edited

Loading

MaxGekk Dec 8, 2020 •

edited

Loading

MaxGekk Dec 8, 2020 •

edited

Loading