Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33706][SQL] Require fully specified partition identifier in partitionExists() #30667

Closed
wants to merge 8 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Dec 8, 2020

What changes were proposed in this pull request?

  1. Check that the partition identifier passed to SupportsPartitionManagement.partitionExists() is fully specified (specifies all values of partition fields).
  2. Remove the custom implementation of partitionExists() from InMemoryPartitionTable, and re-use the default implementation from SupportsPartitionManagement.

Why are the changes needed?

The method is supposed to check existence of one partition but currently it can return true for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path.

Does this PR introduce any user-facing change?

No

How was this patch tested?

By running existing test suites:

$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SupportsPartitionManagementSuite"

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 8, 2020

We don't need to catch exceptions from partitionExists() in ALTER TABLE .. ADD/DROP PARTITION there


because the partition spec resolver guaranties fully specified partition id after this PR #30624

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 8, 2020

@cloud-fan @stczwd @rdblue May I ask you to review this PR.

@@ -80,9 +79,14 @@ void createPartition(
* @return true if the partition exists, false otherwise
*/
default boolean partitionExists(InternalRow ident) {
Copy link
Contributor

@cloud-fan cloud-fan Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also update the javadoc of this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you propose? The doc says clearly that the method tests existence of A partition. Do you want to say that ident must contain values for ALL partition fields?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about @param ident a partition identifier which must contain all partition fields in order?

if (ident.numFields() == partitionNames.length) {
return listPartitionIdentifiers(partitionNames, ident).length > 0;
} else {
throw new IllegalArgumentException("The number of fields (" + ident.numFields() +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxGekk so spark will never call this method with partial partition spec?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, it shouldn't especially after this PR #30624. Maybe I missed some places but as far as I can see, all calls of partitionExists() pass fully specified partition ids.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, then requiring implementations to fail for this case makes sense to me.

Copy link
Member Author

@MaxGekk MaxGekk Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an alternative, we could replace the exact check by an assert. Not sure that it is the appropriate solution for default API implementation, though ...

@github-actions github-actions bot added the SQL label Dec 8, 2020
@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37019/

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37019/

String[] partitionNames = partitionSchema().names();
String[] requiredNames = Arrays.copyOfRange(partitionNames, 0, ident.numFields());
return listPartitionIdentifiers(requiredNames, ident).length > 0;
String[] partitionNames = partitionSchema().names();
Copy link
Member Author

@MaxGekk MaxGekk Dec 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope the indentation of 2 spaces is ok. I see 2 spaces in other Java classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37029/

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37029/

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Test build #132419 has finished for PR 30667 at commit 1a10ee2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 8, 2020

Test build #132428 has finished for PR 30667 at commit 4c5cd18.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Dec 11, 2020

Any objections for the changes?

@cloud-fan
Copy link
Contributor

I'm merging it to master/3.1, as it doesn't change any user-visible behavior. Spark will never call partitionExists with partial partition spec, so this PR is more of code cleanup and API doc improvement.

@cloud-fan cloud-fan closed this in 8b97b19 Dec 11, 2020
cloud-fan pushed a commit that referenced this pull request Dec 11, 2020
…rtitionExists()

### What changes were proposed in this pull request?
1. Check that the partition identifier passed to `SupportsPartitionManagement.partitionExists()` is fully specified (specifies all values of partition fields).
2. Remove the custom implementation of `partitionExists()` from `InMemoryPartitionTable`, and re-use the default implementation from `SupportsPartitionManagement`.

### Why are the changes needed?
The method is supposed to check existence of one partition but currently it can return `true` for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running existing test suites:
```
$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SupportsPartitionManagementSuite"
```

Closes #30667 from MaxGekk/check-len-partitionExists.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8b97b19)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
@MaxGekk MaxGekk deleted the check-len-partitionExists branch February 19, 2021 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants