-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-33706][SQL] Require fully specified partition identifier in partitionExists() #30667
Conversation
@cloud-fan @stczwd @rdblue May I ask you to review this PR. |
@@ -80,9 +79,14 @@ void createPartition( | |||
* @return true if the partition exists, false otherwise | |||
*/ | |||
default boolean partitionExists(InternalRow ident) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also update the javadoc of this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you propose? The doc says clearly that the method tests existence of A partition. Do you want to say that ident
must contain values for ALL partition fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about @param ident a partition identifier which must contain all partition fields in order
?
if (ident.numFields() == partitionNames.length) { | ||
return listPartitionIdentifiers(partitionNames, ident).length > 0; | ||
} else { | ||
throw new IllegalArgumentException("The number of fields (" + ident.numFields() + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk so spark will never call this method with partial partition spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, it shouldn't especially after this PR #30624. Maybe I missed some places but as far as I can see, all calls of partitionExists()
pass fully specified partition ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, then requiring implementations to fail for this case makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an alternative, we could replace the exact check by an assert. Not sure that it is the appropriate solution for default API implementation, though ...
Kubernetes integration test starting |
Kubernetes integration test status success |
String[] partitionNames = partitionSchema().names(); | ||
String[] requiredNames = Arrays.copyOfRange(partitionNames, 0, ident.numFields()); | ||
return listPartitionIdentifiers(requiredNames, ident).length > 0; | ||
String[] partitionNames = partitionSchema().names(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope the indentation of 2 spaces is ok. I see 2 spaces in other Java classes.
Kubernetes integration test starting |
Kubernetes integration test status success |
Test build #132419 has finished for PR 30667 at commit
|
Test build #132428 has finished for PR 30667 at commit
|
Any objections for the changes? |
I'm merging it to master/3.1, as it doesn't change any user-visible behavior. Spark will never call |
…rtitionExists() ### What changes were proposed in this pull request? 1. Check that the partition identifier passed to `SupportsPartitionManagement.partitionExists()` is fully specified (specifies all values of partition fields). 2. Remove the custom implementation of `partitionExists()` from `InMemoryPartitionTable`, and re-use the default implementation from `SupportsPartitionManagement`. ### Why are the changes needed? The method is supposed to check existence of one partition but currently it can return `true` for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running existing test suites: ``` $ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SupportsPartitionManagementSuite" ``` Closes #30667 from MaxGekk/check-len-partitionExists. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 8b97b19) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
SupportsPartitionManagement.partitionExists()
is fully specified (specifies all values of partition fields).partitionExists()
fromInMemoryPartitionTable
, and re-use the default implementation fromSupportsPartitionManagement
.Why are the changes needed?
The method is supposed to check existence of one partition but currently it can return
true
for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path.Does this PR introduce any user-facing change?
No
How was this patch tested?
By running existing test suites: