-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-37217][SQL] The number of dynamic partitions should early check when writing to external tables #34493
Conversation
…s to prevent data deletion
Can one of the admins verify this patch? |
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for making a PR, @cxzl25 .
s"Number of dynamic partitions created is $numWrittenParts" + | ||
s", which is more than $maxDynamicPartitions" + | ||
s". To solve this try to set $maxDynamicPartitionsKey" + | ||
s" to at least $numWrittenParts." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we set hive.exec.max.dynamic.partitions
automatically from Spark side in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to automatically adjust the number of hive.exec.max.dynamic.partitions
.
However, if it is automatically adjusted, many partitions may be created accidentally, and this parameter is meaningless.
cc @sunchao |
cc @viirya too since this is related to your change in SPARK-29295 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, my question is, as we are going to overwrite the table partitions, why we need to prevent data to be deleted? Any other delete-like command, I think if any failure happens during deletion, there will be some data that are already deleted before the failure. I think we don't provide atomicity guarantee for this command, right?
Yes. I agree with you. But in this case, if the number of dynamic partitions exceeds The user thought that the operation was not successful, and theoretically the original data should still be there. Or the user will check whether the number of partitions meets expectations. If it does, the user needs to adjust the hive configuration. If it does not, it needs to modify the sql logic. |
I feel even though we can't guarantee operations like delete to be atomic, we should make effort to do so. This PR looks simple enough and fixes a potential issue which could corrupt an external Hive table, so I think it's well worth it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, this looks simple and just a check before the Hive operation. For the purpose of adding an early check before running the query, it is fine. I only don't think it makes sense for preventing data deletion as rationale because this command is going to delete the data actually and there is no atomicity.
It may be that the PR title is not clear. |
Sounds good to me. Thanks. |
# Conflicts: # sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala
Can we continue to review this pr? @dongjoon-hyun @sunchao @viirya |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @cxzl25 , forgot about this. Left a few minor comments but this looks good to me.
s", which is more than $maxDynamicPartitions" + | ||
s". To solve this try to set $maxDynamicPartitionsKey" + | ||
s" to at least $numWrittenParts." | ||
throw new SparkException(maxDynamicPartitionsErrMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we may want to group this error message and define it in QueryExecutionErrors
.
@@ -192,6 +192,17 @@ case class InsertIntoHiveTable( | |||
if (partition.nonEmpty) { | |||
if (numDynamicPartitions > 0) { | |||
if (overwrite && table.tableType == CatalogTableType.EXTERNAL) { | |||
val numWrittenParts = writtenParts.size | |||
val maxDynamicPartitionsKey = "hive.exec.max.dynamic.partitions" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use HiveConf.ConfVars.DYNAMICPARTITIONMAXPARTS.varname
for the key and HiveConf.ConfVars.DYNAMICPARTITIONMAXPARTS.defaultIntVal
instead of 1000
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with one nit
@@ -1905,4 +1905,14 @@ object QueryExecutionErrors { | |||
def cannotConvertOrcTimestampToTimestampNTZError(): Throwable = { | |||
new RuntimeException("Unable to convert timestamp of Orc to data type 'timestamp_ntz'") | |||
} | |||
|
|||
def writePartitionExceedConfigSizeWhenDynamicPartitionError(numWrittenParts: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: format
def writePartitionExceedConfigSizeWhenDynamicPartitionError(
numWrittenParts: Int,
maxDynamicPartitions: Int,
maxDynamicPartitionsKey: String): Throwable = {
...
}
@@ -1905,4 +1905,15 @@ object QueryExecutionErrors { | |||
def cannotConvertOrcTimestampToTimestampNTZError(): Throwable = { | |||
new RuntimeException("Unable to convert timestamp of Orc to data type 'timestamp_ntz'") | |||
} | |||
|
|||
def writePartitionExceedConfigSizeWhenDynamicPartitionError( | |||
numWrittenParts: Int, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we need to use 4-space indentation here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I did not notice the indentation problem here, you have already provided an example before, thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merged, thanks! also going to cherry-pick to branch-3.2. |
@cxzl25 could you open another PR to backport this to branch-3.2? I tried to cherry-pick it but there's some conflict. |
ok, i will do it now. |
…k when writing to external tables ### What changes were proposed in this pull request? SPARK-29295 introduces a mechanism that writes to external tables is a dynamic partition method, and the data in the target partition will be deleted first. Assuming that 1001 partitions are written, the data of 10001 partitions will be deleted first, but because `hive.exec.max.dynamic.partitions` is 1000 by default, loadDynamicPartitions will fail at this time, but the data of 1001 partitions has been deleted. So we can check whether the number of dynamic partitions is greater than `hive.exec.max.dynamic.partitions` before deleting, it should fail quickly at this time. ### Why are the changes needed? Avoid data that cannot be recovered when the job fails. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add UT Closes apache#34493 from cxzl25/SPARK-37217. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit 4b849ef)
branch-3.2 #34889 |
What changes were proposed in this pull request?
SPARK-29295 introduces a mechanism that writes to external tables is a dynamic partition method, and the data in the target partition will be deleted first.
Assuming that 1001 partitions are written, the data of 10001 partitions will be deleted first, but because
hive.exec.max.dynamic.partitions
is 1000 by default, loadDynamicPartitions will fail at this time, but the data of 1001 partitions has been deleted.So we can check whether the number of dynamic partitions is greater than
hive.exec.max.dynamic.partitions
before deleting, it should fail quickly at this time.Why are the changes needed?
Avoid data that cannot be recovered when the job fails.
Does this PR introduce any user-facing change?
No
How was this patch tested?
add UT