[SPARK-37217][SQL] The number of dynamic partitions should early check when writing to external tables #34493

cxzl25 · 2021-11-05T11:05:16Z

What changes were proposed in this pull request?

SPARK-29295 introduces a mechanism that writes to external tables is a dynamic partition method, and the data in the target partition will be deleted first.

Assuming that 1001 partitions are written, the data of 10001 partitions will be deleted first, but because hive.exec.max.dynamic.partitions is 1000 by default, loadDynamicPartitions will fail at this time, but the data of 1001 partitions has been deleted.

So we can check whether the number of dynamic partitions is greater than hive.exec.max.dynamic.partitions before deleting, it should fail quickly at this time.

Why are the changes needed?

Avoid data that cannot be recovered when the job fails.

Does this PR introduce any user-facing change?

No

How was this patch tested?

add UT

…s to prevent data deletion

AmplabJenkins · 2021-11-05T11:45:08Z

Can one of the admins verify this patch?

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

dongjoon-hyun

Thank you for making a PR, @cxzl25 .

dongjoon-hyun · 2021-11-06T00:42:19Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+              s"Number of dynamic partitions created is $numWrittenParts" +
+                s", which is more than $maxDynamicPartitions" +
+                s". To solve this try to set $maxDynamicPartitionsKey" +
+                s" to at least $numWrittenParts."


Do you think we set hive.exec.max.dynamic.partitions automatically from Spark side in this case?

It is possible to automatically adjust the number of hive.exec.max.dynamic.partitions.
However, if it is automatically adjusted, many partitions may be created accidentally, and this parameter is meaningless.

https://github.com/apache/hive/blob/135629b8d6b538fed092641537034a9fbc59c7a0/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L1857-L1864

dongjoon-hyun · 2021-11-06T00:42:36Z

cc @sunchao

sunchao · 2021-11-07T18:20:54Z

cc @viirya too since this is related to your change in SPARK-29295

viirya

Hmm, my question is, as we are going to overwrite the table partitions, why we need to prevent data to be deleted? Any other delete-like command, I think if any failure happens during deletion, there will be some data that are already deleted before the failure. I think we don't provide atomicity guarantee for this command, right?

cxzl25 · 2021-11-08T04:55:59Z

Hmm, my question is, as we are going to overwrite the table partitions, why we need to prevent data to be deleted? Any other delete-like command, I think if any failure happens during deletion, there will be some data that are already deleted before the failure. I think we don't provide atomicity guarantee for this command, right?

Yes. I agree with you.
Operation is not guaranteed to be atomic.
Failure during the deletion process is not guaranteed to be restored.

But in this case, if the number of dynamic partitions exceeds hive.exec.max.dynamic.partitions, Spark deletes the partition data first, and then checks that the number of partitions exceeds the configured number when client.loadDynamicPartitions loads the data, and it fails immediately. No data is written to the partition.

The user thought that the operation was not successful, and theoretically the original data should still be there.

Or the user will check whether the number of partitions meets expectations. If it does, the user needs to adjust the hive configuration. If it does not, it needs to modify the sql logic.
It also takes time to re-run sql, and the data during this period will not be able to be read.

sunchao · 2021-11-08T17:49:41Z

I feel even though we can't guarantee operations like delete to be atomic, we should make effort to do so. This PR looks simple enough and fixes a potential issue which could corrupt an external Hive table, so I think it's well worth it?

viirya

Yea, this looks simple and just a check before the Hive operation. For the purpose of adding an early check before running the query, it is fine. I only don't think it makes sense for preventing data deletion as rationale because this command is going to delete the data actually and there is no atomicity.

cxzl25 · 2021-11-08T18:57:14Z

It may be that the PR title is not clear.
Maybe i can change to
The number of dynamic partitions should early check when writing to external tables ?

viirya · 2021-11-08T19:10:30Z

Sounds good to me. Thanks.

# Conflicts: # sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

cxzl25 · 2021-11-22T04:53:53Z

Can we continue to review this pr? @dongjoon-hyun @sunchao @viirya

sunchao

Sorry @cxzl25 , forgot about this. Left a few minor comments but this looks good to me.

sunchao · 2021-12-09T20:01:30Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

+                s", which is more than $maxDynamicPartitions" +
+                s". To solve this try to set $maxDynamicPartitionsKey" +
+                s" to at least $numWrittenParts."
+            throw new SparkException(maxDynamicPartitionsErrMsg)


nit: we may want to group this error message and define it in QueryExecutionErrors.

sunchao · 2021-12-09T20:07:27Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala

@@ -192,6 +192,17 @@ case class InsertIntoHiveTable(
    if (partition.nonEmpty) {
      if (numDynamicPartitions > 0) {
        if (overwrite && table.tableType == CatalogTableType.EXTERNAL) {
+          val numWrittenParts = writtenParts.size
+          val maxDynamicPartitionsKey = "hive.exec.max.dynamic.partitions"


can we use HiveConf.ConfVars.DYNAMICPARTITIONMAXPARTS.varname for the key and HiveConf.ConfVars.DYNAMICPARTITIONMAXPARTS.defaultIntVal instead of 1000?

sunchao

LGTM with one nit

sunchao · 2021-12-10T05:20:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala

@@ -1905,4 +1905,14 @@ object QueryExecutionErrors {
  def cannotConvertOrcTimestampToTimestampNTZError(): Throwable = {
    new RuntimeException("Unable to convert timestamp of Orc to data type 'timestamp_ntz'")
  }
+
+  def writePartitionExceedConfigSizeWhenDynamicPartitionError(numWrittenParts: Int,


nit: format

def writePartitionExceedConfigSizeWhenDynamicPartitionError( numWrittenParts: Int, maxDynamicPartitions: Int, maxDynamicPartitionsKey: String): Throwable = { ... }

sunchao · 2021-12-10T17:46:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala

@@ -1905,4 +1905,15 @@ object QueryExecutionErrors {
  def cannotConvertOrcTimestampToTimestampNTZError(): Throwable = {
    new RuntimeException("Unable to convert timestamp of Orc to data type 'timestamp_ntz'")
  }
+
+  def writePartitionExceedConfigSizeWhenDynamicPartitionError(
+    numWrittenParts: Int,


nit: we need to use 4-space indentation here.

check here: https://github.com/databricks/scala-style-guide#indent

Sorry, I did not notice the indentation problem here, you have already provided an example before, thank you.

sunchao

LGTM

sunchao · 2021-12-14T03:30:12Z

Merged, thanks! also going to cherry-pick to branch-3.2.

sunchao · 2021-12-14T04:05:31Z

@cxzl25 could you open another PR to backport this to branch-3.2? I tried to cherry-pick it but there's some conflict.

cxzl25 · 2021-12-14T04:06:57Z

@cxzl25 could you open another PR to backport this to branch-3.2? I tried to cherry-pick it but there's some conflict.

ok, i will do it now.

…k when writing to external tables ### What changes were proposed in this pull request? SPARK-29295 introduces a mechanism that writes to external tables is a dynamic partition method, and the data in the target partition will be deleted first. Assuming that 1001 partitions are written, the data of 10001 partitions will be deleted first, but because `hive.exec.max.dynamic.partitions` is 1000 by default, loadDynamicPartitions will fail at this time, but the data of 1001 partitions has been deleted. So we can check whether the number of dynamic partitions is greater than `hive.exec.max.dynamic.partitions` before deleting, it should fail quickly at this time. ### Why are the changes needed? Avoid data that cannot be recovered when the job fails. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add UT Closes apache#34493 from cxzl25/SPARK-37217. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Chao Sun <sunchao@apple.com> (cherry picked from commit 4b849ef)

cxzl25 · 2021-12-14T04:18:23Z

branch-3.2 #34889

Dynamic partitions should fail quickly when writing to external table…

741c6c6

…s to prevent data deletion

github-actions bot added the SQL label Nov 5, 2021

dongjoon-hyun reviewed Nov 6, 2021

View reviewed changes

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala Show resolved Hide resolved

dongjoon-hyun reviewed Nov 6, 2021

View reviewed changes

cxzl25 added 2 commits November 6, 2021 14:53

indent

c46d734

add missing colon

7df3c78

viirya reviewed Nov 7, 2021

View reviewed changes

viirya reviewed Nov 8, 2021

View reviewed changes

cxzl25 changed the title ~~[SPARK-37217][SQL] Dynamic partitions should fail quickly when writing to external tables to prevent data deletion~~ [SPARK-37217][SQL] The number of dynamic partitions should early check when writing to external tables Nov 8, 2021

Merge branch 'master' into SPARK-37217

a2996f8

# Conflicts: # sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala

Merge remote-tracking branch 'origin' into SPARK-37217

c0e1b14

sunchao reviewed Dec 9, 2021

View reviewed changes

group this error message and define it in QueryExecutionErrors

35cb60e

sunchao approved these changes Dec 10, 2021

View reviewed changes

format

c769166

sunchao reviewed Dec 10, 2021

View reviewed changes

4-space indentation

a1b19a2

sunchao approved these changes Dec 10, 2021

View reviewed changes

cxzl25 added 2 commits December 11, 2021 12:33

trigger test

ad14294

Merge remote-tracking branch 'origin' into SPARK-37217

866e95e

sunchao closed this in 4b849ef Dec 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-37217][SQL] The number of dynamic partitions should early check when writing to external tables #34493

[SPARK-37217][SQL] The number of dynamic partitions should early check when writing to external tables #34493

cxzl25 commented Nov 5, 2021

AmplabJenkins commented Nov 5, 2021

dongjoon-hyun left a comment •

edited

Loading

dongjoon-hyun Nov 6, 2021

cxzl25 Nov 6, 2021

dongjoon-hyun commented Nov 6, 2021

sunchao commented Nov 7, 2021

viirya left a comment

cxzl25 commented Nov 8, 2021

sunchao commented Nov 8, 2021

viirya left a comment

cxzl25 commented Nov 8, 2021

viirya commented Nov 8, 2021

cxzl25 commented Nov 22, 2021

sunchao left a comment

sunchao Dec 9, 2021

sunchao Dec 9, 2021

sunchao left a comment

sunchao Dec 10, 2021

sunchao Dec 10, 2021

sunchao Dec 10, 2021

cxzl25 Dec 10, 2021

sunchao left a comment

sunchao commented Dec 14, 2021

sunchao commented Dec 14, 2021

cxzl25 commented Dec 14, 2021

cxzl25 commented Dec 14, 2021

[SPARK-37217][SQL] The number of dynamic partitions should early check when writing to external tables #34493

[SPARK-37217][SQL] The number of dynamic partitions should early check when writing to external tables #34493

Conversation

cxzl25 commented Nov 5, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

AmplabJenkins commented Nov 5, 2021

dongjoon-hyun left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 6, 2021

sunchao commented Nov 7, 2021

viirya left a comment

Choose a reason for hiding this comment

cxzl25 commented Nov 8, 2021

sunchao commented Nov 8, 2021

viirya left a comment

Choose a reason for hiding this comment

cxzl25 commented Nov 8, 2021

viirya commented Nov 8, 2021

cxzl25 commented Nov 22, 2021

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao left a comment

Choose a reason for hiding this comment

sunchao commented Dec 14, 2021

sunchao commented Dec 14, 2021

cxzl25 commented Dec 14, 2021

cxzl25 commented Dec 14, 2021

dongjoon-hyun left a comment •

edited

Loading