[HUDI-5884] Support bulk_insert for insert_overwrite and insert_overwrite_table #8076

boneanxs · 2023-02-28T10:48:44Z

Change Logs

For Spark SQL, add bulk_insert support for insert_overwrite and insert_overwrite_table
Add tests to cover new feature

Impact

In order to keep consistent behavior, this pr still deletes old data completely when overwriting whole table while enabling BULK_INSERT. Here's a table to better understand current HUDI behaviors under different SaveMode and Operation(Especially for BULK_INSERT and INSERT_OVERWRITE related).

dataframe:

SaveMode	Operation	behavior
Overwrite	BULK_INSERT	delete whole table data
Overwrite	INSERT_OVERWRITE_TABLE	use `replaceCommit` to overwrite all table data
Overwrite	INSERT_OVERWRITE	delete whole table data
Append	INSERT_OVERWRITE	use `replaceCommit` to overwrite old partitions

sql:

SQL Type	Operation	behavior(before)	behavior(after)
INSERT OVERWRITE	BULK_INSERT	delete whole table data	same
INSERT OVERWRITE PARTITION	BULK_INSERT	Not Supported	use `replaceCommit` to overwrite old partitions
INSERT OVERWRITE	INSERT_OVERWRITE_TABLE	use `replaceCommit` to overwrite all table data	same
INSERT OVERWRITE PARTITION	INSERT_OVERWRITE	use `replaceCommit` to overwrite old partitions	same

There still are some issues needed to be addressed,

INSERT_OVERWRITE with Overwrite mode should not delete whole table data, it only needs to overwrite old partitions
We should always keep old data for time travel purpose & ACID-compliance, but currently Overwrite mode will delete all data first

created a issue to track this: HUDI-6286

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

boneanxs · 2023-02-28T11:06:06Z

@hudi-bot run azure

boneanxs · 2023-03-07T02:32:24Z

Hey @alexeykudinkin @nsivabalan, could you please take a look?

boneanxs · 2023-03-07T02:38:38Z

...di-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala

-      mode = SaveMode.Overwrite
-      isOverWriteTable = true
+    val mode = if (overwrite) {
+      SaveMode.Overwrite


Given the Overwrite mode doesn't care abt the old data, do we need to enable bulk_insert by default if it's Overwrite mode?

I think that's a good suggestion. cc @nsivabalan @yihua

boneanxs · 2023-03-07T02:44:35Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

-      } else if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) {
-        // When user set operation as INSERT_OVERWRITE_TABLE,
-        // overwrite will use INSERT_OVERWRITE_TABLE operator in doWriteOperation
+      } else if (mode == SaveMode.Overwrite && tableExists &&


Not sure why we need to explicitly delete old data if it's Overwrite mode, this behavior actually make the HUDI not ACID-compliant(I keep it here to make the tests pass).

Maybe we should only delete old data if using drop table command?

Do you mean, for Overwrite mode, we should not delete the basePath. Just overwrite the existing data. If so, I agree with you. Probably something to tackle in another PR.

Do you mean, for Overwrite mode, we should not delete the basePath. Just overwrite the existing data.

Yea

Probably something to tackle in another PR.

Sure, will fix it in another PR

Please file a JIRA to track this change.

Sure, created: HUDI-6286

boneanxs · 2023-03-07T02:46:57Z

Hi, @stream2000 Could you also please review this, this fixes the pr #8015

boneanxs · 2023-03-07T02:52:11Z

.../hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertActionExecutor.java

+          + " To use row writer please switch to spark 2 or spark 3");
+    }
+
+    records.write().format(targetFormat)


Still keep the old behavior here to do the bulk_insert, maybe we should also use HoodieDatasetBulkInsertHelper.bulkInsert to perform write operation? We can reduce many codes for handling commit behavior(Here will add a complete commit, while HoodieDatasetBulkInsertHelper.bulkInsert doesn't, we need to handle this differently in bulkInsertAsRow)

boneanxs · 2023-03-20T08:04:59Z

Gentle ping @alexeykudinkin @xushiyan @danny0405 @yihua

codope

@boneanxs I am yet to review fully, but have taken one pass. Can you break it down into two PRs - a) don't delete the table location if using SaveMode.Overwrite for bulk_insert, insert_overwrite, b) add support for bulk_insert for insert_overwrite and insert_overwrite_table.

Also, I want to understand the use case when we need this. If you can elaborate a bit more on why we need this, that would be great.

codope · 2023-04-02T03:26:02Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java

  public static final ConfigProperty<String> BULKINSERT_INPUT_DATA_SCHEMA_DDL = ConfigProperty
      .key("hoodie.bulkinsert.schema.ddl")
      .noDefaultValue()
      .withDocumentation("Schema set for row writer/bulk insert.");

+  public static final ConfigProperty<String> BULKINSERT_OVERWRITE_MODE = ConfigProperty
+      .key("hoodie.bulkinsert.overwrite.mode")


The value for this config is a write operation type. So, its key should be named accordingly.

...ava/org/apache/hudi/client/clustering/run/strategy/SparkSingleFileSortExecutionStrategy.java

...n/java/org/apache/hudi/client/clustering/run/strategy/SparkSortAndSizeExecutionStrategy.java

codope · 2023-04-02T03:28:56Z

...i-spark-common/src/main/java/org/apache/hudi/commit/BaseDatasetBulkCommitActionExecutor.java

+import java.util.List;
+import java.util.Map;
+
+public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {


Do we need this abstraction at a higher layer i.e. in hudi-client-common? And then maybe extend in hudi-spark-common for Dataset?

Yes, at first I tried to put this in hudi-client-common, but since BaseDatasetBulkCommitActionExecutor needs to access DataSourceUtils and DataSourceWriteOptions, not sure it's reasonable to move these classes there, and I'm afraid there are other dependencies for these two classes, we may also need to change those dependencies.

codope · 2023-04-02T03:29:13Z

...i-spark-common/src/main/java/org/apache/hudi/commit/BaseDatasetBulkCommitActionExecutor.java

+
+public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {
+
+  protected final HoodieWriteConfig writeConfig;


Do we need to serialize write config too or can it be transient?

codope · 2023-04-02T03:29:53Z

...i-spark-common/src/main/java/org/apache/hudi/commit/BaseDatasetBulkCommitActionExecutor.java

+import java.util.List;
+import java.util.Map;
+
+public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {


Suggested change

public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {

public abstract class BaseDatasetBulkInsertCommitActionExecutor implements Serializable {

codope · 2023-04-02T03:34:13Z

...di-spark/src/main/scala/org/apache/spark/sql/hudi/command/InsertIntoHoodieTableCommand.scala

-      mode = SaveMode.Overwrite
-      isOverWriteTable = true
+    val mode = if (overwrite) {
+      SaveMode.Overwrite


I think that's a good suggestion. cc @nsivabalan @yihua

stream2000 · 2023-04-02T15:34:42Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+    val writeConfig = DataSourceUtils.createHoodieConfig(writerSchemaStr, basePath.toString, tblName, opts)
+    val executor = mode match {
+      case SaveMode.Append =>
+        new DatasetBulkInsertActionExecutor(writeConfig, writeClient, instantTime)


Could we use writeClient to do the insert overwrite indead of calling the xxxActionExecutor directly?

writeClient is specifically for RDD[HoodieRecord], since all xxxActionExecutor here are Dataset[Row] based, I didn't put these logic there before.

stream2000 · 2023-04-02T15:39:03Z

.../hudi-spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertActionExecutor.java

+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class DatasetBulkInsertActionExecutor extends BaseDatasetBulkCommitActionExecutor {


Maybe we should change DatasetBulkInsertActionExecutor -> DatasetBulkInsertCommitActionExecutor since it is sub class of BaseCommitActionExecutor

make sense, will change

stream2000 · 2023-04-02T15:39:29Z

...rk-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteActionExecutor.java

+import java.util.Map;
+import java.util.stream.Collectors;
+
+public class DatasetBulkInsertOverwriteActionExecutor extends BaseDatasetBulkCommitActionExecutor {


Ditto, DatasetBulkInsertOverwriteCommitActionExecutor

stream2000 · 2023-04-02T15:54:26Z

...mmon/src/main/java/org/apache/hudi/commit/DatasetBulkInsertOverwriteTableActionExecutor.java

+import java.util.List;
+import java.util.Map;
+
+public class DatasetBulkInsertOverwriteTableActionExecutor extends DatasetBulkInsertOverwriteActionExecutor {


Ditto, DatasetBulkInsertOverwriteTableCommitActionExecutor

boneanxs · 2023-04-03T02:39:57Z

I am yet to review fully, but have taken one pass. Can you break it down into two PRs - a) don't delete the table location if using SaveMode.Overwrite for bulk_insert, insert_overwrite, b) add support for bulk_insert for insert_overwrite and insert_overwrite_table.

Yea, sure, will do so

Also, I want to understand the use case when we need this. If you can elaborate a bit more on why we need this, that would be great.

Currently, we want to migrate all existing hive tables to HUDI table, given many hive tables

usually perform insert_overwrite operation to overwrite the partition
written by batch jobs, could contains TB level data one day
doesn't need to perform the tag, drop duplicates

bulk_insert mode fit such scenario well, we can use bulk_insert mode to boost the write performance and make users easier to migrate existing hive table to hudi table.

boneanxs · 2023-05-09T04:06:19Z

Hi @codope @stream2000 Gentle ping... Could you please take a look again?

codope · 2023-05-17T06:53:47Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala

-               | partitioned by (dt, hh)
-               | location '${tmp.getCanonicalPath}/$tableMultiPartition'
-       """.stripMargin)
+  test("Test bulk insert with insert into for non partitioned table") {


These tests are only testing for default values of BULKINSERT_OVERWRITE_OPERATION_TYPE right? Can we also test for the other possible value?

Test bulk insert with insert overwrite table test INSERT_OVERWRITE_TABLE,

Test bulk insert with insert overwrite partition test INSERT_OVERWRITE

These two tests test all values of BULKINSERT_OVERWRITE_OPERATION_TYPE

codope · 2023-05-17T06:57:43Z

...k3.2plus-common/src/main/scala/org/apache/spark/sql/hudi/catalog/HoodieInternalV2Table.scala

@@ -106,8 +106,14 @@ private class HoodieV1WriteBuilder(writeOptions: CaseInsensitiveStringMap,
    override def toInsertableRelation: InsertableRelation = {
      new InsertableRelation {
        override def insert(data: DataFrame, overwrite: Boolean): Unit = {
+          val mode = if (overwriteTable || overwritePartition) {


Can you confirm if it's insert_overwrite_table then then table basePath will still be removed?

WIth this pr, it won't delete the basePath.

// HoodieSparkSqlWriter#handleSaveModes // won't delete the path if it's Overwrite mode and INSERT_OVERWRITE_TABLE, INSERT_OVERWRITE else if (mode == SaveMode.Overwrite && tableExists && (operation != WriteOperationType.INSERT_OVERWRITE_TABLE && operation != WriteOperationType.INSERT_OVERWRITE && operation != WriteOperationType.BULK_INSERT)) { // For INSERT_OVERWRITE_TABLE, INSERT_OVERWRITE and BULK_INSERT with Overwrite mode, // we'll use replacecommit to overwrite the old data. log.warn(s"hoodie table at $tablePath already exists. Deleting existing data & overwriting with new data.") fs.delete(tablePath, true) tableExists = false }

codope · 2023-05-17T07:04:07Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

-      } else if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) {
-        // When user set operation as INSERT_OVERWRITE_TABLE,
-        // overwrite will use INSERT_OVERWRITE_TABLE operator in doWriteOperation
+      } else if (mode == SaveMode.Overwrite && tableExists &&


Do you mean, for Overwrite mode, we should not delete the basePath. Just overwrite the existing data. If so, I agree with you. Probably something to tackle in another PR.

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

codope · 2023-05-17T08:52:11Z

...spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertCommitActionExecutor.java

+        .options(optsOverrides)
+        .mode(SaveMode.Append)
+        .save();
+    return null;


why return null here?

BULK_INSERT doesn't need to return WriteStatus(don't need to execute afterExecute method), since it call dataframe api records.write() to perform write operation, it will write the commit data after the write operation is done(in HoodieDataSourceInternalBatchWrite#commit, dataSourceInternalWriterHelper.commit)

Then how about returning Option<HoodieData<WriteStatus>> or maybe empty HoodieData if the return is not needed at the call site? Returning null can be potentially dangerous, if another author adds some change with the assumption that WriteStatus will always be present.

make sense, let me change it

...k-common/src/main/java/org/apache/hudi/commit/BaseDatasetBulkInsertCommitActionExecutor.java

...ava/org/apache/hudi/client/clustering/run/strategy/SparkSingleFileSortExecutionStrategy.java

...e/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkDataSourceDAGExecution.scala

boneanxs · 2023-05-23T10:58:36Z

@hudi-bot run azure

boneanxs · 2023-05-26T06:07:16Z

Hey, @codope, all comments are addressed, could you pls review it again?

codope

@boneanxs Can you please rebase?

codope · 2023-05-30T10:20:24Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

-      } else if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) {
-        // When user set operation as INSERT_OVERWRITE_TABLE,
-        // overwrite will use INSERT_OVERWRITE_TABLE operator in doWriteOperation
+      } else if (mode == SaveMode.Overwrite && tableExists &&


Please file a JIRA to track this change.

codope · 2023-05-30T10:22:30Z

...spark-common/src/main/java/org/apache/hudi/commit/DatasetBulkInsertCommitActionExecutor.java

+        .options(optsOverrides)
+        .mode(SaveMode.Append)
+        .save();
+    return null;


Then how about returning Option<HoodieData<WriteStatus>> or maybe empty HoodieData if the return is not needed at the call site? Returning null can be potentially dangerous, if another author adds some change with the assumption that WriteStatus will always be present.

boneanxs · 2023-05-31T12:08:25Z

@boneanxs Can you please rebase?

@codope done, and all comments are addressed.

codope · 2023-06-01T03:21:11Z

Looks good to me. @yihua @nsivabalan If you can take one pass, that would be great.

boneanxs · 2023-06-08T03:57:18Z

@yihua Gentle ping... could you pls help to review it?

yihua · 2023-06-20T17:21:11Z

@yihua Gentle ping... could you pls help to review it?

Sorry for the delay. I will review this PR this week.

yihua · 2023-06-20T17:21:48Z

@boneanxs meanwhile, could you rebase the PR on the latest master?

codope

@boneanxs Since this is a breaking change for users who rely on Savemode.Overwrite, can we just keep the bulk insert part, while extract the behavior change (i.e. not delete the table location if using SaveMode.Overwrite for bulk_insert, insert_overwrite) to a separate PR? We intend to do any behavior changes in 1.0 release while keeping 0.14.0 compatible with previous releases..

boneanxs · 2023-06-25T06:13:47Z

can we just keep the bulk insert part, while extract the behavior change (i.e. not delete the table location if using SaveMode.Overwrite for bulk_insert, insert_overwrite) to a separate PR

@codope removed the breaking changes

codope

Thanks @boneanxs for extracting out the breaking change. Left one minor comment for the config. Can you also squash all commits to one?

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java

…rite_table

hudi-bot · 2023-06-28T20:10:35Z

CI report:

6a239ad UNKNOWN
b796358 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

boneanxs · 2023-06-29T02:06:58Z

Thanks @boneanxs for extracting out the breaking change. Left one minor comment for the config. Can you also squash all commits to one?

@codope Thanks for reviewing, address the comment and squashed all commits.

boneanxs marked this pull request as draft February 28, 2023 10:50

boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from f8d8080 to be21282 Compare February 28, 2023 10:53

boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from 6a239ad to f384bbc Compare March 4, 2023 04:51

boneanxs changed the title ~~Support bulk_insert for insert_overwrite and insert_overwrite_table~~ [HUDI-5884] Support bulk_insert for insert_overwrite and insert_overwrite_table Mar 7, 2023

boneanxs marked this pull request as ready for review March 7, 2023 02:30

boneanxs commented Mar 7, 2023

View reviewed changes

boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch 2 times, most recently from 8becde2 to 5914b1a Compare March 20, 2023 01:55

KnightChess mentioned this pull request Apr 1, 2023

[SUPPORT] Spark insert overwrite in partition table causes executors OOM. #8332

Closed

codope reviewed Apr 2, 2023

View reviewed changes

stream2000 reviewed Apr 2, 2023

View reviewed changes

boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from 5914b1a to 1fadedf Compare April 26, 2023 08:43

boneanxs requested a review from codope April 27, 2023 02:02

yihua added the big-needle-movers label May 2, 2023

codope reviewed May 17, 2023

View reviewed changes

boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from 1fadedf to 851a1c3 Compare May 23, 2023 02:32

boneanxs commented May 23, 2023

View reviewed changes

...e/hudi-spark/src/test/scala/org/apache/hudi/functional/TestSparkDataSourceDAGExecution.scala Outdated Show resolved Hide resolved

boneanxs requested a review from codope May 24, 2023 03:10

codope reviewed May 30, 2023

View reviewed changes

boneanxs requested a review from codope May 31, 2023 01:59

codope approved these changes Jun 1, 2023

View reviewed changes

yihua self-assigned this Jun 2, 2023

xushiyan added priority:blocker release-0.14.0 labels Jun 21, 2023

codope approved these changes Jun 23, 2023

View reviewed changes

codope reviewed Jun 23, 2023

View reviewed changes

codope reviewed Jun 28, 2023

View reviewed changes

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieInternalConfig.java Show resolved Hide resolved

codope added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Jun 28, 2023

[HUDI-5884] Support bulk_insert for insert_overwrite and insert_overw…

b796358

…rite_table

boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from 02b3132 to b796358 Compare June 28, 2023 15:56

boneanxs requested a review from codope June 28, 2023 15:59

codope approved these changes Jun 29, 2023

View reviewed changes

codope merged commit 606bd7b into apache:master Jun 29, 2023

flashJd mentioned this pull request Jul 3, 2023

[HUDI-6466] Fix spark insert overwrite partitioned table with dynamic partition #9113

Merged

4 tasks


		public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {

		protected final HoodieWriteConfig writeConfig;

	public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {
	public abstract class BaseDatasetBulkInsertCommitActionExecutor implements Serializable {

[HUDI-5884] Support bulk_insert for insert_overwrite and insert_overwrite_table #8076

[HUDI-5884] Support bulk_insert for insert_overwrite and insert_overwrite_table #8076

Conversation

boneanxs commented Feb 28, 2023 • edited Loading

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

boneanxs commented Feb 28, 2023

boneanxs commented Mar 7, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boneanxs commented Mar 7, 2023

Choose a reason for hiding this comment

boneanxs commented Mar 20, 2023

codope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boneanxs commented Apr 3, 2023

boneanxs commented May 9, 2023

Choose a reason for hiding this comment

boneanxs May 23, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boneanxs commented May 23, 2023

boneanxs commented May 26, 2023 • edited Loading

codope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

boneanxs commented May 31, 2023

codope commented Jun 1, 2023

boneanxs commented Jun 8, 2023

yihua commented Jun 20, 2023

yihua commented Jun 20, 2023

codope left a comment

Choose a reason for hiding this comment

boneanxs commented Jun 25, 2023

codope left a comment

Choose a reason for hiding this comment

hudi-bot commented Jun 28, 2023

CI report:

boneanxs commented Jun 29, 2023

boneanxs commented Feb 28, 2023 •

edited

Loading

boneanxs May 23, 2023 •

edited

Loading

boneanxs commented May 26, 2023 •

edited

Loading