Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5884] Support bulk_insert for insert_overwrite and insert_overwrite_table #8076

Merged

Conversation

boneanxs
Copy link
Contributor

@boneanxs boneanxs commented Feb 28, 2023

Change Logs

  1. For Spark SQL, add bulk_insert support for insert_overwrite and insert_overwrite_table
  2. Add tests to cover new feature

Impact

In order to keep consistent behavior, this pr still deletes old data completely when overwriting whole table while enabling BULK_INSERT. Here's a table to better understand current HUDI behaviors under different SaveMode and Operation(Especially for BULK_INSERT and INSERT_OVERWRITE related).

dataframe:

SaveMode Operation behavior
Overwrite BULK_INSERT delete whole table data
Overwrite INSERT_OVERWRITE_TABLE use replaceCommit to overwrite all table data
Overwrite INSERT_OVERWRITE delete whole table data
Append INSERT_OVERWRITE use replaceCommit to overwrite old partitions

sql:

SQL Type Operation behavior(before) behavior(after)
INSERT OVERWRITE BULK_INSERT delete whole table data same
INSERT OVERWRITE PARTITION BULK_INSERT Not Supported use replaceCommit to overwrite old partitions
INSERT OVERWRITE INSERT_OVERWRITE_TABLE use replaceCommit to overwrite all table data same
INSERT OVERWRITE PARTITION INSERT_OVERWRITE use replaceCommit to overwrite old partitions same

There still are some issues needed to be addressed,

  1. INSERT_OVERWRITE with Overwrite mode should not delete whole table data, it only needs to overwrite old partitions
  2. We should always keep old data for time travel purpose & ACID-compliance, but currently Overwrite mode will delete all data first

created a issue to track this: HUDI-6286

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@boneanxs boneanxs marked this pull request as draft February 28, 2023 10:50
@boneanxs boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from f8d8080 to be21282 Compare February 28, 2023 10:53
@boneanxs
Copy link
Contributor Author

@hudi-bot run azure

@boneanxs boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from 6a239ad to f384bbc Compare March 4, 2023 04:51
@boneanxs boneanxs changed the title Support bulk_insert for insert_overwrite and insert_overwrite_table [HUDI-5884] Support bulk_insert for insert_overwrite and insert_overwrite_table Mar 7, 2023
@boneanxs boneanxs marked this pull request as ready for review March 7, 2023 02:30
@boneanxs
Copy link
Contributor Author

boneanxs commented Mar 7, 2023

Hey @alexeykudinkin @nsivabalan, could you please take a look?

mode = SaveMode.Overwrite
isOverWriteTable = true
val mode = if (overwrite) {
SaveMode.Overwrite
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the Overwrite mode doesn't care abt the old data, do we need to enable bulk_insert by default if it's Overwrite mode?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good suggestion. cc @nsivabalan @yihua

} else if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) {
// When user set operation as INSERT_OVERWRITE_TABLE,
// overwrite will use INSERT_OVERWRITE_TABLE operator in doWriteOperation
} else if (mode == SaveMode.Overwrite && tableExists &&
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why we need to explicitly delete old data if it's Overwrite mode, this behavior actually make the HUDI not ACID-compliant(I keep it here to make the tests pass).

Maybe we should only delete old data if using drop table command?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean, for Overwrite mode, we should not delete the basePath. Just overwrite the existing data. If so, I agree with you. Probably something to tackle in another PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean, for Overwrite mode, we should not delete the basePath. Just overwrite the existing data.

Yea

Probably something to tackle in another PR.

Sure, will fix it in another PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a JIRA to track this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, created: HUDI-6286

@boneanxs
Copy link
Contributor Author

boneanxs commented Mar 7, 2023

Hi, @stream2000 Could you also please review this, this fixes the pr #8015

+ " To use row writer please switch to spark 2 or spark 3");
}

records.write().format(targetFormat)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still keep the old behavior here to do the bulk_insert, maybe we should also use HoodieDatasetBulkInsertHelper.bulkInsert to perform write operation? We can reduce many codes for handling commit behavior(Here will add a complete commit, while HoodieDatasetBulkInsertHelper.bulkInsert doesn't, we need to handle this differently in bulkInsertAsRow)

@boneanxs boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch 2 times, most recently from 8becde2 to 5914b1a Compare March 20, 2023 01:55
@boneanxs
Copy link
Contributor Author

Gentle ping @alexeykudinkin @xushiyan @danny0405 @yihua

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boneanxs I am yet to review fully, but have taken one pass. Can you break it down into two PRs - a) don't delete the table location if using SaveMode.Overwrite for bulk_insert, insert_overwrite, b) add support for bulk_insert for insert_overwrite and insert_overwrite_table.

Also, I want to understand the use case when we need this. If you can elaborate a bit more on why we need this, that would be great.

public static final ConfigProperty<String> BULKINSERT_INPUT_DATA_SCHEMA_DDL = ConfigProperty
.key("hoodie.bulkinsert.schema.ddl")
.noDefaultValue()
.withDocumentation("Schema set for row writer/bulk insert.");

public static final ConfigProperty<String> BULKINSERT_OVERWRITE_MODE = ConfigProperty
.key("hoodie.bulkinsert.overwrite.mode")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The value for this config is a write operation type. So, its key should be named accordingly.

import java.util.List;
import java.util.Map;

public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this abstraction at a higher layer i.e. in hudi-client-common? And then maybe extend in hudi-spark-common for Dataset?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, at first I tried to put this in hudi-client-common, but since BaseDatasetBulkCommitActionExecutor needs to access DataSourceUtils and DataSourceWriteOptions, not sure it's reasonable to move these classes there, and I'm afraid there are other dependencies for these two classes, we may also need to change those dependencies.


public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {

protected final HoodieWriteConfig writeConfig;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to serialize write config too or can it be transient?

import java.util.List;
import java.util.Map;

public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public abstract class BaseDatasetBulkCommitActionExecutor implements Serializable {
public abstract class BaseDatasetBulkInsertCommitActionExecutor implements Serializable {

mode = SaveMode.Overwrite
isOverWriteTable = true
val mode = if (overwrite) {
SaveMode.Overwrite
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a good suggestion. cc @nsivabalan @yihua

val writeConfig = DataSourceUtils.createHoodieConfig(writerSchemaStr, basePath.toString, tblName, opts)
val executor = mode match {
case SaveMode.Append =>
new DatasetBulkInsertActionExecutor(writeConfig, writeClient, instantTime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use writeClient to do the insert overwrite indead of calling the xxxActionExecutor directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeClient is specifically for RDD[HoodieRecord], since all xxxActionExecutor here are Dataset[Row] based, I didn't put these logic there before.

import java.util.Map;
import java.util.stream.Collectors;

public class DatasetBulkInsertActionExecutor extends BaseDatasetBulkCommitActionExecutor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change DatasetBulkInsertActionExecutor -> DatasetBulkInsertCommitActionExecutor since it is sub class of BaseCommitActionExecutor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, will change

import java.util.Map;
import java.util.stream.Collectors;

public class DatasetBulkInsertOverwriteActionExecutor extends BaseDatasetBulkCommitActionExecutor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, DatasetBulkInsertOverwriteCommitActionExecutor

import java.util.List;
import java.util.Map;

public class DatasetBulkInsertOverwriteTableActionExecutor extends DatasetBulkInsertOverwriteActionExecutor {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, DatasetBulkInsertOverwriteTableCommitActionExecutor

@boneanxs
Copy link
Contributor Author

boneanxs commented Apr 3, 2023

I am yet to review fully, but have taken one pass. Can you break it down into two PRs - a) don't delete the table location if using SaveMode.Overwrite for bulk_insert, insert_overwrite, b) add support for bulk_insert for insert_overwrite and insert_overwrite_table.

Yea, sure, will do so

Also, I want to understand the use case when we need this. If you can elaborate a bit more on why we need this, that would be great.

Currently, we want to migrate all existing hive tables to HUDI table, given many hive tables

  1. usually perform insert_overwrite operation to overwrite the partition
  2. written by batch jobs, could contains TB level data one day
  3. doesn't need to perform the tag, drop duplicates

bulk_insert mode fit such scenario well, we can use bulk_insert mode to boost the write performance and make users easier to migrate existing hive table to hudi table.

@boneanxs boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from 5914b1a to 1fadedf Compare April 26, 2023 08:43
@boneanxs boneanxs requested a review from codope April 27, 2023 02:02
@boneanxs
Copy link
Contributor Author

boneanxs commented May 9, 2023

Hi @codope @stream2000 Gentle ping... Could you please take a look again?

| partitioned by (dt, hh)
| location '${tmp.getCanonicalPath}/$tableMultiPartition'
""".stripMargin)
test("Test bulk insert with insert into for non partitioned table") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are only testing for default values of BULKINSERT_OVERWRITE_OPERATION_TYPE right? Can we also test for the other possible value?

Copy link
Contributor Author

@boneanxs boneanxs May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test bulk insert with insert overwrite table test INSERT_OVERWRITE_TABLE,

Test bulk insert with insert overwrite partition test INSERT_OVERWRITE

These two tests test all values of BULKINSERT_OVERWRITE_OPERATION_TYPE

@@ -106,8 +106,14 @@ private class HoodieV1WriteBuilder(writeOptions: CaseInsensitiveStringMap,
override def toInsertableRelation: InsertableRelation = {
new InsertableRelation {
override def insert(data: DataFrame, overwrite: Boolean): Unit = {
val mode = if (overwriteTable || overwritePartition) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm if it's insert_overwrite_table then then table basePath will still be removed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WIth this pr, it won't delete the basePath.

// HoodieSparkSqlWriter#handleSaveModes
// won't delete the path if it's Overwrite mode and INSERT_OVERWRITE_TABLE, INSERT_OVERWRITE

else if (mode == SaveMode.Overwrite && tableExists &&
        (operation != WriteOperationType.INSERT_OVERWRITE_TABLE
          && operation != WriteOperationType.INSERT_OVERWRITE
          && operation != WriteOperationType.BULK_INSERT)) {
        // For INSERT_OVERWRITE_TABLE, INSERT_OVERWRITE and BULK_INSERT with Overwrite mode,
        // we'll use replacecommit to overwrite the old data.
        log.warn(s"hoodie table at $tablePath already exists. Deleting existing data & overwriting with new data.")
        fs.delete(tablePath, true)
        tableExists = false
      }

} else if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) {
// When user set operation as INSERT_OVERWRITE_TABLE,
// overwrite will use INSERT_OVERWRITE_TABLE operator in doWriteOperation
} else if (mode == SaveMode.Overwrite && tableExists &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean, for Overwrite mode, we should not delete the basePath. Just overwrite the existing data. If so, I agree with you. Probably something to tackle in another PR.

.options(optsOverrides)
.mode(SaveMode.Append)
.save();
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why return null here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BULK_INSERT doesn't need to return WriteStatus(don't need to execute afterExecute method), since it call dataframe api records.write() to perform write operation, it will write the commit data after the write operation is done(in HoodieDataSourceInternalBatchWrite#commit, dataSourceInternalWriterHelper.commit)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how about returning Option<HoodieData<WriteStatus>> or maybe empty HoodieData if the return is not needed at the call site? Returning null can be potentially dangerous, if another author adds some change with the assumption that WriteStatus will always be present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, let me change it

@boneanxs boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from 1fadedf to 851a1c3 Compare May 23, 2023 02:32
@boneanxs
Copy link
Contributor Author

@hudi-bot run azure

@boneanxs boneanxs requested a review from codope May 24, 2023 03:10
@boneanxs
Copy link
Contributor Author

boneanxs commented May 26, 2023

Hey, @codope, all comments are addressed, could you pls review it again?

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boneanxs Can you please rebase?

} else if (mode == SaveMode.Overwrite && tableExists && operation != WriteOperationType.INSERT_OVERWRITE_TABLE) {
// When user set operation as INSERT_OVERWRITE_TABLE,
// overwrite will use INSERT_OVERWRITE_TABLE operator in doWriteOperation
} else if (mode == SaveMode.Overwrite && tableExists &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please file a JIRA to track this change.

.options(optsOverrides)
.mode(SaveMode.Append)
.save();
return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then how about returning Option<HoodieData<WriteStatus>> or maybe empty HoodieData if the return is not needed at the call site? Returning null can be potentially dangerous, if another author adds some change with the assumption that WriteStatus will always be present.

@boneanxs boneanxs requested a review from codope May 31, 2023 01:59
@boneanxs
Copy link
Contributor Author

@boneanxs Can you please rebase?

@codope done, and all comments are addressed.

@codope
Copy link
Member

codope commented Jun 1, 2023

Looks good to me. @yihua @nsivabalan If you can take one pass, that would be great.

@yihua yihua self-assigned this Jun 2, 2023
@boneanxs
Copy link
Contributor Author

boneanxs commented Jun 8, 2023

@yihua Gentle ping... could you pls help to review it?

@yihua
Copy link
Contributor

yihua commented Jun 20, 2023

@yihua Gentle ping... could you pls help to review it?

Sorry for the delay. I will review this PR this week.

@yihua
Copy link
Contributor

yihua commented Jun 20, 2023

@boneanxs meanwhile, could you rebase the PR on the latest master?

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boneanxs Since this is a breaking change for users who rely on Savemode.Overwrite, can we just keep the bulk insert part, while extract the behavior change (i.e. not delete the table location if using SaveMode.Overwrite for bulk_insert, insert_overwrite) to a separate PR? We intend to do any behavior changes in 1.0 release while keeping 0.14.0 compatible with previous releases..

@boneanxs
Copy link
Contributor Author

can we just keep the bulk insert part, while extract the behavior change (i.e. not delete the table location if using SaveMode.Overwrite for bulk_insert, insert_overwrite) to a separate PR

@codope removed the breaking changes

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @boneanxs for extracting out the breaking change. Left one minor comment for the config. Can you also squash all commits to one?

@codope codope added priority:critical production down; pipelines stalled; Need help asap. and removed priority:blocker labels Jun 28, 2023
@boneanxs boneanxs force-pushed the bulk_insert_as_row_for_insert_overwrite branch from 02b3132 to b796358 Compare June 28, 2023 15:56
@boneanxs boneanxs requested a review from codope June 28, 2023 15:59
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@boneanxs
Copy link
Contributor Author

Thanks @boneanxs for extracting out the breaking change. Left one minor comment for the config. Can you also squash all commits to one?

@codope Thanks for reviewing, address the comment and squashed all commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
big-needle-movers priority:critical production down; pipelines stalled; Need help asap. release-0.14.0
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

6 participants