[HUDI-6478] Simplifying INSERT_INTO configs for spark-sql by nsivabalan · Pull Request #9123 · apache/hudi

nsivabalan · 2023-07-05T02:33:50Z

Change Logs

With the intent to simplify different config options with INSERT_INTO spark-sql, we are doing a overhaul. We have 3 to 4 configs with INSERT_INTO like Operation type, insert mode, drop dupes, enable bulk insert configs. Here is what the simplification brings in.

- We will introduce a new config named "hoodie.sql.write.operation" which will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will be "insert" for INSERT_INTO.
         - Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable".
         - Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation type is "Insert" for both spark-sql and spark-ds. This will maintain duplicates but still help w/ small file management with "insert"s.
- Introduce a new config named "hoodie.datasource.insert.dedupe.policy" whose valid values are "ignore, fail and drop". Make "ignore" as default. "fail" will mimic "STRICT" mode we support as of now. 
         - Deprecate hoodie.datasource.insert.drop.dups.

When both old and new configs are set, new config will take effect.
When only new configs are set, new config will take effect.
When neither is set, new configs and their default will take effect.
When only old configs are set, old configs will take effect. Please do note that we are deprecating the use of these old configs. In 2 releases, we will completely remove these configs. So, would recommend users to migrate to new configs.

Note: old refers to "hoodie.sql.insert.mode" and new config refers to "hoodie.sql.write.operation".

Behavior change:
With this patch, we are also switching the default behavior with INSERT_INTO to use "insert" as the operation underneath. Until 0.13.1, default behavior was "upsert". In other words, if you ingest same batch of records in commit1 and in commit2, hudi will do an upsert and will return only the latest value with snapshot read. But with this patch, we are changing the default behavior to use "insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and in commit2 will result in duplicates records with snapshot read. If users override the respective config, we will honor them, but the default behavior where none of the respective configs are overridden explicitly, will see a behavior change.

Impact

Usability will be improved for spark-sql users as we have deprecated few confusing configs and tried to align with spark datasource writes. Also, this brings in a behavior change as well. With this patch, we are also switching the default behavior with INSERT_INTO to use "insert" as the operation underneath. Until 0.13.1, default behavior was "upsert". In other words, if you ingest same batch of records in commit1 and in commit2, hudi will do an upsert and will return only the latest value with snapshot read. But with this patch, we are changing the default behavior to use "insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and in commit2 will result in duplicates records with snapshot read. If users override the respective config, we will honor them, but the default behavior where none of the respective configs are overridden explicitly, will see a behavior change.

Risk level (write none, low medium or high below)

medium

Documentation Update

We will have to call out the behavior change as part of our release docs and also update our quick start guide around the same.
https://issues.apache.org/jira/browse/HUDI-6479

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2023-07-05T03:25:52Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+    if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+      && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) {
+      // enable merge allow duplicates when operation type is insert
+      mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true")


I feel by default, we should never dedup for INSERT operation. That keeps the behavior in line with regular RDBMS.

i generally agree with this point but i think we want to keep the default backwards compatible.

There is no much meaningness to keep backwards compatible if the behavior itself is not correct from user's intuition,
because most of the users that use INSERT operation does not need deduplication and they do not want to specify a record key either.

this is not de-dup. this is actually achieving what you are claiming Danny.
i.e.
if you ingest RK1, val1 in commit and RK1, val2 in commit2 with insert operation type, snapshot will return both values only when you set "MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key" = true.

Yeah, kind of obscure from the first sight, why not just put the default value hoodie.merge.allow.duplicate.on.inserts as true.

I think this config is also used for datasource inserts. So, now the behavior of datasource and sql will differ for the insert operation?

yeah, it's better we can keep the strategy in line.

nope. we are streamlining across spark-ds and spark-sql.
if operation type is insert, we do enable hoodie.merge.allow.duplicate.on.inserts if user does not explicitly set it.

we do enable hoodie.merge.allow.duplicate.on.inserts

Do you mean disable?

Then it's alright. Somehow while reviewing this change i thought it's in ProvidesHoodieConfig. Only now I realised it's in HoodieSparkSqlWriter so should be fine.

...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala

codope · 2023-07-05T10:51:19Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+    if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+      && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) {
+      // enable merge allow duplicates when operation type is insert
+      mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true")


i generally agree with this point but i think we want to keep the default backwards compatible.

...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala

codope · 2023-07-05T11:05:28Z

...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala

+      deducePayloadClassNameLegacy(operation, tableType, insertMode)
    } else {
-      classOf[OverwriteWithLatestAvroPayload].getCanonicalName
+      // should we also consider old way of doing things.


i think we should. we can change the behavior in 1.x. But, in 0.14.0, we should map the previous config value to the new config value, e.g. STRICT is equivalent to FAIL_INSERT_DUP_POLICY.

this is already taken care in deducePayloadClassNameLegacy, none of the downstream methods do anything differently. Its only used to deduce the payload class

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala

zhuanshenbsj1 · 2023-07-07T03:45:28Z

hudi/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java

Lines 527 to 531 in 0ca0999

    
           public static final ConfigProperty<String> MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE = ConfigProperty 
        
               .key("hoodie.merge.allow.duplicate.on.inserts") 
        
               .defaultValue("false") 
        
               .markAdvanced() 
        
               .withDocumentation("When enabled, we allow duplicate keys even if inserts are routed to merge with an existing file (for ensuring file sizing)."

Should we directly modify the default value of "hoodie.merge.allow.duplicate.on.inserts" to true ? This parameter only takes effect in insert mode, and usually users don’t want to remove duplicates when using inserts, which will cause trouble for users.

nsivabalan · 2023-07-09T22:27:13Z

hey @danny0405 @codope : Updated the patch. rebased w/ latest master.

codope

Overall look good except for some minor clarification. Please also check the CI failures.

codope · 2023-07-10T12:24:43Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+    if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+      && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) {
+      // enable merge allow duplicates when operation type is insert
+      mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true")


I think this config is also used for datasource inserts. So, now the behavior of datasource and sql will differ for the insert operation?

nsivabalan · 2023-07-12T19:31:26Z

hey @zhuanshenbsj1 I know we are changing the behavior. But we looked at few other systems in similar space and everywhere INSERT_INTO can result in duplicates. And we are taking a hit on trying to de-dup when compared w/ others.

nsivabalan · 2023-07-12T20:52:09Z

hey @codope : not sure I understand your question here.
I think this config is also used for datasource inserts. So, now the behavior of datasource and sql will differ for the insert operation?

We are aligning the behavior across both w/ this patch. lets sync up f2f and resolve any pending feedback.

codope

Changes looks good. Can land once the CI is green. Can you please also track a docs PR? We need to document different cases and how that changes compared to previous versions.

codope · 2023-07-13T03:12:52Z

...spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala

+    if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
+      && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) {
+      // enable merge allow duplicates when operation type is insert
+      mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true")


Then it's alright. Somehow while reviewing this change i thought it's in ProvidesHoodieConfig. Only now I realised it's in HoodieSparkSqlWriter so should be fine.

hudi-bot · 2023-07-15T09:15:52Z

CI report:

05554e2 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

codope

Review comments addressed and CI green.

TengHuo · 2024-02-26T08:17:31Z

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala

+         | tblproperties (
+         |  type = '$tableType',
+         |  primaryKey = 'id',
+         |  preCombine = 'name'


The name of this key should be preCombineField here.

keyTableConfigMapping in HoodieOptionConfig ->

nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Jul 5, 2023

danny0405 reviewed Jul 5, 2023

View reviewed changes

...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala Show resolved Hide resolved

codope reviewed Jul 5, 2023

View reviewed changes

nsivabalan force-pushed the insert_into_overhaul branch from 7708ff7 to beb523c Compare July 9, 2023 22:26

codope reviewed Jul 10, 2023

View reviewed changes

codope approved these changes Jul 13, 2023

View reviewed changes

nsivabalan force-pushed the insert_into_overhaul branch 2 times, most recently from c4b55ca to 92431ed Compare July 14, 2023 22:11

nsivabalan added 7 commits July 14, 2023 22:29

Simplifying INSERT_INTO config for spark-sql

e29a688

cleaning up after rebasing with master

a73d3cf

addressing comments and rebasing with master

08cd646

Fixing test failurs

4bbee47

Fixing failed tests

fa83502

Fixing last few test failures

2f26c24

Fixing flaky tests

05554e2

nsivabalan force-pushed the insert_into_overhaul branch from 92431ed to 05554e2 Compare July 15, 2023 05:30

codope approved these changes Jul 16, 2023

View reviewed changes

codope merged commit e039dd7 into apache:master Jul 16, 2023

TengHuo reviewed Feb 26, 2024

View reviewed changes

hudi-bot mentioned this pull request Nov 30, 2025

Update release docs and quick start guide around INSERT_INTO default behavior change #16070

Open

Conversation

nsivabalan commented Jul 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zhuanshenbsj1 commented Jul 7, 2023

Uh oh!

nsivabalan commented Jul 9, 2023

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nsivabalan commented Jul 12, 2023

Uh oh!

nsivabalan commented Jul 12, 2023

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Jul 15, 2023

CI report:

Uh oh!

codope left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nsivabalan commented Jul 5, 2023 •

edited

Loading