Skip to content

[HUDI-6478] Simplifying INSERT_INTO configs for spark-sql#9123

Merged
codope merged 7 commits intoapache:masterfrom
nsivabalan:insert_into_overhaul
Jul 16, 2023
Merged

[HUDI-6478] Simplifying INSERT_INTO configs for spark-sql#9123
codope merged 7 commits intoapache:masterfrom
nsivabalan:insert_into_overhaul

Conversation

@nsivabalan
Copy link
Contributor

@nsivabalan nsivabalan commented Jul 5, 2023

Change Logs

With the intent to simplify different config options with INSERT_INTO spark-sql, we are doing a overhaul. We have 3 to 4 configs with INSERT_INTO like Operation type, insert mode, drop dupes, enable bulk insert configs. Here is what the simplification brings in.

- We will introduce a new config named "hoodie.sql.write.operation" which will have 3 values ("insert", "bulk_insert" and "upsert"). Default value will be "insert" for INSERT_INTO.
         - Deprecate hoodie.sql.insert.mode and "hoodie.sql.bulk.insert.enable".
         - Also, enable "hoodie.merge.allow.duplicate.on.inserts" = true if operation type is "Insert" for both spark-sql and spark-ds. This will maintain duplicates but still help w/ small file management with "insert"s.
- Introduce a new config named "hoodie.datasource.insert.dedupe.policy" whose valid values are "ignore, fail and drop". Make "ignore" as default. "fail" will mimic "STRICT" mode we support as of now. 
         - Deprecate hoodie.datasource.insert.drop.dups.

When both old and new configs are set, new config will take effect.
When only new configs are set, new config will take effect.
When neither is set, new configs and their default will take effect.
When only old configs are set, old configs will take effect. Please do note that we are deprecating the use of these old configs. In 2 releases, we will completely remove these configs. So, would recommend users to migrate to new configs.

Note: old refers to "hoodie.sql.insert.mode" and new config refers to "hoodie.sql.write.operation".

Behavior change:
With this patch, we are also switching the default behavior with INSERT_INTO to use "insert" as the operation underneath. Until 0.13.1, default behavior was "upsert". In other words, if you ingest same batch of records in commit1 and in commit2, hudi will do an upsert and will return only the latest value with snapshot read. But with this patch, we are changing the default behavior to use "insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and in commit2 will result in duplicates records with snapshot read. If users override the respective config, we will honor them, but the default behavior where none of the respective configs are overridden explicitly, will see a behavior change.

Impact

Usability will be improved for spark-sql users as we have deprecated few confusing configs and tried to align with spark datasource writes. Also, this brings in a behavior change as well. With this patch, we are also switching the default behavior with INSERT_INTO to use "insert" as the operation underneath. Until 0.13.1, default behavior was "upsert". In other words, if you ingest same batch of records in commit1 and in commit2, hudi will do an upsert and will return only the latest value with snapshot read. But with this patch, we are changing the default behavior to use "insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and in commit2 will result in duplicates records with snapshot read. If users override the respective config, we will honor them, but the default behavior where none of the respective configs are overridden explicitly, will see a behavior change.

Risk level (write none, low medium or high below)

medium

Documentation Update

We will have to call out the behavior change as part of our release docs and also update our quick start guide around the same.
https://issues.apache.org/jira/browse/HUDI-6479

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@nsivabalan nsivabalan added release-0.14.0 priority:blocker Production down; release blocker labels Jul 5, 2023
if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
&& mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) {
// enable merge allow duplicates when operation type is insert
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel by default, we should never dedup for INSERT operation. That keeps the behavior in line with regular RDBMS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i generally agree with this point but i think we want to keep the default backwards compatible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no much meaningness to keep backwards compatible if the behavior itself is not correct from user's intuition,
because most of the users that use INSERT operation does not need deduplication and they do not want to specify a record key either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not de-dup. this is actually achieving what you are claiming Danny.
i.e.
if you ingest RK1, val1 in commit and RK1, val2 in commit2 with insert operation type, snapshot will return both values only when you set "MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key" = true.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, kind of obscure from the first sight, why not just put the default value hoodie.merge.allow.duplicate.on.inserts as true.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this config is also used for datasource inserts. So, now the behavior of datasource and sql will differ for the insert operation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it's better we can keep the strategy in line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope. we are streamlining across spark-ds and spark-sql.
if operation type is insert, we do enable hoodie.merge.allow.duplicate.on.inserts if user does not explicitly set it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do enable hoodie.merge.allow.duplicate.on.inserts

Do you mean disable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's alright. Somehow while reviewing this change i thought it's in ProvidesHoodieConfig. Only now I realised it's in HoodieSparkSqlWriter so should be fine.

if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
&& mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) {
// enable merge allow duplicates when operation type is insert
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i generally agree with this point but i think we want to keep the default backwards compatible.

deducePayloadClassNameLegacy(operation, tableType, insertMode)
} else {
classOf[OverwriteWithLatestAvroPayload].getCanonicalName
// should we also consider old way of doing things.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think we should. we can change the behavior in 1.x. But, in 0.14.0, we should map the previous config value to the new config value, e.g. STRICT is equivalent to FAIL_INSERT_DUP_POLICY.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already taken care in deducePayloadClassNameLegacy, none of the downstream methods do anything differently. Its only used to deduce the payload class

@zhuanshenbsj1
Copy link
Contributor

public static final ConfigProperty<String> MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE = ConfigProperty
.key("hoodie.merge.allow.duplicate.on.inserts")
.defaultValue("false")
.markAdvanced()
.withDocumentation("When enabled, we allow duplicate keys even if inserts are routed to merge with an existing file (for ensuring file sizing)."

Should we directly modify the default value of "hoodie.merge.allow.duplicate.on.inserts" to true ? This parameter only takes effect in insert mode, and usually users don’t want to remove duplicates when using inserts, which will cause trouble for users.

@nsivabalan nsivabalan force-pushed the insert_into_overhaul branch from 7708ff7 to beb523c Compare July 9, 2023 22:26
@nsivabalan
Copy link
Contributor Author

hey @danny0405 @codope : Updated the patch. rebased w/ latest master.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall look good except for some minor clarification. Please also check the CI failures.

if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
&& mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) {
// enable merge allow duplicates when operation type is insert
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this config is also used for datasource inserts. So, now the behavior of datasource and sql will differ for the insert operation?

@nsivabalan
Copy link
Contributor Author

hey @zhuanshenbsj1 I know we are changing the behavior. But we looked at few other systems in similar space and everywhere INSERT_INTO can result in duplicates. And we are taking a hit on trying to de-dup when compared w/ others.

@nsivabalan
Copy link
Contributor Author

hey @codope : not sure I understand your question here.
I think this config is also used for datasource inserts. So, now the behavior of datasource and sql will differ for the insert operation?

We are aligning the behavior across both w/ this patch. lets sync up f2f and resolve any pending feedback.

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes looks good. Can land once the CI is green. Can you please also track a docs PR? We need to document different cases and how that changes compared to previous versions.

if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key())
&& mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) {
// enable merge allow duplicates when operation type is insert
mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it's alright. Somehow while reviewing this change i thought it's in ProvidesHoodieConfig. Only now I realised it's in HoodieSparkSqlWriter so should be fine.

@nsivabalan nsivabalan force-pushed the insert_into_overhaul branch 2 times, most recently from c4b55ca to 92431ed Compare July 14, 2023 22:11
@nsivabalan nsivabalan force-pushed the insert_into_overhaul branch from 92431ed to 05554e2 Compare July 15, 2023 05:30
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Copy link
Member

@codope codope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review comments addressed and CI green.

@codope codope merged commit e039dd7 into apache:master Jul 16, 2023
| tblproperties (
| type = '$tableType',
| primaryKey = 'id',
| preCombine = 'name'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this key should be preCombineField here.

keyTableConfigMapping in HoodieOptionConfig ->

Screenshot 2024-02-26 at 4 16 29 PM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-0.14.0

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

6 participants