[HUDI-6478] Simplifying INSERT_INTO configs for spark-sql#9123
[HUDI-6478] Simplifying INSERT_INTO configs for spark-sql#9123codope merged 7 commits intoapache:masterfrom
Conversation
| if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) | ||
| && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { | ||
| // enable merge allow duplicates when operation type is insert | ||
| mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") |
There was a problem hiding this comment.
I feel by default, we should never dedup for INSERT operation. That keeps the behavior in line with regular RDBMS.
There was a problem hiding this comment.
i generally agree with this point but i think we want to keep the default backwards compatible.
There was a problem hiding this comment.
There is no much meaningness to keep backwards compatible if the behavior itself is not correct from user's intuition,
because most of the users that use INSERT operation does not need deduplication and they do not want to specify a record key either.
There was a problem hiding this comment.
this is not de-dup. this is actually achieving what you are claiming Danny.
i.e.
if you ingest RK1, val1 in commit and RK1, val2 in commit2 with insert operation type, snapshot will return both values only when you set "MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key" = true.
There was a problem hiding this comment.
Yeah, kind of obscure from the first sight, why not just put the default value hoodie.merge.allow.duplicate.on.inserts as true.
There was a problem hiding this comment.
I think this config is also used for datasource inserts. So, now the behavior of datasource and sql will differ for the insert operation?
There was a problem hiding this comment.
yeah, it's better we can keep the strategy in line.
There was a problem hiding this comment.
nope. we are streamlining across spark-ds and spark-sql.
if operation type is insert, we do enable hoodie.merge.allow.duplicate.on.inserts if user does not explicitly set it.
There was a problem hiding this comment.
we do enable hoodie.merge.allow.duplicate.on.inserts
Do you mean disable?
There was a problem hiding this comment.
Then it's alright. Somehow while reviewing this change i thought it's in ProvidesHoodieConfig. Only now I realised it's in HoodieSparkSqlWriter so should be fine.
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Show resolved
Hide resolved
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DataSourceOptions.scala
Show resolved
Hide resolved
| if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) | ||
| && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { | ||
| // enable merge allow duplicates when operation type is insert | ||
| mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") |
There was a problem hiding this comment.
i generally agree with this point but i think we want to keep the default backwards compatible.
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
Show resolved
Hide resolved
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
Outdated
Show resolved
Hide resolved
...source/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/ProvidesHoodieConfig.scala
Show resolved
Hide resolved
| deducePayloadClassNameLegacy(operation, tableType, insertMode) | ||
| } else { | ||
| classOf[OverwriteWithLatestAvroPayload].getCanonicalName | ||
| // should we also consider old way of doing things. |
There was a problem hiding this comment.
i think we should. we can change the behavior in 1.x. But, in 0.14.0, we should map the previous config value to the new config value, e.g. STRICT is equivalent to FAIL_INSERT_DUP_POLICY.
There was a problem hiding this comment.
this is already taken care in deducePayloadClassNameLegacy, none of the downstream methods do anything differently. Its only used to deduce the payload class
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestInsertTable.scala
Outdated
Show resolved
Hide resolved
|
Should we directly modify the default value of "hoodie.merge.allow.duplicate.on.inserts" to true ? This parameter only takes effect in insert mode, and usually users don’t want to remove duplicates when using inserts, which will cause trouble for users. |
7708ff7 to
beb523c
Compare
|
hey @danny0405 @codope : Updated the patch. rebased w/ latest master. |
codope
left a comment
There was a problem hiding this comment.
Overall look good except for some minor clarification. Please also check the CI failures.
| if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) | ||
| && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { | ||
| // enable merge allow duplicates when operation type is insert | ||
| mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") |
There was a problem hiding this comment.
I think this config is also used for datasource inserts. So, now the behavior of datasource and sql will differ for the insert operation?
|
hey @zhuanshenbsj1 I know we are changing the behavior. But we looked at few other systems in similar space and everywhere INSERT_INTO can result in duplicates. And we are taking a hit on trying to de-dup when compared w/ others. |
|
hey @codope : not sure I understand your question here. We are aligning the behavior across both w/ this patch. lets sync up f2f and resolve any pending feedback. |
codope
left a comment
There was a problem hiding this comment.
Changes looks good. Can land once the CI is green. Can you please also track a docs PR? We need to document different cases and how that changes compared to previous versions.
| if (mergedParams.get(OPERATION.key()).get == INSERT_OPERATION_OPT_VAL && mergedParams.contains(DataSourceWriteOptions.INSERT_DUP_POLICY.key()) | ||
| && mergedParams.get(DataSourceWriteOptions.INSERT_DUP_POLICY.key()).get != FAIL_INSERT_DUP_POLICY) { | ||
| // enable merge allow duplicates when operation type is insert | ||
| mergedParams.put(HoodieWriteConfig.MERGE_ALLOW_DUPLICATE_ON_INSERTS_ENABLE.key(), "true") |
There was a problem hiding this comment.
Then it's alright. Somehow while reviewing this change i thought it's in ProvidesHoodieConfig. Only now I realised it's in HoodieSparkSqlWriter so should be fine.
c4b55ca to
92431ed
Compare
92431ed to
05554e2
Compare
codope
left a comment
There was a problem hiding this comment.
Review comments addressed and CI green.
| | tblproperties ( | ||
| | type = '$tableType', | ||
| | primaryKey = 'id', | ||
| | preCombine = 'name' |

Change Logs
With the intent to simplify different config options with INSERT_INTO spark-sql, we are doing a overhaul. We have 3 to 4 configs with INSERT_INTO like Operation type, insert mode, drop dupes, enable bulk insert configs. Here is what the simplification brings in.
When both old and new configs are set, new config will take effect.
When only new configs are set, new config will take effect.
When neither is set, new configs and their default will take effect.
When only old configs are set, old configs will take effect. Please do note that we are deprecating the use of these old configs. In 2 releases, we will completely remove these configs. So, would recommend users to migrate to new configs.
Note: old refers to "hoodie.sql.insert.mode" and new config refers to "hoodie.sql.write.operation".
Behavior change:
With this patch, we are also switching the default behavior with INSERT_INTO to use "insert" as the operation underneath. Until 0.13.1, default behavior was "upsert". In other words, if you ingest same batch of records in commit1 and in commit2, hudi will do an upsert and will return only the latest value with snapshot read. But with this patch, we are changing the default behavior to use "insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and in commit2 will result in duplicates records with snapshot read. If users override the respective config, we will honor them, but the default behavior where none of the respective configs are overridden explicitly, will see a behavior change.
Impact
Usability will be improved for spark-sql users as we have deprecated few confusing configs and tried to align with spark datasource writes. Also, this brings in a behavior change as well. With this patch, we are also switching the default behavior with INSERT_INTO to use "insert" as the operation underneath. Until 0.13.1, default behavior was "upsert". In other words, if you ingest same batch of records in commit1 and in commit2, hudi will do an upsert and will return only the latest value with snapshot read. But with this patch, we are changing the default behavior to use "insert" as the name (INSERT_INTO) signifies. So, ingesting the same batch of records in commit1 and in commit2 will result in duplicates records with snapshot read. If users override the respective config, we will honor them, but the default behavior where none of the respective configs are overridden explicitly, will see a behavior change.
Risk level (write none, low medium or high below)
medium
Documentation Update
We will have to call out the behavior change as part of our release docs and also update our quick start guide around the same.
https://issues.apache.org/jira/browse/HUDI-6479
Contributor's checklist