New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HUDI-2369] Blog on bulk_insert sort modes #3549
base: asf-site
Are you sure you want to change the base?
[HUDI-2369] Blog on bulk_insert sort modes #3549
Conversation
few months back, when I ran some benchmarks, global sorting w/ bulk_insert took more time than no sorting which made sense. but this time, it wasn't the way I anticipated. |
## Bulk insert with different sort modes | ||
Here is a microbenchmark to show the performance difference between different sort modes. | ||
|
||
![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/bulkinsert-sort-modes.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these figures are good. but I think we can group numbers by sort modes rather than key type. So its easy to see the differences across the sort modes (main focus of this blog)
|
||
![Upsert followed by bulk_insert with different sort modes](/assets/images/blog/bulkinsert-sort-modes/upsert-sort-modes.png) | ||
|
||
As you could see, when data is globally sorted, upserts will have lower latency since lot of data files could be filtered out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need more motivation here. High level, we are saying -pay more cost during writing say 2x once, and reap benefits every other upsert? We need a more compelling case IMO
I have addressed the feedback. |
@vinothchandar : this patch is also good to review. updated based on our discussion. |
different sort modes available out of the box, and how each compares with others. | ||
<!--truncate--> | ||
|
||
Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: loading to data -> loading of data.
|
||
Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected | ||
to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two | ||
aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please correct me if I am wrong. AFAIK records will be looked up for performing deduplication in case of bulk insert as well, which is the same case with insert operation. Am I missing something here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
guess you got confused w/ two configs. One is dedup(combine before insert) and another is Insert_Drop_Dupes. dedup is just deduping among incoming batch of records. Insert_Drop_Dupes is dropping those records that are already in storage. with row writer path, we don't support Insert_Drop_dupes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think there is some confusion here. I went through the entire flow of DeltaStreamer. As per the below 2 lines -
hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java
Line 469 in 5d60491
if (cfg.filterDupes) { hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java
Line 597 in 5d60491
ValidationUtils.checkArgument(!cfg.filterDupes || cfg.operation != WriteOperationType.UPSERT,
Both the types of deduplication happens for INSERT as well as BULK_INSERT cases. Please correct me if I am still getting it wrong @nsivabalan
Thank you for writing this blog @nsivabalan . Quite useful! :) |
## Bulk insert with different sort modes | ||
Here is a microbenchmark to show the performance difference between different sort modes. | ||
|
||
![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/sort-modes.png) <br/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a bit misleading how sorting adds very little overhead for bulk_insert. thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have fixed the benchmarks. PTAL. with row writer, I don't see a lot of overhead w/ sorting. with write client, definitely there was some overhead for sure.
9cffd07
to
21b9083
Compare
|
||
### Global Sort | ||
|
||
As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hudi sorts the records globally by which column? recordKey?
What is the purpose of the pull request
Blog on bulk_insert sort modes
Brief change log
(for example:)
Verify this pull request
(Please pick either of the following options)
This pull request is a trivial rework / code cleanup without any test coverage.
(or)
This pull request is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Committer checklist
Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.