[HUDI-2369] Blog on bulk_insert sort modes #3549

nsivabalan · 2021-08-27T03:39:45Z

What is the purpose of the pull request

Blog on bulk_insert sort modes

Brief change log

(for example:)

Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end.
Added HoodieClientWriteTest to verify the change.
Manually verified the change by running a job locally.

Committer checklist

Has a corresponding JIRA in PR title & commit
Commit message is descriptive of the change
CI is green
Necessary doc changes done or have another open PR
For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

nsivabalan · 2021-08-27T15:41:53Z

few months back, when I ran some benchmarks, global sorting w/ bulk_insert took more time than no sorting which made sense. but this time, it wasn't the way I anticipated.

vinothchandar · 2021-08-27T16:44:43Z

website/blog/2021-08-27-bulk-insert-sort-modes.md

+## Bulk insert with different sort modes
+Here is a microbenchmark to show the performance difference between different sort modes.
+
+![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/bulkinsert-sort-modes.png)


these figures are good. but I think we can group numbers by sort modes rather than key type. So its easy to see the differences across the sort modes (main focus of this blog)

vinothchandar · 2021-08-27T16:46:00Z

website/blog/2021-08-27-bulk-insert-sort-modes.md

+
+![Upsert followed by bulk_insert with different sort modes](/assets/images/blog/bulkinsert-sort-modes/upsert-sort-modes.png)
+
+As you could see, when data is globally sorted, upserts will have lower latency since lot of data files could be filtered out.


We need more motivation here. High level, we are saying -pay more cost during writing say 2x once, and reap benefits every other upsert? We need a more compelling case IMO

nsivabalan · 2021-08-27T23:19:24Z

I have addressed the feedback.

nsivabalan · 2021-08-28T01:12:56Z

@vinothchandar : this patch is also good to review. updated based on our discussion.

pratyakshsharma · 2021-08-28T07:38:33Z

website/blog/2021-08-27-bulk-insert-sort-modes.md

+different sort modes available out of the box, and how each compares with others. 
+<!--truncate-->
+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected


nit: loading to data -> loading of data.

pratyakshsharma · 2021-08-28T07:44:03Z

website/blog/2021-08-27-bulk-insert-sort-modes.md

+
+Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
+to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
+aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like 


Please correct me if I am wrong. AFAIK records will be looked up for performing deduplication in case of bulk insert as well, which is the same case with insert operation. Am I missing something here?

guess you got confused w/ two configs. One is dedup(combine before insert) and another is Insert_Drop_Dupes. dedup is just deduping among incoming batch of records. Insert_Drop_Dupes is dropping those records that are already in storage. with row writer path, we don't support Insert_Drop_dupes.

I still think there is some confusion here. I went through the entire flow of DeltaStreamer. As per the below 2 lines -

hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/DeltaSync.java

Line 469 in 5d60491

if (cfg.filterDupes) {

hudi/hudi-utilities/src/main/java/org/apache/hudi/utilities/deltastreamer/HoodieDeltaStreamer.java

Line 597 in 5d60491

ValidationUtils.checkArgument(!cfg.filterDupes || cfg.operation != WriteOperationType.UPSERT,

Both the types of deduplication happens for INSERT as well as BULK_INSERT cases. Please correct me if I am still getting it wrong @nsivabalan

website/blog/2021-08-27-bulk-insert-sort-modes.md

pratyakshsharma · 2021-08-28T08:11:50Z

Thank you for writing this blog @nsivabalan . Quite useful! :)

vinothchandar · 2021-09-01T22:48:59Z

website/blog/2021-08-27-bulk-insert-sort-modes.md

+## Bulk insert with different sort modes
+Here is a microbenchmark to show the performance difference between different sort modes.
+
+![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/sort-modes.png) <br/>


its a bit misleading how sorting adds very little overhead for bulk_insert. thoughts?

Have fixed the benchmarks. PTAL. with row writer, I don't see a lot of overhead w/ sorting. with write client, definitely there was some overhead for sure.

nsivabalan · 2021-09-20T20:56:28Z

Gatsby-Lee · 2023-02-06T00:05:49Z

website/blog/2021-08-27-bulk-insert-sort-modes.md

+
+### Global Sort
+
+As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files 


hudi sorts the records globally by which column? recordKey?

Adding bulk_insert sort modes blog

0a5c29a

nsivabalan mentioned this pull request Aug 27, 2021

Blog on Bulk insert sort variants #2969

Closed

5 tasks

vinothchandar added the priority:blocker label Aug 27, 2021

vinothchandar requested changes Aug 27, 2021

View reviewed changes

vinothchandar and others added 2 commits August 27, 2021 14:24

Review edits

1e01a31

Addressing comments

e494626

Fixing content for partition sort

20d536c

pratyakshsharma reviewed Aug 28, 2021

View reviewed changes

website/blog/2021-08-27-bulk-insert-sort-modes.md Show resolved Hide resolved

vinothchandar reviewed Sep 1, 2021

View reviewed changes

Fixing perf nos

21b9083

nsivabalan force-pushed the asf-site-bulkinsert-sortmode branch from 9cffd07 to 21b9083 Compare September 7, 2021 15:59

vinothchandar added this to Ready for Review in PR Tracker Board Sep 7, 2021

vinothchandar self-assigned this Sep 7, 2021

nsivabalan removed the priority:blocker label Nov 3, 2021

vinothchandar moved this from Ready for Review to Under Discussion PRs in PR Tracker Board Dec 15, 2021

xushiyan added the docs label Apr 20, 2022

xushiyan added the pr:wip Work in Progress/PRs label Oct 31, 2022

Gatsby-Lee reviewed Feb 6, 2023

View reviewed changes

github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-2369] Blog on bulk_insert sort modes #3549

[HUDI-2369] Blog on bulk_insert sort modes #3549

nsivabalan commented Aug 27, 2021

nsivabalan commented Aug 27, 2021

vinothchandar Aug 27, 2021

vinothchandar Aug 27, 2021

nsivabalan commented Aug 27, 2021

nsivabalan commented Aug 28, 2021

pratyakshsharma Aug 28, 2021

pratyakshsharma Aug 28, 2021

nsivabalan Sep 7, 2021

pratyakshsharma Sep 13, 2021

pratyakshsharma commented Aug 28, 2021

vinothchandar Sep 1, 2021

nsivabalan Sep 7, 2021 •

edited

nsivabalan commented Sep 20, 2021

Gatsby-Lee Feb 6, 2023


		![Upsert followed by bulk_insert with different sort modes](/assets/images/blog/bulkinsert-sort-modes/upsert-sort-modes.png)

		As you could see, when data is globally sorted, upserts will have lower latency since lot of data files could be filtered out.


		### Global Sort

		As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files

[HUDI-2369] Blog on bulk_insert sort modes #3549

Are you sure you want to change the base?

[HUDI-2369] Blog on bulk_insert sort modes #3549

Conversation

nsivabalan commented Aug 27, 2021

What is the purpose of the pull request

Brief change log

Verify this pull request

Committer checklist

nsivabalan commented Aug 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsivabalan commented Aug 27, 2021

nsivabalan commented Aug 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pratyakshsharma commented Aug 28, 2021

Choose a reason for hiding this comment

nsivabalan Sep 7, 2021 • edited

Choose a reason for hiding this comment

nsivabalan commented Sep 20, 2021

Choose a reason for hiding this comment

nsivabalan Sep 7, 2021 •

edited