Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-2369] Blog on bulk_insert sort modes #3549

Open
wants to merge 5 commits into
base: asf-site
Choose a base branch
from

Conversation

nsivabalan
Copy link
Contributor

What is the purpose of the pull request

Blog on bulk_insert sort modes

Brief change log

(for example:)

  • Modify AnnotationLocation checkstyle rule in checkstyle.xml

Verify this pull request

(Please pick either of the following options)

This pull request is a trivial rework / code cleanup without any test coverage.

(or)

This pull request is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end.
  • Added HoodieClientWriteTest to verify the change.
  • Manually verified the change by running a job locally.

Committer checklist

  • Has a corresponding JIRA in PR title & commit

  • Commit message is descriptive of the change

  • CI is green

  • Necessary doc changes done or have another open PR

  • For large changes, please consider breaking it into sub-tasks under an umbrella JIRA.

@nsivabalan
Copy link
Contributor Author

few months back, when I ran some benchmarks, global sorting w/ bulk_insert took more time than no sorting which made sense. but this time, it wasn't the way I anticipated.

## Bulk insert with different sort modes
Here is a microbenchmark to show the performance difference between different sort modes.

![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/bulkinsert-sort-modes.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these figures are good. but I think we can group numbers by sort modes rather than key type. So its easy to see the differences across the sort modes (main focus of this blog)


![Upsert followed by bulk_insert with different sort modes](/assets/images/blog/bulkinsert-sort-modes/upsert-sort-modes.png)

As you could see, when data is globally sorted, upserts will have lower latency since lot of data files could be filtered out.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need more motivation here. High level, we are saying -pay more cost during writing say 2x once, and reap benefits every other upsert? We need a more compelling case IMO

@nsivabalan
Copy link
Contributor Author

I have addressed the feedback.

@nsivabalan
Copy link
Contributor Author

@vinothchandar : this patch is also good to review. updated based on our discussion.

different sort modes available out of the box, and how each compares with others.
<!--truncate-->

Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: loading to data -> loading of data.


Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong. AFAIK records will be looked up for performing deduplication in case of bulk insert as well, which is the same case with insert operation. Am I missing something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess you got confused w/ two configs. One is dedup(combine before insert) and another is Insert_Drop_Dupes. dedup is just deduping among incoming batch of records. Insert_Drop_Dupes is dropping those records that are already in storage. with row writer path, we don't support Insert_Drop_dupes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think there is some confusion here. I went through the entire flow of DeltaStreamer. As per the below 2 lines -

  1. ValidationUtils.checkArgument(!cfg.filterDupes || cfg.operation != WriteOperationType.UPSERT,

Both the types of deduplication happens for INSERT as well as BULK_INSERT cases. Please correct me if I am still getting it wrong @nsivabalan

@pratyakshsharma
Copy link
Contributor

Thank you for writing this blog @nsivabalan . Quite useful! :)

## Bulk insert with different sort modes
Here is a microbenchmark to show the performance difference between different sort modes.

![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/sort-modes.png) <br/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a bit misleading how sorting adds very little overhead for bulk_insert. thoughts?

Copy link
Contributor Author

@nsivabalan nsivabalan Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have fixed the benchmarks. PTAL. with row writer, I don't see a lot of overhead w/ sorting. with write client, definitely there was some overhead for sure.

@vinothchandar vinothchandar added this to Ready for Review in PR Tracker Board Sep 7, 2021
@vinothchandar vinothchandar self-assigned this Sep 7, 2021
@nsivabalan
Copy link
Contributor Author

bulk_insert_sort_modes

@vinothchandar vinothchandar moved this from Ready for Review to Under Discussion PRs in PR Tracker Board Dec 15, 2021
@xushiyan xushiyan added the docs label Apr 20, 2022
@xushiyan xushiyan added the pr:wip Work in Progress/PRs label Oct 31, 2022

### Global Sort

As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hudi sorts the records globally by which column? recordKey?

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs pr:wip Work in Progress/PRs size:S PR with lines of changes in (10, 100]
Projects
Status: No status
Status: 🏗 Under discussion
PR Tracker Board
Under Discussion PRs
Development

Successfully merging this pull request may close these issues.

None yet

5 participants