Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-2369] Blog on bulk_insert sort modes #3549

Open
wants to merge 5 commits into
base: asf-site
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
88 changes: 88 additions & 0 deletions website/blog/2021-08-27-bulk-insert-sort-modes.md
@@ -0,0 +1,88 @@
---
title: "Bulk Insert Sort Modes with Apache Hudi"
excerpt: "Different sort modes available with BulkInsert"
author: shivnarayan
category: blog
---

Apache Hudi supports a `bulk_insert` operation in addition to "insert" and "upsert" to ingest data into a hudi table.
There are different sort modes that one could employ while using bulk_insert. This blog will talk about
different sort modes available out of the box, and how each compares with others.
<!--truncate-->

Apache Hudi supports “bulk_insert” to assist in initial loading to data to a hudi table. This is expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: loading to data -> loading of data.

to be faster when compared to using “insert” or “upsert” operations. Bulk insert differs from insert in two
aspects. Existing records are never looked up with bulk_insert, and some writer side optimizations like
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me if I am wrong. AFAIK records will be looked up for performing deduplication in case of bulk insert as well, which is the same case with insert operation. Am I missing something here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

guess you got confused w/ two configs. One is dedup(combine before insert) and another is Insert_Drop_Dupes. dedup is just deduping among incoming batch of records. Insert_Drop_Dupes is dropping those records that are already in storage. with row writer path, we don't support Insert_Drop_dupes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think there is some confusion here. I went through the entire flow of DeltaStreamer. As per the below 2 lines -

  1. ValidationUtils.checkArgument(!cfg.filterDupes || cfg.operation != WriteOperationType.UPSERT,

Both the types of deduplication happens for INSERT as well as BULK_INSERT cases. Please correct me if I am still getting it wrong @nsivabalan

small files are not managed with bulk_insert.

Bulk insert offers 3 different sort modes to cater to different needs of users, based on the following principles.

- Sorting will give us good compression and upsert performance, if data is laid out well. Especially if your record keys
have some sort of ordering (timestamp, etc) characteristics, sorting will assist in trimming down a lot of files
during upsert. If data is sorted by frequently queried columns, queries will leverage parquet predicate pushdown
to trim down the data to ensure lower latency as well.

- Additionally, parquet writing is quite a memory intensive operation. When writing large volumes of data into a table
that is also partitioned into 1000s of partitions, without sorting of any kind, the writer may have to keep 1000s of
parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes.

- It's also desirable to start with the smallest amount of files possible when bulk importing data, as to avoid
metadata overhead later on for writers and queries.

3 Sort modes supported out of the box are: `PARTITION_SORT`,`GLOBAL_SORT` and `NONE`.

## Configurations
One can set the config [“hoodie.bulkinsert.sort.mode”](https://hudi.apache.org/docs/configurations.html#withBulkInsertSortMode) to either
of the three values, namely NONE, GLOBAL_SORT and PARTITION_SORT. Default sort mode is `GLOBAL_SORT`.

## Different Sort Modes

### Global Sort

As the name suggests, Hudi sorts the records globally across the input partitions, which maximizes the number of files

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hudi sorts the records globally by which column? recordKey?

pruned using key ranges, during index lookups for subsequent upserts. This is because each file has non-overlapping
min, max values for keys, which really helps, when the key has some ordering characteristics such as a time based prefix.
Given we are writing to a single parquet file on a single output partition path on storage at any given time, this mode
greatly helps control memory pressure during large partitioned writes. Also due to global sorting, each small table
partition path will be written from atmost two spark partition and thus contain just 2 files.
pratyakshsharma marked this conversation as resolved.
Show resolved Hide resolved
This is the default sort mode with bulk_insert operation in Hudi.

### Partition sort
In this sort mode, records within a given spark partition will be sorted. But there are chances that a given spark partition
can contain records from different table partitions. And so, even though we sort within each spark partitions, this sort
mode could result in large number of files at the end of bulk_insert, since records for a given table partition could
be spread across many spark partitions. During actual write by the writers, we may not have much open files
simultaneously, since we close out the file before moving to next file (as records are sorted within a spark partition)
and hence may not have much memory pressure.

### None

In this mode, no transformation such as sorting is done to the user records and delegated to the writers as is. So,
when writing large volumes of data into a table partitioned into 1000s of partitions, the writer may have to keep 1000s of
parquet writers open simultaneously incurring unsustainable memory pressure and eventually leading to crashes. Also,
min max ranges for a given file could be very wide (unsorted records) and hence subsequent upserts may read
bloom filters from lot of files during index lookup. Since records are not sorted, and each writer could get records
across N number of table partitions, this sort mode could result in a huge number of files at the end of bulk import.
This could also impact your upsert or query performance due to large number of small files.

## User defined partitioner

If none of the above built-in sort modes suffice, users can also choose to implement their own
[partitioner](https://github.com/apache/hudi/blob/master/hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/BulkInsertPartitioner.java)
and plug it in with bulk insert as needed.

## Bulk insert with different sort modes
Here is a microbenchmark to show the performance difference between different sort modes.

![Figure showing different sort modes in bulk_insert](/assets/images/blog/bulkinsert-sort-modes/sort-modes.png) <br/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its a bit misleading how sorting adds very little overhead for bulk_insert. thoughts?

Copy link
Contributor Author

@nsivabalan nsivabalan Sep 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have fixed the benchmarks. PTAL. with row writer, I don't see a lot of overhead w/ sorting. with write client, definitely there was some overhead for sure.

Figure: Shows performance of different bulk insert variants

This benchmark had 10M entries being bulk inserted to hudi using different sort modes. This was followed by an upsert
with 2M entries. As you could see, global sorting gives a very good upsert performance when compared to No sorting.
Partition Sort also has higher upsert performance due to the metadata overhead due to large files.

## Conclusion
Hopefully this blog has given you good insights into different sort modes in bulk insert and when to use what.
Please check out [this](https://hudi.apache.org/contribute/get-involved) if you would like to start contributing to Hudi.


Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.