Fix performance gap in Bulk Insert row-writing path with enabled de-duplication

Currently, in case flag {{hoodie.combine.before.insert}} is set to true and {{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row Writing performance will considerably degrade due to the following circumstances
 * During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on {{(partition-path, record-key)}} into N partitions
 * In case {{BulkInsertSortMode.NONE}} is used as partitioner, no re-partitioning will be performed and therefore each Spark task might be writing into M table partitions
 * This in turn entails explosion in the number of created (small) files, killing performance and table's layout

## JIRA info

- Link: https://issues.apache.org/jira/browse/HUDI-5685
- Type: Bug
- Epic: https://issues.apache.org/jira/browse/HUDI-3249
- Fix version(s):
  - 1.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix performance gap in Bulk Insert row-writing path with enabled de-duplication #15747

JIRA info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fix performance gap in Bulk Insert row-writing path with enabled de-duplication #15747

Description

JIRA info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions