Skip to content

Fix performance gap in Bulk Insert row-writing path with enabled de-duplication #15747

@hudi-bot

Description

@hudi-bot

Currently, in case flag {{hoodie.combine.before.insert}} is set to true and {{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row Writing performance will considerably degrade due to the following circumstances

  • During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on {{(partition-path, record-key)}} into N partitions
  • In case {{BulkInsertSortMode.NONE}} is used as partitioner, no re-partitioning will be performed and therefore each Spark task might be writing into M table partitions
  • This in turn entails explosion in the number of created (small) files, killing performance and table's layout

JIRA info

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions