-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Open
Labels
from-jirapriority:blockerProduction down; release blockerProduction down; release blockerstatus:pr-availablePull request availablePull request availabletype:bugBug reports and fixesBug reports and fixes
Milestone
Description
Currently, in case flag {{hoodie.combine.before.insert}} is set to true and {{hoodie.bulkinsert.sort.mode}} is set to {{{}NONE{}}}, Bulk Insert Row Writing performance will considerably degrade due to the following circumstances
- During de-duplication (w/in {{{}dedupRows{}}}) records in the incoming RDD would be reshuffled (by Spark's default {{{}HashPartitioner{}}}) based on {{(partition-path, record-key)}} into N partitions
- In case {{BulkInsertSortMode.NONE}} is used as partitioner, no re-partitioning will be performed and therefore each Spark task might be writing into M table partitions
- This in turn entails explosion in the number of created (small) files, killing performance and table's layout
JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-5685
- Type: Bug
- Epic: https://issues.apache.org/jira/browse/HUDI-3249
- Fix version(s):
- 1.1.0
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
from-jirapriority:blockerProduction down; release blockerProduction down; release blockerstatus:pr-availablePull request availablePull request availabletype:bugBug reports and fixesBug reports and fixes