[HUDI-5342] Add new bulk insert sort modes repartitioning data by partition path by yihua · Pull Request #7402 · apache/hudi

yihua · 2022-12-07T07:03:45Z

Change Logs

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.

New unit tests are added to verify the added functionality.

Impact

This PR adds a new bulk insert sort mode. Existing sort modes are not affected.

Risk level

none

Documentation Update

This PR revises docs for hoodie.bulkinsert.sort.mode in HoodieWriteConfig.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

...udi-client-common/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertSortMode.java

...src/main/java/org/apache/hudi/execution/bulkinsert/PartitionPathRedistributePartitioner.java

.../java/org/apache/hudi/execution/bulkinsert/PartitionPathRedistributePartitionerWithRows.java

…rtition path

hudi-bot · 2022-12-10T02:58:25Z

CI report:

d58e2fd UNKNOWN
3b206a6 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…tition path (#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.

…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.

…tition path (#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.

…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.

alexeykudinkin reviewed Dec 7, 2022

View reviewed changes

codope added priority:blocker Production down; release blocker writer-core release-0.12.2 Patches targetted for 0.12.2 labels Dec 8, 2022

codope assigned codope and alexeykudinkin Dec 8, 2022

codope force-pushed the HUDI-5342-add-partition-path-redistribute-sort-mode branch from 3d67bab to 4d1e882 Compare December 8, 2022 10:38

codope mentioned this pull request Dec 8, 2022

[HUDI-5351] Handle populateMetaFields when repartitioning in sort partitioner #7411

Merged

4 tasks

[HUDI-5342] Add a new bulk insert sort mode redistributing data by pa…

742da08

…rtition path

yihua force-pushed the HUDI-5342-add-partition-path-redistribute-sort-mode branch from 4d1e882 to b67d424 Compare December 9, 2022 22:48

Rename the new sort mode

48a4bd1

yihua force-pushed the HUDI-5342-add-partition-path-redistribute-sort-mode branch from b67d424 to 48a4bd1 Compare December 9, 2022 22:51

yihua changed the title ~~[HUDI-5342] Add a new bulk insert sort mode redistributing data by partition path~~ [HUDI-5342] Add a new bulk insert sort mode repartitioning data by partition path Dec 9, 2022

yihua added 2 commits December 9, 2022 16:20

Add PARTITION_PATH_REPARTITION_AND_SORT sort mode

d58e2fd

Add docs

3b206a6

yihua changed the title ~~[HUDI-5342] Add a new bulk insert sort mode repartitioning data by partition path~~ [HUDI-5342] Add new bulk insert sort modes repartitioning data by partition path Dec 10, 2022

alexeykudinkin approved these changes Dec 10, 2022

View reviewed changes

alexeykudinkin merged commit ca3333d into apache:master Dec 10, 2022

hudi-bot mentioned this pull request Dec 9, 2025

Add new bulk insert sort mode based on partition path #15615

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5342] Add new bulk insert sort modes repartitioning data by partition path#7402

[HUDI-5342] Add new bulk insert sort modes repartitioning data by partition path#7402
alexeykudinkin merged 4 commits intoapache:masterfrom
yihua:HUDI-5342-add-partition-path-redistribute-sort-mode

yihua commented Dec 7, 2022 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Dec 10, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yihua commented Dec 7, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change Logs

Impact

Risk level

Documentation Update

Contributor's checklist

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hudi-bot commented Dec 10, 2022

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yihua commented Dec 7, 2022 •

edited

Loading