Skip to content

[HUDI-5342] Add new bulk insert sort modes repartitioning data by partition path#7402

Merged
alexeykudinkin merged 4 commits intoapache:masterfrom
yihua:HUDI-5342-add-partition-path-redistribute-sort-mode
Dec 10, 2022
Merged

[HUDI-5342] Add new bulk insert sort modes repartitioning data by partition path#7402
alexeykudinkin merged 4 commits intoapache:masterfrom
yihua:HUDI-5342-add-partition-path-redistribute-sort-mode

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented Dec 7, 2022

Change Logs

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

  • For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
  • For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.

New unit tests are added to verify the added functionality.

Impact

This PR adds a new bulk insert sort mode. Existing sort modes are not affected.

Risk level

none

Documentation Update

This PR revises docs for hoodie.bulkinsert.sort.mode in HoodieWriteConfig.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@codope codope added priority:blocker Production down; release blocker writer-core release-0.12.2 Patches targetted for 0.12.2 labels Dec 8, 2022
@codope codope force-pushed the HUDI-5342-add-partition-path-redistribute-sort-mode branch from 3d67bab to 4d1e882 Compare December 8, 2022 10:38
@yihua yihua force-pushed the HUDI-5342-add-partition-path-redistribute-sort-mode branch from 4d1e882 to b67d424 Compare December 9, 2022 22:48
@yihua yihua force-pushed the HUDI-5342-add-partition-path-redistribute-sort-mode branch from b67d424 to 48a4bd1 Compare December 9, 2022 22:51
@yihua yihua changed the title [HUDI-5342] Add a new bulk insert sort mode redistributing data by partition path [HUDI-5342] Add a new bulk insert sort mode repartitioning data by partition path Dec 9, 2022
@yihua yihua changed the title [HUDI-5342] Add a new bulk insert sort mode repartitioning data by partition path [HUDI-5342] Add new bulk insert sort modes repartitioning data by partition path Dec 10, 2022
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@alexeykudinkin alexeykudinkin merged commit ca3333d into apache:master Dec 10, 2022
nsivabalan pushed a commit that referenced this pull request Dec 13, 2022
…tition path (#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…tition path (apache#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…tition path (apache#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…tition path (apache#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…tition path (apache#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…tition path (apache#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…tition path (apache#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…tition path (apache#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
alexeykudinkin pushed a commit that referenced this pull request Dec 14, 2022
…tition path (#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…tition path (apache#7402)

This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following

For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.
For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions.
New unit tests are added to verify the added functionality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

priority:blocker Production down; release blocker release-0.12.2 Patches targetted for 0.12.2

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

4 participants