[HUDI-5342] Add new bulk insert sort modes repartitioning data by partition path#7402
Merged
alexeykudinkin merged 4 commits intoapache:masterfrom Dec 10, 2022
Conversation
...udi-client-common/src/main/java/org/apache/hudi/execution/bulkinsert/BulkInsertSortMode.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/hudi/execution/bulkinsert/PartitionPathRedistributePartitioner.java
Outdated
Show resolved
Hide resolved
.../java/org/apache/hudi/execution/bulkinsert/PartitionPathRedistributePartitionerWithRows.java
Show resolved
Hide resolved
3d67bab to
4d1e882
Compare
4 tasks
4d1e882 to
b67d424
Compare
b67d424 to
48a4bd1
Compare
alexeykudinkin
approved these changes
Dec 10, 2022
nsivabalan
pushed a commit
that referenced
this pull request
Dec 13, 2022
…tition path (#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
alexeykudinkin
pushed a commit
to onehouseinc/hudi
that referenced
this pull request
Dec 14, 2022
…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
alexeykudinkin
pushed a commit
to onehouseinc/hudi
that referenced
this pull request
Dec 14, 2022
…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
alexeykudinkin
pushed a commit
to onehouseinc/hudi
that referenced
this pull request
Dec 14, 2022
…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
alexeykudinkin
pushed a commit
to onehouseinc/hudi
that referenced
this pull request
Dec 14, 2022
…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
alexeykudinkin
pushed a commit
to onehouseinc/hudi
that referenced
this pull request
Dec 14, 2022
…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
alexeykudinkin
pushed a commit
to onehouseinc/hudi
that referenced
this pull request
Dec 14, 2022
…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
alexeykudinkin
pushed a commit
to onehouseinc/hudi
that referenced
this pull request
Dec 14, 2022
…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
alexeykudinkin
pushed a commit
that referenced
this pull request
Dec 14, 2022
…tition path (#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
fengjian428
pushed a commit
to fengjian428/hudi
that referenced
this pull request
Apr 5, 2023
…tition path (apache#7402) This PR adds two new bulk insert sort modes, PARTITION_PATH_REPARTITION and PARTITION_PATH_REPARTITION_AND_SORT, which does the following For a physically partitioned table, repartition the input records based on the partition path, limiting the shuffle parallelism to specified outputSparkPartitions. For PARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done. For a physically non-partitioned table, simply does coalesce for the input rows with outputSparkPartitions. New unit tests are added to verify the added functionality.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Logs
This PR adds two new bulk insert sort modes,
PARTITION_PATH_REPARTITIONandPARTITION_PATH_REPARTITION_AND_SORT, which does the followingoutputSparkPartitions. ForPARTITION_PATH_REPARTITION_AND_SORT, an additional step of sorting the records based on the partition path within each Spark partition is done.outputSparkPartitions.New unit tests are added to verify the added functionality.
Impact
This PR adds a new bulk insert sort mode. Existing sort modes are not affected.
Risk level
none
Documentation Update
This PR revises docs for
hoodie.bulkinsert.sort.modeinHoodieWriteConfig.Contributor's checklist