[HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert #7396

yihua · 2022-12-07T01:29:55Z

Change Logs

Before this change, the NONE sort mode for bulk insert does coalesce for the input records or rows based on the shuffle parallelism of bulk insert (hoodie.bulkinsert.shuffle.parallelism) to reduce the parallelism. This could affect write latency if the cluster workers are not fully utilized due to reduced parallelism.

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.

New tests are added for the behavior change.

Impact

The removal of coalesce within NONE sort mode for bulk insert will reduce the write latency if the input parallelism is higher and the cluster workers are not fully utilized due to the lower shuffle parallelism of bulk insert.

For clustering, there is no behavior change, i.e., coalesce still happens in NONE sort mode for bulk insert in clustering.

Risk level

low

Documentation Update

HUDI-5339 for updating docs regarding the behavior change in NONE sort mode for bulk insert.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

alexeykudinkin

@yihua we also need to

Review other impls that actually use parallelism hint (like GlobalSortPartitioner) and makes sure that we use max(config, input_parallelism)

...hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/NonSortPartitioner.java

Zouxxyy · 2022-12-07T17:30:11Z

@yihua @alexeykudinkin Can you have a review of this? #7372 (comment), I think they are somewhat related. Moreover, bulk insert is also used in cluster, and its parallelism is determined by cluser output files in one group.

yihua · 2022-12-09T19:39:04Z

@yihua we also need to

Review other impls that actually use parallelism hint (like GlobalSortPartitioner) and makes sure that we use max(config, input_parallelism)

I created a ticket for the follow-up: HUDI-5360. Regarding the GlobalSortPartitioner and some other partitioners, we do want to reduce the parallelism instead of maximizing/keeping the input parallelism for file sizing, so that larger files can be created.

yihua · 2022-12-09T19:41:10Z

@yihua @alexeykudinkin Can you have a review of this? #7372 (comment), I think they are somewhat related.

Thanks for raising this. I'll check the PR.

Moreover, bulk insert is also used in cluster, and its parallelism is determined by cluser output files in one group.

Thanks for pointing this out. Yes, I also noticed this. I revised the PR so that the NONE sort mode can still respect the specified number of output partitions for clustering.

...hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/NonSortPartitioner.java

…titions must be respected

yihua · 2022-12-10T03:46:35Z

CI passes before rebasing. Affected tests also pass after rebasing. I'll merge this PR once Github actions pass.

hudi-bot · 2022-12-10T04:34:59Z

CI report:

637214c Azure: SUCCESS
c5a9d2e Azure: PENDING

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

…nsert (#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.

…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.

…nsert (#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.

…nsert (apache#7396) This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.

alexeykudinkin reviewed Dec 7, 2022

View reviewed changes

...hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/NonSortPartitioner.java Show resolved Hide resolved

codope added priority:blocker writer-core Issues relating to core transactions/write actions release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022

codope assigned alexeykudinkin Dec 7, 2022

yihua mentioned this pull request Dec 9, 2022

[HUDI-5358] Fix flaky tests in TestCleanerInsertAndCleanByCommits #7420

Merged

4 tasks

alexeykudinkin approved these changes Dec 9, 2022

View reviewed changes

...hudi-spark-client/src/main/java/org/apache/hudi/execution/bulkinsert/NonSortPartitioner.java Outdated Show resolved Hide resolved

yihua changed the title ~~[HUDI-5338] Remove coalesce within NONE sort mode for bulk insert~~ [HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert Dec 9, 2022

yihua force-pushed the HUDI-5338-remove-coalesce-none-sort branch from 87f44b3 to 637214c Compare December 9, 2022 22:12

yihua added 6 commits December 9, 2022 19:06

[HUDI-5338] Remove coalesce within NONE sort mode for bulk insert

e5010c5

Fix test

f2b1225

Add a config for NONE sort mode to decide if the number of output par…

8998e3a

…titions must be respected

Remove flaky test fix

5e87ac7

Revise Javadocs and add new tests

9f1106d

Address naming nit

20dd6c6

yihua force-pushed the HUDI-5338-remove-coalesce-none-sort branch from 637214c to 20dd6c6 Compare December 10, 2022 03:20

yihua added 2 commits December 9, 2022 19:39

Fix rebase

9eff997

Add validation on the number of partitions

c5a9d2e

yihua merged commit 273a5bb into apache:master Dec 10, 2022

alexeykudinkin mentioned this pull request Feb 6, 2023

[HUDI-5716] Cleaning up Partitioners hierarchy #7872

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert #7396

[HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert #7396

yihua commented Dec 7, 2022 •

edited

Loading

alexeykudinkin left a comment

Zouxxyy commented Dec 7, 2022 •

edited

Loading

yihua commented Dec 9, 2022

yihua commented Dec 9, 2022

yihua commented Dec 10, 2022

hudi-bot commented Dec 10, 2022

[HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert #7396

[HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert #7396

Conversation

yihua commented Dec 7, 2022 • edited Loading

Change Logs

Impact

Risk level

Documentation Update

Contributor's checklist

alexeykudinkin left a comment

Choose a reason for hiding this comment

Zouxxyy commented Dec 7, 2022 • edited Loading

yihua commented Dec 9, 2022

yihua commented Dec 9, 2022

yihua commented Dec 10, 2022

hudi-bot commented Dec 10, 2022

CI report:

yihua commented Dec 7, 2022 •

edited

Loading

Zouxxyy commented Dec 7, 2022 •

edited

Loading