Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert #7396

Merged
merged 8 commits into from
Dec 10, 2022

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Dec 7, 2022

Change Logs

Before this change, the NONE sort mode for bulk insert does coalesce for the input records or rows based on the shuffle parallelism of bulk insert (hoodie.bulkinsert.shuffle.parallelism) to reduce the parallelism. This could affect write latency if the cluster workers are not fully utilized due to reduced parallelism.

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.

New tests are added for the behavior change.

Impact

The removal of coalesce within NONE sort mode for bulk insert will reduce the write latency if the input parallelism is higher and the cluster workers are not fully utilized due to the lower shuffle parallelism of bulk insert.

For clustering, there is no behavior change, i.e., coalesce still happens in NONE sort mode for bulk insert in clustering.

Risk level

low

Documentation Update

HUDI-5339 for updating docs regarding the behavior change in NONE sort mode for bulk insert.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

Copy link
Contributor

@alexeykudinkin alexeykudinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yihua we also need to

  • Review other impls that actually use parallelism hint (like GlobalSortPartitioner) and makes sure that we use max(config, input_parallelism)

@codope codope added priority:blocker writer-core Issues relating to core transactions/write actions release-0.12.2 Patches targetted for 0.12.2 labels Dec 7, 2022
@Zouxxyy
Copy link
Contributor

Zouxxyy commented Dec 7, 2022

@yihua @alexeykudinkin Can you have a review of this? #7372 (comment), I think they are somewhat related. Moreover, bulk insert is also used in cluster, and its parallelism is determined by cluser output files in one group.

@yihua
Copy link
Contributor Author

yihua commented Dec 9, 2022

@yihua we also need to

  • Review other impls that actually use parallelism hint (like GlobalSortPartitioner) and makes sure that we use max(config, input_parallelism)

I created a ticket for the follow-up: HUDI-5360. Regarding the GlobalSortPartitioner and some other partitioners, we do want to reduce the parallelism instead of maximizing/keeping the input parallelism for file sizing, so that larger files can be created.

@yihua
Copy link
Contributor Author

yihua commented Dec 9, 2022

@yihua @alexeykudinkin Can you have a review of this? #7372 (comment), I think they are somewhat related.

Thanks for raising this. I'll check the PR.

Moreover, bulk insert is also used in cluster, and its parallelism is determined by cluser output files in one group.

Thanks for pointing this out. Yes, I also noticed this. I revised the PR so that the NONE sort mode can still respect the specified number of output partitions for clustering.

@yihua yihua changed the title [HUDI-5338] Remove coalesce within NONE sort mode for bulk insert [HUDI-5338] Adjust coalesce behavior within NONE sort mode for bulk insert Dec 9, 2022
@yihua yihua force-pushed the HUDI-5338-remove-coalesce-none-sort branch from 87f44b3 to 637214c Compare December 9, 2022 22:12
@yihua yihua force-pushed the HUDI-5338-remove-coalesce-none-sort branch from 637214c to 20dd6c6 Compare December 10, 2022 03:20
@yihua
Copy link
Contributor Author

yihua commented Dec 10, 2022

CI passes before rebasing. Affected tests also pass after rebasing. I'll merge this PR once Github actions pass.
Screen Shot 2022-12-09 at 19 20 04

@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 273a5bb into apache:master Dec 10, 2022
nsivabalan pushed a commit that referenced this pull request Dec 13, 2022
…nsert (#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
alexeykudinkin pushed a commit to onehouseinc/hudi that referenced this pull request Dec 14, 2022
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
alexeykudinkin pushed a commit that referenced this pull request Dec 14, 2022
…nsert (#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
fengjian428 pushed a commit to fengjian428/hudi that referenced this pull request Apr 5, 2023
…nsert (apache#7396)

This PR adjusts NONE sort mode for bulk insert so that, by default, coalesce is not applied, matching the default parquet write behavior. The NONE sort mode still applies coalesce for clustering as the clustering operation relies on the bulk insert and the specified number of output Spark partitions to write a specific number of files.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:blocker release-0.12.2 Patches targetted for 0.12.2 writer-core Issues relating to core transactions/write actions
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants