Skip to content

[HUDI-7452] Repartition row dataset in S3/GCS based on task size#10777

Merged
yihua merged 1 commit intoapache:masterfrom
vinishjail97:HUDI-7452-Repartition-S3-GCS
Feb 29, 2024
Merged

[HUDI-7452] Repartition row dataset in S3/GCS based on task size#10777
yihua merged 1 commit intoapache:masterfrom
vinishjail97:HUDI-7452-Repartition-S3-GCS

Conversation

@vinishjail97
Copy link
Contributor

Change Logs

In our current code we are doing a coalesce which just decreases the partitions but doesn't increase them, adding a function known as coalesceOrRepartition which does coalesce or repartition depending on the rdd partitions and the numPartitions calculated using the task/partition size.

Impact

Improvement in S3/GCS sources to increase/decrease parallelism based on partition size.

Risk level (write none, low medium or high below)

Medium

Documentation Update

None.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:S PR with lines of changes in (10, 100] label Feb 28, 2024
@vinishjail97
Copy link
Contributor Author

@hudi-bot run azure

@apache apache deleted a comment from hudi-bot Feb 28, 2024
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua yihua merged commit 98af701 into apache:master Feb 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-0.15.0 size:S PR with lines of changes in (10, 100]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants