Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-6863] Revert auto-tuning of dedup parallelism #9722

Merged
merged 2 commits into from Sep 16, 2023

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Sep 15, 2023

Change Logs

Before this PR, the auto-tuning logic for dedup parallelism dictates the write parallelism so that the user-configured hoodie.upsert.shuffle.parallelism is ignored. This PR reverts #6802 to fix the issue.

Impact

Performance fix

Risk level

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@apache apache deleted a comment from hudi-bot Sep 15, 2023
@nsivabalan
Copy link
Contributor

Lets revisit the problems 6802 was tackliing. Main issue it was addressing is, making our shuffle parallelism dynamic and relative to the incoming df's num partitions. So, if someone is running 1000s of pipelines, they don't need to statically set the right value for shuffle parallelism for each of the 1000 pipelines.

can you help me understand whats the issue we are hitting that warrants us to revert it?
also, this would mean that we are going back to old state where we expect users to explicitly configure the shuffle parallelism.
If so, do we have a plan around dynamically choosing the right shuffle partition value depending on incoming batch?

@yihua
Copy link
Contributor Author

yihua commented Sep 15, 2023

Lets revisit the problems 6802 was tackliing. Main issue it was addressing is, making our shuffle parallelism dynamic and relative to the incoming df's num partitions. So, if someone is running 1000s of pipelines, they don't need to statically set the right value for shuffle parallelism for each of the 1000 pipelines.

can you help me understand whats the issue we are hitting that warrants us to revert it? also, this would mean that we are going back to old state where we expect users to explicitly configure the shuffle parallelism. If so, do we have a plan around dynamically choosing the right shuffle partition value depending on incoming batch?

This PR does not revert the dynamic determination of the shuffle parallelism. The decided target shuffle parallelism is passed in with "int parallelism" through deduplicateRecords. Without the revert, the user loses the ability to override the parallelism through the shuffle parallelism configs because parallelism can be ignored inside this method and the rest of the write DAG uses the new parallelism.

Copy link
Contributor

@nsivabalan nsivabalan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 minor comments. source code changes looks good.

@apache apache deleted a comment from hudi-bot Sep 16, 2023
@yihua
Copy link
Contributor Author

yihua commented Sep 16, 2023

CI is green.
Screenshot 2023-09-15 at 18 16 45

@yihua yihua merged commit ea8f925 into apache:master Sep 16, 2023
31 checks passed
prashantwason pushed a commit that referenced this pull request Sep 19, 2023
Before this PR, the auto-tuning logic for dedup parallelism dictates the write parallelism so that the user-configured `hoodie.upsert.shuffle.parallelism` is ignored.  This commit reverts #6802 to fix the issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants