[HUDI-6863] Revert auto-tuning of dedup parallelism #9722

yihua · 2023-09-15T05:49:25Z

Change Logs

Before this PR, the auto-tuning logic for dedup parallelism dictates the write parallelism so that the user-configured hoodie.upsert.shuffle.parallelism is ignored. This PR reverts #6802 to fix the issue.

Impact

Performance fix

Risk level

low

Documentation Update

N/A

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

nsivabalan · 2023-09-15T16:43:50Z

Lets revisit the problems 6802 was tackliing. Main issue it was addressing is, making our shuffle parallelism dynamic and relative to the incoming df's num partitions. So, if someone is running 1000s of pipelines, they don't need to statically set the right value for shuffle parallelism for each of the 1000 pipelines.

can you help me understand whats the issue we are hitting that warrants us to revert it?
also, this would mean that we are going back to old state where we expect users to explicitly configure the shuffle parallelism.
If so, do we have a plan around dynamically choosing the right shuffle partition value depending on incoming batch?

yihua · 2023-09-15T17:07:05Z

Lets revisit the problems 6802 was tackliing. Main issue it was addressing is, making our shuffle parallelism dynamic and relative to the incoming df's num partitions. So, if someone is running 1000s of pipelines, they don't need to statically set the right value for shuffle parallelism for each of the 1000 pipelines.

can you help me understand whats the issue we are hitting that warrants us to revert it? also, this would mean that we are going back to old state where we expect users to explicitly configure the shuffle parallelism. If so, do we have a plan around dynamically choosing the right shuffle partition value depending on incoming batch?

This PR does not revert the dynamic determination of the shuffle parallelism. The decided target shuffle parallelism is passed in with "int parallelism" through deduplicateRecords. Without the revert, the user loses the ability to override the parallelism through the shuffle parallelism configs because parallelism can be ignored inside this method and the rest of the write DAG uses the new parallelism.

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala

nsivabalan

1 minor comments. source code changes looks good.

yihua · 2023-09-16T01:17:11Z

CI is green.

Before this PR, the auto-tuning logic for dedup parallelism dictates the write parallelism so that the user-configured `hoodie.upsert.shuffle.parallelism` is ignored. This commit reverts #6802 to fix the issue.

[HUDI-6863] Revert auto-tuning of dedup parallelism

ea619c6

apache deleted a comment from hudi-bot Sep 15, 2023

nsivabalan reviewed Sep 15, 2023

View reviewed changes

hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/hudi/TestHoodieSparkSqlWriter.scala Outdated Show resolved Hide resolved

nsivabalan reviewed Sep 15, 2023

View reviewed changes

Remove unnecessary parallelism configs in tests

09a14fb

nsivabalan approved these changes Sep 15, 2023

View reviewed changes

yihua mentioned this pull request Sep 16, 2023

[MINOR] Add tests on combine parallelism #9731

Merged

4 tasks

apache deleted a comment from hudi-bot Sep 16, 2023

yihua added priority:blocker release-0.14.0 labels Sep 16, 2023

yihua merged commit ea8f925 into apache:master Sep 16, 2023
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6863] Revert auto-tuning of dedup parallelism #9722

[HUDI-6863] Revert auto-tuning of dedup parallelism #9722

yihua commented Sep 15, 2023 •

edited

nsivabalan commented Sep 15, 2023

yihua commented Sep 15, 2023

nsivabalan left a comment

yihua commented Sep 16, 2023

[HUDI-6863] Revert auto-tuning of dedup parallelism #9722

[HUDI-6863] Revert auto-tuning of dedup parallelism #9722

Conversation

yihua commented Sep 15, 2023 • edited

Change Logs

Impact

Risk level

Documentation Update

Contributor's checklist

nsivabalan commented Sep 15, 2023

yihua commented Sep 15, 2023

nsivabalan left a comment

Choose a reason for hiding this comment

yihua commented Sep 16, 2023

yihua commented Sep 15, 2023 •

edited