[HUDI-6990] Configurable clustering task parallelism #9925

Askwang · 2023-10-26T08:57:40Z

Change Logs

Spark executes clustering will generate too many tasks which is equal to the the number of files in clustering plan. Support config it when read files.

Impact

None

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

danny0405 · 2023-10-26T09:26:16Z

hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieClusteringConfig.java

@@ -161,6 +161,13 @@ public class HoodieClusteringConfig extends HoodieConfig {
          + "value will let the clustering job run faster, while it will give additional pressure to the "
          + "execution engines to manage more concurrent running jobs.");

+  public static final ConfigProperty<Integer> CLUSTERING_READ_RECORDS_PARALLELISM = ConfigProperty
+      .key("hoodie.clustering.read.records.parallelism")
+      .defaultValue(20)


Maybe just rename to hoodie.clustering.parallelism

I'm wondering where does the default 20 come from, should we limit the parallelism by default? Do we have similiar option for compaction?

We verify that 20 is a relatively stable parameter running job. The param hoodie.clustering.plan.strategy.target.file.max.bytes is 1g which contains many files, we read it with 20 parallelism is enough and no need to increase executor-memory.

We should limit the parallelism by default, this can reduce the number of tasks to a great extent.

Compaction has a similiar use, it parallelize operations with its size.

// HoodieCompactor.java return context.parallelize(operations).map(operation -> compact( compactionHandler, metaClient, config, operation, compactionInstantTime, maxInstantTime, instantRange, taskContextSupplier, executionHelper)) .flatMap(List::iterator);

cc @boneanxs to take a look~

cc @boneanxs This config oodie.clustering.rdd.read.parallelism is ok? or whether you have better adivce.

What about the sum of clusteringGroup.getNumOutputFileGroups for all clustering groups, which is actually the final write parallelism.

What about the sum of clusteringGroup.getNumOutputFileGroups for all clustering groups, which is actually the final write parallelism.

Not very friendly. The number of all clustering groups is controlled by hoodie.clustering.plan.strategy.max.num.groups. If we increase this config, the read parallelism will be larger. In our case, we generate 100 groups and the read parallelism will be 200, that is not helpful for reading tasks.

Mayne hoodie.clustering.group.read.parallelism

Askwang · 2023-11-01T08:20:25Z

@hudi-bot run azure re-run the last Azure build

danny0405 · 2023-11-01T12:41:01Z

@hudi-bot run azure

danny0405 · 2023-11-02T00:24:25Z

@hudi-bot run azure

Askwang · 2023-11-02T01:35:59Z

@hudi-bot run azure

danny0405 · 2023-11-03T00:28:52Z

@hudi-bot run azure

Askwang · 2023-11-03T01:40:36Z

Azure looks some problems

danny0405 · 2023-11-03T03:02:15Z

You can rebase with the latest master to re-trigger it.

Askwang · 2023-11-03T10:33:37Z

@hudi-bot run azure

hudi-bot · 2023-11-03T10:44:13Z

CI report:

abd9807 Azure: FAILURE

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

danny0405 · 2023-11-04T05:35:27Z

Tests have passed: https://dev.azure.com/apache-hudi-ci-org/apache-hudi-ci/_build/results?buildId=20654&view=results

danny0405 reviewed Oct 26, 2023

View reviewed changes

danny0405 changed the title ~~[HUDI-6990] control read records parallelism~~ [HUDI-6990] Configurable clustering task parallelism Oct 27, 2023

danny0405 added table-service stability labels Oct 27, 2023

Askwang added 4 commits November 3, 2023 11:43

[HUDI-6990] control read records parallelism

73574bc

change config name

9406fd8

change config name

2ea97d3

change config name

abd9807

Askwang force-pushed the HUDI-6990 branch from c782b5e to abd9807 Compare November 3, 2023 03:45

danny0405 approved these changes Nov 4, 2023

View reviewed changes

danny0405 merged commit 4768776 into apache:master Nov 4, 2023
28 checks passed

Askwang mentioned this pull request Nov 6, 2023

[DOCS] update clustering configuration #9990

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-6990] Configurable clustering task parallelism #9925

[HUDI-6990] Configurable clustering task parallelism #9925

Askwang commented Oct 26, 2023

danny0405 Oct 26, 2023

Askwang Oct 26, 2023

danny0405 Oct 27, 2023

Askwang Oct 27, 2023

danny0405 Oct 27, 2023

Askwang Oct 31, 2023

stream2000 Oct 31, 2023

Askwang Oct 31, 2023

danny0405 Nov 1, 2023

Askwang Nov 1, 2023

Askwang commented Nov 1, 2023

danny0405 commented Nov 1, 2023

danny0405 commented Nov 2, 2023

Askwang commented Nov 2, 2023

danny0405 commented Nov 3, 2023

Askwang commented Nov 3, 2023

danny0405 commented Nov 3, 2023

Askwang commented Nov 3, 2023

hudi-bot commented Nov 3, 2023

danny0405 commented Nov 4, 2023

[HUDI-6990] Configurable clustering task parallelism #9925

[HUDI-6990] Configurable clustering task parallelism #9925

Conversation

Askwang commented Oct 26, 2023

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Askwang commented Nov 1, 2023

danny0405 commented Nov 1, 2023

danny0405 commented Nov 2, 2023

Askwang commented Nov 2, 2023

danny0405 commented Nov 3, 2023

Askwang commented Nov 3, 2023

danny0405 commented Nov 3, 2023

Askwang commented Nov 3, 2023

hudi-bot commented Nov 3, 2023

CI report:

danny0405 commented Nov 4, 2023