Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-6990] Configurable clustering task parallelism #9925

Merged
merged 4 commits into from
Nov 4, 2023

Conversation

Askwang
Copy link
Contributor

@Askwang Askwang commented Oct 26, 2023

Change Logs

Spark executes clustering will generate too many tasks which is equal to the the number of files in clustering plan. Support config it when read files.
image

Impact

None

Risk level (write none, low medium or high below)

If medium or high, explain what verification was done to mitigate the risks.

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change

  • The config description must be updated if new configs are added or the default value of the configs are changed
  • Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
    ticket number here and follow the instruction to make
    changes to the website.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@@ -161,6 +161,13 @@ public class HoodieClusteringConfig extends HoodieConfig {
+ "value will let the clustering job run faster, while it will give additional pressure to the "
+ "execution engines to manage more concurrent running jobs.");

public static final ConfigProperty<Integer> CLUSTERING_READ_RECORDS_PARALLELISM = ConfigProperty
.key("hoodie.clustering.read.records.parallelism")
.defaultValue(20)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just rename to hoodie.clustering.parallelism

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering where does the default 20 come from, should we limit the parallelism by default? Do we have similiar option for compaction?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We verify that 20 is a relatively stable parameter running job. The param hoodie.clustering.plan.strategy.target.file.max.bytes is 1g which contains many files, we read it with 20 parallelism is enough and no need to increase executor-memory.

We should limit the parallelism by default, this can reduce the number of tasks to a great extent.

Compaction has a similiar use, it parallelize operations with its size.

// HoodieCompactor.java
return context.parallelize(operations).map(operation -> compact(
        compactionHandler, metaClient, config, operation, compactionInstantTime, maxInstantTime, instantRange, taskContextSupplier, executionHelper))
        .flatMap(List::iterator);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @boneanxs to take a look~

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @boneanxs This config oodie.clustering.rdd.read.parallelism is ok? or whether you have better adivce.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the sum of clusteringGroup.getNumOutputFileGroups for all clustering groups, which is actually the final write parallelism.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the sum of clusteringGroup.getNumOutputFileGroups for all clustering groups, which is actually the final write parallelism.

Not very friendly. The number of all clustering groups is controlled by hoodie.clustering.plan.strategy.max.num.groups. If we increase this config, the read parallelism will be larger. In our case, we generate 100 groups and the read parallelism will be 200, that is not helpful for reading tasks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mayne hoodie.clustering.group.read.parallelism

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

@danny0405 danny0405 changed the title [HUDI-6990] control read records parallelism [HUDI-6990] Configurable clustering task parallelism Oct 27, 2023
@Askwang
Copy link
Contributor Author

Askwang commented Nov 1, 2023

@hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor

@hudi-bot run azure

3 similar comments
@danny0405
Copy link
Contributor

@hudi-bot run azure

@Askwang
Copy link
Contributor Author

Askwang commented Nov 2, 2023

@hudi-bot run azure

@danny0405
Copy link
Contributor

@hudi-bot run azure

@Askwang
Copy link
Contributor Author

Askwang commented Nov 3, 2023

Azure looks some problems

@danny0405
Copy link
Contributor

You can rebase with the latest master to re-trigger it.

@Askwang
Copy link
Contributor Author

Askwang commented Nov 3, 2023

@hudi-bot run azure

@hudi-bot
Copy link

hudi-bot commented Nov 3, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@danny0405
Copy link
Contributor

@danny0405 danny0405 merged commit 4768776 into apache:master Nov 4, 2023
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants