[HUDI-8340] Fixing functional index record generation using spark distributed computation by nsivabalan · Pull Request #12091 · apache/hudi

nsivabalan · 2024-10-13T22:24:10Z

Change Logs

Fixing functional index record generation using spark distributed computation.
This patch is stacked on top of #12090

Impact

Fixing functional index record generation using spark distributed computation

Risk level (write none, low medium or high below)

low

Documentation Update

Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".

The config description must be updated if new configs are added or the default value of the configs are changed
Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
ticket number here and follow the instruction to make
changes to the website.

Contributor's checklist

Read through contributor's guide
Change Logs and Impact were stated clearly
Adequate tests were added if applicable
CI passed

…omputation

codope · 2024-10-14T16:08:04Z

...di-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java

-    int parallelism = Math.min(partitionFileSlicePairs.size(), dataWriteConfig.getMetadataConfig().getFunctionalIndexParallelism());
+    List<Pair<String, Pair<String, Long>>> partitionFilePathPairs = new ArrayList<>();
+    commitMetadata.getPartitionToWriteStats().forEach((dataPartition, writeStats) -> writeStats.forEach(writeStat -> partitionFilePathPairs.add(
+        Pair.of(writeStat.getPartitionPath(), Pair.of(new StoragePath(dataMetaClient.getBasePath(), writeStat.getPath()).toString(), writeStat.getFileSizeInBytes())))));


needed to combine with base path to form the full path (that's what readers expect)

codope · 2024-10-14T16:10:17Z

...park-client/src/main/java/org/apache/hudi/metadata/SparkHoodieBackedTableMetadataWriter.java

+    ForkJoinPool customThreadPool = new ForkJoinPool(parallelism);
+    List<HoodieRecord> allRecords = customThreadPool.submit(() ->


Can't use engine context parallelism, because we need the context for dataset creation on which spark functions can be applied. So, going with usual java parallelism, which is still better than the previous sequential stats computation.

hudi-bot · 2024-10-14T21:59:08Z

CI report:

4ebb134 Azure: SUCCESS

Bot commands

@hudi-bot supports the following commands:

@hudi-bot run azure re-run the last Azure build

github-actions bot added the size:M PR with lines of changes in (100, 300] label Oct 13, 2024

Fixing functional index MDT record generation for spark distributed c…

438f932

…omputation

codope force-pushed the HUDI-8340-fixingFunctionalIndexUpdates1 branch from 978ae60 to 438f932 Compare October 14, 2024 09:14

codope changed the title ~~[HUDI-8340][HUDI-8341] Fixing functional index record generation using spark distributed computation~~ [HUDI-8340] Fixing functional index record generation using spark distributed computation Oct 14, 2024

replace spark engine context parallelism by usual java parallelism

d0da87b

codope reviewed Oct 14, 2024

View reviewed changes

codope approved these changes Oct 14, 2024

View reviewed changes

codope reviewed Oct 14, 2024

View reviewed changes

[wip] trying to fetch records in parallel and then apply index

4ebb134

codope force-pushed the HUDI-8340-fixingFunctionalIndexUpdates1 branch from b2d9b4f to 4ebb134 Compare October 14, 2024 19:28

nsivabalan closed this Oct 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HUDI-8340] Fixing functional index record generation using spark distributed computation#12091

[HUDI-8340] Fixing functional index record generation using spark distributed computation#12091
nsivabalan wants to merge 3 commits intoapache:masterfrom
nsivabalan:HUDI-8340-fixingFunctionalIndexUpdates1

nsivabalan commented Oct 13, 2024

Uh oh!

codope Oct 14, 2024

Uh oh!

codope Oct 14, 2024 •

edited

Loading

Uh oh!

hudi-bot commented Oct 14, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		ForkJoinPool customThreadPool = new ForkJoinPool(parallelism);
		List<HoodieRecord> allRecords = customThreadPool.submit(() ->

Conversation

nsivabalan commented Oct 13, 2024

Change Logs

Impact

Risk level (write none, low medium or high below)

Documentation Update

Contributor's checklist

Uh oh!

codope Oct 14, 2024

Choose a reason for hiding this comment

Uh oh!

codope Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hudi-bot commented Oct 14, 2024

CI report:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codope Oct 14, 2024 •

edited

Loading