[HUDI-8340] Fixing functional index record generation using spark distributed computation#12091
Closed
nsivabalan wants to merge 3 commits intoapache:masterfrom
Closed
[HUDI-8340] Fixing functional index record generation using spark distributed computation#12091nsivabalan wants to merge 3 commits intoapache:masterfrom
nsivabalan wants to merge 3 commits intoapache:masterfrom
Conversation
978ae60 to
438f932
Compare
codope
reviewed
Oct 14, 2024
| int parallelism = Math.min(partitionFileSlicePairs.size(), dataWriteConfig.getMetadataConfig().getFunctionalIndexParallelism()); | ||
| List<Pair<String, Pair<String, Long>>> partitionFilePathPairs = new ArrayList<>(); | ||
| commitMetadata.getPartitionToWriteStats().forEach((dataPartition, writeStats) -> writeStats.forEach(writeStat -> partitionFilePathPairs.add( | ||
| Pair.of(writeStat.getPartitionPath(), Pair.of(new StoragePath(dataMetaClient.getBasePath(), writeStat.getPath()).toString(), writeStat.getFileSizeInBytes()))))); |
Member
There was a problem hiding this comment.
needed to combine with base path to form the full path (that's what readers expect)
codope
approved these changes
Oct 14, 2024
codope
reviewed
Oct 14, 2024
Comment on lines
+177
to
+178
| ForkJoinPool customThreadPool = new ForkJoinPool(parallelism); | ||
| List<HoodieRecord> allRecords = customThreadPool.submit(() -> |
Member
There was a problem hiding this comment.
Can't use engine context parallelism, because we need the context for dataset creation on which spark functions can be applied. So, going with usual java parallelism, which is still better than the previous sequential stats computation.
b2d9b4f to
4ebb134
Compare
Collaborator
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Change Logs
Fixing functional index record generation using spark distributed computation.
This patch is stacked on top of #12090
Impact
Fixing functional index record generation using spark distributed computation
Risk level (write none, low medium or high below)
low
Documentation Update
Describe any necessary documentation update if there is any new feature, config, or user-facing change. If not, put "none".
ticket number here and follow the instruction to make
changes to the website.
Contributor's checklist