Skip to content

[HUDI-6476] Improve the performance of getAllPartitionPaths#9121

Merged
danny0405 merged 1 commit intoapache:masterfrom
wecharyu:HUDI-6476
Jul 5, 2023
Merged

[HUDI-6476] Improve the performance of getAllPartitionPaths#9121
danny0405 merged 1 commit intoapache:masterfrom
wecharyu:HUDI-6476

Conversation

@wecharyu
Copy link
Contributor

@wecharyu wecharyu commented Jul 4, 2023

Change Logs

Currently Hudi will list all status of files in hudi table directory, which can be avoid to improve the performance of getAllPartitionPaths, especially for the non-partitioned table with many files. What we change in this patch:

  • reduce a stage in getPartitionPathWithPathPrefix()
  • only check directory to find the PartitionMetadata
  • avoid listStatus of .hoodie/.hoodie_partition_metadata

Impact

Performance improvement.

Risk level (write none, low medium or high below)

None.

Documentation Update

None.

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@hudi-bot
Copy link
Collaborator

hudi-bot commented Jul 4, 2023

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()).map(entry -> entry.getKey().get())
.collect(Collectors.toList()));
partitionPaths.addAll(result.stream().filter(entry -> entry.getKey().isPresent()).map(entry -> entry.getKey().get())
.collect(Collectors.toList()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, the code looks much simpler!

@danny0405 danny0405 self-assigned this Jul 5, 2023
@danny0405 danny0405 added writer-core area:performance Performance optimizations labels Jul 5, 2023
@danny0405 danny0405 merged commit 72f0477 into apache:master Jul 5, 2023
FileSystem fileSystem = path.getFileSystem(hadoopConf.get());
return Arrays.stream(fileSystem.listStatus(path));
if (HoodiePartitionMetadata.hasPartitionMetadata(fileSystem, path)) {
return Stream.of(Pair.of(Option.of(FSUtils.getRelativePartitionPath(new Path(datasetBasePath), path)), Option.empty()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partition meta file could have extensions like parquet, orc etc. did we consider that?

this was in previous code:
fileStatus.getPath().getName().startsWith(HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your careful check, this case will be handled in HoodiePartitionMetadata#hasPartitionMetadata:

static List<Path> baseFormatMetaFilePaths(Path partitionPath) {
return Stream.of(HoodieFileFormat.PARQUET.getFileExtension(), HoodieFileFormat.ORC.getFileExtension())
.map(ext -> new Path(partitionPath, HoodiePartitionMetadata.HOODIE_PARTITION_METAFILE_PREFIX + ext))
.collect(Collectors.toList());
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:performance Performance optimizations

Projects

Status: ✅ Done

Development

Successfully merging this pull request may close these issues.

4 participants