Skip to content

[HUDI-8073] Add hosts to storage path info and use it if present#11761

Merged
yihua merged 6 commits intoapache:masterfrom
jonvex:abstract_engine_file_format
Aug 16, 2024
Merged

[HUDI-8073] Add hosts to storage path info and use it if present#11761
yihua merged 6 commits intoapache:masterfrom
jonvex:abstract_engine_file_format

Conversation

@jonvex
Copy link
Contributor

@jonvex jonvex commented Aug 12, 2024

Change Logs

FileSplit has hosts information that should be used if present

Impact

possibly better perf for hive

Risk level (write none, low medium or high below)

low

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:M PR with lines of changes in (100, 300] label Aug 12, 2024
@jonvex jonvex requested a review from yihua August 13, 2024 01:33
Copy link
Contributor

@CTTY CTTY left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

jonvex and others added 2 commits August 13, 2024 21:33
*/
public abstract ClosableIterator<T> getFileRecordIterator(
protected abstract ClosableIterator<T> getFileRecordIterator(
StoragePath filePath, long start, long length, Schema dataSchema, Schema requiredSchema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we keep this only? The caller should know the path to use. Otherwise, we may hide issues where path info is null.


package org.apache.hudi.storage;

public interface StorageFile {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without this, by adding the locations to StoragePathInfo, the same goal can be achieved.

@jonvex jonvex requested a review from yihua August 15, 2024 20:18
fileStatus.getModificationTime());
}

public static StoragePathInfo convertToStoragePathInfo(FileStatus fileStatus, String[] locations) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LocatedFileStatus (which extends FileStatus) stores BlockLocation[] locations. As a follow-up, see if we want to leverage that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants