Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HUDI-7336] Introduce new HoodieStorage abstraction #10567

Merged
merged 4 commits into from
Jan 29, 2024

Conversation

yihua
Copy link
Contributor

@yihua yihua commented Jan 26, 2024

Change Logs

This PR introduces HoodieStorage abstraction and Hudi's counterpart classes for Hadoop File System classes (org.apache.hadoop.fs.[FileSystem, Path, PathFilter, FileStatus]) to decouple Hudi's implementation from Hadoop classes, so it's much easier to plugin different file system implementation. Detailed changes include:

  • HoodieStorage interface: the counterpart class for Hadoop's FileSystem. This provides all I/O APIs on files and directories on storage, such as open, read, etc. This can also contain storage layer optimizations like caching, federated storage layout, hot/cold storage separation, etc. This needs to be implemented based on particular systems.
    • HoodieHadoopStorage implemenets HoodieStorage with Hadoop's FileSystem.
  • HoodieLocation: the counterpart class for Hadoop's Path. We migrate and simply path parsing logic in this class.
  • HoodieLocationFilter interface: the counterpart class for Hadoop's PathFilter.
  • HoodieFileStatus: the counterpart class for Hadoop's FileStatus. This keeps the location, length, isDirectory, and modification which are used by Hudi.

This is part of the effort to provide Hudi storage abstraction and decouple hudi-common from hadoop dependencies. For reference, the single big-change PR can be found here: #10360.

Impact

No impact as this PR does not have the integration.

Risk level

none

Documentation Update

N/A

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@yihua
Copy link
Contributor Author

yihua commented Jan 26, 2024

Currently, this PR contains some changes from #10564. The PR is reviewable. Please do not merge until #10564 is merged.

Copy link
Contributor

@danny0405 danny0405 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general, I'm wondering do we have referenced system for the HoodieStorage abstration?

@yihua yihua changed the title [HUDI-7336] Introduce new HoodieStorage abstraction [HUDI-7336][RFR|DNM] Introduce new HoodieStorage abstraction Jan 26, 2024
@yihua
Copy link
Contributor Author

yihua commented Jan 26, 2024

Looks good in general, I'm wondering do we have referenced system for the HoodieStorage abstration?

Yes, it's mainly based on hadoop's FileSystem class, so that we can replace any FS calls with HoodieStorage calls.

Copy link
Contributor

@leesf leesf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice abstraction. LGTM

@yihua yihua force-pushed the HUDI-7336-hoodie-storage-abstraction branch from ea050a0 to c5ebd50 Compare January 29, 2024 01:52
@yihua yihua changed the title [HUDI-7336][RFR|DNM] Introduce new HoodieStorage abstraction [HUDI-7336] Introduce new HoodieStorage abstraction Jan 29, 2024
@hudi-bot
Copy link

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@yihua
Copy link
Contributor Author

yihua commented Jan 29, 2024

Azure CI is green.
Screenshot 2024-01-28 at 21 26 22

@yihua yihua merged commit 4c4fe76 into apache:master Jan 29, 2024
32 checks passed
yihua added a commit that referenced this pull request Feb 27, 2024
This commit introduces `HoodieStorage` abstraction and Hudi's counterpart classes for Hadoop File System classes (`org.apache.hadoop.fs.`[`FileSystem`, `Path`, `PathFilter`, `FileStatus`]) to decouple Hudi's implementation from Hadoop classes, so it's much easier to plugin different file system implementation.
yihua added a commit that referenced this pull request Apr 18, 2024
…mmon module (#10591)

This commit makes the changes to replace most `FileSystem`, `Path`, and `FileStatus` usage with `HoodieStorage`, `StoragePath` and `StoragePathInfo` (introduced in #10567, renamed in #10672) in `hudi-common` module, to remove dependency on Hadoop FS abstraction which is not essential to most Hudi core read and write logic.

This commit still keeps using the Hadoop FileSystem-based implementation under the hood.  A follow-up PR will make `HoodieStorage` and I/O implementation pluggable.
yihua added a commit that referenced this pull request May 15, 2024
…mmon module (#10591)

This commit makes the changes to replace most `FileSystem`, `Path`, and `FileStatus` usage with `HoodieStorage`, `StoragePath` and `StoragePathInfo` (introduced in #10567, renamed in #10672) in `hudi-common` module, to remove dependency on Hadoop FS abstraction which is not essential to most Hudi core read and write logic.

This commit still keeps using the Hadoop FileSystem-based implementation under the hood.  A follow-up PR will make `HoodieStorage` and I/O implementation pluggable.
yihua added a commit that referenced this pull request May 15, 2024
…mmon module (#10591)

This commit makes the changes to replace most `FileSystem`, `Path`, and `FileStatus` usage with `HoodieStorage`, `StoragePath` and `StoragePathInfo` (introduced in #10567, renamed in #10672) in `hudi-common` module, to remove dependency on Hadoop FS abstraction which is not essential to most Hudi core read and write logic.

This commit still keeps using the Hadoop FileSystem-based implementation under the hood.  A follow-up PR will make `HoodieStorage` and I/O implementation pluggable.
yihua added a commit that referenced this pull request May 15, 2024
…mmon module (#10591)

This commit makes the changes to replace most `FileSystem`, `Path`, and `FileStatus` usage with `HoodieStorage`, `StoragePath` and `StoragePathInfo` (introduced in #10567, renamed in #10672) in `hudi-common` module, to remove dependency on Hadoop FS abstraction which is not essential to most Hudi core read and write logic.

This commit still keeps using the Hadoop FileSystem-based implementation under the hood.  A follow-up PR will make `HoodieStorage` and I/O implementation pluggable.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants