Skip to content

[HUDI-7776] Simplify HoodieStorage instance fetching#11259

Merged
yihua merged 12 commits intoapache:masterfrom
yihua:HUDI-7776-simplify-storage-fetch
May 25, 2024
Merged

[HUDI-7776] Simplify HoodieStorage instance fetching#11259
yihua merged 12 commits intoapache:masterfrom
yihua:HUDI-7776-simplify-storage-fetch

Conversation

@yihua
Copy link
Contributor

@yihua yihua commented May 19, 2024

Change Logs

This PR simplifies HoodieStorage instance fetching to pass down the HoodieStorage instance from the meta client as much as possible, instead of using reflection, which may not work with a given file system instance like TrinoFileSystem instance.

The major changes in this PR include:

  • Introduces a new config hoodie.storage.class for instantiation of HoodieStorage class in Spark engine only.
  • Makes HoodieIOFactory to instantiate with HoodieStorage instance and use the instance for creating readers and writers; makes HoodieStorage to store the StorageConfiguration instance and adds a new API #newInstance to HoodieStorage.
  • Removes usage of HoodieStorageUtils.getStorage on the read path in hudi-common and hudi-io modules. Before this PR, HoodieStorage instantiated through reflection with the path and storage configuration through HoodieStorageUtils.getStorage does not work with a provided file system instance like TrinoFileSystem. With this PR, HoodieStorage instance is passed down from the meta client as much as possible. For engines like Spark, we still have the reflection code for HoodieHadoopStorage to make it work on the executor side.
  • Refactors the code so that the HoodieStorage is passed down to the places it is needed.

Impact

As above.

Risk level

low

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Change Logs and Impact were stated clearly
  • Adequate tests were added if applicable
  • CI passed

@github-actions github-actions bot added the size:XL PR with lines of changes > 1000 label May 19, 2024
@yihua yihua force-pushed the HUDI-7776-simplify-storage-fetch branch 6 times, most recently from 24ea421 to 9cb1f6b Compare May 20, 2024 20:14
@yihua yihua marked this pull request as ready for review May 20, 2024 20:14
@yihua yihua force-pushed the HUDI-7776-simplify-storage-fetch branch from 9cb1f6b to e02d4a3 Compare May 20, 2024 20:18
@hudi-bot
Copy link
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

+ "by Hudi. The provided class should implement `org.apache.hudi.io.storage.HoodieIOFactory`.");


public static final ConfigProperty<String> HOODIE_STORAGE_CLASS = ConfigProperty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a separate config? Can we infer from HOODIE_IO_FACTORY_CLASS? Or maybe we can add infer function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can replace HOODIE_IO_FACTORY_CLASS with HOODIE_STORAGE_CLASS since wherever the HoodieIOFactory is instantiated, the HoodieStorage is needed. So we can add a new API getIOFactory() in HoodieStorage to get the HoodieIOFactory instance and only use HOODIE_STORAGE_CLASS if the instantiation through reflection is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rethinking about this, the reason we cannot directly use a getter to return the HoodieIOFactory instance from HoodieStorage instance is that HoodieIOFactory and related classes are in hudi-common module because they use Hudi concepts such as HoodieRecord, BloomFilter, etc., while the HoodieStorage is in hudi-io module (hudi-common module depends on hudi-io). For now, the best way is to keep two configs. For Hadoop-based implementation, no configs are required as the default are the HoodieHadoopIOFactory and HoodieHadoopStorage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created HUDI-7789 for future effort of moving HoodieIOFactory to hudi-io module and keeping one config only.

@yihua yihua force-pushed the HUDI-7776-simplify-storage-fetch branch from 2006412 to 853c2ae Compare May 24, 2024 16:36
@yihua
Copy link
Contributor Author

yihua commented May 24, 2024

@hudi-bot run azure

@yihua yihua closed this May 24, 2024
@yihua yihua reopened this May 24, 2024
@yihua yihua force-pushed the HUDI-7776-simplify-storage-fetch branch from 25906d9 to 35f0698 Compare May 24, 2024 20:59
@yihua
Copy link
Contributor Author

yihua commented May 24, 2024

Azure CI is green on another PR with the same changes (this PR could not trigger Azure CI somehow).
Screenshot 2024-05-24 at 15 50 48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-0.15.0 size:XL PR with lines of changes > 1000

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants