Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for loading from an AbstractFileSystem #315

Closed
bilelomrani1 opened this issue Sep 1, 2023 · 5 comments
Closed

Support for loading from an AbstractFileSystem #315

bilelomrani1 opened this issue Sep 1, 2023 · 5 comments
Labels
feat/misc Feature: Miscellaneous type/feature Type: Feature

Comments

@bilelomrani1
Copy link

A problem that I face very often with HuggingFace transformers is to efficiently load a model from a private cloud storage. transformers unfortunately does not support fsspec URLs in their .from_pretrained API. The consequence is that it is both inefficient and slightly ugly to load a checkpoint from cloud storage because we first have to transit through disk.

A better alternative would be to directly load from a fsspec file system

encoder = BERTEncoder.load(
   fs=GCSFileSystem(...),
   device=torch.device("cuda", index=0),
)

or perhaps directly by passing a fsspec-compliant URL

encoder = BERTEncoder.load(
   url="gs://my-bucket/.../my-model/",
   device=torch.device("cuda", index=0),
)

The HuggingFace Hub can also be interacted with through fsspec (documentation), perhaps it can help completely abstract the storage layer.

It would be a very nice and useful addition to the package when hosting on the Hub is not possible.

@danieldk
Copy link
Collaborator

danieldk commented Sep 4, 2023

Thanks for the suggestion, that sounds like a great idea! We'll add it to our todo list.

@shadeMe shadeMe added type/feature Type: Feature feat/misc Feature: Miscellaneous labels Sep 4, 2023
@danieldk
Copy link
Collaborator

danieldk commented Sep 27, 2023

The main branch now has experimental support for loading from an fsspec filesystem.

@bilelomrani1
Copy link
Author

Hi @danieldk, thank you very much, I will gladly look into it and test it this week, will come back to you!

@bilelomrani1
Copy link
Author

I tested a few loading scenarios that I target in development and production:

Loading from GCS

fs, model_path = fsspec.core.url_to_fs("gs://my-bucket/testing/model")
model = AutoCausalLM.from_fsspec(fs=fs, model_path=model_path)

Loading from a DVC repository on Gitlab with artifacts hosted on GCS

fs, model_path = fsspec.core.url_to_fs("dvc://data/model", dvc={"url": ..., "rev": "e3d34"})
model = AutoCausalLM.from_fsspec(fs=fs, model_path=model_path)

Loading directly from a zip file, versioned in a DVC repository, hosted on GCS

fs, model_path = fsspec.core.url_to_fs("zip://model::dvc://data/model.zip", dvc={"url": ... "rev": "d623af"})
model = AutoCausalLM.from_fsspec(fs=fs, model_path=model_path)

I can confirm that these three scenarios work exactly as expected 🎉!

Thank you @danieldk, it's very nice that these somehow less simple loading scenarios work out of the box, great job!

@danieldk
Copy link
Collaborator

danieldk commented Oct 8, 2023

Thanks a lot for testing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/misc Feature: Miscellaneous type/feature Type: Feature
Projects
None yet
Development

No branches or pull requests

3 participants