Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for file write/upload operations with HfHubRepository #354

Merged
merged 11 commits into from
Nov 8, 2023

Conversation

shadeMe
Copy link
Collaborator

@shadeMe shadeMe commented Oct 6, 2023

Description

This PR introduces support for uploading files to Hugging Face Hub repositories through the current RepositoryFile API. In the case of fsspec repositories, we support implicit streaming of data using buffered IO. However, this is not possible with HF repositories due its usage of Git as a backing store. So, we need to allocate and write to files on the host's local storage device before uploading them as a Git commit operation to the remote repo.

The current implementation hides this detail by using a proxy to represent the remote file that we want to write to. The HfHubFile class wraps either a locally cached file from a HF repo or a temporary file on the local storage. The user can open and write to the latter like they would with any (fsspec) file - its contents will be uploaded to the repo as soon as the file handle is closed.

Furthermore, we also introduce the concept of transactional file writes using a context manager. This lets the user batch multiple file operations that get uploaded to the repo as a single commit. The fsspec implementation of transactions will be implemented in a follow-up PR.

Other changes:

  • Repository.file returns a lazily-loaded file, i.e., its existence is only checked when the file is opened using the open method (or when the exists method is called).
  • Fixed a bug in FsspecRepository.open that called unstrip_protocol on a None object.

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.

@shadeMe shadeMe added type/feature Type: Feature feat/misc Feature: Miscellaneous labels Oct 6, 2023
Copy link
Contributor

@danieldk danieldk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Added some comments

curated_transformers/repository/hf_hub.py Outdated Show resolved Hide resolved
curated_transformers/repository/hf_hub.py Outdated Show resolved Hide resolved
curated_transformers/tests/repository/test_hf_hub.py Outdated Show resolved Hide resolved
curated_transformers/tokenizers/tokenizer.py Outdated Show resolved Hide resolved
@shadeMe shadeMe added this to the v2.0.0 milestone Oct 19, 2023
Copy link

@rmitsch rmitsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not familar with the codebase yet, so my comments are somewhat superficial. Overall this looks reasonable to me. I'll do a second pass after the first batch of comments has been addressed.

curated_transformers/repository/file.py Show resolved Hide resolved
curated_transformers/repository/fsspec.py Show resolved Hide resolved
curated_transformers/repository/hf_hub.py Show resolved Hide resolved
curated_transformers/repository/hf_hub.py Show resolved Hide resolved
curated_transformers/repository/hf_hub.py Outdated Show resolved Hide resolved
Copy link

@rmitsch rmitsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@svlandeg svlandeg merged commit 7a436e0 into explosion:main Nov 8, 2023
9 checks passed
@shadeMe shadeMe deleted the feature/repo-transactions-hfhub branch November 8, 2023 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat/misc Feature: Miscellaneous type/feature Type: Feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants