You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I hope this message finds you well. I would like to discuss the possibility of adjusting the current codebase to enable streaming of datasets directly from HuggingFace, eliminating the need for downloading them. This enhancement can significantly streamline the workflow, reduce storage requirements, and improve efficiency, especially for users working with limited local storage or in environments where data download speeds are a bottleneck.
Implementing dataset streaming can be achieved by leveraging HuggingFace's datasets library, which supports on-the-fly data access. The modification would involve integrating this functionality into the existing data handling pipeline, ensuring compatibility and seamless transition for current users.
The high-level steps include:
Updating the data loading functions to utilize HuggingFace's load_dataset with streaming enabled.
Ensuring all downstream processes can handle data in a streamed format without requiring local storage.
Conducting thorough testing to verify the integrity and performance of the streamed data pipeline.
If you are interested, I can raise a pull request with the proposed changes for your review. This would allow us to collaboratively refine and integrate this feature into the project.
Looking forward to your thoughts on this.
Best regards,
Vishesh Yadav;
The text was updated successfully, but these errors were encountered:
I hope this message finds you well. I would like to discuss the possibility of adjusting the current codebase to enable streaming of datasets directly from HuggingFace, eliminating the need for downloading them. This enhancement can significantly streamline the workflow, reduce storage requirements, and improve efficiency, especially for users working with limited local storage or in environments where data download speeds are a bottleneck.
Implementing dataset streaming can be achieved by leveraging HuggingFace's
datasets
library, which supports on-the-fly data access. The modification would involve integrating this functionality into the existing data handling pipeline, ensuring compatibility and seamless transition for current users.The high-level steps include:
load_dataset
with streaming enabled.If you are interested, I can raise a pull request with the proposed changes for your review. This would allow us to collaboratively refine and integrate this feature into the project.
Looking forward to your thoughts on this.
Best regards,
Vishesh Yadav;
The text was updated successfully, but these errors were encountered: