Proposal for Streaming HuggingFace Datasets to Optimize Workflow #41

vishesh9131 · 2024-07-14T22:16:45Z

I hope this message finds you well. I would like to discuss the possibility of adjusting the current codebase to enable streaming of datasets directly from HuggingFace, eliminating the need for downloading them. This enhancement can significantly streamline the workflow, reduce storage requirements, and improve efficiency, especially for users working with limited local storage or in environments where data download speeds are a bottleneck.

Implementing dataset streaming can be achieved by leveraging HuggingFace's datasets library, which supports on-the-fly data access. The modification would involve integrating this functionality into the existing data handling pipeline, ensuring compatibility and seamless transition for current users.

The high-level steps include:

Updating the data loading functions to utilize HuggingFace's load_dataset with streaming enabled.
Ensuring all downstream processes can handle data in a streamed format without requiring local storage.
Conducting thorough testing to verify the integrity and performance of the streamed data pipeline.

If you are interested, I can raise a pull request with the proposed changes for your review. This would allow us to collaboratively refine and integrate this feature into the project.

Looking forward to your thoughts on this.
Best regards,

Vishesh Yadav;

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal for Streaming HuggingFace Datasets to Optimize Workflow #41

Proposal for Streaming HuggingFace Datasets to Optimize Workflow #41

vishesh9131 commented Jul 14, 2024

Proposal for Streaming HuggingFace Datasets to Optimize Workflow #41

Proposal for Streaming HuggingFace Datasets to Optimize Workflow #41

Comments

vishesh9131 commented Jul 14, 2024