Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to load/save distisets to/from disk #673

Merged
merged 19 commits into from
May 29, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented May 24, 2024

Description

This PR includes two new methods on Distiset to save the content to disk (or load from it), and also to a remote storage based on the content of the storage_options passed, following the implementation of Hugging Face's datasets: https://huggingface.co/docs/datasets/filesystems#saving-serialized-datasets

with Pipeline(...) as pipe:
    distiset = pipe.run()

distiset.save_to_disk(distiset_path)  # Or include the storage_options for the remote storage.
ds = Distiset.load_from_disk(distiset_path)

@plaguss plaguss added this to the 1.2.0 milestone May 24, 2024
@plaguss plaguss self-assigned this May 24, 2024
@plaguss plaguss linked an issue May 24, 2024 that may be closed by this pull request
@plaguss plaguss added enhancement New feature or request and removed improvement labels May 24, 2024
@plaguss plaguss marked this pull request as ready for review May 27, 2024 10:16
src/distilabel/distiset.py Outdated Show resolved Hide resolved
@plaguss
Copy link
Contributor Author

plaguss commented May 27, 2024

@rasdani did you have time to try saving/loading to s3 with this branch?

@rasdani
Copy link
Contributor

rasdani commented May 27, 2024

yes, saving and loading with S3 works so far! :)

However one needs to keep in mind, to append default/train (or your respective config/split name) to your S3 directory path.
This is expected behaviour from datasets.load_from_disc() though.

EDIT: hold on, I just realised, I tested with datasets.load_from_disc() and not Distiset.load_from_disc(). Will try in a minute.

Copy link
Member

@gabrielmbmb gabrielmbmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

mkdocs.yml Outdated Show resolved Hide resolved
src/distilabel/distiset.py Outdated Show resolved Hide resolved
src/distilabel/distiset.py Outdated Show resolved Hide resolved
src/distilabel/distiset.py Outdated Show resolved Hide resolved
src/distilabel/distiset.py Outdated Show resolved Hide resolved
src/distilabel/distiset.py Outdated Show resolved Hide resolved
plaguss and others added 4 commits May 28, 2024 12:27
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
plaguss and others added 3 commits May 28, 2024 12:35
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
Co-authored-by: Gabriel Martín Blázquez <gmartinbdev@gmail.com>
@plaguss plaguss merged commit 7e9230b into develop May 29, 2024
4 checks passed
@plaguss plaguss deleted the distiset-to-disk branch May 29, 2024 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Saving/Loading of Distiset with S3 bucket (or locally)
4 participants