-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return distiset
from Pipeline.run
#417
Conversation
…filename to the cache filenames
… running a step process
Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM so far, but still need to test it further!
return pipeline.run( | ||
parameters={ | ||
"load_dataset": { | ||
"repo_id": "plaguss/test", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unrelated to this PR, but maybe it's time to create argilla-internal-testing
in Hugging Face Hub? 😆
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes... the current workflow is weird at best 😆
Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>
Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! We can rename some files, but looks good :)
Description
This PR creates a
Distiset
class which is a wrapper around a dictionary containing the internaldatasets.Dataset
generated during thePipeline.run
method. Each key corresponds to theleaf_step
name in internalDAG
, and each value is adatasets.Dataset
. It has two methods:push_to_hub
: to push theDistiset
to the hub, where each configuration corresponds to one of the subsets`train_test_split
: which transforms each one of the internaldatasets.Dataset
to adatasets.DatasetDict
(all the subsets with the same train/test sizes.The
Pipeline.run
method after finishing will locate in the cache folder the (parquet) files written via_WriteBuffer
and generate theDistiset
.Dummy example:
Closes #373