Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return distiset from Pipeline.run #417

Merged
merged 46 commits into from
Mar 20, 2024
Merged

Return distiset from Pipeline.run #417

merged 46 commits into from
Mar 20, 2024

Conversation

plaguss
Copy link
Contributor

@plaguss plaguss commented Mar 13, 2024

Description

This PR creates a Distiset class which is a wrapper around a dictionary containing the internal datasets.Dataset generated during the Pipeline.run method. Each key corresponds to the leaf_step name in internal DAG, and each value is a datasets.Dataset. It has two methods:

  • push_to_hub: to push the Distiset to the hub, where each configuration corresponds to one of the subsets`
  • train_test_split: which transforms each one of the internal datasets.Dataset to a datasets.DatasetDict (all the subsets with the same train/test sizes.

The Pipeline.run method after finishing will locate in the cache folder the (parquet) files written via _WriteBuffer and generate the Distiset.

Dummy example:

with Pipeline() as pipeline:
    load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
    rename_columns = RenameColumns(name="rename_columns", input_batch_size=12)
    generate_response = GenerateResponse(
        name="generate_response", input_batch_size=16
    )

    load_hub_dataset.connect(rename_columns)
    rename_columns.connect(generate_response)

    ds = pipeline.run(
        parameters={
            "load_dataset": {
                "repo_id": "plaguss/test",
                "split": "train",
            },
            "rename_columns": {
                "rename_mappings": {
                    "prompt": "instruction",
                },
            },
        }
    )
# >>> ds["generate_response"]
# Distiset({"leaf_step_1": Dataset(...), "leaf_step_2": Dataset(...)})

Closes #373

@plaguss plaguss self-assigned this Mar 18, 2024
@plaguss plaguss marked this pull request as ready for review March 18, 2024 11:45
@alvarobartt alvarobartt added the enhancement New feature or request label Mar 18, 2024
@alvarobartt alvarobartt added this to the 1.0.0 milestone Mar 18, 2024
@alvarobartt alvarobartt linked an issue Mar 18, 2024 that may be closed by this pull request
plaguss and others added 2 commits March 18, 2024 12:49
Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>
Copy link
Member

@alvarobartt alvarobartt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so far, but still need to test it further!

src/distilabel/pipeline/local.py Outdated Show resolved Hide resolved
src/distilabel/pipeline/local.py Outdated Show resolved Hide resolved
return pipeline.run(
parameters={
"load_dataset": {
"repo_id": "plaguss/test",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to this PR, but maybe it's time to create argilla-internal-testing in Hugging Face Hub? 😆

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes... the current workflow is weird at best 😆

Copy link
Member

@gabrielmbmb gabrielmbmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! We can rename some files, but looks good :)

src/distilabel/pipeline/local.py Outdated Show resolved Hide resolved
src/distilabel/pipeline/local.py Outdated Show resolved Hide resolved
src/distilabel/utils/data.py Outdated Show resolved Hide resolved
@plaguss plaguss merged commit 59dfd56 into core-refactor Mar 20, 2024
4 checks passed
@plaguss plaguss deleted the distiset branch March 20, 2024 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pipeline.run method returning Distiset instance
3 participants