Return `distiset` from `Pipeline.run` #417

plaguss · 2024-03-13T15:41:54Z

Description

This PR creates a Distiset class which is a wrapper around a dictionary containing the internal datasets.Dataset generated during the Pipeline.run method. Each key corresponds to the leaf_step name in internal DAG, and each value is a datasets.Dataset. It has two methods:

push_to_hub: to push the Distiset to the hub, where each configuration corresponds to one of the subsets`
train_test_split: which transforms each one of the internal datasets.Dataset to a datasets.DatasetDict (all the subsets with the same train/test sizes.

The Pipeline.run method after finishing will locate in the cache folder the (parquet) files written via _WriteBuffer and generate the Distiset.

Dummy example:

with Pipeline() as pipeline:
    load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
    rename_columns = RenameColumns(name="rename_columns", input_batch_size=12)
    generate_response = GenerateResponse(
        name="generate_response", input_batch_size=16
    )

    load_hub_dataset.connect(rename_columns)
    rename_columns.connect(generate_response)

    ds = pipeline.run(
        parameters={
            "load_dataset": {
                "repo_id": "plaguss/test",
                "split": "train",
            },
            "rename_columns": {
                "rename_mappings": {
                    "prompt": "instruction",
                },
            },
        }
    )
# >>> ds["generate_response"]
# Distiset({"leaf_step_1": Dataset(...), "leaf_step_2": Dataset(...)})

Closes #373

into caching

… the cache

…filename to the cache filenames

… running a step process

into distiset

src/distilabel/pipeline/local.py

src/distilabel/utils/data.py

Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>

alvarobartt

LGTM so far, but still need to test it further!

src/distilabel/pipeline/local.py

alvarobartt · 2024-03-18T11:52:38Z

tests/integration/test_pipe_simple.py

+        return pipeline.run(
+            parameters={
+                "load_dataset": {
+                    "repo_id": "plaguss/test",


Unrelated to this PR, but maybe it's time to create argilla-internal-testing in Hugging Face Hub? 😆

yes... the current workflow is weird at best 😆

Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>

into distiset

gabrielmbmb

LGTM! We can rename some files, but looks good :)

src/distilabel/pipeline/local.py

src/distilabel/utils/data.py

plaguss added 30 commits February 28, 2024 11:29

Add signature cache dir for the pipeline

555a1ad

Add test for the signature and update path for the write buffer

2b748d4

Merge with main

5661b85

Add all the information from the steps fo create the signature

0e60693

Merge with core-refactor

42df7f9

Merge branch 'core-refactor' of https://github.com/argilla-io/distilabel

4e57cc3

into caching

Ensure the adjacency is also checked when loading from dict

72ec3b4

Save pipeline to cache dir and load back automatically

0ed7b63

Merge and solve conflict

2bbd47d

Draft advances caching

f4a8b36

Merge and solve conflict

1f2abca

Make the batching system ser/deserializable

ab6b2a9

Merge with core-refactor

c59504d

Update create_signature to the new pipeline serialization format

366cbe5

Update tests to account for the batch manager in the pipeline dump

9e3d863

Add batch manager to the BasePipeline and improve docstrings

a676d46

Load batch_manager from the dag only if wasn't loaded previously from…

41c2b79

… the cache

Store the _BatchManager in it's own file next to the pipeline

b47c8d1

Fix signature from dict values

12dbf2d

Sort the step items to obtain always the same signature and add data …

97747b9

…filename to the cache filenames

Update tests with the sorted items in the signature

8aa91b2

Write pipeline as yaml by default

ae1d094

Remove unused test

f1dd524

Remove repeated functions

3517156

Make seq_no public in _BatchManagerStep to simplify serialization

fbe5351

Merge with core-refactor

de2a0ff

Transform last_batch_received to public in _BatchManagerStep

8397c12

Cache results after each step is done running inside the process wrapper

e0ab82c

Update example to showcase caching and caching at a given point after…

2918f52

… running a step process

Add helper function to generate dummy batches of data

ca18b26

plaguss added 3 commits March 13, 2024 16:30

Add initial tests for _WriteBuffer

639acc9

Add draft for write buffer with parquet files

8f7c2e6

Solve conflicts

387ea74

plaguss self-assigned this Mar 18, 2024

plaguss added 3 commits March 18, 2024 12:43

Update tests to check for the returned datasets

d201a3b

Return DatasetDict from Pipeline run method

aa2a3db

Merge branch 'core-refactor' of https://github.com/argilla-io/distilabel

6eb70eb

into distiset

plaguss marked this pull request as ready for review March 18, 2024 11:45

alvarobartt requested review from gabrielmbmb and alvarobartt March 18, 2024 11:45

alvarobartt added the enhancement New feature or request label Mar 18, 2024

alvarobartt added this to the 1.0.0 milestone Mar 18, 2024

alvarobartt linked an issue Mar 18, 2024 that may be closed by this pull request

Pipeline.run method returning Distiset instance #373

Closed

alvarobartt reviewed Mar 18, 2024

View reviewed changes

src/distilabel/pipeline/local.py Outdated Show resolved Hide resolved

alvarobartt reviewed Mar 18, 2024

View reviewed changes

src/distilabel/utils/data.py Outdated Show resolved Hide resolved

plaguss and others added 2 commits March 18, 2024 12:49

Update src/distilabel/utils/data.py

7fc42cd

Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>

Update docs

71cefcc

alvarobartt reviewed Mar 18, 2024

View reviewed changes

plaguss and others added 5 commits March 18, 2024 12:54

Update src/distilabel/pipeline/local.py

7ae0888

Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>

Update src/distilabel/pipeline/local.py

6f7fdcc

Co-authored-by: Alvaro Bartolome <alvaro@argilla.io>

Update docs

ea88903

Merge branch 'core-refactor' of https://github.com/argilla-io/distilabel

e62da0a

into distiset

Create Distiset to return a wrapper around datasets from Pipeline.run

a66813a

gabrielmbmb approved these changes Mar 19, 2024

View reviewed changes

src/distilabel/pipeline/local.py Outdated Show resolved Hide resolved

src/distilabel/pipeline/local.py Outdated Show resolved Hide resolved

src/distilabel/utils/data.py Outdated Show resolved Hide resolved

plaguss added 3 commits March 20, 2024 09:38

Merge and solve conflicts

cb83da5

Apply comments from code review

3c743ea

Fix import in test

31749fa

plaguss merged commit 59dfd56 into core-refactor Mar 20, 2024
4 checks passed

plaguss deleted the distiset branch March 20, 2024 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return `distiset` from `Pipeline.run` #417

Return `distiset` from `Pipeline.run` #417

plaguss commented Mar 13, 2024 •

edited

alvarobartt left a comment

alvarobartt Mar 18, 2024

plaguss Mar 18, 2024

gabrielmbmb left a comment

Return distiset from Pipeline.run #417

Return distiset from Pipeline.run #417

Conversation

plaguss commented Mar 13, 2024 • edited

Description

alvarobartt left a comment

Choose a reason for hiding this comment

alvarobartt Mar 18, 2024

Choose a reason for hiding this comment

plaguss Mar 18, 2024

Choose a reason for hiding this comment

gabrielmbmb left a comment

Choose a reason for hiding this comment

Return `distiset` from `Pipeline.run` #417

Return `distiset` from `Pipeline.run` #417

plaguss commented Mar 13, 2024 •

edited