Implement caching mechanism for the pipelines #370

plaguss · 2024-03-01T08:41:26Z

Description

This PR implements the first version of caching for the Pipeline objects. It works by serializing and saving the pipeline content after each Step (runs as a callback after each process) finishes to a folder under ~/.cache/distilabel/pipelines:

batch_manager.json: Contains the serialized _BatchManager, the object in charge of managing the batches internally in the Pipeline.
pipeline.yaml: Contains the serialized Pipeline in YAML format, which can be reused within the CLI (CLI with run command #403).
data.jsonl: Generations saved as a jsonl file (this file may be modified or removed).

The following example, written under the tests/integrations folder can be run as an example:

from typing import Any, Dict, Generator, List

from distilabel.pipeline.local import Pipeline
from distilabel.steps.base import RuntimeParameter, Step, StepInput
from distilabel.steps.generators.huggingface import LoadHubDataset


class RenameColumns(Step):
    rename_mappings: RuntimeParameter[Dict[str, str]]

    @property
    def inputs(self) -> List[str]:
        return []

    @property
    def outputs(self) -> List[str]:
        return list(self.rename_mappings.values())  # type: ignore

    def process(self, inputs: StepInput) -> Generator[List[Dict[str, Any]], None, None]:
        outputs = []
        for input in inputs:
            outputs.append(
                {self.rename_mappings.get(k, k): v for k, v in input.items()}  # type: ignore
            )
        yield outputs


class GenerateResponse(Step):
    @property
    def inputs(self) -> List[str]:
        return ["instruction"]

    def process(self, inputs: StepInput) -> Generator[List[Dict[str, Any]], None, None]:
        import time

        time.sleep(0.8)

        print("***** NOT CACHED ******", len(inputs))
        for input in inputs:
            input["response"] = "I don't know"

        # NOTE: Caching here to save the evolution of the _BatchManager
        self.pipeline._cache()
        yield inputs

    @property
    def outputs(self) -> List[str]:
        return ["response"]


def test_pipeline_cached():
    def run_pipeline():
        with Pipeline() as pipeline:
            load_hub_dataset = LoadHubDataset(name="load_dataset", batch_size=8)
            rename_columns = RenameColumns(name="rename_columns", input_batch_size=12)
            generate_response = GenerateResponse(
                name="generate_response", input_batch_size=16
            )

            load_hub_dataset.connect(rename_columns)
            rename_columns.connect(generate_response)

            pipeline.run(
                parameters={
                    "load_dataset": {
                        "repo_id": "plaguss/test",
                        "split": "train",
                    },
                    "rename_columns": {
                        "rename_mappings": {
                            "prompt": "instruction",
                        },
                    },
                }
            )

    run_pipeline()
    print()
    print("----- RUNNING PIPELINE AGAIN -----")
    print()
    run_pipeline()

if __name__ == "__main__":
    test_pipeline_cached()

The script runs the pipeline twice, taking a look at the logs should show the effect of passing through the cached batches. Currently, we cached only after a step finishes, but it can managed by the user. See for example the previous GenerateResponse step, which calls the _pipeline._cache method before yielding it's results.

This PR would close #389

into caching

… the cache

…filename to the cache filenames

… running a step process

…formation

plaguss added 5 commits February 28, 2024 11:29

Add signature cache dir for the pipeline

555a1ad

Add test for the signature and update path for the write buffer

2b748d4

Merge with main

5661b85

Add all the information from the steps fo create the signature

0e60693

Merge with core-refactor

42df7f9

plaguss added the enhancement New feature or request label Mar 1, 2024

plaguss self-assigned this Mar 1, 2024

plaguss added 23 commits March 1, 2024 16:29

Merge branch 'core-refactor' of https://github.com/argilla-io/distilabel

4e57cc3

into caching

Ensure the adjacency is also checked when loading from dict

72ec3b4

Save pipeline to cache dir and load back automatically

0ed7b63

Merge and solve conflict

2bbd47d

Draft advances caching

f4a8b36

Merge and solve conflict

1f2abca

Make the batching system ser/deserializable

ab6b2a9

Merge with core-refactor

c59504d

Update create_signature to the new pipeline serialization format

366cbe5

Update tests to account for the batch manager in the pipeline dump

9e3d863

Add batch manager to the BasePipeline and improve docstrings

a676d46

Load batch_manager from the dag only if wasn't loaded previously from…

41c2b79

… the cache

Store the _BatchManager in it's own file next to the pipeline

b47c8d1

Fix signature from dict values

12dbf2d

Sort the step items to obtain always the same signature and add data …

97747b9

…filename to the cache filenames

Update tests with the sorted items in the signature

8aa91b2

Write pipeline as yaml by default

ae1d094

Remove unused test

f1dd524

Remove repeated functions

3517156

Make seq_no public in _BatchManagerStep to simplify serialization

fbe5351

Merge with core-refactor

de2a0ff

Transform last_batch_received to public in _BatchManagerStep

8397c12

Cache results after each step is done running inside the process wrapper

e0ab82c

plaguss added 3 commits March 10, 2024 18:56

Update example to showcase caching and caching at a given point after…

2918f52

… running a step process

Merge with core-refactor

274e641

Allow cache method to take generic arguments to simplify testing

c9ac229

plaguss marked this pull request as ready for review March 11, 2024 08:52

plaguss requested review from gabrielmbmb and alvarobartt March 11, 2024 08:53

plaguss mentioned this pull request Mar 13, 2024

Return distiset from Pipeline.run #417

Merged

plaguss added 6 commits March 18, 2024 09:16

Fixes and updates on the batch manager to keep track of the cached in…

f3c080a

…formation

Include an offset on the generator steps to skip already processed info

f75489e

Test script to see the caching mechanism in action

1689a27

Merge and solve conflict

d58659e

Fix test errors

8de857b

Merge and solve conflicts

1d8e92a

plaguss merged commit eecdab4 into core-refactor Mar 18, 2024
4 checks passed

plaguss deleted the caching branch March 18, 2024 09:02

alvarobartt linked an issue Mar 18, 2024 that may be closed by this pull request

Pipeline caching #389

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement caching mechanism for the pipelines #370

Implement caching mechanism for the pipelines #370

plaguss commented Mar 1, 2024 •

edited

Implement caching mechanism for the pipelines #370

Implement caching mechanism for the pipelines #370

Conversation

plaguss commented Mar 1, 2024 • edited

Description

plaguss commented Mar 1, 2024 •

edited