<a href="https://colab.research.google.com/github/bacoco/LLM-Finetuning/blob/main/clean_existing_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clean an existing preference dataset

- **Goal**: Clean an existing preference dataset by providing AI feedback on the quality of the data.
- **Libraries**: [argilla](https://github.com/argilla-io/argilla), [hf-inference-endpoints](https://github.com/huggingface/huggingface_hub)
- **Components**: [LoadDataFromDicts](https://distilabel.argilla.io/dev/components-gallery/steps/loaddatafromdicts/), [UltraFeedback](https://distilabel.argilla.io/latest/components-gallery/tasks/ultrafeedback/), [KeepColumns](https://distilabel.argilla.io/latest/components-gallery/steps/groupcolumns/), [PreferenceToArgilla](https://distilabel.argilla.io/latest/components-gallery/steps/textgenerationtoargilla/), [InferenceEndpointsLLM](https://distilabel.argilla.io/latest/components-gallery/llms/inferenceendpointsllm/), [GlobalStep](https://distilabel.argilla.io/latest/sections/how_to_guides/basic/step/global_step/)

## Getting Started

### Install the dependencies

To complete this tutorial, you need to install the distilabel SDK and a few third-party libraries via pip. We will be using **the free but rate-limited Hugging Face serverless Inference API** for this tutorial, so we need to install this as an extra distilabel dependency. You can install them by running the following command:

> Check the available extras [in the documentation](https://distilabel.argilla.io/latest/sections/getting_started/installation/#extras).

In [None]:
!pip install "distilabel[hf-inference-endpoints]"

In [None]:
!pip install "transformers~=4.0" "torch~=2.0" "huggingface_hub~=0.24.0"

Let's make the required imports:

In [None]:
import random

from datasets import load_dataset

from distilabel.llms import InferenceEndpointsLLM
from distilabel.pipeline import Pipeline
from distilabel.steps import (
    KeepColumns,
    LoadDataFromDicts,
    PreferenceToArgilla,
)
from distilabel.steps.tasks import UltraFeedback

You'll need an `HF_TOKEN` to use the HF Inference Endpoints. Login to use it directly within this notebook.

In [None]:
import os
from google.colab import userdata
from huggingface_hub import login

login(token=os.getenv("HF_TOKEN") or userdata.get("HF_TOKEN"), add_to_git_credential=True)

Token is valid (permission: fineGrained).
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.cache/huggingface/token
Login successful


### (optional) Deploy Argilla

You can skip this step or replace it with any other data evaluation tool, but the quality of your model will suffer from a lack of data quality, so we do recommend looking at your data. If you already deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/).

Along with that, you will need to install Argilla as a distilabel extra.

In [None]:
!pip install "distilabel[argilla, hf-inference-endpoints]"

## The dataset

In this case, we will clean a preference dataset, so we will use the [`Intel/orca_dpo_pairs`](https://huggingface.co/datasets/Intel/orca_dpo_pairs) dataset from the Hugging Face Hub.

In [None]:
from IPython.display import IFrame

IFrame("https://huggingface.co/datasets/Intel/orca_dpo_pairs/embed/viewer/default/train", frameborder="0", width="100%", height="560px")

In [None]:
dataset = load_dataset("Intel/orca_dpo_pairs", split="train[:20]")

Downloading readme:   0%|          | 0.00/196 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/36.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/12859 [00:00<?, ? examples/s]

Next, we will shuffle the `chosen` and `rejected` columns to avoid any bias in the dataset.

In [None]:
def shuffle_and_track(chosen, rejected):
    pair = [chosen, rejected]
    random.shuffle(pair)
    order = ["chosen" if x == chosen else "rejected" for x in pair]
    return {"generations": pair, "order": order}

dataset = dataset.map(lambda x: shuffle_and_track(x["chosen"], x["rejected"]))

Map:   0%|          | 0/20 [00:00<?, ? examples/s]

In [None]:
dataset = dataset.to_list()

#### (Alternative) As a custom step

You can also [create a custom step](https://distilabel.argilla.io/latest/sections/how_to_guides/basic/step/#defining-custom-steps) in a separate module, import it and add it to the pipeline after loading the `orca_dpo_pairs` dataset using the `LoadDataFromHub` step.

In [None]:
# shuffle_step.py
import random
from typing import TYPE_CHECKING, List
from distilabel.steps import GlobalStep, StepInput

if TYPE_CHECKING:
    from distilabel.steps.typing import StepOutput

# @requirements(["required_package"]) # a warning will be raised if missing
class ShuffleStep(GlobalStep):
    @property
    def inputs(self) -> List[str]:
        return ["instruction", "chosen", "rejected"]

    @property
    def outputs(self) -> List[str]:
        return ["instruction", "generations", "order"]

    def process(self, inputs: StepInput) -> "StepOutput":
        outputs = []

        for input in inputs:
            chosen = input["chosen"]
            rejected = input["rejected"]
            pair = [chosen, rejected]
            random.shuffle(pair)
            order = ["chosen" if x == chosen else "rejected" for x in pair]

            outputs.append({"instruction": input["instruction"], "generations": pair, "order": order})

        yield outputs

In [None]:
from shuffle_step import ShuffleStep

## Define the pipeline

To clean an existing preference dataset, we will need to define a `Pipeline` with all the necessary steps. However, a similar workflow can be used to clean a SFT dataset. Below, we will go over each step in detail.

### Load the dataset
We will use the dataset we just shuffled as source data.

- Component: `LoadDataFromDicts`
- Input columns: `system`, `question`, `chosen`, `rejected`, `generations` and `order`, the same keys as in the loaded list of dictionaries.
- Output columns: `system`, `instruction`, `chosen`, `rejected`, `generations` and `order`. We will use `output_mappings` to rename the columns.

In [None]:
load_dataset = LoadDataFromDicts(
    data=dataset[:1],
    output_mappings={"question": "instruction"},
    pipeline=Pipeline(name="showcase-pipeline"), # optional
)
load_dataset.load()
next(load_dataset.process())

([{'system': '',
   'question': "You will be given a definition of a task first, then some input of the task.\nThis task is about using the specified sentence and converting the sentence to Resource Description Framework (RDF) triplets of the form (subject, predicate object). The RDF triplets generated must be such that the triplets accurately capture the structure and semantics of the input sentence. The input is a sentence and the output is a list of triplets of the form [subject, predicate, object] that capture the relationships present in the sentence. When a sentence has more than 1 RDF triplet possible, the output must contain all of them.\n\nAFC Ajax (amateurs)'s ground is Sportpark De Toekomst where Ajax Youth Academy also play.\nOutput:",
   'chosen': '[\n  ["AFC Ajax (amateurs)", "has ground", "Sportpark De Toekomst"],\n  ["Ajax Youth Academy", "plays at", "Sportpark De Toekomst"]\n]',
   'rejected': " Sure, I'd be happy to help! Here are the RDF triplets for the input senten

### Evaluate the responses

To evaluate the quality of the responses, we will use [`meta-llama/Meta-Llama-3.1-70B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3.1-70B-Instruct), applying the `UltraFeedback` task that judges the responses according to different dimensions (helpfulness, honesty, instruction-following, truthfulness). For an SFT dataset, you can use [`PrometheusEval`](../papers/prometheus.md) instead.

- Component: `UltraFeedback` task with LLMs using `InferenceEndpointsLLM`
- Input columns: `instruction`, `generations`
- Output columns: `ratings`, `rationales`, `distilabel_metadata`, `model_name`

For your use case and to improve the results, you can use any [other LLM of your choice](https://distilabel.argilla.io/latest/components-gallery/llms/).

In [None]:
evaluate_responses = UltraFeedback(
    aspect="overall-rating",
    llm=InferenceEndpointsLLM(
        model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        generation_kwargs={"max_new_tokens": 512, "temperature": 0.7},
    ),
    pipeline=Pipeline(name="showcase-pipeline"), # optional
)
evaluate_responses.load()
next(
    evaluate_responses.process(
        [
            {
                "instruction": "What's the capital of Spain?",
                "generations": ["Madrid", "Barcelona"],
            }
        ]
    )
)



tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


[{'instruction': "What's the capital of Spain?",
  'generations': ['Madrid', 'Barcelona'],
  'ratings': [5, 1],
  'rationales': ['Text 1 provides accurate information, is confident in its answer, and aligns perfectly with the instruction, which asks for the capital of Spain, and Madrid is indeed the capital.',
   'Text 2 is incorrect as Barcelona is not the capital of Spain, although it is a major city in the country. The answer does not align with the instruction and introduces inaccurate information.'],
  'distilabel_metadata': {'raw_output_ultra_feedback_0': '#### Output for Text 1\nRating: 5 (Excellent)\nRationale: Text 1 provides accurate information, is confident in its answer, and aligns perfectly with the instruction, which asks for the capital of Spain, and Madrid is indeed the capital.\n\n#### Output for Text 2\nRating: 1 (Low Quality)\nRationale: Text 2 is incorrect as Barcelona is not the capital of Spain, although it is a major city in the country. The answer does not alig

### Keep only the required columns

We will get rid of the unneeded columns.

- Component: `KeepColumns`
- Input columns: `system`, `instruction`, `chosen`, `rejected`, `generations`, `ratings`, `rationales`, `distilabel_metadata` and `model_name`
- Output columns: `instruction`, `chosen`, `rejected`, `generations` and `order`

In [None]:
keep_columns = KeepColumns(
    columns=[
        "instruction",
        "generations",
        "order",
        "ratings",
        "rationales",
        "model_name",
    ],
    pipeline=Pipeline(name="showcase-pipeline"), # optional
)
keep_columns.load()
next(
    keep_columns.process(
        [
            {
                "system": "",
                "instruction": "What's the capital of Spain?",
                "chosen": "Madrid",
                "rejected": "Barcelona",
                "generations": ["Madrid", "Barcelona"],
                "order": ["chosen", "rejected"],
                "ratings": [5, 1],
                "rationales": ["", ""],
                "model_name": "meta-llama/Meta-Llama-3.1-70B-Instruct",
            }
        ]
    )
)

[{'instruction': "What's the capital of Spain?",
  'generations': ['Madrid', 'Barcelona'],
  'order': ['chosen', 'rejected'],
  'ratings': [5, 1],
  'rationales': ['', ''],
  'model_name': 'meta-llama/Meta-Llama-3.1-70B-Instruct'}]

### (Optional) Further data curation

You can use Argilla to further curate your data.

-  Component: `PreferenceToArgilla` step
- Input columns: `instruction`, `generations`, `generation_models`, `ratings`
- Output columns: `instruction`, `generations`, `generation_models`, `ratings`

In [None]:
to_argilla = PreferenceToArgilla(
    dataset_name="cleaned-dataset",
    dataset_workspace="argilla",
    api_url="https://[your-owner-name]-[your-space-name].hf.space",
    api_key="[your-api-key]",
    num_generations=2
)

## Run the pipeline

Below, you can see the full pipeline definition:

> For more information about how steps, tasks and pipelines work, we have prepared [these guides](https://distilabel.argilla.io/latest/sections/how_to_guides/).

In [None]:
with Pipeline(name="clean-dataset", cache_dir="./my_cache_dir", requirements=["distilabel"]) as pipeline:

    load_dataset = LoadDataFromDicts(
        data=dataset, output_mappings={"question": "instruction"}
    )

    evaluate_responses = UltraFeedback(
        aspect="overall-rating",
        llm=InferenceEndpointsLLM(
            model_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
            tokenizer_id="meta-llama/Meta-Llama-3.1-70B-Instruct",
        ),
    )

    keep_columns = KeepColumns(
        columns=[
            "instruction",
            "generations",
            "order",
            "ratings",
            "rationales",
            "model_name",
        ]
    )

    to_argilla = PreferenceToArgilla(
        dataset_name="cleaned-dataset",
        dataset_workspace="argilla",
        api_url="https://[your-owner-name]-[your-space-name].hf.space",
        api_key=userdata.get("ARGILLA_API_KEY"),
        num_generations=2,
    )

    load_dataset.connect(evaluate_responses)
    evaluate_responses.connect(keep_columns)
    keep_columns.connect(to_argilla)

    # load_dataset >> evaluate_responses >> keep_columns >> to_argilla # alternative to `.connect`

Let's now run the pipeline and clean our preference dataset.

In [None]:
distiset = pipeline.run(
    parameters={
            evaluate_responses.name: {
                "llm": {
                    "generation_kwargs": {
                        "temperature": 0.7,
                        "max_new_tokens": 512,
                    }
                }
            },
        },
    use_cache=True
    )

Let's check it! If you have loaded the data to Argilla, you can [start annotating in the Argilla UI](https://docs.argilla.io/latest/how_to_guides/annotate/).

You can push the dataset to the Hub for sharing with the community and [embed it to explore the data](https://huggingface.co/docs/hub/datasets-viewer-embed).

> Check [here](https://distilabel.argilla.io/latest/sections/how_to_guides/advanced/distiset/) how to make the most of a distiset.

In [None]:
distiset.push_to_hub("[your-owner-name]/example-cleaned-preference-dataset")

In [None]:
from IPython.display import IFrame

IFrame("https://huggingface.co/datasets/distilabel-internal-testing/example-cleaned-preference-dataset/embed/viewer/default/train", frameborder="0", width="100%", height="560px")

## Conclusions

In this tutorial, we showcased the detailed steps to build a pipeline for cleaning a preference dataset using distilabel. However, you can customize this pipeline for your own use cases, such as cleaning an SFT dataset or adding custom steps.

We used a preference dataset as our starting point and shuffled the data to avoid any bias. Next, we evaluated the responses using a model through the serverless Hugging Face Inference API, following the UltraFeedback standards. Finally, we kept the needed columns and used Argilla for further curation.