# Label datasets with LLMs, good annotation guidelines, and distilabel

In this notebook, we will use the `distilabel` library to label datasets with LLMs. We will use the `ArgillaLabeller` class to label any dataset that is hosted on Argilla with based on the written dataset definitions, like guidelines, fields, and questions. This is cool because you can use the same labelling interface for any dataset and questions that is hosted on Argilla.

## Getting started

### Deploy the Argilla server¶

If you already have deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](https://docs.argilla.io/latest/getting_started/quickstart/).

### Install dependencies

In [3]:
%pip install "git+https://github.com/argilla-io/distilabel.git@develop#egg=distilabel[argilla,llama_cpp,outlines]"
!pip install numpy==1.26.4
!pip install outlines==0.0.36
!pip install llama_cpp_python==0.2.85

[33mDEPRECATION: git+https://github.com/argilla-io/distilabel.git@develop#egg=distilabel[argilla,llama_cpp,outlines] contains an egg fragment with a non-PEP 508 name pip 25.0 will enforce this behaviour change. A possible replacement is to use the req @ url syntax, and remove the egg fragment. Discussion can be found at https://github.com/pypa/pip/issues/11617[0m[33m
[0mCollecting distilabel (from distilabel[argilla,llama_cpp,outlines])
  Cloning https://github.com/argilla-io/distilabel.git (to revision develop) to /private/var/folders/9t/msy700h16jz3q35qvg4z1ln40000gn/T/pip-install-k_xp9cij/distilabel_3660aa26060c4e56a32016d3f61de7bd
  Running command git clone --filter=blob:none --quiet https://github.com/argilla-io/distilabel.git /private/var/folders/9t/msy700h16jz3q35qvg4z1ln40000gn/T/pip-install-k_xp9cij/distilabel_3660aa26060c4e56a32016d3f61de7bd
  Running command git checkout -b develop --track origin/develop
  Switched to a new branch 'develop'
  branch 'develop' set up to 

### Download the model

For the example, we will just use a [quantized version of llama 3.2 1B](https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF/tree/main). This model will be good enough for basic labelling tasks.

In [22]:

url="https://huggingface.co/hugging-quants/Llama-3.2-1B-Instruct-Q8_0-GGUF/resolve/main/llama-3.2-1b-instruct-q8_0.gguf?download=true"
filename="llama-3.2-1b-instruct-q8_0.gguf"

import requests
import os

def download_file(url, filename):
    response = requests.get(url, stream=True)
    response.raise_for_status()
    
    with open(filename, 'wb') as file:
        for chunk in response.iter_content(chunk_size=8192):
            file.write(chunk)

if not os.path.exists(filename):
    print(f"Downloading {filename}...")
    download_file(url, filename)
    print(f"Download complete: {filename}")
else:
    print(f"{filename} already exists. Skipping download.")


Downloading llama-3.2-1b-instruct-q8_0.gguf...
Download complete: llama-3.2-1b-instruct-q8_0.gguf


## Upload some a basic datasets to Argilla

We will now choose 4 datasets from Hugging Face and upload them to Argilla. A dataset each for the `TextQuestion` and `LabelQuestion`.

- https://huggingface.co/datasets/dair-ai/emotion: emotions as classes  
- https://huggingface.co/datasets/fka/awesome-chatgpt-prompts: chatgpt prompts as text

We will use the `rg.Dataset.from_hub` method to upload example datasets to Argilla. We will also use the `rg.Settings` class to map the fields to the questions.

In [None]:
import argilla as rg
import os

client = rg.Argilla(
    api_url=os.environ["ARGILLA_API_URL_DEV"], api_key=os.environ["ARGILLA_API_KEY_DEV"]
)

settings = rg.Settings(
    fields=[
        rg.TextField(
            name="prompt",
            title="Prompt",
            description="Provide a concise response to the prompt",
        )
    ],
    questions=[
        rg.TextQuestion(
            name="response",
            title="Response",
            description="Provide a concise response to the prompt",
        )
    ],
    mapping={"prompt": "prompt"},
)

rg.Dataset.from_hub(
    repo_id="fka/awesome-chatgpt-prompts",
    name="awesome-chatgpt-prompts",
    split="train[:100]",
    client=client,
    with_records=True,
    settings=settings,
)

settings = rg.Settings(
    fields=[
        rg.TextField(
            name="text",
            title="Text",
            description="Provide a concise response to the prompt",
        )
    ],
    questions=[
        rg.LabelQuestion(
            name="emotion",
            title="Emotion",
            description="Provide a single label for the emotion of the text",
            labels=["joy", "anger", "sadness", "fear", "surprise", "love"],
        )
    ],
    mapping={"text": "text"},
)

rg.Dataset.from_hub(
    repo_id="dair-ai/emotion",
    name="emotion",
    split="train[:100]",
    client=client,
    with_records=True,
    settings=settings,
)

## Label the datasets

We will now label the datasets. We will use the `ArgillaLabeller` class to label the datasets. This class will use  will use a `LlamaCppLLM` LLM to label the datasets. These labels will then be converted into `rg.Suggestion` objects and added to the records. For the sake of the example, we will only label 5 records per time using a while loop that continuesly fetches pending records from Argilla for both datasets and labels them with the LLM. After the labelling, we will update the dataset with the new records.

In [None]:
import os

import argilla as rg
from distilabel.llms.llamacpp import LlamaCppLLM
from distilabel.steps.tasks import ArgillaLabeller

client = rg.Argilla(
    api_url=os.environ["ARGILLA_API_URL_DEV"], api_key=os.environ["ARGILLA_API_KEY_DEV"]
)

# Initialize the labeller with the model and fields
labeller = ArgillaLabeller(
    llm=LlamaCppLLM(
        model_path="llama-3.2-1b-instruct-q8_0.gguf",
        n_ctx=8000,
        extra_kwargs={"max_new_tokens": 8000, "temperature": 0.0},
    )
)
labeller.load()

# Define the datasets, questions, and fields to use with the labeller
datasets = ["emotion", "awesome-chatgpt-prompts"]
fields: list[str] = ["text", "prompt"]
questions = ["emotion", "response"]

# Loop over the datasets, questions, and fields
while True:
    for dataset, question, field in zip(datasets, questions, fields):
        # Get information from Argilla dataset definition
        dataset = client.datasets(name=dataset, workspace="argilla")
        pending_records_filter = rg.Filter(("status", "==", "pending"))

        pending_records = list(
            dataset.records(
                query=rg.Query(filter=pending_records_filter),
                limit=1,
            )
        )
        field = dataset.settings.fields[field]
        question = dataset.settings.questions[question]

        # Process the pending records
        result = next(
            labeller.process(
                [
                    {
                        "record": record,
                        "fields": [field],
                        "question": question,
                        "guidelines": dataset.guidelines,
                    }
                    for record in pending_records
                ]
            )
        )

        # Add the suggestions to the records
        for record, suggestion in zip(pending_records, result):
            record.suggestions.add(rg.Suggestion(**suggestion["suggestion"]))

        # Log the updated records
        dataset.records.log(pending_records)

![label dataset argilla labeller](./images/label_datasets%20with_llms_annotation_guidelines_and_distilabel.png)