# Train Zero Shot Text Classification with SetFit on Argilla Labeler

- **Goal**: Show a standard workflow for a text classification task, including zero-shot suggestions and model fine-tuning.
- **Dataset**: [IMDB](https://huggingface.co/datasets/stanfordnlp/imdb), a dataset of movie reviews that need to be classified as positive or negative.
- **Libraries**: [datasets](https://github.com/huggingface/datasets), [transformers](https://github.com/huggingface/transformers), [setfit](https://github.com/huggingface/setfit)
- **Components**: [TextField](https://docs.argilla.io/latest/reference/argilla/settings/fields/#src.argilla.settings._field.TextField), [LabelQuestion](https://docs.argilla.io/latest/reference/argilla/settings/questions/#src.argilla.settings._question.LabelQuestion), [Suggestion](https://docs.argilla.io/latest/reference/argilla/records/suggestions/), [Query](https://docs.argilla.io/dev/reference/argilla/search/#rgquery_1), [Filter](https://docs.argilla.io/dev/reference/argilla/search/#rgfilter)

## Getting started

### Deploy the Argilla server

If you already have deployed Argilla, you can skip this step. Otherwise, you can quickly deploy Argilla following [this guide](../getting_started/quickstart.md).

### Set up the environment

To complete this tutorial, you need to install the Argilla SDK and a few third-party libraries via `pip`.

# Argilla Labeler 

In [1]:
!pip install scikit-learn


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
import os
from uuid import uuid4

import argilla as rg

client = rg.Argilla(api_key="argilla.apikey", api_url="http://localhost:6900")

labels = ["joy", "anger", "sadness", "fear", "surprise", "love"]

settings = rg.Settings(
    fields=[
        rg.TextField(
            name="text",
            title="Text",
            description="Provide a concise response to the prompt",
        )
    ],
    questions=[
        rg.LabelQuestion(
            name="label",
            title="Emotion",
            description="Provide a single label for the emotion of the text",
            labels=labels,
        )
    ],
    mapping={"text": "text"},
)

dataset_name = f"emotion-{uuid4()}"

rg.Dataset.from_hub(
    repo_id="dair-ai/emotion",
    name=dataset_name,
    split="train",
    client=client,
    with_records=True,
    settings=settings,
)

  from .autonotebook import tqdm as notebook_tqdm
Sending records...: 63batch [00:30,  2.06batch/s]                     


Dataset(id=UUID('387dccb7-31bb-4d8e-a571-1b6f98769ed3') inserted_at=datetime.datetime(2024, 10, 10, 16, 20, 15, 601414) updated_at=datetime.datetime(2024, 10, 10, 16, 20, 15, 765278) name='emotion-e45a52fc-9dbf-480f-9fa7-b5c5e8e46d52' status='ready' guidelines=None allow_extra_metadata=False distribution=OverlapTaskDistributionModel(strategy='overlap', min_submitted=1) workspace_id=UUID('735cae0d-eb08-45c3-ad79-0a11ad4dd2c2') last_activity_at=datetime.datetime(2024, 10, 10, 16, 20, 15, 765278))

## Label the datasets

We will now label the datasets. We will use the `ArgillaLabeller` class to label the datasets. This class will use  will use a `LlamaCppLLM` LLM to label the datasets. These labels will then be converted into `rg.Suggestion` objects and added to the records. For the sake of the example, we will only label 5 records per time using a while loop that continuesly fetches pending records from Argilla for both datasets and labels them with the LLM. After the labelling, we will update the dataset with the new records.

In [3]:
from collections import Counter
from distilabel.llms.llamacpp import LlamaCppLLM
from distilabel.steps.tasks import ArgillaLabeller

dataset = client.datasets(name=dataset_name, workspace="argilla")

example_records = []

counter = Counter()
max_samples = 32


for record in dataset.records(with_responses=True, with_suggestions=True):
    value = record.suggestions["label"].value
    counter[value] += 1
    if counter[value] > 8:
        continue
    record.responses.add(
        rg.Response(question_name="label", value=value, user_id=client.me)
    )
    example_records.append(record)

In [4]:
example_records = [record.to_dict() for record in example_records]

In [None]:
fixed_example_records = []

for record in example_records:
    record["responses"] = record["responses"]["label"]
    fixed_example_records.append(record)
    
example_records = fixed_example_records

In [5]:
len(example_records)

48

In [6]:
# randomly shuffle the records

import random

random.shuffle(example_records)

# Distilabel can label your records

We can use the `process` method of the `ArgillaLabeller` class to label the records. This method will label the records with the LLM and update the records with the new labels. We will use this method to label the records of the datasets.

In [7]:
from datasets import load_dataset

train_dataset = load_dataset("dair-ai/emotion", split="train[:100]")
test_dataset = load_dataset("dair-ai/emotion", split="test")

In [9]:
from sklearn.metrics import classification_report

results_scores = {}
max_eval_records = 10

for n_samples in [0, 8, 16, 32]:
    # Initialize the labeller with the model and fields
    labeller = ArgillaLabeller(
        llm=LlamaCppLLM(
            model_path="llama-3.2-1b-instruct-q8_0.gguf",
            n_ctx=8000,
            extra_kwargs={"max_new_tokens": 8000, "temperature": 0.0},
        ),
        example_records=example_records[: n_samples + 1],
    )
    labeller.load()
    predictions = []
    true = []
    results = labeller.process(
        [
            {
                "record": rg.Record(fields={"text": sample["text"]}),
                "fields": dataset.fields,
                "question": dataset.questions[0],
                "guidelines": dataset.guidelines,
            }
            for sample in test_dataset.select(range(max_eval_records))
        ]
    )
    for sample, result in zip(test_dataset.select(range(max_eval_records)), next(results)):
        true.append(test_dataset.features["label"].int2str(sample["label"]))
        if not result["suggestion"]:
            predictions.append(None)
            continue
        suggestion = result["suggestion"]["value"]
        test_dataset.features["label"].int2str(sample["label"])
        predictions.append(suggestion)

    results_scores[n_samples] = {"true": true, "predictions": predictions}

    del labeller



In [11]:
for n_samples, data in results_scores.items():
    true = data["true"]
    predictions = data["predictions"]
    
    acc = sum([t == p for t, p in zip(true, predictions) if p is not None]) / len([p for p in predictions if p is not None])
    
    print(f"Accuracy for {n_samples} samples: {acc}")

Accuracy for 0 samples: 0.3
Accuracy for 8 samples: 0.2
Accuracy for 16 samples: 0.2
Accuracy for 32 samples: 0.3333333333333333


# SetFit Efficient Classifier

Let's make the required imports:

In [18]:
from datasets import load_dataset, Dataset
from setfit import (
    SetFitModel,
    Trainer,
    get_templated_dataset,
    sample_dataset,
    TrainingArguments,
)

In [19]:
labels = ["joy", "anger", "sadness", "fear", "surprise", "love"]

In [20]:
zero_ds = get_templated_dataset(
    candidate_labels=labels,
    sample_size=8,
)

In [113]:
def train_model(model_name, train_dataset, eval_dataset):
    model = SetFitModel.from_pretrained(model_name)

    args = TrainingArguments(
        num_epochs=1,
        eval_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    )
    
    trainer = Trainer(
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        # args=args,
    )

    trainer.train()

    results = trainer.evaluate()

    print(results)

    return model, results

Let's train the model. We will use `TaylorAI/bge-micro-v2`, available in the [Hugging Face Hub](https://huggingface.co/TaylorAI/bge-micro-v2).

In [114]:
model, results = train_model(
    model_name="TaylorAI/bge-micro-v2", train_dataset=zero_ds, eval_dataset=test_dataset
)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.




Map: 100%|██████████| 48/48 [00:00<00:00, 8534.77 examples/s]
***** Running training *****
  Num unique pairs = 1920
  Batch size = 16
  Num epochs = 2




[A[A[A[A



                                                 

[A[A                                                  
[A                                             


[A[A[A                                        



[A[A[A[A                                            




 78%|███████▊  | 75/96 [1:26:55<00:01, 17.13it/s]
[A

[A[A


[A[A[A



[A[A[A[A



[A[A[A[A

{'embedding_loss': 0.2085, 'grad_norm': 1.5014283657073975, 'learning_rate': 8.333333333333333e-07, 'epoch': 0.01}






[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



                                                 

[A[A                                                  
[A                                             


[A[A[A                                        



[A[A[A[A                                            




 78%|███████▊  | 75/96 [1:26:57<00:01, 17.13it/s]
[A

[A[A


[A[A[A



[A[A[A[A



[A[A[A[A

{'embedding_loss': 0.1017, 'grad_norm': 0.5493279695510864, 'learning_rate': 1.7592592592592595e-05, 'epoch': 0.42}






[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



                                                 

[A[A                                                  
[A                                             


[A[A[A                                        



[A[A[A[A                                            




 78%|███████▊  | 75/96 [1:26:59<00:01, 17.13it/s]
[A

[A[A


[A[A[A



[A[A[A[A



[A[A[A[A

{'embedding_loss': 0.0155, 'grad_norm': 0.08654134720563889, 'learning_rate': 1.2962962962962964e-05, 'epoch': 0.83}






[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A



[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A




[A[A[A[A[A

In [42]:
test_dataset.to_pandas()

Unnamed: 0,text,label
0,im feeling rather rotten so im not very ambiti...,0
1,im updating my blog because i feel shitty,0
2,i never make her separate from me because i do...,0
3,i left with my bouquet of red and yellow tulip...,1
4,i was feeling a little vain when i did this one,0
...,...,...
1995,i just keep feeling like someone is being unki...,3
1996,im feeling a little cranky negative after this...,3
1997,i feel that i am useful to my people and that ...,1
1998,im feeling more comfortable with derby i feel ...,1


In [41]:
results

{'accuracy': 0.085}

You can save it locally or push it to the Hub. And then, load it from there.

In [97]:
# results_scores = {}

for n_samples in [8, 16, 32]:
    
    train_dataset = sample_dataset(
        train_dataset, label_column="label", num_samples=n_samples
    )
    model, results = train_model(
        model_name="TaylorAI/bge-micro-v2",
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )
    print(f"Results for {n_samples} samples:")
    print(results)
    results_scores[n_samples] = results

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
Map: 100%|██████████| 43/43 [00:00<00:00, 6174.64 examples/s]
***** Running training *****
  Num unique pairs = 1528
  Batch size = 16
  Num epochs = 4
                                               
 78%|███████▊  | 75/96 [32:15<00:01, 17.13it/s]

{'embedding_loss': 0.1935, 'grad_norm': 0.5774672627449036, 'learning_rate': 5.128205128205128e-07, 'epoch': 0.01}


                                               
 78%|███████▊  | 75/96 [32:18<00:01, 17.13it/s] 

{'embedding_loss': 0.2234, 'grad_norm': 0.8519500494003296, 'learning_rate': 1.9362318840579713e-05, 'epoch': 0.52}



[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A
[A

In [None]:
# Save and load locally
# model.save_pretrained("text_classification_model")
# model = SetFitModel.from_pretrained("text_classification_model")

# Push and load in HF
# model.push_to_hub("[username]/text_classification_model")
# model = SetFitModel.from_pretrained("[username]/text_classification_model")

In [98]:
print(results_scores)

{0: {'true': ['sadness', 'sadness', 'sadness', 'joy', 'sadness', 'fear', 'anger', 'joy', 'joy', 'anger'], 'predictions': ['sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness']}, 8: {'true': ['sadness', 'sadness', 'sadness', 'joy', 'sadness', 'fear', 'anger', 'joy', 'joy', 'anger'], 'predictions': ['sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness', 'sadness']}, 16: {'true': ['sadness', 'sadness', 'sadness', 'joy', 'sadness', 'fear', 'anger', 'joy', 'joy', 'anger'], 'predictions': ['fear', 'sadness', 'sadness', 'joy', 'sadness', 'sadness', 'anger', 'sadness', 'sadness', 'fear']}, 32: {'true': ['sadness', 'sadness', 'sadness', 'joy', 'sadness', 'fear', 'anger', 'joy', 'joy', 'anger'], 'predictions': ['emotion', 'sadness', 'sadness', 'emotions', 'sadness', 'joy', None, 'sadness', 'sadness', 'joy']}}
