<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M3_2_SetFit_Hatespeech_%26_distilroberta_v3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SetFit (25 examples) vs BERT (1000 examples)

In this tutorial, we perform hate speech classification using SetFit and BERT. We read tweets from a CSV file and balance the number of samples in each class. Then, we split the data into a training set and a testing set.

We use a pre-trained SetFit model to train on the training set and evaluate its performance on the testing set. Code for pushing the model to 🤗 hub is provided but commented out. Next, we fine-tune a pre-trained BERT model on the training set and evaluate its performance on the testing set. We  save the fine-tuned model.

We evaluate using a classification report that includes precision, recall, F1 score, and support for each class.

In [None]:
# Install the necessary packages
!pip install setfit --q
!pip install accelerate -U --q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/75.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.9/75.9 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m13.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.8/132.8 kB[0m [31m18.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m15.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m18.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

In [None]:
!git clone https://github.com/huggingface/transformers
!pip install /content/transformers

Cloning into 'transformers'...
remote: Enumerating objects: 181570, done.[K
remote: Counting objects: 100% (397/397), done.[K
remote: Compressing objects: 100% (219/219), done.[K
remote: Total 181570 (delta 198), reused 296 (delta 140), pack-reused 181173[K
Receiving objects: 100% (181570/181570), 201.68 MiB | 17.42 MiB/s, done.
Resolving deltas: 100% (127151/127151), done.
Updating files: 100% (4044/4044), done.
Processing ./transformers
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: transformers
  Building wheel for transformers (pyproject.toml) ... [?25l[?25hdone
  Created wheel for transformers: filename=transformers-4.38.0.dev0-py3-none-any.whl size=8457137 sha256=a279a2a57c3cc594315cf032fa5f2969eb8083e5fa91d64f9e5855bf70d530b3
  Stored in directory: /tmp/pip-ephem-wheel-cache-yhkqvnlv/wheels/7c/35/80/e946b22a0812

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import evaluate
from sklearn.metrics import classification_report
from imblearn.under_sampling import RandomUnderSampler
from transformers import AutoTokenizer, TrainingArguments, Trainer, pipeline, AutoModelForSequenceClassification
from sentence_transformers.losses import CosineSimilarityLoss
from datasets import Dataset, load_dataset
from setfit import SetFitModel, SetFitTrainer, sample_dataset

## Reading Data

The code reads in the hate speech dataset from a given URL using the `pandas` library, and creates a pandas dataframe with the 'text' and 'label' columns.


In [None]:
## PREPPING THE DATA ##

# Read in the data from a CSV file
data = pd.read_csv('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/twitter_hate.zip')

# Rename and reorder the columns
data_df = pd.DataFrame({'label':data['class'], 'text':data['tweet']})



  and should_run_async(code)


## Fixing Sample Imbalance

The `RandomUnderSampler` from the `imblearn` library is used to fix any sample imbalance in the dataset by undersampling the overrepresented class.

## Splitting Data

The `train_test_split` method from the `datasets` library is used to split the dataset into a training set and a testing set.


In [None]:
# Fix sample imbalance using RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
data_df_res, y_res = rus.fit_resample(data_df, data_df['label'])

# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(data_df_res)

# Split the dataset into training and testing sets
dataset = dataset.train_test_split(test_size=0.2)

  and should_run_async(code)


In [None]:
# Simulate the few-shot regime by sampling 25 examples per class in the training set
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=25)
eval_dataset = dataset["test"]

  and should_run_async(code)


## SetFit Model

The `train_setfit` function takes in a training dataset and an evaluation dataset, trains a SetFit model on the training dataset, evaluates the model on the evaluation dataset, and returns the trained model and evaluation metrics.

This here is a version of SetFit with a sklearn-classification-head. It is also possible to add a neural layer for to the SBERT model. Check out the original example for that here: https://github.com/huggingface/setfit


In [None]:
# Load a pre-trained SBERT model from Hugging Face model hub
model_setfit = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create SetFitTrainer and train the SetFit model
trainer_setfit = SetFitTrainer(
    model=model_setfit,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=20, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for contrastive learning
    column_mapping={"text": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
)
trainer_setfit.train()

  and should_run_async(code)
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/3.70k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/1.19k [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.
  trainer_setfit = SetFitTrainer(
Applying column mapping to the training dataset
Applying column mapping to the evaluation dataset


Map:   0%|          | 0/75 [00:00<?, ? examples/s]

***** Running training *****
  Num unique pairs = 3000
  Batch size = 16
  Num epochs = 1
  Total optimization steps = 188


Step,Training Loss


In [None]:
# Evaluate the performance of the trained SetFit model on the testing dataset
metrics_setfit = trainer_setfit.evaluate()

preds_setfit = model_setfit(eval_dataset['text'])
target_names = ['hate', 'offense', 'nothing']
print(classification_report(eval_dataset['label'], preds_setfit, target_names=target_names))

# Save the trained SetFit model to the HF hub
# trainer_setfit.push_to_hub("my-awesome-setfit-model")

# Download from Hub and run inference
# model_setfit = SetFitModel.from_pretrained("myname/my-awesome-setfit-model")
# Run inference
# preds = model_setfit(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])

  and should_run_async(code)
***** Running evaluation *****


Downloading builder script:   0%|          | 0.00/4.20k [00:00<?, ?B/s]

              precision    recall  f1-score   support

        hate       0.64      0.72      0.68       288
     offense       0.63      0.58      0.60       263
     nothing       0.81      0.77      0.79       307

    accuracy                           0.69       858
   macro avg       0.69      0.69      0.69       858
weighted avg       0.70      0.69      0.69       858



This section of the code involves loading a pre-trained BERT model and tokenizer and using them to fine-tune the model for text classification tasks. The fine-tuning process involves preparing the datasets for fine-tuning the BERT model, setting up the Trainer for the fine-tuned BERT model, and training it. Once the model is trained, it is saved to the local file system along with the tokenizer for later use. The saved model and tokenizer are then used to perform text classification on the testing set, and the output labels are converted to match the labels in the original dataset. Finally, the performance of the fine-tuned BERT model is evaluated using the `classification_report` function.

The `pipeline` function is useful for quickly performing text classification without the need for a custom inference script. The `Trainer` class from the Hugging Face `transformers` library is useful for training the fine-tuned BERT model, and the `compute_metrics` function is useful for computing the evaluation metrics for the fine-tuned BERT model. The `save_pretrained` function is useful for saving the fine-tuned BERT model and tokenizer to the local file system for later use, and the `load_pretrained` function is useful for loading the fine-tuned BERT model and tokenizer from the local file system for future machine learning tasks.


In [None]:
# Load a pre-trained BERT model and tokenizer
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
model_bert = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased",
                                                                num_labels=3,
                                                                ignore_mismatched_sizes=True).to('cuda')

  and should_run_async(code)


tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Note that here we are only using 1000 examples to finetune BERT. We use all 858 available observations from the test set for evaluation.
Since that is not a proper model development pipeline, we use the test-dataset for evaluation, which is otherwise not a good practice...


In [None]:
# Prepare the datasets for fine-tuning the BERT model
def tokenize_function(examples):
    return tokenizer_bert(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(858))

Map:   0%|          | 0/3432 [00:00<?, ? examples/s]

Map:   0%|          | 0/858 [00:00<?, ? examples/s]

In [None]:
# Set up training arguments
training_args = TrainingArguments(output_dir="bert_trainer")

# Define the evaluation metric for the fine-tuned BERT model
metric_bert = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Set up the Trainer for the fine-tuned BERT model and train it
trainer_bert = Trainer(
    model=model_bert,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer_bert.train()

# Save the fine-tuned BERT model and tokenizer to the local file system
model_bert.save_pretrained('model_bert')
tokenizer_bert.save_pretrained('model_bert')

Step,Training Loss


('model_bert/tokenizer_config.json',
 'model_bert/special_tokens_map.json',
 'model_bert/vocab.txt',
 'model_bert/added_tokens.json',
 'model_bert/tokenizer.json')

This saved model could now be pushed to HF hub...or elsewhere

In [None]:
!pip install huggingface-hub --q

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model_bert.push_to_hub("HamidBekam/bert_classification")

  and should_run_async(code)


model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/HamidBekam/bert_classification/commit/eb58e396fe053b62286a1657b2f81e3ad029bcec', commit_message='Upload BertForSequenceClassification', commit_description='', oid='eb58e396fe053b62286a1657b2f81e3ad029bcec', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
tokenizer_bert.push_to_hub("HamidBekam/bert_classification")

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/HamidBekam/bert_classification/commit/26e4f038967e2ac719ea1d750a08c4031b0b5c10', commit_message='Upload tokenizer', commit_description='', oid='26e4f038967e2ac719ea1d750a08c4031b0b5c10', pr_url=None, pr_revision=None, pr_num=None)

In [None]:
# Use the saved fine-tuned BERT model and tokenizer to perform text classification on the testing set
classifier = pipeline("text-classification", model="model_bert", device=0)
preds_bert = classifier(eval_dataset['text'])

In [None]:
# Convert the output labels to match the labels in the original dataset
preds_bert_num = [x['label'] for x in preds_bert]
mapping = {'LABEL_0':0,'LABEL_1':1,'LABEL_2':2}
preds_bert_num = [mapping[x] for x in preds_bert_num]

# Print the classification report for the fine-tuned BERT model
target_names = ['hate', 'offense', 'nothing']
print(classification_report(eval_dataset['label'], preds_bert_num, target_names=target_names))

              precision    recall  f1-score   support

        hate       0.78      0.74      0.76       288
     offense       0.77      0.80      0.79       263
     nothing       0.91      0.92      0.91       307

    accuracy                           0.82       858
   macro avg       0.82      0.82      0.82       858
weighted avg       0.82      0.82      0.82       858

