<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebooks/M4_SetFit_Hatespeech_and_distilroberta.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SetFit (25 examples) vs BERT (1000 examples)

In this tutorial, we perform hate speech classification using SetFit and BERT. We read tweets from a CSV file and balance the number of samples in each class. Then, we split the data into a training set and a testing set.

We use a pre-trained SetFit model to train on the training set and evaluate its performance on the testing set. Code for pushing the model to 🤗 hub is provided but commented out. Next, we fine-tune a pre-trained BERT model on the training set and evaluate its performance on the testing set. We  save the fine-tuned model.

We evaluate using a classification report that includes precision, recall, F1 score, and support for each class.

In [None]:
# Install the necessary packages
!pip install setfit -q

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import evaluate
from sklearn.metrics import classification_report
from imblearn.under_sampling import RandomUnderSampler
from transformers import AutoTokenizer, TrainingArguments, Trainer, pipeline, AutoModelForSequenceClassification
from sentence_transformers.losses import CosineSimilarityLoss
from datasets import Dataset, load_dataset
from setfit import SetFitModel, SetFitTrainer, sample_dataset

## Reading Data

The code reads in the hate speech dataset from a given URL using the `pandas` library, and creates a pandas dataframe with the 'text' and 'label' columns.


In [None]:
## PREPPING THE DATA ##

# Read in the data from a CSV file
data = pd.read_csv('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/twitter_hate.zip')

# Rename and reorder the columns
data_df = pd.DataFrame({'label':data['class'], 'text':data['tweet']})



## Fixing Sample Imbalance

The `RandomUnderSampler` from the `imblearn` library is used to fix any sample imbalance in the dataset by undersampling the overrepresented class.

## Splitting Data

The `train_test_split` method from the `datasets` library is used to split the dataset into a training set and a testing set.


In [None]:
# Fix sample imbalance using RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
data_df_res, y_res = rus.fit_resample(data_df, data_df['label'])

# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(data_df_res)

# Split the dataset into training and testing sets
dataset = dataset.train_test_split(test_size=0.2)

In [None]:
# Simulate the few-shot regime by sampling 25 examples per class in the training set
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=25)
eval_dataset = dataset["test"]

## SetFit Model

The `train_setfit` function takes in a training dataset and an evaluation dataset, trains a SetFit model on the training dataset, evaluates the model on the evaluation dataset, and returns the trained model and evaluation metrics.

This here is a version of SetFit with a sklearn-classification-head. It is also possible to add a neural layer for to the SBERT model. Check out the original example for that here: https://github.com/huggingface/setfit 


In [None]:
# Load a pre-trained SBERT model from Hugging Face model hub
model_setfit = SetFitModel.from_pretrained("sentence-transformers/paraphrase-mpnet-base-v2")

# Create SetFitTrainer and train the SetFit model
trainer_setfit = SetFitTrainer(
    model=model_setfit,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    loss_class=CosineSimilarityLoss,
    metric="accuracy",
    batch_size=16,
    num_iterations=20, # The number of text pairs to generate for contrastive learning
    num_epochs=1, # The number of epochs to use for contrastive learning
    column_mapping={"text": "text", "label": "label"} # Map dataset columns to text/label expected by trainer
)
trainer_setfit.train()

In [None]:
# Evaluate the performance of the trained SetFit model on the testing dataset
metrics_setfit = trainer_setfit.evaluate()

preds_setfit = model_setfit(eval_dataset['text'])
target_names = ['hate', 'offense', 'nothing']
print(classification_report(eval_dataset['label'], preds_setfit, target_names=target_names))

# Save the trained SetFit model to the HF hub
# trainer_setfit.push_to_hub("my-awesome-setfit-model")

# Download from Hub and run inference
# model_setfit = SetFitModel.from_pretrained("myname/my-awesome-setfit-model")
# Run inference
# preds = model_setfit(["i loved the spiderman movie!", "pineapple on pizza is the worst 🤮"])

This section of the code involves loading a pre-trained BERT model and tokenizer and using them to fine-tune the model for text classification tasks. The fine-tuning process involves preparing the datasets for fine-tuning the BERT model, setting up the Trainer for the fine-tuned BERT model, and training it. Once the model is trained, it is saved to the local file system along with the tokenizer for later use. The saved model and tokenizer are then used to perform text classification on the testing set, and the output labels are converted to match the labels in the original dataset. Finally, the performance of the fine-tuned BERT model is evaluated using the `classification_report` function. 

The `pipeline` function is useful for quickly performing text classification without the need for a custom inference script. The `Trainer` class from the Hugging Face `transformers` library is useful for training the fine-tuned BERT model, and the `compute_metrics` function is useful for computing the evaluation metrics for the fine-tuned BERT model. The `save_pretrained` function is useful for saving the fine-tuned BERT model and tokenizer to the local file system for later use, and the `load_pretrained` function is useful for loading the fine-tuned BERT model and tokenizer from the local file system for future machine learning tasks.


In [None]:
# Load a pre-trained BERT model and tokenizer
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")
model_bert = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                                num_labels=3,
                                                                ignore_mismatched_sizes=True).to('cuda')

Note that here we are only using 1000 examples to finetune BERT. We use all 858 available observations from the test set for evaluation.
Since that is not a proper model development pipeline, we use the test-dataset for evaluation, which is otherwise not a good practice...


In [None]:
# Prepare the datasets for fine-tuning the BERT model
def tokenize_function(examples):
    return tokenizer_bert(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(1000))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(858))

In [None]:
# Set up training arguments
training_args = TrainingArguments(output_dir="bert_trainer")

# Define the evaluation metric for the fine-tuned BERT model
metric_bert = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

# Set up the Trainer for the fine-tuned BERT model and train it
trainer_bert = Trainer(
    model=model_bert,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    compute_metrics=compute_metrics,
)

trainer_bert.train()

# Save the fine-tuned BERT model and tokenizer to the local file system
model_bert.save_pretrained('model_bert')
tokenizer_bert.save_pretrained('model_bert')

This saved model could now be pushed to HF hub...or elsewhere

In [None]:
# Use the saved fine-tuned BERT model and tokenizer to perform text classification on the testing set
classifier = pipeline("text-classification", model="model_bert", device=0)
preds_bert = classifier(eval_dataset['text'])

In [None]:
# Convert the output labels to match the labels in the original dataset
preds_bert_num = [x['label'] for x in preds_bert]
mapping = {'LABEL_0':0,'LABEL_1':1,'LABEL_2':2}
preds_bert_num = [mapping[x] for x in preds_bert_num]

# Print the classification report for the fine-tuned BERT model
target_names = ['hate', 'offense', 'nothing']
print(classification_report(eval_dataset['label'], preds_bert_num, target_names=target_names))