<a href="https://colab.research.google.com/github/aaubs/ds-master/blob/main/notebook/M3_2_SetFit_Hatespeech_%26_distilroberta_v4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# SetFit (25 examples) vs BERT (1000 examples)

In this tutorial, we perform hate speech classification using SetFit and BERT. We read tweets from a CSV file and balance the number of samples in each class. Then, we split the data into a training set and a testing set.

We use a pre-trained SetFit model to train on the training set and evaluate its performance on the testing set. Code for pushing the model to 🤗 hub is provided but commented out. Next, we fine-tune a pre-trained BERT model on the training set and evaluate its performance on the testing set. We  save the fine-tuned model.

We evaluate using a classification report that includes precision, recall, F1 score, and support for each class.

In [None]:
!pip install setfit

Collecting setfit
  Downloading setfit-1.1.0-py3-none-any.whl.metadata (12 kB)
Collecting datasets>=2.15.0 (from setfit)
  Downloading datasets-3.0.1-py3-none-any.whl.metadata (20 kB)
Collecting sentence-transformers>=3 (from sentence-transformers[train]>=3->setfit)
  Downloading sentence_transformers-3.2.0-py3-none-any.whl.metadata (10 kB)
Collecting evaluate>=0.3.0 (from setfit)
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets>=2.15.0->setfit)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets>=2.15.0->setfit)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets>=2.15.0->setfit)
  Downloading multiprocess-0.70.17-py310-none-any.whl.metadata (7.2 kB)
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
 

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
from datasets import Dataset, load_dataset
import evaluate
from sklearn.metrics import classification_report
from imblearn.under_sampling import RandomUnderSampler
from setfit import SetFitModel, Trainer, TrainingArguments, sample_dataset

  and should_run_async(code)


In [None]:
## PREPPING THE DATA ##

# Read in the data from a CSV file
data = pd.read_csv('https://github.com/SDS-AAU/SDS-master/raw/master/M2/data/twitter_hate.zip')

# Rename and reorder the columns
data_df = pd.DataFrame({'label':data['class'], 'text':data['tweet']})

In [None]:
data_df.head()

  and should_run_async(code)


Unnamed: 0,label,text
0,2,!!! RT @mayasolovely: As a woman you shouldn't...
1,1,!!!!! RT @mleew17: boy dats cold...tyga dwn ba...
2,1,!!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...
3,1,!!!!!!!!! RT @C_G_Anderson: @viva_based she lo...
4,1,!!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...


In [None]:
# Check context for each label
data_df[data_df.label == 2].text[0:5]

Unnamed: 0,text
0,!!! RT @mayasolovely: As a woman you shouldn't...
40,""" momma said no pussy cats inside my doghouse """
63,"""@Addicted2Guys: -SimplyAddictedToGuys http://..."
66,"""@AllAboutManFeet: http://t.co/3gzUpfuMev"" woo..."
67,"""@Allyhaaaaa: Lemmie eat a Oreo &amp; do these..."


In [None]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24783 entries, 0 to 24782
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   24783 non-null  int64 
 1   text    24783 non-null  object
dtypes: int64(1), object(1)
memory usage: 387.4+ KB


## Fixing Sample Imbalance

The `RandomUnderSampler` from the `imblearn` library is used to fix any sample imbalance in the dataset by undersampling the overrepresented class.

## Splitting Data

The `train_test_split` method from the `datasets` library is used to split the dataset into a training set and a testing set.


In [None]:
data_df.label.value_counts()

  and should_run_async(code)


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,19190
2,4163
0,1430


In [None]:
# Fix sample imbalance using RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
data_df_res, y_res = rus.fit_resample(data_df, data_df['label'])
data_df_res.reset_index(drop=True, inplace=True)

# Convert the pandas DataFrame to a Hugging Face Dataset
dataset = Dataset.from_pandas(data_df_res)

# Split the dataset into training and testing sets
dataset = dataset.train_test_split(test_size=0.2)

In [None]:
data_df_res.shape

  and should_run_async(code)


(4290, 2)

In [None]:
dataset

DatasetDict({
    train: Dataset({
        features: ['label', 'text'],
        num_rows: 3432
    })
    test: Dataset({
        features: ['label', 'text'],
        num_rows: 858
    })
})

In [None]:
# Simulate the few-shot regime by sampling 25 examples per class
train_dataset = sample_dataset(dataset["train"], label_column="label", num_samples=25)
eval_dataset = dataset["test"].select(range(500))
test_dataset = dataset["test"].select(range(500, len(dataset["test"])))

In [None]:
train_dataset

Dataset({
    features: ['label', 'text'],
    num_rows: 75
})

In [None]:
train_dataset[1]

{'label': 0,
 'text': '@coughlan616 Youre racist against white people whigger. Anti-Racist is a codeword for anti-white.\nAnti-zionist is a codeword for antisemite.'}

In [None]:
# Load a SetFit model from Hub
model = SetFitModel.from_pretrained(
    "sentence-transformers/paraphrase-mpnet-base-v2",
    labels=['hate', 'offense', 'nothing'],
)

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [None]:
# Set up training arguments
args = TrainingArguments(
    batch_size=16,
    num_epochs=1, # epoch should be raised to increase accuracy (also increase computing time)
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

In [None]:
# Set up training process
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    metric="accuracy",
    # column_mapping={"sentence": "text", "label": "label"}  # Map dataset columns to text/label expected by trainer
)

Map:   0%|          | 0/75 [00:00<?, ? examples/s]

In [None]:
# instantiate training
trainer.train()

***** Running training *****
  Num unique pairs = 3750
  Batch size = 16
  Num epochs = 1


Epoch,Training Loss,Validation Loss
1,0.0016,0.247896


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

In [None]:
# Evaluation
metrics = trainer.evaluate(test_dataset)
print(metrics)

  and should_run_async(code)
***** Running evaluation *****


{'accuracy': 0.6927374301675978}


In [None]:
# Now this is the fine-tuned model
preds = model.predict(["I hate people", "You look so fucking stupid", "what weather we are having"])
print(preds)

['hate', 'hate', 'nothing']


In [None]:
# Evaluate on test data for classification report
preds_setfit = model(eval_dataset['text'])

  and should_run_async(code)


In [None]:
len(test_dataset['label'])

  and should_run_async(code)


34

In [None]:
# Mapping for string labels to numeric labels
label_mapping = {'hate': 0, 'offense': 1, 'nothing': 2}

# Convert string predictions (preds_bert) to numeric labels
numeric_preds = [label_mapping[pred] for pred in preds_setfit]

# True labels are already numeric, so we can use them directly
true_labels = test_dataset['label']

target_names = ['hate', 'offense', 'nothing']

# Generate the classification report
print(classification_report(true_labels, numeric_preds, target_names=target_names))

              precision    recall  f1-score   support

        hate       0.82      0.82      0.82        11
     offense       0.92      0.92      0.92        12
     nothing       0.91      0.91      0.91        11

    accuracy                           0.88        34
   macro avg       0.88      0.88      0.88        34
weighted avg       0.88      0.88      0.88        34



### SetFit with BERT

This section of the code involves loading a pre-trained BERT model and tokenizer and using them to fine-tune the model for text classification tasks. The fine-tuning process involves preparing the datasets for fine-tuning the BERT model, setting up the Trainer for the fine-tuned BERT model, and training it. Once the model is trained, it is saved to the local file system along with the tokenizer for later use. The saved model and tokenizer are then used to perform text classification on the testing set, and the output labels are converted to match the labels in the original dataset. Finally, the performance of the fine-tuned BERT model is evaluated using the `classification_report` function.

In [None]:
!pip install transformers --q

  and should_run_async(code)


In [None]:
from transformers import AutoTokenizer, pipeline
#from sentence_transformers.losses import CosineSimilarityLoss

In [None]:
# Load a pre-trained BERT tokenizer
tokenizer_bert = AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]



In [None]:
#import the BERT model through Setfit
model_bert = SetFitModel.from_pretrained("bert-base-uncased", labels=['hate', 'offense', 'nothing'])



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

model_head.pkl not found on HuggingFace Hub, initialising classification head with random weights. You should TRAIN this model on a downstream task to use it for predictions and inference.


In [None]:
# Prepare the datasets for fine-tuning the BERT model
def tokenize_function(examples):
    return tokenizer_bert(examples["text"], padding="max_length", truncation=True)

# Tokenize text data for BERT classification
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/134 [00:00<?, ? examples/s]

Map:   0%|          | 0/34 [00:00<?, ? examples/s]

In [None]:
tokenized_datasets

DatasetDict({
    train: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 134
    })
    test: Dataset({
        features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 34
    })
})

In [None]:
# Simulate the few-shot regime by sampling 25 examples per class
small_train_dataset = sample_dataset(tokenized_datasets["train"], label_column="label", num_samples=25) #increase sample number for increased accuracy/computing time
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))
small_test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500, len(tokenized_datasets["test"])))

In [None]:
small_train_dataset

Dataset({
    features: ['label', 'text', 'input_ids', 'token_type_ids', 'attention_mask'],
    num_rows: 75
})

In [None]:
# Set up training arguments - we use BERTs standard training parameters
training_args = TrainingArguments(output_dir="bert_trainer")

In [None]:
# Set up trainer
trainer_bert = Trainer(
    model=model_bert,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
    metric="accuracy",
    # column_mapping={"sentence": "text", "label": "label"}  # Map dataset columns to text/label expected by trainer
)

Map:   0%|          | 0/75 [00:00<?, ? examples/s]

In [None]:
# train the model
trainer_bert.train()

  and should_run_async(code)
***** Running training *****
  Num unique pairs = 3750
  Batch size = 16
  Num epochs = 1


Step,Training Loss
1,0.288
50,0.2346
100,0.0746
150,0.0031
200,0.0013


Computing widget examples:   0%|          | 0/1 [00:00<?, ?example/s]

In [None]:
# Evaluation
metrics = trainer_bert.evaluate(small_test_dataset)
print(metrics)

***** Running evaluation *****


{'accuracy': 0.6764705882352942}


In [None]:
# Now this is the fine-tuned model
preds = model_bert.predict(["I hate people", "You look so fucking stupid", "what weather we are having"])
print(preds)

['hate', 'hate', 'nothing']


In [None]:
# Run our model on the test data
preds_bert = model_bert(small_test_dataset['text'])

  and should_run_async(code)


In [None]:
# Checking the output
preds_bert

['hate',
 'hate',
 'nothing',
 'hate',
 'hate',
 'nothing',
 'hate',
 'hate',
 'hate',
 'hate',
 'nothing',
 'offense',
 'hate',
 'offense',
 'offense',
 'offense',
 'hate',
 'offense',
 'hate',
 'offense',
 'offense',
 'hate',
 'hate',
 'hate',
 'nothing',
 'offense',
 'nothing',
 'hate',
 'hate',
 'offense',
 'offense',
 'offense',
 'hate',
 'nothing']

In [None]:
# compare to original test data
small_test_dataset["label"]

[0,
 0,
 2,
 1,
 1,
 2,
 2,
 0,
 0,
 1,
 2,
 2,
 0,
 1,
 1,
 0,
 0,
 1,
 2,
 1,
 1,
 2,
 2,
 0,
 2,
 1,
 2,
 0,
 1,
 1,
 0,
 1,
 0,
 2]

In [None]:
# Mapping for string labels to numeric labels
label_mapping = {'hate': 0, 'offense': 1, 'nothing': 2}

# Convert string predictions (preds_bert) to numeric labels
numeric_preds = [label_mapping[pred] for pred in preds_bert]

# True labels are already numeric, so we can use them directly
true_labels = small_test_dataset['label']

target_names = ['hate', 'offense', 'nothing']

# Generate the classification report
print(classification_report(true_labels, numeric_preds, target_names=target_names))

              precision    recall  f1-score   support

        hate       0.53      0.82      0.64        11
     offense       0.73      0.67      0.70        12
     nothing       1.00      0.55      0.71        11

    accuracy                           0.68        34
   macro avg       0.75      0.68      0.68        34
weighted avg       0.75      0.68      0.68        34



### Saving model and uploading to HuggingFace Hub

In [None]:
# Save the fine-tuned BERT model and tokenizer to the local file system
model_bert.save_pretrained('model_bert')
tokenizer_bert.save_pretrained('model_bert')

In [None]:
# Load the tokenizer and model from the local directory
# tokenizer = AutoTokenizer.from_pretrained('model_bert')
# model = SetFitModel.from_pretrained('model_bert')

In [None]:
# Use the saved fine-tuned BERT model and tokenizer to perform text classification on the testing set with the Pipeline function from HF
classifier = pipeline("text-classification", model="model_bert", device=0)
preds_classifier_bert = classifier(small_test_dataset['text'])

In [None]:
preds_classifier_bert

#### Saving in Huggingface Hub
This saved model could now be pushed to HF hub...or elsewhere

In [None]:
#Install Huggingface Hub
!pip install huggingface-hub --q

In [None]:
from huggingface_hub import notebook_login

#Login through notebook - Huggingface API Key needed
notebook_login()

In [None]:
# Push model to Huggingface Hub user space - replace "usename" with Huggingface Hub username
model_bert.push_to_hub("Username/bert_classification")

In [None]:
# Push tokenizer to Huggingface Hub user space
tokenizer_bert.push_to_hub("Username/bert_classification")

In [None]:
# Load model from hub
#tokenizer = AutoTokenizer.from_pretrained('"Username/bert_classification"')
#model = SetFitModel.from_pretrained("Username/bert_classification")