<a href="https://www.kaggle.com/code/yeemeitsang/spam-text-classification-roberta?scriptVersionId=130286904" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

**Introduction**

Welcome to this walkthrough on building a spam text classifier using pretrained models from Hugging Face. In this notebook, we will be using data provided by freeCodeCamp to further train a roberta-base model to differentiate between spam and non-spam texts. We will be using the Roberta Tokenizer to tokenize text messages and training the model on a GPU for faster processing.

The freeCodeCamp test suite for the [Neural Network SMS Text Classifier project](https://www.freecodecamp.org/learn/machine-learning-with-python/machine-learning-with-python-projects/neural-network-sms-text-classifier) is used as a reference for evaluating the performance of your model. You can also compose your custom messages and see how the model performs on them.

Note that to successfully run this notebook, you will need access to wandb, which can be used to track your experiments and visualize your results.

If you prefer building your own models from scratch, you may find [my GitHub repository](https://github.com/a-t-em/Keras-LSTM-spam-text-classification) on spam text classification useful as a reference. The repository contains an example of a TensorFlow LSTM model using character embeddings trained on the same dataset.

In [2]:
!pip install transformers

[0m

In [65]:
#import libraries
from transformers import pipeline
from transformers import Trainer, TrainingArguments
from transformers import TextClassificationPipeline
from transformers import RobertaTokenizer, RobertaForSequenceClassification
import tensorflow as tf
from tensorflow import keras
import torch
import pandas as pd
import numpy as np
from keras import layers
from torch.utils.data import Dataset
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

**Load data and prepare training and validation datasets**

In [66]:
# get data from freeCodeCamp
!wget https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
!wget https://cdn.freecodecamp.org/project-data/sms/valid-data.tsv

train_file_path = "train-data.tsv"
test_file_path = "valid-data.tsv"

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
--2023-05-20 10:20:21--  https://cdn.freecodecamp.org/project-data/sms/train-data.tsv
Resolving cdn.freecodecamp.org (cdn.freecodecamp.org)... 172.67.70.149, 104.26.2.33, 104.26.3.33, ...
Connecting to cdn.freecodecamp.org (cdn.freecodecamp.org)|172.67.70.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 358233 (350K) [text/tab-separated-values]
Saving to: ‘train-data.tsv.1’


2023-05-20 10:20:22 (21.9 MB/s) - ‘train-data.tsv.1’ saved [358233/358233]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environme

In [67]:
#load and view train data
df_train = pd.read_csv(train_file_path, sep='\t', header = 0, names = ['label', 'text'])
df_train.head()

Unnamed: 0,label,text
0,ham,you can never do nothing
1,ham,"now u sound like manky scouse boy steve,like! ..."
2,ham,mum say we wan to go then go... then she can s...
3,ham,never y lei... i v lazy... got wat? dat day ü ...
4,ham,in xam hall boy asked girl tell me the startin...


In [68]:
#load and view validation data
df_valid = pd.read_csv(test_file_path, sep = '\t', header = 0, names = ['label', 'text'])
df_valid.head()

Unnamed: 0,label,text
0,ham,"not much, just some textin'. how bout you?"
1,ham,i probably won't eat at all today. i think i'm...
2,ham,don‘t give a flying monkeys wot they think and...
3,ham,who are you seeing?
4,ham,your opinion about me? 1. over 2. jada 3. kusr...


In [69]:
#prepare train data
train_text = df_train.text.values
df_train.label[df_train.label == 'spam'] = 1
df_train.label[df_train.label == 'ham'] = 0
train_label = df_train.label.values
train_text, train_label

(array(['you can never do nothing',
        'now u sound like manky scouse boy steve,like! i is travelling on da bus home.wot has u inmind 4 recreation dis eve?',
        'mum say we wan to go then go... then she can shun bian watch da glass exhibition...',
        ...,
        'free entry into our £250 weekly competition just text the word win to 80086 now. 18 t&c www.txttowin.co.uk',
        '-pls stop bootydelious (32/f) is inviting you to be her friend. reply yes-434 or no-434 see her: www.sms.ac/u/bootydelious stop? send stop frnd to 62468',
        "tell my  bad character which u dnt lik in me. i'll try to change in  &lt;#&gt; . i ll add tat 2 my new year resolution. waiting for ur reply.be frank...good morning."],
       dtype=object),
 array([0, 0, 0, ..., 1, 1, 0], dtype=object))

In [70]:
#prepare validation data
val_text = df_valid.text.values
df_valid.label[df_valid.label == 'spam'] = 1
df_valid.label[df_valid.label == 'ham'] = 0
val_label = df_valid.label.values
val_text, val_label

(array(["not much, just some textin'. how bout you?",
        "i probably won't eat at all today. i think i'm gonna pop. how was your weekend? did u miss me?",
        'don‘t give a flying monkeys wot they think and i certainly don‘t mind. any friend of mine and all that!',
        ...,
        "where are you ? what are you doing ? are yuou working on getting the pc to your mom's ? did you find a spot that it would work ? i need you",
        'ur cash-balance is currently 500 pounds - to maximize ur cash-in now send cash to 86688 only 150p/msg. cc: 08708800282 hg/suite342/2lands row/w1j6hl',
        'not heard from u4 a while. call 4 rude chat private line 01223585334 to cum. wan 2c pics of me gettin shagged then text pix to 8552. 2end send stop 8552 sam xxx'],
       dtype=object),
 array([0, 0, 0, ..., 0, 1, 1], dtype=object))

In [72]:
# choose a tokenizer
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

# define custom dataset
class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, index):
        text = self.texts[index]
        label = self.labels[index]

        encoded_text = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=100,
            padding='max_length',
            truncation=True,
            return_token_type_ids=False,
            return_attention_mask=True,
            return_tensors='pt'
        )

        input_ids = encoded_text['input_ids'].squeeze()
        attention_mask = encoded_text['attention_mask'].squeeze()
        label = torch.tensor(label)

        return {
            'input_ids': input_ids.cpu(),
            'attention_mask': attention_mask.cpu(),
            'labels': label.cpu()
        }

# create datasets
train_dataset = TextClassificationDataset(train_text, train_label, tokenizer)
eval_dataset = TextClassificationDataset(val_text, val_label, tokenizer)

**Build and train model**

In [73]:
# create model
model = RobertaForSequenceClassification.from_pretrained('roberta-base', num_labels=2)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

In [74]:
#define custom metrics for validation to avoid error
def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='weighted')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'precision': precision,
        'recall': recall,
        'f1': f1
    }

In [75]:
#set training parameters
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=8,
    warmup_steps=10,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    fp16=True,  # enable mixed precision training
    evaluation_strategy='epoch',  # evaluate after each epoch
    save_strategy='epoch',  # save once per epoch
    learning_rate=5e-5,  # default learning rate for RoBERTa
    load_best_model_at_end=True,  # load the best model at the end of training
    metric_for_best_model='accuracy',
    greater_is_better=True
)

# create and train the model on the GPU
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics  
)

trainer.train()



Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.0532,0.053014,0.991373,0.991398,0.991373,0.991272
2,0.0421,0.088807,0.981308,0.982653,0.981308,0.981667
3,0.047,0.033458,0.994249,0.994232,0.994249,0.994222


TrainOutput(global_step=786, training_loss=0.06010622421217936, metrics={'train_runtime': 197.8736, 'train_samples_per_second': 63.343, 'train_steps_per_second': 3.972, 'total_flos': 644108196852000.0, 'train_loss': 0.06010622421217936, 'epoch': 3.0})

In [56]:
#optional: save model 
#model.save_pretrained('./saved_model')

In [None]:
#model = RobertaForSequenceClassification.from_pretrained('./saved_model')

**Make predictions**

In [76]:
#helper function for testing
def predict_message(pred_text):
    # encode the message
    encoded_msg = tokenizer.encode_plus(
        pred_text,
        add_special_tokens=True,
        max_length=100,
        padding='max_length',
        truncation=True,
        return_token_type_ids=False,
        return_attention_mask=True,
        return_tensors='pt'
    )

    # move the input tensor onto the GPU
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    encoded_msg = {k: v.to(device) for k, v in encoded_msg.items()}

    # make the prediction
    with torch.no_grad():
        prediction = model(encoded_msg['input_ids'], encoded_msg['attention_mask'])
        label = prediction.logits.argmax().item()
    if label == 1:
        output = [label, 'spam']
    else:
        output = [label, 'ham']
    return output

In [77]:
#view one example
predict_message('our new mobile video service is live. just install on your phone to start watching.')

[1, 'spam']

In [78]:
#the freeCodeCamp test suite
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    print(prediction)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()

[0, 'ham']
[1, 'spam']
[0, 'ham']
[1, 'spam']
[1, 'spam']
[0, 'ham']
[0, 'ham']
You passed the challenge. Great job!


**Bonus: compare results to model without additional training**

In [79]:
#choose pretrained model from hugging face
classifier = pipeline("text-classification", model="textattack/roberta-base-SST-2")

Some weights of the model checkpoint at textattack/roberta-base-SST-2 were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [80]:
#helper function for testing
def predict_message(pred_text):
  results = classifier(pred_text)
  probability = results[0]['score']
  if results[0]['label'] == 'LABEL_1':
    output = [probability, 'ham']
  else:
    output = [probability, 'spam']
  return output

In [81]:
#test message
predict_message('Congrats! You have won a trip to Italy! Click here to claim your prize.')

[0.9688261151313782, 'ham']

In [82]:
#the freeCodeCamp test suite
def test_predictions():
  test_messages = ["how are you doing today",
                   "sale today! to stop texts call 98912460324",
                   "i dont want to go. can we try it a different day? available sat",
                   "our new mobile video service is live. just install on your phone to start watching.",
                   "you have won £1000 cash! call to claim your prize.",
                   "i'll bring it tomorrow. don't forget the milk.",
                   "wow, is your arm alright. that happened to me one time too"
                  ]

  test_answers = ["ham", "spam", "ham", "spam", "spam", "ham", "ham"]
  passed = True

  for msg, ans in zip(test_messages, test_answers):
    prediction = predict_message(msg)
    print(prediction)
    if prediction[1] != ans:
      passed = False

  if passed:
    print("You passed the challenge. Great job!")
  else:
    print("You haven't passed yet. Keep trying.")

test_predictions()

[0.9967869520187378, 'ham']
[0.8398308753967285, 'spam']
[0.9915432929992676, 'spam']
[0.9446273446083069, 'ham']
[0.9607493281364441, 'ham']
[0.8403170704841614, 'ham']
[0.9938607215881348, 'ham']
You haven't passed yet. Keep trying.
