# SMS Spam Classification with Pretrained Language Models

## Data Collection

<b>Summary :- </b>The SMS Spam Collection is a public set of SMS labeled messages that have been collected for mobile phone spam research. This dataset contains 5,574 English messages tagged as either spam ("1") or non-spam ("0").

<b>Classification task :- </b>The text classification task is to take an SMS message as input and determine whether the message is spam ("1") or not ("0"). There are several factors that make the task non-trivial. Spam messages cannot be identified just by looking for some fixed words like 'good', 'bad', 'spam', etc. It is not just the words but the combination of words and the context in which they are used which decides whether the message is spam or not. For example the following message is spam <br><br>
"Sunshine Quiz Wkly Q! Win a top Sony DVD player if u know which country the Algarve is in? Txt ansr to 82277. £1.50 SP:Tyrone"<br><br>
Such messages cannot be identified just by looking for a predefined set of words. These cases require use of complex models that can learn patterns in messages and identify spam. This makes spam detection a non-trivial task.

<b>Statistics :-</b><br>
Labeled examples = 5574<br>
Examples labeled spam = 747<br>
Examples labeled non-spam = 4827<br>
Unique words = 17929<br>

## Text classification

In this part, the pretrained language models are fine-tuned on the dataset. Since we're dealing with large models, the first step is to change to a GPU runtime.

### Adding a hardware accelerator

Go to the menu and add a GPU as follows:

`Edit > Notebook Settings > Hardware accelerator > (GPU)`

Run the following cell to confirm that the GPU is detected.

In [None]:
import torch

# Confirm that the GPU is detected

assert torch.cuda.is_available()
torch.manual_seed(0)
# Get the GPU device name.
device_name = torch.cuda.get_device_name()
n_gpu = torch.cuda.device_count()
print(f"Found device: {device_name}, n_gpu: {n_gpu}")
device = torch.device("cuda")

Found device: Tesla T4, n_gpu: 1


## Installing Hugging Face's Transformers library
This project uses Hugging Face's Transformers (https://github.com/huggingface/transformers), an open-source library that provides general-purpose architectures for natural language understanding and generation with a collection of various pretrained models made by the NLP community. This library will allow us to easily use pretrained models like `BERT` and perform experiments on top of them. These models can be used to solve downstream target tasks, such as text classification, question answering, and sequence labeling.

Run the following cell to install Hugging Face's Transformers library.

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.30.2-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m68.6 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers)
  Downloading huggingface_hub-0.16.2-py3-none-any.whl (268 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.5/268.5 kB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m82.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m74.0 MB/s[0m eta [36m0:00:0

# Data Prep and Model Specification

The csv file of the dataset, titled **SMSSpamCollection.csv**, has one column "text" and another column "labels" containing integers.

In [None]:
from helpers import tokenize_and_format, flat_accuracy
import pandas as pd

df = pd.read_csv('SMSSpamCollection.csv')

df = df.sample(frac=1, random_state=42).reset_index(drop=True)

texts = df.text.values
labels = df.label.values

### tokenize_and_format() is a helper function provided in helpers.py ###
input_ids, attention_masks = tokenize_and_format(texts)

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(labels)

# Print sentence 0, now as a list of IDs.
print('Original: ', texts[0])
print('Token IDs:', input_ids[0])

Original:  You still coming tonight?
Token IDs: tensor([ 101, 2017, 2145, 2746, 3892, 1029,  102,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0])


## Create train/test/validation splits

Here the dataset is split into 3 parts: a training set, a validation set, and a testing set. Each item in the dataset is a 3-tuple containing an input_id tensor, an attention_mask tensor, and a label tensor.

In [None]:
total = len(df)

num_train = int(total * .8)
num_val = int(total * .1)
num_test = total - num_train - num_val

# make lists of 3-tuples (already shuffled the dataframe in cell above)

train_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_train)]
val_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_train, num_val+num_train)]
test_set = [(input_ids[i], attention_masks[i], labels[i]) for i in range(num_val + num_train, total)]

train_text = [texts[i] for i in range(num_train)]
val_text = [texts[i] for i in range(num_train, num_val+num_train)]
test_text = [texts[i] for i in range(num_val + num_train, total)]


Here we choose the model we want to finetune from https://huggingface.co/transformers/pretrained_models.html. Because the task requires us to label sentences, we will be using BertForSequenceClassification below.

In [None]:
from transformers import BertForSequenceClassification, AdamW, BertConfig

model = BertForSequenceClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 2, # The number of output labels.
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
model.cuda()


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

The cell below defines the approach to fine-tune hyperparameters. Basically, it is an experiment with different configurations to find the one that works best (i.e., highest accuracy) on the validation set.

In [None]:
# Hyperparameters were fine-tuned using grid search over batch_size, learning_rate, epsilon and epochs
# Set of hyperparameters considered for grid search :-
#   Batch size :- {32, 64, 128}
#   Learning rate :- {1e-3, 5e-3, 1e-2, 5e-2}
#   Epsilon :- {1e-7, 1e-8, 1e-9}
#   Epochs :- {3, 5, 10}

# batch_size = 32
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-7 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 3

# batch_size = 32
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-8 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

# batch_size = 32
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-9 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

# batch_size = 32
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-7 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 10

# batch_size = 32
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-8 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 10

# batch_size = 32
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-9 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 3

# batch_size = 32
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-7 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

# batch_size = 32
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-8 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 10

# batch_size = 64
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-9 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 10

# batch_size = 64
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-8 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

# batch_size = 64
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-7 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 3

# batch_size = 64
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-8 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

# batch_size = 64
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-9 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

# batch_size = 64
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-8 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 10

# batch_size = 64
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-9 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 10

# batch_size = 128
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-7 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 3

# batch_size = 128
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-8 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

# batch_size = 128
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-3, # args.learning_rate - default is 5e-5
#                   eps = 1e-9 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

# batch_size = 128
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-7 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 3

# batch_size = 128
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-8 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 10

# batch_size = 128
# optimizer = AdamW(model.parameters(),
#                   lr = 1e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-9 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 10

# batch_size = 128
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-7 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 3

# batch_size = 128
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-8 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

# batch_size = 128
# optimizer = AdamW(model.parameters(),
#                   lr = 5e-2, # args.learning_rate - default is 5e-5
#                   eps = 1e-9 # args.adam_epsilon  - default is 1e-8
#                 )
# epochs = 5

batch_size = 64
optimizer = AdamW(model.parameters(),
                  lr = 1e-2, # args.learning_rate - default is 5e-5
                  eps = 1e-9 # args.adam_epsilon  - default is 1e-8
                )
epochs = 3



# Model fine-tuning
The following code performs fine-tuning of the model, monitors the loss, and checks the validation accuracy.

In [None]:
import numpy as np
# function to get validation accuracy
def get_validation_performance(val_set):
    # Put the model in evaluation mode
    model.eval()

    # Tracking variables
    total_eval_accuracy = 0
    total_eval_loss = 0

    num_batches = int(len(val_set)/batch_size) + 1

    total_correct = 0

    for i in range(num_batches):

      end_index = min(batch_size * (i+1), len(val_set))

      batch = val_set[i*batch_size:end_index]

      if len(batch) == 0: continue

      input_id_tensors = torch.stack([data[0] for data in batch])
      input_mask_tensors = torch.stack([data[1] for data in batch])
      label_tensors = torch.stack([data[2] for data in batch])

      # Move tensors to the GPU
      b_input_ids = input_id_tensors.to(device)
      b_input_mask = input_mask_tensors.to(device)
      b_labels = label_tensors.to(device)

      # Tell pytorch not to bother with constructing the compute graph during
      # the forward pass, since this is only needed for backprop (training).
      with torch.no_grad():

        # Forward pass, calculate logit predictions.
        outputs = model(b_input_ids,
                                token_type_ids=None,
                                attention_mask=b_input_mask,
                                labels=b_labels)
        loss = outputs.loss
        logits = outputs.logits

        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the number of correctly labeled examples in batch
        pred_flat = np.argmax(logits, axis=1).flatten()
        labels_flat = label_ids.flatten()
        num_correct = np.sum(pred_flat == labels_flat)
        total_correct += num_correct

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_correct / len(val_set)
    return avg_val_accuracy

In [None]:
import random

# training loop

# For each epoch...
for epoch_i in range(0, epochs):
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode.
    model.train()

    # For each batch of training data...
    num_batches = int(len(train_set)/batch_size) + 1

    for i in range(num_batches):
        end_index = min(batch_size * (i+1), len(train_set))

        batch = train_set[i*batch_size:end_index]

        if len(batch) == 0: continue

        input_id_tensors = torch.stack([data[0] for data in batch])
        input_mask_tensors = torch.stack([data[1] for data in batch])
        label_tensors = torch.stack([data[2] for data in batch])

        # Move tensors to the GPU
        b_input_ids = input_id_tensors.to(device)
        b_input_mask = input_mask_tensors.to(device)
        b_labels = label_tensors.to(device)

        # Clear the previously calculated gradient
        model.zero_grad()

        # Perform a forward pass (evaluate the model on this training batch).
        outputs = model(b_input_ids,
                                token_type_ids=None,
                                attention_mask=b_input_mask,
                                labels=b_labels)
        loss = outputs.loss
        logits = outputs.logits

        total_train_loss += loss.item()

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Update parameters and take a step using the computed gradient.
        optimizer.step()

    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure performance on validation set.
    print(f"Total loss: {total_train_loss}")
    val_acc = get_validation_performance(val_set)
    print(f"Validation accuracy: {val_acc}")

print("")
print("Training complete!")



Training...
Total loss: 114.32864111661911
Validation accuracy: 0.8725314183123878

Training...
Total loss: 33.65106010437012
Validation accuracy: 0.8725314183123878

Training...
Total loss: 36.12894009053707
Validation accuracy: 0.8725314183123878

Training complete!


# Evaluate the model on the test set
After finding the hyperparameters that achieve the highest validation accuracy, it's time to evaluate the model on the test set! The cell below computes the test set accuracy.

In [None]:
get_validation_performance(test_set)

0.8494623655913979

<b>Hyperparameter selection process :- </b>Grid search method was used for hyperparameter tuning. In this method a grid of all possible combinations of hyperparameters was constructed. Then the pre-trained BERT model was trained (fine-tuned) and evaluated for every combination of hyperparameters on the validation set. The combination of hyperparameters that produced the best-performing model on the validation set was then selected as the optimal set of hyperparameters.

Range of hyperparameters considered for grid search is as follows :-

1. Batch size :- {32, 64, 128}
2. Learning rate :- {1e-3, 5e-3, 1e-2, 5e-2}
3. Epsilon :- {1e-7, 1e-8, 1e-9}
4. Epochs :- {3, 5, 10}

<b>Why are chosen hyperparameters better :-</b>

Hyper parameters can affect model performance in many ways :-

1. High learning rate causes the model to diverge and models with very small learning rate fail to reach the optimum in given epochs.

2. Training for too many epochs may make the model overfit the training data.

3. Smaller batch size provides better generalization but if it is too small, the model may see noisy gradients leading to unstable training.

In summary, chosen hyperparameters work better than others because the chosen values are neither too high nor too low and just right (based on the points mentioned above) for obtaining the best accuracy on the validation set.

<b>Discrepancy between test and val accuracy :- </b>There is a 3% gap between validation and test accuracy. This can happen when validation set and test set have different distributions. This also happens when the model overfits the validation set.

Next step is to perform an *error analysis* on the model. The code below prints out five test set examples that the model gets **wrong**. Then, the following text cell shows a qualitative analysis of these examples.

In [None]:
## print out up to 5 test set examples that the model gets wrong

# Put the model in evaluation mode
model.eval()

input_id_tensors = torch.stack([data[0] for data in test_set])
input_mask_tensors = torch.stack([data[1] for data in test_set])
label_tensors = torch.stack([data[2] for data in test_set])

# Move tensors to the GPU
b_input_ids = input_id_tensors.to(device)
b_input_mask = input_mask_tensors.to(device)
b_labels = label_tensors.to(device)

# Tell pytorch not to bother with constructing the compute graph during
# the forward pass, since this is only needed for backprop (training).
with torch.no_grad():

    # Forward pass, calculate logit predictions.
    outputs = model(b_input_ids,
                            token_type_ids=None,
                            attention_mask=b_input_mask,
                            labels=b_labels)

    logits = outputs.logits

    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.to('cpu').numpy()

    # Calculate the number of correctly labeled examples in batch
    pred_flat = np.argmax(logits, axis=1).flatten()
    labels_flat = label_ids.flatten()

    errIndices = []
    for i in range(len(pred_flat)):
        if(pred_flat[i] != labels_flat[i]):
            errIndices.append(i)


    for i in random.sample(errIndices, 5):
        print('Message :- ', texts[num_val + num_train + i])
        print('Prediction :- ', pred_flat[i])
        print('Label :- ', labels_flat[i])
        print()

Message :-  You have 1 new voicemail. Please call 08719181513.
Prediction :-  0
Label :-  1

Message :-  SMS. ac JSco: Energy is high but u may not know where 2channel it. 2day ur leadership skills r strong. Psychic? Reply ANS w/question. End? Reply END JSCO
Prediction :-  0
Label :-  1

Message :-  Cashbin.co.uk (Get lots of cash this weekend!) www.cashbin.co.uk Dear Welcome to the weekend We have got our biggest and best EVER cash give away!! These..
Prediction :-  0
Label :-  1

Message :-  Urgent Please call 09066612661 from landline. £5000 cash or a luxury 4* Canary Islands Holiday await collection. T&Cs SAE award. 20M12AQ. 150ppm. 16+ “
Prediction :-  0
Label :-  1

Message :-  Dear Voucher Holder 2 claim this weeks offer at your PC go to http://www.e-tlp.co.uk/expressoffer Ts&Cs apply.2 stop texts txt STOP to 80062.
Prediction :-  0
Label :-  1



All the 5 examples have label 1 which means that they are spam but the model classifies them as non-spam. This is probably because the dataset is not balanced i.e. number of spam messages in the dataset is significantly less than the number of non-spam messages. An unbalanced distribution is a probable cause of low accuracy.

A possible future step to improve the classifier would be to use a weighted loss function during training. This involves assigning a higher weight to spam messages, which can prevent the model from being biased towards non-spam messages. This way, the penalty of misclassifying a spam message will be higher than misclassifying a non-spam message and the model can efficiently learn to classify spam messages with higher accuracy.

Another way to address this issue is to use ensemble methods to combine predictions of multiple models trained on different subsets of the data. This can improve the model's performance for spam messages by reducing the impact of noise and biases in the individual models.