Subhan Ahmad

# Spam Classification # 

The purpose of this project is to train a neural network using fine tuning with pytorch to classify whether given text is spam or not. I'm going to use existing dataset "sms spam" from [Hugging Face](https://huggingface.co/datasets/sms_spam#discussion-of-biases) to train my neural network model. 

Resources obtained from :

[Professor Andrew Leahy's Github](https://github.com/aleahy-work/CS-STAT323-W24)

[machine-learning-book](https://github.com/rasbt/machine-learning-book)

# Loading Data # 

I'm importing the load dataset function to load the"sms_spam" data from huggingface

In [50]:
from datasets import load_dataset

In [51]:
dataset = load_dataset("sms_spam")

In [52]:
dataset

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 5574
    })
})

The dataset consists of two features, sms and label containing the text and corresponding label. The label 0 indicates that the text is not spam whereas 1 indicate that the text is spam

# Spliting the data into train and validation set #

I'm using the train_test_split function to create a train and validation set 

In [53]:
train_valid = dataset['train'].train_test_split(test_size=0.1) 

In [54]:
train_valid

DatasetDict({
    train: Dataset({
        features: ['sms', 'label'],
        num_rows: 5016
    })
    test: Dataset({
        features: ['sms', 'label'],
        num_rows: 558
    })
})

In [55]:
train_valid.column_names

{'train': ['sms', 'label'], 'test': ['sms', 'label']}

In [56]:
mytrain = train_valid['train']

In [57]:
myvalid = train_valid['test']

# Tokenizating #

The DistilBertTokenizerFast is a fast tokenizer optimized for usage during training and inference. By loading the pre-trained tokenizer, you can tokenize text data efficiently, preparing it for input into the model.

In [58]:
from transformers import DistilBertTokenizerFast
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

The tokenize_function takes a text from "sms" column and converts it into tokens so the text can be computed 

In [59]:
def tokenize_function(examples):
    return tokenizer(examples["sms"], truncation=True, padding='max_length')

map function applies the tokenize_function to the training and validation set

In [60]:
train_tokened = mytrain.map(tokenize_function, batched=True)

In [61]:
valid_tokened = myvalid.map(tokenize_function, batched=True)

Map:   0%|          | 0/558 [00:00<?, ? examples/s]

# Fine Tuning #

The DistilBertForSequenceClassification class is part of the Hugging Face transformers library, specifically designed for sequence classification tasks using DistilBERT. DistilBERT, a distilled version of the BERT (Bidirectional Encoder Representations from Transformers) model, is a model for natural language understanding tasks.

In [62]:
from transformers import DistilBertForSequenceClassification

We want perform out computations on a GPU if it is available and the number of epochs for our model are 3. The pre-trained DistilBERT model for sequence classification we are using is the 'distilbert-base-uncased' configuration. This configuration represents a DistilBERT model trained on uncased English text.

In [63]:
import torch
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
NUM_EPOCHS = 3

In [64]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(DEVICE)
model.train();

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


By calling list_metrics(), we obtain a list of strings representing the names of the available metrics. These metrics can then be used to evaluate the performance of models trained on specific datasets or tasks. The compute_metrics function is defined to compute the evaluation metric based on the predictions and labels obtained during evaluation

In [65]:
from datasets import list_metrics
metrics_list = list_metrics()

In [66]:
from datasets import load_metric
import numpy as np


metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred 
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(
               predictions=predictions, references=labels)

You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this metric from the next major release of `datasets`.


training_arg uses TrainingArguments class provided by the Hugging Face transformers library. The directory where the model checkpoints and output will be saved is results, the whole model will be trained for 3 epochs, the batch size is 16, the directory where where logs such as training metrics and evaluation results will be saved is set to './logs', training logs are printed every 10 steps and evaluation_strategy="epoch" specifies to perform evaluation at the end of each epoch. 

trainer sets up all the necessary components for training the model, including the model itself, training and evaluation datasets, optimizer, training arguments, and evaluation metrics computation function. After setting up the Trainer, the training process can be initiated using the trainer.train() method.

In [69]:
optim = torch.optim.Adam(model.parameters(), lr=5e-5)

training_args = TrainingArguments(
    output_dir='./results', 
    num_train_epochs=3,     
    per_device_train_batch_size=16, 
    per_device_eval_batch_size=16,   
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="epoch"
)

trainer = Trainer(
    model=model,
    compute_metrics=compute_metrics,
    args=training_args,
    train_dataset=train_tokened,
    eval_dataset=valid_tokened,
    optimizers=(optim, None) 
)

In [70]:
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.0504,0.061099,0.980287
2,0.001,0.100197,0.982079
3,0.0003,0.097228,0.985663


Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.


TrainOutput(global_step=942, training_loss=0.02862816490887935, metrics={'train_runtime': 930.485, 'train_samples_per_second': 16.172, 'train_steps_per_second': 1.012, 'total_flos': 1993369414975488.0, 'train_loss': 0.02862816490887935, 'epoch': 3.0})

In [72]:
trainer.evaluate()

{'eval_loss': 0.09722808003425598,
 'eval_accuracy': 0.985663082437276,
 'eval_runtime': 10.8146,
 'eval_samples_per_second': 51.597,
 'eval_steps_per_second': 3.236,
 'epoch': 3.0}

The model has an accuracy  of 0.9856

In [73]:
trainer.save_model('sms_model')