### Emilie Dubief

# Introduction

Text Classification is the process of sorting text data into different categories depending on what the data contains. The categories can be multiple, yet here we will use Binary classification for the data. This is a task in NLP (Natural Language Processing), several methods can be used. From traditional machine learning methods like Logistic Regression, or even Transfer Learning approaches.

In order to find the best approach for our problem, let's recap what we are dealing with.

Here, we work with emails that can be either spam or not spam. Thus, we need to build a binary text classification model. The data is split in two files, a training file and a testing file. We will then use the training file for training and validation, and the testing file for the submission file.

First, you will see the final methodology applied for this problem alongside the previous algorithms I used. Then, a summary of all the results I had for each trial. Finally, a small conclusion and all the references used in this notebook.

# Table of content

* [Methodology - NLP Model](#methodology)
* [Results](#results)
* [Conclusion](#conclusion)
* [References](#references)

In [None]:
!pip install -q --upgrade transformers huggingface_hub

import pandas as pd
pd.set_option('display.max_rows', 36)
pd.set_option("display.max_colwidth", 150)

seed = 42
import numpy as np
np.random.seed(seed)

from sklearn.model_selection import train_test_split

# Regex
import re

# Naive Bayes
from sklearn.naive_bayes import MultinomialNB

# LSTM
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Transfer Learning (RoBERTa)
import torch
from datasets import Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# competition metric for local evaluation
from sklearn.metrics import matthews_corrcoef
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

# Methodology <a class="anchor"  id="methodology"></a>

In this section, you will be able to see all the different steps I took to achieve the building of my final model.

## Read in the training data

The first thing to do was to read the data from the train.csv files, containing a sample of spam and not spam emails.

In [None]:
train = pd.read_csv("/kaggle/input/u-tad-spam-not-spam-2025-edition/train.csv", index_col = "row_id")

X = train["text"]        # Content of the emails
y = train["spam_label"]  # Labels: Spam or Not-Spam

# Splitting the data for training and validation
raw_X_train, raw_X_val, raw_y_train, raw_y_val = train_test_split(X, y, test_size=0.2, random_state=42)

## Data cleaning

Then, I cleaned the data to remove insignificant data in the emails, such as repetitive labels.

This cleaning process is not used in the final model because it was not useful to increase the score, yet it was necessary for the first algorithms used.

In [None]:
# # Cleaning data
# def clean_text(text):
#     text = re.sub(r'Subject: ', '', text)  # remove Subject: 
#     text = re.sub(r're : ', '', text)      # remove re: 
#     text = re.sub(r'\n', '', text)         # remove \n
#     return text

# cleaned_X_train = raw_X_train.apply(clean_text).tolist()
# cleaned_X_val = raw_X_val.apply(clean_text).tolist()

## Convertion for tranfer learning

This step is for the final algorithm used: Transfer Learning. Here, we only need to convert the data we have into a dataset to use it with the functions related to the algorithm. The content is not changing, only the type.

In [None]:
# Convert into Dataframes
train_df = pd.DataFrame({
    "text": raw_X_train.values,
    "label": raw_y_train.values
})

val_df = pd.DataFrame({
    "text": raw_X_val.values,
    "label": raw_y_val.values
})

# Convert to HuggingFace Dataset
train_ds = Dataset.from_pandas(train_df)
val_ds = Dataset.from_pandas(val_df)

## Tokenisation

The next step is to separate each email into tokens. The idea here is to split the emails for the model to analyse better the data.

In [None]:
max_len = 128   

## Transfer learning
model_name = "distilbert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    add_prefix_space=False,
    truncation_side="right"
)

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

X_train = train_ds.map(tokenize, batched=True)
X_val = val_ds.map(tokenize, batched=True)

X_train = X_train.rename_column("label", "labels")
X_val   = X_val.rename_column("label", "labels")

X_train.set_format("torch")
X_val.set_format("torch")

## NLP model

Then, we need to build the model with the pre-trained model distilbert-base-uncased, to do some transfer learning and better our model.

After that, we train our model with all our data.

In [None]:
## Transfer Learning (BERT)

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)

import os
os.environ["WANDB_DISABLED"] = "true"

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=5,
    logging_steps=10,          
    disable_tqdm=False         
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=X_train,
    eval_dataset=X_val
)

print("Starting training...")
trainer.train()

## Make prediction

Here, we test our model with the validation data we kept, and we print the accuracy and the matthews coef to see how it did.

In [None]:
# Make prediction on validation data
preds = trainer.predict(X_val)
y_pred_val = preds.predictions.argmax(axis=1)

# Print results
print("Accuracy :", accuracy_score(raw_y_val, y_pred_val))
print("Matthews coef:", matthews_corrcoef(raw_y_val, y_pred_val))
print(classification_report(raw_y_val, y_pred_val))

## Real data to test the model on

In this section, we only use the model we trained with the real data for the submission.

### This is the test data that you are asked to make predictions for

In [None]:
test = pd.read_csv("/kaggle/input/u-tad-spam-not-spam-2025-edition/test.csv", index_col = "row_id")

raw_X_test = test["text"]

test_ds = Dataset.from_pandas(pd.DataFrame({"text": raw_X_test.values}))

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, padding="max_length", max_length=128)

test_ds = test_ds.map(tokenize, batched=True)
test_ds.set_format("torch")

In [None]:
preds_test = trainer.predict(test_ds)
y_pred = preds_test.predictions.argmax(axis=1)

print(y_pred[:10])

### Submit your predictions in a `submission.csv` file for scoring on the [leaderboard](https://www.kaggle.com/competitions/u-tad-spam-not-spam-2025-edition/leaderboard)
To submit your notebook click on **Submit to competition** and then **Submit**.

In [None]:
# do not modify this code
submission = pd.read_csv("/kaggle/input/u-tad-spam-not-spam-2025-edition/sample_submission.csv")
submission["spam_label"] = y_pred
submission.to_csv('submission.csv',index=False)

In [None]:
submission.head()

# Results <a class="anchor"  id="results"></a>

In order to achieve the creation of this final model with an accuracy score of 0.91, I had to make several trials.

## Logistic Regression

The first step was to update the starter notebook to build a basic logistic regression model, which gave me a score of 0.75.

Then, I added the cleaning process using regex. I removed the common labels in emails such as "Subject:" and "re:". I also removed the "\n" at the end of the lines. This cleaning increased my score to 0.77.

After that, I changed the tokenizer to use the best one I found when I did the tests in class: CountVector. It better my score with an accuracy of 0.80.

Then, following what we did in class, I tried to remove the English stopwords. Yet it didn't help my model and made my accuracy score decrease to 0.79. Thus, I choose not to remove them.

To try everything I could with logistic regression, I tried to use another tokenizer: TdifVectorizer. I didn't have good results with it during the tests in class, but I read it was a good tokenizer. Unfortunately it gave me an accuracy score of 0.72, so I switched back to CountVetcor.

Then, to improve the model, I tried to use a Naives Bayes classifier with a basic configuration, which resulted in an accuracy of 0.80.

I wanted to try the different configuration for this tool. The first one was the sensibility to uncommon worlds.
    - sensible to uncommon worlds    -> 0.77 accuracy
    - unsensible to uncommon worlds -> 0.81 accuracy
At first, I thought making it sensible to uncommon worlds was important because in not-spam we could use some familiar worlds not in the vocabulary of spam. Yet, I guess, according to the results, that spams are becoming better to fool us with uncommon worlds.

Another configuration was to make the model influenced by the probability of classification. Which means that if there were more spam in training data, it would choose this option if it has a doubt. By default this option was set to true. I tried to switch it to false but it decreased my score to 0.76, thus I switched it back to true.

## LSTM (Long Short-Term Memory)

I tried all I could do with Logistic regression, so I wanted to try another model: LSTM (Long Short-Term Memory). I made a basic model using the cleaning process I built for logistic regression. At first, I only put 5 epochs. I obtained a score of 0.76.

There were not many things to do to improve the model apart from the number of epochs, thus I added 15 epochs. My accuracy score didn't change.

## Transfer Learning

The last method I chose to try was Transfer Learning, since it worked well for image classification. Moreover, since our dataset is small, enlarging it with another train model seemed to be a great solution.

According to websites, I found out that an excellent model for our problem of spam/not-spam classification was roBERTa. I applied it to the data, only with 2 epochs, and got an accuracy of 0.90.

Then, I added 3 epochs improving my accuracy to 0.91.

Adding more epochs after that didn't seem to improve my score and took several hours. Thus I stopped at 5 epochs.

The last test I wanted to do was to clean the data with my cleaning process. Yet it only reduced my accuracy score to 0.89

# Conclusion <a class="anchor"  id="conclusion"></a>

This experiment was really interesting, I learned to work with text data. I even tried more things than with image classification. 

I think I covered a large array of methods, but I know there are so many more for different classification, or multiple categories. I found it more interesting than image classification because I was able to understand better how it works.

I ended up with an accuracy score of 0.91, which is really satisfying for me because I was blocked at 0.80 with the first models I used.

I think that with even more trying and time, I could try more algorithms. I've read articles about so many other models and parameters for text classification.

# References <a class="anchor"  id="references"></a>

Geek for geeks article about Text Classification - https://www.geeksforgeeks.org/nlp/what-is-text-classification/

Article about the different NLP models for Text Classification - https://mljourney.com/best-nlp-models-for-text-classification-in-2025/