# 1. Data Acquisition & Exploration

In [1]:
# Use data set SMSSpamCollection

import pandas as pd

# Load dataset from raw GitHub URL
url = "https://raw.githubusercontent.com/ahsanb6/Assignment-Building-a-Text-Classification-Pipeline/main/SMSSpamCollection"

# Dataset has no header, tab-separated; we'll assign column names manually
df = pd.read_csv(url, sep='\t', header=None, names=['label', 'message'])

# Basic info
print("Total messages:", len(df))
print("\nClass distribution:\n", df['label'].value_counts())
df.head()


Total messages: 5572

Class distribution:
 label
ham     4825
spam     747
Name: count, dtype: int64


Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [2]:
# Show 5 examples of ham
print(" Example 'ham' messages:\n")
print(df[df['label'] == 'ham']['message'].sample(5, random_state=1).to_string(index=False))

# Show 5 examples of spam
print("\n Example 'spam' messages:\n")
print(df[df['label'] == 'spam']['message'].sample(5, random_state=2).to_string(index=False))

 Example 'ham' messages:

                     Ok enjoy . R u there in home.
Yo, the game almost over? Want to go to walmart...
                        Shall i come to get pickle
 Hi. Hope you had a good day. Have a better night.
          K..u also dont msg or reply to his msg..

 Example 'spam' messages:

Someone has contacted our dating service and en...
Want 2 get laid tonight? Want real Dogging loca...
   3. You have received your mobile content. Enjoy
WELL DONE! Your 4* Costa Del Sol Holiday or £50...
Enjoy the jamster videosound gold club with you...


# 2. Pre‑processing Pipeline

In [3]:
import re

# Function to lowercase and clean text
def clean_text(text):
    text = text.lower()  # lowercase
    text = re.sub(r'[^a-z\s]', '', text)  # remove punctuation and numbers
    return text

# Apply on a few example messages
df['cleaned'] = df['message'].apply(clean_text)

# Show before and after
print("Original Message:\n", df['message'].iloc[0])
print("\nCleaned Message:\n", df['cleaned'].iloc[0])

Original Message:
 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

Cleaned Message:
 go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat


In [4]:
# Tokenization + Stopword Removal

import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords

# Get English stopwords
stop_words = set(stopwords.words('english'))

# Tokenize using .split() and remove stopwords
def tokenize_and_remove_stopwords(text):
    tokens = text.split()  # simple split by whitespace
    filtered = [word for word in tokens if word not in stop_words]
    return filtered

# Apply on cleaned text
df['tokens'] = df['cleaned'].apply(tokenize_and_remove_stopwords)

# Show example
print("Cleaned Message:\n", df['cleaned'].iloc[0])
print("\nTokens (without stopwords):\n", df['tokens'].iloc[0])

Cleaned Message:
 go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat

Tokens (without stopwords):
 ['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amore', 'wat']


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [5]:
# Lemmatization

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Function: lemmatize tokens
def lemmatize_tokens(tokens):
    return [lemmatizer.lemmatize(token) for token in tokens]

# Apply lemmatization
df['lemmas'] = df['tokens'].apply(lemmatize_tokens)

# Show example
print("Original Message:\n", df['message'].iloc[0])
print("\nLemmatized Tokens:\n", df['lemmas'].iloc[0])

Original Message:
 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

Lemmatized Tokens:
 ['go', 'jurong', 'point', 'crazy', 'available', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'got', 'amore', 'wat']


# 3. Feature Engineering

In [6]:
# Bag of Words (BoW) – with CountVectorizer

from sklearn.feature_extraction.text import CountVectorizer

# Join lemmas back into strings for vectorizer
df['text_ready'] = df['lemmas'].apply(lambda x: ' '.join(x))

# Initialize vectorizer (Bag of Words)
vectorizer = CountVectorizer()

# Fit and transform text into feature vectors
X_bow = vectorizer.fit_transform(df['text_ready'])

# Show shape and some sample features
print("Shape of BoW matrix:", X_bow.shape)
print("Sample features:\n", vectorizer.get_feature_names_out()[:10])

Shape of BoW matrix: (5572, 7950)
Sample features:
 ['aa' 'aah' 'aaniye' 'aaooooright' 'aathilove' 'aathiwhere' 'ab' 'abbey'
 'abdomen' 'abeg']


In [7]:
# TF‑IDF Vectorization

from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
tfidf = TfidfVectorizer()

# Fit and transform text
X_tfidf = tfidf.fit_transform(df['text_ready'])

# Show shape and sample features
print("Shape of TF-IDF matrix:", X_tfidf.shape)
print("Sample features:\n", tfidf.get_feature_names_out()[:10])

Shape of TF-IDF matrix: (5572, 7950)
Sample features:
 ['aa' 'aah' 'aaniye' 'aaooooright' 'aathilove' 'aathiwhere' 'ab' 'abbey'
 'abdomen' 'abeg']


In [8]:
!pip install gensim



In [9]:
# Word Embeddings (Dense Vectors using Word2Vec)

import gensim.downloader as api
import numpy as np

# Load pre-trained GloVe vectors (50 dimensions)
glove = api.load("glove-wiki-gigaword-50")

# Function to average word vectors
def get_avg_word_vector(tokens):
    vectors = [glove[word] for word in tokens if word in glove]
    if len(vectors) == 0:
        return np.zeros(50)
    else:
        return np.mean(vectors, axis=0)

# Apply to all messages
X_glove = np.vstack(df['lemmas'].apply(get_avg_word_vector))

print("Shape of GloVe averaged matrix:", X_glove.shape)

Shape of GloVe averaged matrix: (5572, 50)


# 4. Modelling & Evaluation

In [11]:
# Data Split

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Encode labels (ham=0, spam=1)
le = LabelEncoder()
y = le.fit_transform(df['label'])

# Split: 70% train, 10% val, 20% test
X_train_bow, X_temp_bow, y_train, y_temp = train_test_split(X_bow, y, test_size=0.3, random_state=42)
X_val_bow, X_test_bow, y_val, y_test = train_test_split(X_temp_bow, y_temp, test_size=2/3, random_state=42)

# Repeat for TF-IDF
X_train_tfidf, X_temp_tfidf = train_test_split(X_tfidf, test_size=0.3, random_state=42)
X_val_tfidf, X_test_tfidf = train_test_split(X_temp_tfidf, test_size=2/3, random_state=42)

# Repeat for GloVe
X_train_glove, X_temp_glove = train_test_split(X_glove, test_size=0.3, random_state=42)
X_val_glove, X_test_glove = train_test_split(X_temp_glove, test_size=2/3, random_state=42)

print("Train size:", X_train_bow.shape[0])
print("Val size:", X_val_bow.shape[0])
print("Test size:", X_test_bow.shape[0])

Train size: 3900
Val size: 557
Test size: 1115


In [12]:
# Train Multinomial Naive Bayes

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

# Initialize and train model
nb = MultinomialNB()
nb.fit(X_train_bow, y_train)

# Predict on validation set
y_val_pred = nb.predict(X_val_bow)

# Evaluate
print("📊 Naive Bayes on BoW Features (Validation Set):\n")
print(classification_report(y_val, y_val_pred, target_names=le.classes_))

📊 Naive Bayes on BoW Features (Validation Set):

              precision    recall  f1-score   support

         ham       0.99      0.98      0.99       491
        spam       0.86      0.95      0.91        66

    accuracy                           0.98       557
   macro avg       0.93      0.97      0.95       557
weighted avg       0.98      0.98      0.98       557



In [13]:
# Train Logistic Regression on BoW and GloVe

from sklearn.linear_model import LogisticRegression

# Initialize model
lr_bow = LogisticRegression(max_iter=1000)
lr_bow.fit(X_train_bow, y_train)

# Predict
y_val_pred_lr_bow = lr_bow.predict(X_val_bow)

# Evaluate
print("📊 Logistic Regression on BoW Features:\n")
print(classification_report(y_val, y_val_pred_lr_bow, target_names=le.classes_))

📊 Logistic Regression on BoW Features:

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       491
        spam       1.00      0.86      0.93        66

    accuracy                           0.98       557
   macro avg       0.99      0.93      0.96       557
weighted avg       0.98      0.98      0.98       557



Higher precision on spam than Naive Bayes

Slightly lower recall

In [14]:
# Logistic Regression on GloVe Dense Vectors

# Initialize model
lr_glove = LogisticRegression(max_iter=1000)
lr_glove.fit(X_train_glove, y_train)

# Predict
y_val_pred_glove = lr_glove.predict(X_val_glove)

# Evaluate
print("📊 Logistic Regression on GloVe Embeddings:\n")
print(classification_report(y_val, y_val_pred_glove, target_names=le.classes_))

📊 Logistic Regression on GloVe Embeddings:

              precision    recall  f1-score   support

         ham       0.95      0.96      0.95       491
        spam       0.66      0.61      0.63        66

    accuracy                           0.92       557
   macro avg       0.80      0.78      0.79       557
weighted avg       0.91      0.92      0.91       557



In [15]:
# compare all models

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pandas as pd

# Helper to collect results
def get_metrics(y_true, y_pred, model_name, feature_type):
    return {
        "Model": model_name,
        "Features": feature_type,
        "Accuracy": accuracy_score(y_true, y_pred),
        "Precision (Spam)": precision_score(y_true, y_pred),
        "Recall (Spam)": recall_score(y_true, y_pred),
        "F1-score (Spam)": f1_score(y_true, y_pred)
    }

# Collect metrics for each model
results = []

# Naive Bayes (BoW)
results.append(get_metrics(y_val, y_val_pred, "Naive Bayes", "BoW"))

# Logistic Regression (BoW)
results.append(get_metrics(y_val, y_val_pred_lr_bow, "Logistic Regression", "BoW"))

# Logistic Regression (GloVe)
results.append(get_metrics(y_val, y_val_pred_glove, "Logistic Regression", "GloVe"))

# Create table
results_df = pd.DataFrame(results)

# Round values to 2 decimals for readability
results_df = results_df.round(2)

# Show table
results_df

Unnamed: 0,Model,Features,Accuracy,Precision (Spam),Recall (Spam),F1-score (Spam)
0,Naive Bayes,BoW,0.98,0.86,0.95,0.91
1,Logistic Regression,BoW,0.98,1.0,0.86,0.93
2,Logistic Regression,GloVe,0.92,0.66,0.61,0.63


Sparse features (BoW/TF‑IDF) worked much better than dense embeddings (GloVe) for this small dataset.

Naive Bayes was more balanced (better recall on spam).

Logistic Regression was very precise on spam (no false positives) but missed some.

GloVe embeddings underperformed

# 5. Analysis & Discussion

**1. Naive Bayes vs Logistic Regression**

Naive Bayes is a basic model that works well for text.

It found almost all spam messages (high recall).

Logistic Regression is a bit smarter.

It was very good at not marking ham as spam (high precision), but it missed some spam.

If we want to catch all spam, Naive Bayes is better.

If we want to avoid marking ham as spam, Logistic Regression is better.

**2. BoW vs GloVe (Feature Types)**

BoW (Bag of Words) looks at how often each word appears.

It worked really well with both models.

GloVe gives a number list for each word (word meaning).

But GloVe didn't work as well here because:

Our data (SMS) is small and casual

GloVe is trained on big news/web data, not short messages


BoW was better for this small SMS dataset.

# 6. Reproducibility & Code Quality

In [16]:
!pip freeze > requirements.txt

In [17]:
from google.colab import files
files.download('requirements.txt')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>