# SMS Spam Classification with Logistic Regression

This notebook trains and evaluates Logistic Regression models for SMS spam detection using three feature types:

- TF-IDF features
- Word2Vec Skip-gram embeddings (mean pooled)
- Word2Vec CBOW embeddings (mean pooled)

workflow: preprocessing, feature engineering, model tuning, evaluation, and comparison.

## 1. Setup and Imports

I'm using scikit-learn for modeling and evaluation, and gensim for Word2Vec training.

In [1]:
# Install gensim if needed
try:
    import gensim  # noqa: F401
except ImportError:
    !pip -q install gensim

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.9/27.9 MB[0m [31m69.4 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
import os
import re
import random
from pathlib import Path

import numpy as np
import pandas as pd
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder

RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

## 2. Load Dataset

If `SMSSpamCollection` is available, I use it; otherwise, the fall back is `processed_sms.csv` if present.

In [12]:
data_dir = Path("/content")
raw_path = data_dir / "SMSSpamCollection"
processed_path = data_dir / "processed_sms.csv"

if raw_path.exists():
    df = pd.read_csv(raw_path, sep="\t", header=None, names=["label", "text"])
elif processed_path.exists():
    df = pd.read_csv(processed_path)
    if "text" not in df.columns and "clean_text" in df.columns:
        df = df.rename(columns={"clean_text": "text"})
else:
    raise FileNotFoundError("Dataset not found in ./data/.")

df = df[["label", "text"]].dropna().reset_index(drop=True)
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [None]:
from google.colab import drive
drive.mount('/content/drive')

## 3. Text Preprocessing

cleaning methods: lowercasing and punctuation removal.
Tokenization is adapted for each embedding method.

In [13]:
def clean_text(text: str) -> str:
    text = text.lower()
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

def tokenize(text: str):
    return text.split()

df["clean_text"] = df["text"].astype(str).apply(clean_text)
df["tokens"] = df["clean_text"].apply(tokenize)

label_encoder = LabelEncoder()
df["label_encoded"] = label_encoder.fit_transform(df["label"])

df[["label", "label_encoded", "clean_text"]].head()

Unnamed: 0,label,label_encoded,clean_text
0,ham,0,go until jurong point crazy available only in ...
1,ham,0,ok lar joking wif u oni
2,spam,1,free entry in 2 a wkly comp to win fa cup fina...
3,ham,0,u dun say so early hor u c already then say
4,ham,0,nah i don t think he goes to usf he lives arou...


## 4. Train-Test Split

using a stratified split to preserve class balance.

In [14]:
X_text = df["clean_text"]
X_tokens = df["tokens"]
y = df["label_encoded"]

X_text_train, X_text_test, X_tokens_train, X_tokens_test, y_train, y_test = train_test_split(
    X_text, X_tokens, y, test_size=0.2, random_state=RANDOM_SEED, stratify=y
)

print("Train size:", len(X_text_train))
print("Test size:", len(X_text_test))

Train size: 4457
Test size: 1115


## 5. Evaluation Utilities

evaluation protocol for all models: accuracy, precision, recall, and F1-score.

In [15]:
results = []

def evaluate_model(model_name, y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred)
    rec = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)

    results.append({
        "model": model_name,
        "accuracy": acc,
        "precision": prec,
        "recall": rec,
        "f1_score": f1
    })

    print(f"\n=== {model_name} ===")
    print("Accuracy:", acc)
    print("Precision:", prec)
    print("Recall:", rec)
    print("F1-score:", f1)
    print("\nClassification Report:\n")
    print(classification_report(y_true, y_pred, target_names=label_encoder.classes_))

## 6. Logistic Regression with TF-IDF

computing TF-IDF features on the cleaned training text and hyper paramanter tuning for Logistic Regression with GridSearchCV.

In [16]:
tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X_train_tfidf = tfidf.fit_transform(X_text_train)
X_test_tfidf = tfidf.transform(X_text_test)

param_grid = {
    "C": [0.1, 1, 10],
    "solver": ["liblinear", "lbfgs"]
}

lr_tfidf = LogisticRegression(max_iter=1000, random_state=RANDOM_SEED)
grid_tfidf = GridSearchCV(lr_tfidf, param_grid, cv=5, scoring="f1", n_jobs=-1)
grid_tfidf.fit(X_train_tfidf, y_train)

best_tfidf = grid_tfidf.best_estimator_
y_pred_tfidf = best_tfidf.predict(X_test_tfidf)

print("Best TF-IDF params:", grid_tfidf.best_params_)
evaluate_model("LogReg + TF-IDF", y_test, y_pred_tfidf)

Best TF-IDF params: {'C': 10, 'solver': 'liblinear'}

=== LogReg + TF-IDF ===
Accuracy: 0.9847533632286996
Precision: 0.9925373134328358
Recall: 0.8926174496644296
F1-score: 0.9399293286219081

Classification Report:

              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       966
        spam       0.99      0.89      0.94       149

    accuracy                           0.98      1115
   macro avg       0.99      0.95      0.97      1115
weighted avg       0.98      0.98      0.98      1115



## 7. Logistic Regression with Word2Vec Skip-gram

training Word2Vec (Skip-gram, `sg=1`) on training tokens only and creating document vectors by mean pooling.

In [17]:
def train_word2vec(sentences, sg):
    return Word2Vec(
        sentences=sentences,
        vector_size=100,
        window=5,
        min_count=2,
        workers=4,
        sg=sg,
        epochs=20,
        seed=RANDOM_SEED
    )

def document_vector(tokens, w2v_model):
    vectors = [w2v_model.wv[token] for token in tokens if token in w2v_model.wv]
    if len(vectors) == 0:
        return np.zeros(w2v_model.vector_size)
    return np.mean(vectors, axis=0)

# Train Skip-gram model
w2v_skipgram = train_word2vec(X_tokens_train.tolist(), sg=1)

X_train_skipgram = np.vstack([document_vector(tokens, w2v_skipgram) for tokens in X_tokens_train])
X_test_skipgram = np.vstack([document_vector(tokens, w2v_skipgram) for tokens in X_tokens_test])

lr_skipgram = LogisticRegression(max_iter=1000, random_state=RANDOM_SEED)
grid_skipgram = GridSearchCV(lr_skipgram, param_grid, cv=5, scoring="f1", n_jobs=-1)
grid_skipgram.fit(X_train_skipgram, y_train)

best_skipgram = grid_skipgram.best_estimator_
y_pred_skipgram = best_skipgram.predict(X_test_skipgram)

print("Best Skip-gram params:", grid_skipgram.best_params_)
evaluate_model("LogReg + Word2Vec Skip-gram", y_test, y_pred_skipgram)

Best Skip-gram params: {'C': 10, 'solver': 'lbfgs'}

=== LogReg + Word2Vec Skip-gram ===
Accuracy: 0.9847533632286996
Precision: 0.9714285714285714
Recall: 0.912751677852349
F1-score: 0.9411764705882353

Classification Report:

              precision    recall  f1-score   support

         ham       0.99      1.00      0.99       966
        spam       0.97      0.91      0.94       149

    accuracy                           0.98      1115
   macro avg       0.98      0.95      0.97      1115
weighted avg       0.98      0.98      0.98      1115



## 8. Logistic Regression with Word2Vec CBOW
training Word2Vec (CBOW, `sg=0`) on training tokens only and use mean pooling for document vectors.

In [18]:
# Train CBOW model
w2v_cbow = train_word2vec(X_tokens_train.tolist(), sg=0)

X_train_cbow = np.vstack([document_vector(tokens, w2v_cbow) for tokens in X_tokens_train])
X_test_cbow = np.vstack([document_vector(tokens, w2v_cbow) for tokens in X_tokens_test])

lr_cbow = LogisticRegression(max_iter=1000, random_state=RANDOM_SEED)
grid_cbow = GridSearchCV(lr_cbow, param_grid, cv=5, scoring="f1", n_jobs=-1)
grid_cbow.fit(X_train_cbow, y_train)

best_cbow = grid_cbow.best_estimator_
y_pred_cbow = best_cbow.predict(X_test_cbow)

print("Best CBOW params:", grid_cbow.best_params_)
evaluate_model("LogReg + Word2Vec CBOW", y_test, y_pred_cbow)

Best CBOW params: {'C': 10, 'solver': 'liblinear'}

=== LogReg + Word2Vec CBOW ===
Accuracy: 0.979372197309417
Precision: 0.95
Recall: 0.8926174496644296
F1-score: 0.9204152249134948

Classification Report:

              precision    recall  f1-score   support

         ham       0.98      0.99      0.99       966
        spam       0.95      0.89      0.92       149

    accuracy                           0.98      1115
   macro avg       0.97      0.94      0.95      1115
weighted avg       0.98      0.98      0.98      1115



## 9. Results Comparison

summarizing performance metrics across embeddings for easy comparison.

In [None]:
# Create comprehensive comparison table with hyperparameters
comparison_data = {
    "Model": [
        "LogReg + TF-IDF",
        "LogReg + Word2Vec Skip-gram", 
        "LogReg + Word2Vec CBOW"
    ],
    "Best C": [
        grid_tfidf.best_params_["C"],
        grid_skipgram.best_params_["C"],
        grid_cbow.best_params_["C"]
    ],
    "Best Solver": [
        grid_tfidf.best_params_["solver"],
        grid_skipgram.best_params_["solver"],
        grid_cbow.best_params_["solver"]
    ],
    "Accuracy": [
        f"{results[0]['accuracy']:.4f}",
        f"{results[1]['accuracy']:.4f}",
        f"{results[2]['accuracy']:.4f}"
    ],
    "Precision": [
        f"{results[0]['precision']:.4f}",
        f"{results[1]['precision']:.4f}",
        f"{results[2]['precision']:.4f}"
    ],
    "Recall": [
        f"{results[0]['recall']:.4f}",
        f"{results[1]['recall']:.4f}",
        f"{results[2]['recall']:.4f}"
    ],
    "F1-Score": [
        f"{results[0]['f1_score']:.4f}",
        f"{results[1]['f1_score']:.4f}",
        f"{results[2]['f1_score']:.4f}"
    ]
}

comparison_df = pd.DataFrame(comparison_data)
print("\\n" + "="*80)
print("COMPREHENSIVE MODEL COMPARISON: HYPERPARAMETERS & PERFORMANCE METRICS")
print("="*80)
print(comparison_df.to_string(index=False))
print("="*80)

In [19]:
results_df = pd.DataFrame(results)
results_df

Unnamed: 0,model,accuracy,precision,recall,f1_score
0,LogReg + TF-IDF,0.984753,0.992537,0.892617,0.939929
1,LogReg + Word2Vec Skip-gram,0.984753,0.971429,0.912752,0.941176
2,LogReg + Word2Vec CBOW,0.979372,0.95,0.892617,0.920415


### Comprehensive Model Comparison

This table presents a side-by-side comparison of all three models, including their optimal hyperparameters from GridSearchCV and their corresponding performance metrics.

## 10. Observations and Experimental Findings

### Key Results Summary

Based on the experimental outputs, all three models achieved excellent performance on the SMS spam classification task:

**1. TF-IDF Model Performance (98.48% Accuracy)**
- **Best Hyperparameters**: C=10, solver='liblinear'
- **Strengths**: Highest precision (99.25%), excellent at avoiding false positives
- **F1-Score**: 93.99%
- **Observation**: TF-IDF excels at capturing important spam keywords and n-gram patterns, making it highly effective for this task

**2. Word2Vec Skip-gram Model Performance (98.48% Accuracy)**  
- **Best Hyperparameters**: C=10, solver='lbfgs'
- **Strengths**: Best F1-score (94.12%), balanced precision-recall tradeoff, highest recall (91.28%)
- **Observation**: Skip-gram embeddings effectively capture semantic relationships, achieving the best overall balance between detecting spam (recall) and maintaining precision

**3. Word2Vec CBOW Model Performance (97.94% Accuracy)**
- **Best Hyperparameters**: C=10, solver='liblinear'
- **Performance**: Slightly lower than other models but still excellent (95% precision, 89.26% recall, 92.04% F1)
- **Observation**: CBOW provides good semantic representations but shows marginally lower performance compared to Skip-gram on this dataset

### Comparative Analysis

1. **Accuracy**: TF-IDF and Skip-gram tied at 98.48%, outperforming CBOW (97.94%) by a small margin
2. **Precision**: TF-IDF leads with 99.25%, followed by Skip-gram (97.14%) and CBOW (95%)
3. **Recall**: Skip-gram achieved the highest recall at 91.28%, indicating better spam detection capability
4. **F1-Score**: Skip-gram leads slightly at 94.12%, followed closely by TF-IDF (93.99%) and CBOW (92.04%)

### Insights

- **TF-IDF** remains highly competitive for sparse text classification, particularly when precision is critical
- **Word2Vec Skip-gram** provides the best overall balance and may generalize better to unseen semantic variations
- **Hyperparameter C=10** was optimal across all models, suggesting aggressive regularization helps prevent overfitting
- All models show excellent performance on the 'ham' class (98-99% F1) but have more room for improvement on 'spam' detection (89-94% recall)

### Future Recommendations for Further Experimentation

- Experiment with larger Word2Vec vector sizes (200, 300) to capture richer semantic information  
- Increase Word2Vec training epochs beyond 20 to improve embedding quality
- Try ensemble methods combining TF-IDF and Word2Vec features
- Explore class-weighted logistic regression to improve spam recall
- Consider using pre-trained embeddings (GloVe, FastText) for comparison