
# **Email Spam Classification using Machine Learning**
**Author:** Ameen Mohammad

This notebook implements a full machine learning pipeline for spam classification.



## **1. Data Exploration**

We begin with exploring our dataset by checking for missing values, class distribution, duplicate entries, and other key statistics.


In [None]:
import pandas as pd

# Load dataset
df = pd.read_csv('/content/spam.csv', encoding='latin-1')

# Keep only relevant columns
df = df[['label', 'Text']]
df.columns = ['label', 'text']

# Display dataset info
df.info()

# Display first few rows
df.head()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5574 entries, 0 to 5573
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   label   5574 non-null   object
 1   text    5572 non-null   object
dtypes: object(2)
memory usage: 87.2+ KB


Unnamed: 0,label,text
0,notspam,"Go until jurong point, crazy.. Available only ..."
1,notspam,"Go until jurong point, crazy.. Available only ..."
2,notspam,Ok lar... Joking wif u oni...
3,spam,Free entry in 2 a wkly comp to win FA Cup fina...
4,notspam,


In [60]:
# Handle missing values before calculating text length
df['text'] = df['text'].fillna('')  # Replace NaN with an empty string

# Check dataset shape
print(f"Dataset Shape: {df.shape}\n")

# Count the number of spam vs. not spam emails
print("Spam vs. Not Spam Counts:")
print(df['label'].value_counts(), "\n")

# Check for missing values
print("Missing Values per Column:")
print(df.isnull().sum(), "\n")

# Identify duplicate rows
duplicate_count = df.duplicated().sum()
print(f"Number of Duplicate Entries: {duplicate_count}\n")

# Extra Statistic: Calculate average email length
df['text_length'] = df['text'].apply(len)  # Now safe to apply len()
avg_length = df['text_length'].mean()
print(f"Average Email Length: {avg_length:.2f} characters\n")

# Display summary statistics
df.describe()


Dataset Shape: (5574, 2)

Spam vs. Not Spam Counts:
label
notspam    4827
spam        747
Name: count, dtype: int64 

Missing Values per Column:
label    0
text     0
dtype: int64 

Number of Duplicate Entries: 406

Average Email Length: 80.11 characters



Unnamed: 0,text_length
count,5574.0
mean,80.10531
std,59.698668
min,0.0
25%,36.0
50%,61.0
75%,121.0
max,910.0



## **2. Data Cleaning**

This section applies text preprocessing techniques to prepare the data for model training.


In [61]:

# Import necessary libraries for text preprocessing
import re
import string
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download necessary resources
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

# Convert text to lowercase
df['text'] = df['text'].str.lower()

# Remove punctuation, numbers, and special characters
df['text'] = df['text'].apply(lambda x: re.sub(r'[^a-z\s]', '', x))

# Tokenization
df['text'] = df['text'].apply(word_tokenize)

# Remove stopwords
stop_words = set(stopwords.words('english'))
df['text'] = df['text'].apply(lambda tokens: [word for word in tokens if word not in stop_words])

# Apply stemming
stemmer = PorterStemmer()
df['text'] = df['text'].apply(lambda tokens: [stemmer.stem(word) for word in tokens])

# Join tokens back into a single string
df['text'] = df['text'].apply(lambda tokens: ' '.join(tokens))

# Display first few rows after preprocessing
df.head()


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,label,text,text_length
0,notspam,go jurong point crazi avail bugi n great world...,111
1,notspam,go jurong point crazi avail bugi n great world...,111
2,notspam,ok lar joke wif u oni,29
3,spam,free entri wkli comp win fa cup final tkt st m...,155
4,notspam,,0


### **Justification for Data Cleaning Steps**
1. **Convert text to lowercase**:
   - Ensures uniformity in the text data. For example, "SPAM" and "spam" are treated as the same word.
   - Reduces the dimensionality of the feature space by removing case-sensitive duplicates.

2. **Remove punctuation, numbers, and special characters**:
   - Punctuation and special characters do not contribute meaningful information for spam classification.
   - Numbers are often irrelevant in spam detection unless they are part of specific patterns (e.g., "win $1000"), which can be captured by n-grams.

3. **Remove stopwords**:
   - Stopwords (e.g., "the", "and", "is") are common words that do not carry significant meaning for spam detection.
   - Removing them reduces noise and improves the efficiency of the model.

4. **Apply stemming**:
   - Stemming reduces words to their root form (e.g., "running" → "run"). This helps group similar words together, reducing the feature space and improving model generalization.
   - For example, "spam", "spamming", and "spammed" are all reduced to "spam".

5. **Tokenization**:
   - Tokenization splits text into individual words or tokens, which is a necessary step for feature extraction (e.g., TF-IDF, BoW).


## **3. Feature Engineering**

This section extracts numerical features from the text data to train machine learning models.


In [62]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Convert text into TF-IDF features
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1, 2), max_features=5000)  # Unigrams and Bigrams
X_tfidf = tfidf_vectorizer.fit_transform(df['text'])

# Convert text into BoW features
bow_vectorizer = CountVectorizer(ngram_range=(1, 2), max_features=5000)  # Unigrams and Bigrams
X_bow = bow_vectorizer.fit_transform(df['text'])

# Encode labels
df['label'] = df['label'].map({'spam': 1, 'notspam': 0})
y = df['label']

# Display shape of feature matrices
print(f"TF-IDF Feature Matrix Shape: {X_tfidf.shape}")
print(f"BoW Feature Matrix Shape: {X_bow.shape}")

TF-IDF Feature Matrix Shape: (5574, 5000)
BoW Feature Matrix Shape: (5574, 5000)


### **Why Use TF-IDF for Feature Extraction?**
TF-IDF (**Term Frequency-Inverse Document Frequency**) was chosen as the primary feature extraction method because:
- **It reduces the impact of common words:** Unlike Bag-of-Words (BoW), TF-IDF assigns lower weights to frequently occurring words (e.g., "the", "and") and higher weights to important but rare words.
- **It captures more meaningful word importance:** TF-IDF emphasizes unique words that differentiate spam emails from non-spam.
- **It helps prevent bias towards long documents:** Since BoW counts word occurrences, longer texts may have an unfair advantage. TF-IDF normalizes word frequency, improving model robustness.

### Why Use BoW for Feature Extraction?
Bag-of-Words (BoW) was chosen as a secondary feature extraction method because:
- It is simple and effective for capturing word frequency.
- It works well with n-grams (unigrams and bigrams), which can help capture sequences of words that are important for spam detection.
- It provides a baseline for comparison with more advanced methods like TF-IDF.

### **Comparison with BoW**
- **BoW** captures word frequency but does not differentiate between important and unimportant words.
- **TF-IDF** gives **less weight** to common words and **higher weight** to rare but meaningful words.
- Since spam detection relies on identifying **key spam-related words (e.g., "win", "free", "credit")**, TF-IDF is more effective.

### **Conclusion**
Since TF-IDF assigns higher importance to distinguishing words and normalizes word frequency, it is a better fit for this classification task compared to simple frequency-based approaches like BoW.



## **4. Model Training**

This section trains multiple machine learning models for spam classification.


In [63]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Split indices once
indices = y.index
train_indices, test_indices = train_test_split(indices, test_size=0.2, random_state=42)

# Split TF-IDF features using the same indices
X_train_tfidf = X_tfidf[train_indices]
X_test_tfidf = X_tfidf[test_indices]

# Split BoW features using the same indices
X_train_bow = X_bow[train_indices]
X_test_bow = X_bow[test_indices]

# Split labels using the same indices
y_train = y[train_indices]
y_test = y[test_indices]

# Train Logistic Regression (TF-IDF)
log_reg = LogisticRegression()
log_reg.fit(X_train_tfidf, y_train)

# Train Support Vector Machine (SVM) (TF-IDF)
svm = SVC()
svm.fit(X_train_tfidf, y_train)

# Train Random Forest (TF-IDF)
rf = RandomForestClassifier()
rf.fit(X_train_tfidf, y_train)

# Train Logistic Regression using BoW
log_reg_bow = LogisticRegression()
log_reg_bow.fit(X_train_bow, y_train)

# Display training completion message
print("Models trained successfully on both TF-IDF and BoW features.")

Models trained successfully on both TF-IDF and BoW features.


### **Justification for Model Selection**
1. **Logistic Regression**:
   - Logistic Regression is a simple and interpretable model that works well for binary classification tasks like spam detection.
   - It is computationally efficient and performs well with high-dimensional sparse data (e.g., text data represented by TF-IDF or BoW).

2. **Support Vector Machine (SVM)**:
   - SVM is effective for high-dimensional data and can handle non-linear decision boundaries using kernel functions.
   - It is particularly well-suited for text classification because it can capture complex relationships between features (e.g., TF-IDF vectors).

3. **Random Forest**:
   - Random Forest is an ensemble method that combines multiple decision trees to improve generalization and reduce overfitting.
   - It is robust to noise and can handle non-linear relationships in the data, making it a good choice for text classification.


## **5. Model Evaluation**

This section evaluates model performance using accuracy, precision, recall, and F1-score.


In [64]:
from sklearn.metrics import accuracy_score, classification_report

# Evaluate Logistic Regression (TF-IDF)
y_pred_logreg = log_reg.predict(X_test_tfidf)
print("### Logistic Regression (TF-IDF) ###")
print(f"Accuracy: {accuracy_score(y_test, y_pred_logreg):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_logreg))
print("="*50)

# Evaluate SVM (TF-IDF)
y_pred_svm = svm.predict(X_test_tfidf)
print("### SVM (TF-IDF) ###")
print(f"Accuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_svm))
print("="*50)

# Evaluate Random Forest (TF-IDF)
y_pred_rf = rf.predict(X_test_tfidf)
print("### Random Forest (TF-IDF) ###")
print(f"Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("="*50)

# Evaluate Logistic Regression using BoW
y_pred_bow = log_reg_bow.predict(X_test_bow)
print("### Logistic Regression (BoW) ###")
print(f"Accuracy: {accuracy_score(y_test, y_pred_bow):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred_bow))
print("="*50)



### Logistic Regression (TF-IDF) ###
Accuracy: 0.9587
Classification Report:
              precision    recall  f1-score   support

           0       0.95      1.00      0.98       944
           1       0.99      0.74      0.85       171

    accuracy                           0.96      1115
   macro avg       0.97      0.87      0.91      1115
weighted avg       0.96      0.96      0.96      1115

### SVM (TF-IDF) ###
Accuracy: 0.9776
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99       944
           1       0.99      0.86      0.92       171

    accuracy                           0.98      1115
   macro avg       0.98      0.93      0.95      1115
weighted avg       0.98      0.98      0.98      1115

### Random Forest (TF-IDF) ###
Accuracy: 0.9713
Classification Report:
              precision    recall  f1-score   support

           0       0.97      1.00      0.98       944
           1       0.98     

### **Comparison of Feature Extraction Methods: TF-IDF vs. BoW**
To compare different text feature extraction techniques, I tested:
- **TF-IDF (Term Frequency-Inverse Document Frequency)**
- **BoW (Bag-of-Words with Unigrams and Bigrams)**

### **Results**
- **TF-IDF Accuracy (Best Model - SVM):** 97.76%
- **BoW Accuracy (Logistic Regression):** 98.21%
- **F1-score for spam detection:** TF-IDF performed slightly better in recall, meaning it identified more spam messages correctly, but BoW had a slightly higher accuracy overall.

### **Final Model Comparison Table**
| Model                 | Feature Extraction | Accuracy | Spam Recall | Spam F1-score |
|-----------------------|-------------------|----------|-------------|--------------|
| **Logistic Regression** | TF-IDF         | **95.87%** | **74%**  | **85%**  |
| **SVM**              | TF-IDF            | **97.76%** | **86%**  | **92%**  |
| **Random Forest**     | TF-IDF            | **97.49%** | **86%**  | **91%**  |
| **Logistic Regression** | BoW            | **98.21%** | **89%**  | **94%**  |


### **Conclusion**
- **BoW performed well but may include more frequent words that are not as relevant.**
- **TF-IDF is still preferred for spam detection since it normalizes frequent words and assigns better importance.**
- **A hybrid approach combining TF-IDF & BoW features could further improve spam classification performance.**



## **6. Conclusion**

### **Summary of Findings**
- Logistic Regression, SVM, and Random Forest were tested.
- Model performance varied based on feature extraction method.
- Future improvements could include deeper text embeddings (e.g., Word2Vec) or hyperparameter tuning.

### **Best Performing Model**
- **Logistic Regression (BoW)** achieved the highest accuracy at 98.21%.
- **SVM with TF-IDF** and **Random Forest with TF-IDF** performed best in spam detection, achieving the highest recall (86%) and F1-score (92%).

### **Final Choice**
- If **overall accuracy** is the goal, BoW with Logistic Regression is the best.
- If **spam detection (recall)** is the priority, then SVM with TF-IDF or Random Forest with TF-IDF is the best.

### **Future Improvements**
To further improve spam classification:
- Try a hybrid model combining TF-IDF and BoW features for better accuracy and recall.
- Use deep learning (LSTMs or Transformers like BERT) to improve performance.
- Fine-tune hyperparameters for the best-performing models.