# Spam Email Detection

## Introduction

This project focuses on detecting spam emails using a dataset containing information from 5,172 randomly selected email files. The goal is to build a classification model that can accurately distinguish between spam and not-spam emails based on the content of the emails.

## Source

This dataset is available on Kaggele in the following link:

> https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

## About the Dataset

The dataset is provided in a CSV file with the following characteristics:

- **Rows**: 5,172 rows, each representing an individual email.
- **Columns**: 3,002 columns in total.
  - **First Column**: Indicates the email name. The names have been anonymized with numbers to protect privacy.
  - **Last Column**: Contains the labels for classification:
    - `1` for spam emails.
    - `0` for not-spam emails.
  - **Remaining 3,000 Columns**: These columns represent the 3,000 most common words across all emails, excluding non-alphabetical characters. Each cell in these columns contains the count of the respective word in the corresponding email.

This compact representation allows for efficient processing and analysis of email data without needing to work with separate text files.

## Problem Statement

1. **Model Training**: Train the model with training dataset so that it can identify whether an email is spam or not.
2. **Model Evaluation**: Evaluate the performance of the trained model using the evaluation metrics such as accuracy, precision, recall and F1 score.
3. **Model Optimization**: Optimeze the performance of the model with cross validation and hyperparameter tuning.

### Load Libraries

In [50]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# Model and evaluation metrics
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Optimization
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

# Oversampling
from imblearn.over_sampling import SMOTE

### Settings

In [55]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
model_path = "../models"
# csv_path = os.path.join(data_path, "emails_fr.csv")
csv_path = os.path.join(data_path, "emails_or.csv")
# csv_path = os.path.join(data_path, "emails_pca.csv")

### Load Data

In [56]:
df = pd.read_csv(csv_path)

In [57]:
# Check Data
df.head()

Unnamed: 0,the,to,ect,and,for,of,a,you,hou,in,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,0,0,1,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,8,13,24,6,6,2,102,1,27,18,...,0,0,0,0,0,0,0,1,0,0
2,0,0,1,0,0,0,8,0,0,4,...,0,0,0,0,0,0,0,0,0,0
3,0,5,22,0,5,1,51,2,10,1,...,0,0,0,0,0,0,0,0,0,0
4,7,6,17,1,5,2,57,0,9,3,...,0,0,0,0,0,0,0,1,0,0


### Preprocessing

In [58]:
# Separate input and output features
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [59]:
# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 42)

In [60]:
# Standardize the the data
scaler = StandardScaler()
# scaler = MinMaxScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Building and Evaluation

In [61]:
# Function to train and evaluate a model
def train_evaluate(model, y_train):
    # Train the model with training set
    model.fit(X_train_s, y_train)

    # Predict on training and testing data
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # print evaluation metrics for taring and testing
    print("=" * 60)
    print("EVALUATION METRICS FOR TRAINING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred):.3f}")
    print(f"Precision: {precision_score(y_train, y_train_pred):.3f}")
    print(f"Recall: {recall_score(y_train, y_train_pred):.3f}")
    print(f"F1: {f1_score(y_train, y_train_pred):.3f}\n")
    print("=" * 60)
    print("EVALUATION METRICS FOR TESTING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred):.3f}")
    print(f"Precision: {precision_score(y_test, y_test_pred):.3f}")
    print(f"Recall: {recall_score(y_test, y_test_pred):.3f}")
    print(f"F1: {f1_score(y_test, y_test_pred):.3f}")

In [62]:
# Train the KNN Classifier with training data and evaluate performance with 4 metrics
knn = KNeighborsClassifier()
train_evaluate(knn, y_train)

EVALUATION METRICS FOR TRAINING
Accuracy: 0.897
Precision: 0.742
Recall: 0.972
F1: 0.841

EVALUATION METRICS FOR TESTING
Accuracy: 0.849
Precision: 0.642
Recall: 0.972
F1: 0.773


### Insights

The evaluation metrics for the KNN classifier on the Spam Email classification task show good overall performance, especially in terms of generalization to the testing data. Here’s a detailed analysis:

#### Training Metrics (High scores):

- **Accuracy (0.90)**: The model correctly classifies **89.7%** of the training data. This indicates that the model has learned the patterns in the training data well.
- **Precision (0.74)**: The precision of **74.2%** means that of all the emails classified as spam, **74.2%** were actually spam. The model is making a noticeable number of false positive predictions, where legitimate emails are incorrectly classified as spam.
- **Recall (0.97)**: Recall is very high at **97.2%**, indicating that the model is correctly identifying almost all actual spam emails in the training data.
- **F1 Score(0.84)**: The F1 score, which balances precision and recall, is **84.1%**, indicating good overall performance. The lower precision is pulling down the F1 score a bit, suggesting the need to reduce false positives.
- 
#### Testing Metrics (Slightly lower):

- **Accuracy (0.85)**: The model correctly classifies **84.9%** of the testing data, a good performance. This suggests the model generalizes well to unseen data. But the difference from the training accuracy is showing slightly overfitting.
- **Precision (0.64)**: The precision on the test set is **64.2%**, which is lower than on the training set. This suggests the model is making more false positive predictions on the test data (i.e., classifying legitimate emails as spam).
- **Recall (0.97)**: The recall on the test set is **97.2%**, meaning the model correctly identifies **97.2%** of the actual actual non-spam emails.
- **F1 Score (0.77)**: The F1 score for the test data is **77.3%**, which is lower than the training F1 score (**84.1%**). This drop indicates that the model is struggling with precision on the test set, leading to a trade-off between precision and recall.

#### Analysis of Performance:

- **High Recall, Low Precision:** Both on the training and testing data, the model has very high recall (**97.2%**) but relatively low precision, especially on the test set (**64.2%**). This means that while the model is excellent at catching most spam emails (low false negatives), it struggles with distinguishing between spam and legitimate emails, leading to more false positives (legitimate emails classified as spam). **In spam detection**, high recall is typically prioritized because missing spam emails can be more harmful than wrongly classifying legitimate emails as spam. However, low precision can result in user frustration, as important emails might be wrongly classified as spam and go unnoticed.
- **Generalization:** The model's performance on the test set is slightly worse than on the training set, which is expected. The drop in precision (**from 74.2% to 64.2%**) suggests that the model is overfitting to the training data when it comes to distinguishing between legitimate and spam emails. The generalization to unseen data could be improved by addressing overfitting through techniques such as hyperparameter tuning or using regularization.
- **F1 Score:** The F1 score balances precision and recall, and its drop from training (**0.841**) to testing (**0.773**) reflects the precision issue. Since the F1 score is lower on the test set, it indicates that the model’s ability to balance catching spam emails while minimizing false positives deteriorates slightly when applied to new data.

### Model Optimization

- Try to find the optimal model using hyperparameter tuning and corss validation

In [63]:
# KFold Cross validation
kf = KFold(n_splits= 5)

knn_cv = KNeighborsClassifier()

# Cross validation
cvs = cross_val_score(knn_cv, X, y, cv= kf)
print(cvs.mean())

0.8429128119838767


In [64]:
# Hyperparameter tuning
def tune_hyperparameter(model, param_dict):
    # Difine tuner
    gvcv = GridSearchCV(estimator= model,
                   param_grid= param_dict,
                   cv= 5,
                   verbose= 1, scoring= "precision")
    # Train the tuner
    gvcv.fit(X, y)

    # Get Best parameters and print best score
    best_params = gvcv.best_params_
    print(f"Best Score: {gvcv.best_score_}")

    return best_params
    

In [65]:
# Define Hyperparameter
param_dict = {
    "n_neighbors": [5, 7, 9, 11, 13],
    "weights": ["uniform", "distance"]
}
cv_model = KNeighborsClassifier()
best_params = tune_hyperparameter(cv_model, param_dict)
print(f"Best Parameter set: {best_params}")

Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best Score: 0.7175680862904007
Best Parameter set: {'n_neighbors': 13, 'weights': 'distance'}


In [66]:
# train and evaluate with best parameters
model = KNeighborsClassifier(**best_params)
train_evaluate(model, y_train)

EVALUATION METRICS FOR TRAINING
Accuracy: 1.000
Precision: 1.000
Recall: 1.000
F1: 1.000

EVALUATION METRICS FOR TESTING
Accuracy: 0.795
Precision: 0.565
Recall: 0.992
F1: 0.720


### Insights

After hyperparameter tuning for the Spam Email classification with the KNN classifier, I obtained the following results:

#### Training Metrics (Perfect scores):

- **Accuracy (1.0)**: The model achieves perfect accuracy on the training set, classifying all emails (both spam and legitimate) correctly.
- **Precision (1.0)**: The precision is also perfect, meaning that every email classified as spam was indeed spam.
- **Recall (1.0)**: The model detects **100%** of the actual spam emails in the training data (no false negatives).
- **F1 Score(1.0)**:  The F1 score is a perfect **1.0**, reflecting the model's flawless performance on the training data.
  
#### Testing Metrics (Slightly lower):

- **Accuracy (0.80)**: On the test set, the model's accuracy drops to **79.5%**, indicating that the model does not generalize as well to unseen data.
- **Precision (0.57)**: Precision is quite low (**56.5%**), meaning that only **56.5%** of the emails classified as spam on the test set are actually spam. This suggests a high number of false positives (legitimate emails classified as spam).
- **Recall (0.99)**: Recall remains very high (**99.2%**), indicating that the model successfully identifies almost all spam emails in the test set, with very few false negatives.
- **F1 Score (0.72)**: The F1 score is **0.72**, reflecting the imbalance between precision and recall, but it shows reasonable overall performance, albeit not as good as on the training data.

#### Analysis of Performance:

- **High Recall, Low Precision:** The recall on the test set remains very high (**99.2%**), meaning the model is still very good at catching almost all spam emails. However, the precision (**56.5%**) is quite low, meaning the model is classifying a significant number of legitimate emails as spam (false positives). Low precision could be problematic in a real-world spam detection scenario, where users might miss important legitimate emails if they are wrongly classified as spam.
- **Overfitting:** The model achieves perfect performance on the training set, which is a strong indication of overfitting. KNN might be memorizing the training data, especially if a small **k** value was selected during hyperparameter tuning. This results in excellent performance on training data but poor generalization to unseen data (as indicated by the drop in accuracy and precision on the test set).

### Oversampling

In [67]:
# Define SMOTE
smote = SMOTE()
X_train_r, y_train_r = smote.fit_resample(X_train, y_train)

In [68]:
# Standard Scaling the resampled data
scaler = StandardScaler()
X_train_s = scaler.fit_transform(X_train_r)
X_test_s = scaler.transform(X_test)

In [72]:
# Try KNN Classifier
knn_r = KNeighborsClassifier(**best_params)
train_evaluate(knn_r, y_train_r)

EVALUATION METRICS FOR TRAINING
Accuracy: 1.000
Precision: 1.000
Recall: 1.000
F1: 1.000

EVALUATION METRICS FOR TESTING
Accuracy: 0.663
Precision: 0.440
Recall: 1.000
F1: 0.611


### Conclusion

After applying oversampling on the training data, the evaluation metrics are as follows:

#### Training Metrics (Perfect scores):

- **Accuracy (1.0)**: The model achieves perfect accuracy on the training set, classifying all emails (both spam and legitimate) correctly.
- **Precision (1.0)**: The precision is also perfect, meaning that every email classified as spam was indeed spam.
- **Recall (1.0)**: The model detects **100%** of the actual spam emails in the training data (no false negatives).
- **F1 Score(1.0)**:  The F1 score is a perfect **1.0**, reflecting the model's flawless performance on the training data.

#### Testing Metrics:

- **Accuracy (0.66)**: The accuracy on the test set is quite low (**66.3%**), showing that the model is struggling to generalize well to unseen data.
- **Precision (0.44)**: Precision has dropped significantly to **44%**, meaning that less than half of the emails classified as spam are actually spam. This suggests a high number of false positives (legitimate emails misclassified as spam).
- **Recall (1.0)**: Recall is perfect, meaning that the model identifies all actual spam emails in the test set. This implies that the model is very aggressive in classifying emails as spam, hence it catches every spam email but at the cost of many false positives.
- **F1 Score (0.61)**: The F1 score is **0.611**, which reflects the imbalance between precision and recall. Despite high recall, the low precision drags down the overall performance.

#### Analysis of Performance:

- **Overfitting due to Oversampling:** The model achieves perfect scores on the oversampled training data, but this is a strong indication of overfitting. Oversampling replicates the minority class (spam emails) in the training set, which can make the model too sensitive to identifying spam, especially when paired with a KNN classifier. As a result, the model memorizes the training data and cannot generalize well to new, unseen data.
- **Evaluate Different Models:** KNN may not be the best model for this problem, especially given its tendency to memorize the data rather than generalize. Consider evaluating more advanced classifiers such as **Random Forest, XGBoost, or even SVM**. These models often perform better with imbalanced datasets and have built-in mechanisms to handle class imbalance.