# Spam Email Detection

## Introduction

This project focuses on detecting spam emails using a dataset containing information from 5,172 randomly selected email files. The goal is to build a classification model that can accurately distinguish between spam and not-spam emails based on the content of the emails.

## Source

This dataset is available on Kaggele in the following link:

> https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv

## About the Dataset

The dataset is provided in a CSV file with the following characteristics:

- **Rows**: 5,172 rows, each representing an individual email.
- **Columns**: 3,002 columns in total.
  - **First Column**: Indicates the email name. The names have been anonymized with numbers to protect privacy.
  - **Last Column**: Contains the labels for classification:
    - `1` for spam emails.
    - `0` for not-spam emails.
  - **Remaining 3,000 Columns**: These columns represent the 3,000 most common words across all emails, excluding non-alphabetical characters. Each cell in these columns contains the count of the respective word in the corresponding email.

This compact representation allows for efficient processing and analysis of email data without needing to work with separate text files.

## Problem Statement

1. **Model Training**: Train the model with training dataset so that it can identify whether an email is spam or not.
2. **Model Evaluation**: Evaluate the performance of the trained model using the evaluation metrics such as accuracy, precision, recall and F1 score.
3. **Model Optimization**: Optimeze the performance of the model with cross validation and hyperparameter tuning.

### Load Libraries

In [87]:
# General
import pandas as pd
import numpy as np
import os
import warnings
import pickle

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split

# Model and evaluation metrics
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Optimization
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

### Settings

In [86]:
# Warnings
warnings.filterwarnings("ignore")

# Path
data_path = "../data"
model_path = "../models"
# csv_path = os.path.join(data_path, "emails_fr.csv")
csv_path = os.path.join(data_path, "emails_or.csv")
# csv_path = os.path.join(data_path, "emails_pca.csv")

### Load Data

In [68]:
df = pd.read_csv(csv_path)

In [69]:
# Check Data
df.head()

Unnamed: 0,the,to,ect,and,for,of,a,you,hou,in,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,0,0,1,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,8,13,24,6,6,2,102,1,27,18,...,0,0,0,0,0,0,0,1,0,0
2,0,0,1,0,0,0,8,0,0,4,...,0,0,0,0,0,0,0,0,0,0
3,0,5,22,0,5,1,51,2,10,1,...,0,0,0,0,0,0,0,0,0,0
4,7,6,17,1,5,2,57,0,9,3,...,0,0,0,0,0,0,0,1,0,0


### Preprocessing

In [70]:
# Separate input and output features
X = df.iloc[:, :-1]
y = df.iloc[:, -1]

In [71]:
# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 42)

In [72]:
# Standardize the the data
scaler = StandardScaler()
# scaler = MinMaxScaler()
X_train_s = scaler.fit_transform(X_train)
X_test_s = scaler.transform(X_test)

### Model Building and Evaluation

In [75]:
# Function to train and evaluate a model
def train_evaluate(model):
    # Train the model with training set
    model.fit(X_train_s, y_train)

    # Predict on training and testing data
    y_train_pred = model.predict(X_train_s)
    y_test_pred = model.predict(X_test_s)

    # print evaluation metrics for taring and testing
    print("=" * 60)
    print("EVALUATION METRICS FOR TRAINING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_train, y_train_pred):.2f}")
    print(f"Precision: {precision_score(y_train, y_train_pred):.2f}")
    print(f"Recall: {recall_score(y_train, y_train_pred):.2f}")
    print(f"F1: {f1_score(y_train, y_train_pred):.2f}\n")
    print("=" * 60)
    print("EVALUATION METRICS FOR TESTING")
    print("=" * 60)
    print(f"Accuracy: {accuracy_score(y_test, y_test_pred):.2f}")
    print(f"Precision: {precision_score(y_test, y_test_pred):.2f}")
    print(f"Recall: {recall_score(y_test, y_test_pred):.2f}")
    print(f"F1: {f1_score(y_test, y_test_pred):.2f}")

In [76]:
# Train the Decision Tree Classifier with training data and evaluate performance with 4 metrics
dtc = DecisionTreeClassifier()
train_evaluate(dtc)

EVALUATION METRICS FOR TRAINING
Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1: 1.00

EVALUATION METRICS FOR TESTING
Accuracy: 0.94
Precision: 0.88
Recall: 0.89
F1: 0.88


### Findings

- Training with defalut hyperparameter gives **accuracy** of **94%**.
- Precision is critical when false positives are costly or harmful. For example, in spam detection, if an email is incorrectly marked as spam (a false positive), the user might miss important messages. We found **precision** of **88%**.

### Model Optimization

- Try to find the optimal model using hyperparameter tuning and corss validation

In [82]:
# KFold Cross validation
kf = KFold(n_splits= 5)

dtc_cv = DecisionTreeClassifier()

# Cross validation
cvs = cross_val_score(dtc_cv, X, y, cv= kf)
print(cvs.mean())

0.9149061594077359


In [83]:
# Define Hyperparameter
param_dict = {
    "criterion": ["gini", "entropy"],
    "splitter": ["best", "random"],
    "max_depth": [None, 2, 3, 4, 5],
    "min_samples_split": [2, 3, 4, 5],
    "min_samples_leaf": [1, 2, 3, 4, 5]
}

In [84]:
# Hyperparameter tuning
cv_model = DecisionTreeClassifier()

gvcv = GridSearchCV(estimator= cv_model,
                   param_grid= param_dict,
                   cv= 5,
                   verbose= 1)
gvcv.fit(X, y)
best_params = gvcv.best_params_
print(f"Best Parameter set: {best_params}")
print(f"Best Score: {gvcv.best_score_}")

Fitting 5 folds for each of 400 candidates, totalling 2000 fits
Best Parameter set: {'criterion': 'entropy', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 3, 'splitter': 'best'}
Best Score: 0.9247917027592022


In [85]:
# train and evaluate with best parameters
model = DecisionTreeClassifier(**best_params)
train_evaluate(model)

EVALUATION METRICS FOR TRAINING
Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1: 1.00

EVALUATION METRICS FOR TESTING
Accuracy: 0.95
Precision: 0.90
Recall: 0.92
F1: 0.91


### Conclusion

- We found the optimal model after hyperparameter tuning which increases the accuracy to **95%** and also increases the precision also. It has the **precision** of **90%**.

### Model Saving

In [88]:
# Save the optimal model for future use to identify spam email.
dt_model_path = os.path.join(model_path, "spam_detector_dt.pkl")
with open(dt_model_path, "wb") as dt_model:
    pickle.dump(model, dt_model)