# E-Mail Spam Classification
## YZV 311E Term Project

Abdullah Bilici, 150200330

Bora Boyacıoğlu, 150200310

Import the necessary libraries.

In [1]:
import numpy as np
import pandas as pd

import zipfile
import os

from utils import concatenate_loader, evaluate_model
from dataloader import DataLoader

%load_ext autoreload
%autoreload 2

## Read Data

In [None]:
file_path = "../Data/data"

# Unzip the zip file
with zipfile.ZipFile(file_path + ".zip", 'r') as zip_ref:
    zip_ref.extractall(file_path)


In [None]:
# This data loaders will help us to handle huge data
train_data = DataLoader("../Data/data/data_train.npy", shuffle=True, batch_size=64)
test_data = DataLoader("../Data/data/data_test.npy", shuffle=False, batch_size=64)
validation_data = DataLoader("../Data/data/data_validation.npy", shuffle=False, batch_size=64)

In [None]:
# Remove unnecessary files and folders
os.remove("../Data/data/data.npy")
os.remove("../Data/data/data_train.npy")
os.remove("../Data/data/data_test.npy")
os.remove("../Data/data/data_validation.npy")
os.rmdir("../Data/data/")

In [None]:
test_data

## Model Selection

Due to its simplicity, efficiency and effectiveness, first choice will be **Naive Bayes** for E-Mail classification. For this task, we will use Multionmial Naive Bayes. Using an evaluation method, we will see if it is a good choice. If not, our next trial will be **SVM**. It can handle large datasets better. And later, we can try **Random Forest**, **Logistic Regression** or other models to see if they are even better.

In [None]:
# Convert training data loader back to a dataset
X_train, y_train = concatenate_loader(train_data)

### Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
# Initialize the Naive Bayes model
mnb = MultinomialNB()

# Train the model
mnb.fit(X_train, y_train)

In [None]:
# Evaluate on validation set
print("Validation Results:")
evaluate_model(validation_data, mnb)

# Evaluate on test set
print("Test Results:")
evaluate_model(test_data, mnb)

It turns out, Naive Bayes is not an ideal model. Especially recall value is too low. The model seems to identify an important amount of positive values as negative, while having a good accuracy identifying the negative values.

### Support Vector Machines (SVM)

In [None]:
from sklearn.svm import SVC

In [None]:
# Initialize the SVM model
svc = SVC()

# Train the model
svc.fit(X_train, y_train)

In [None]:
# Evaluate on validation set
print("Validation Results:")
evaluate_model(validation_data, svc)

# Evaluate on test set
print("Test Results:")
evaluate_model(test_data, svc)

This time, the model presented much better results. The accuracy and precision is very good, while recall can be improved. Positive values still identified as negative, but with a highly reduced scale. We can tune the hyperparameters to fit the data better. However, the computational time is so long this time. There may be a possiblity that another model will work better. So, let's keep trying.

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Set the parameters
n_estimators = 100
random_state = 42

# Initialize the Random Forest model
rfc = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state)

# Train the model
rfc.fit(X_train, y_train)

In [None]:
# Evaluate on validation set
print("Validation Results:")
evaluate_model(validation_data, rfc)

# Evaluate on test set
print("Test Results:")
evaluate_model(test_data, rfc)

Using 100 estimators, Random Forest performed an even better job than SVM. This time, recall is over 0.94 as well. Let's try one more model to make sure we are using our best model.

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
# Set the parameters
max_iter = 10000

# Initialize the Logistic Regression model
log_reg = LogisticRegression(max_iter=max_iter)

# Train the model
log_reg.fit(X_train, y_train)

In [None]:
# Evaluate on validation set
print("Validation Results:")
evaluate_model(validation_data, log_reg)

# Evaluate on test set
print("Test Results:")
evaluate_model(test_data, log_reg)

This time, the model didn't do a great job. Recall value is below 60%.

#### Conclusions

After trying four different models, we can say that Random Forest is a great choice to continue. From the next part, we will apply **hyperparameter tuning** to have the even better results. Also, we will make sure we're not **overfitting**.

## Random Forest: Hyperparameter Tuning

As we've picked our model to be Random Forest, there is a crucial step before continuing: tuning. We can improve the performance of the model by adjusting the parameters. Also, we need to make sure that there is no overfitting happening.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

In [None]:
# Initialize the GridSearchCV
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42), 
    param_grid=param_grid, 
    cv=5,       # Number of cross-validation folds (k-fold)
    n_jobs=-1,  # Use all available cores
    verbose=2
)

# Fit GridSearchCV
grid_search.fit(X_train, y_train)

# Best parameters and score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)