# E-Mail Spam Classification
## YZV 311E Term Project

Abdullah Bilici, 150200330

Bora Boyacıoğlu, 150200310

Import the necessary libraries.

In [1]:
import zipfile
import os

from utils import concatenate_loader, evaluate_model
from dataloader import DataLoader

%load_ext autoreload
%autoreload 2

## Read Data

In [2]:
file_path = "../Data/data"

# Unzip the zip file
with zipfile.ZipFile(file_path + ".zip", 'r') as zip_ref:
    zip_ref.extractall(file_path)


In [3]:
# This data loaders will help us to handle huge data
train_data = DataLoader("../Data/data/data_train.npy", shuffle=True, batch_size=64)
test_data = DataLoader("../Data/data/data_test.npy", shuffle=False, batch_size=64)
validation_data = DataLoader("../Data/data/data_validation.npy", shuffle=False, batch_size=64)

In [4]:
# Remove unnecessary files and folders
os.remove("../Data/data/data.npy")
os.remove("../Data/data/data_train.npy")
os.remove("../Data/data/data_test.npy")
os.remove("../Data/data/data_validation.npy")
os.rmdir("../Data/data/")

In [5]:
test_data

Data with shape of (1146, 16908), shuffle = False, batch_size = 64

## Model Selection

Due to its simplicity, efficiency and effectiveness, first choice will be **Naive Bayes** for E-Mail classification. For this task, we will use Multionmial Naive Bayes. Using an evaluation method, we will see if it is a good choice. If not, our next trial will be **SVM**. It can handle large datasets better. And later, we can try **Random Forest**, **Logistic Regression** or other models to see if they are even better.

In [6]:
# Convert training data loader back to a dataset
X_train, y_train = concatenate_loader(train_data)

### Naive Bayes

In [7]:
from sklearn.naive_bayes import MultinomialNB

In [8]:
# Initialize the Naive Bayes model
mnb = MultinomialNB()

# Train the model
mnb.fit(X_train, y_train)

In [9]:
# Evaluate on validation set
print("Validation Results:")
evaluate_model(validation_data, mnb)

# Evaluate on test set
print("Test Results:")
evaluate_model(test_data, mnb)

Validation Results:
[4mConfusion Matrix:[0m
[[TP: [91m111[0m	FP: [91m0[0m	]
 [FN: [91m175[0m	TN: [91m802[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.8392[0m
Precision: [91m1.0000[0m
Recall   : [91m0.3881[0m
F1 Score : [91m0.5592[0m

Test Results:
[4mConfusion Matrix:[0m
[[TP: [91m106[0m	FP: [91m0[0m	]
 [FN: [91m156[0m	TN: [91m826[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.8566[0m
Precision: [91m1.0000[0m
Recall   : [91m0.4046[0m
F1 Score : [91m0.5761[0m



It turns out, Naive Bayes is not an ideal model. Especially recall value is too low. The model seems to identify an important amount of positive values as negative, while having a good accuracy identifying the negative values.

### Support Vector Machines (SVM)

In [10]:
from sklearn.svm import SVC

In [11]:
# Initialize the SVM model
svc = SVC()

# Train the model
svc.fit(X_train, y_train)

In [12]:
# Evaluate on validation set
print("Validation Results:")
evaluate_model(validation_data, svc)

# Evaluate on test set
print("Test Results:")
evaluate_model(test_data, svc)

Validation Results:
[4mConfusion Matrix:[0m
[[TP: [91m242[0m	FP: [91m1[0m	]
 [FN: [91m44[0m	TN: [91m801[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.9586[0m
Precision: [91m0.9959[0m
Recall   : [91m0.8462[0m
F1 Score : [91m0.9149[0m

Test Results:
[4mConfusion Matrix:[0m
[[TP: [91m210[0m	FP: [91m2[0m	]
 [FN: [91m52[0m	TN: [91m824[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.9504[0m
Precision: [91m0.9906[0m
Recall   : [91m0.8015[0m
F1 Score : [91m0.8861[0m



This time, the model presented much better results. The accuracy and precision is very good, while recall can be improved. Positive values still identified as negative, but with a highly reduced scale. We can tune the hyperparameters to fit the data better. However, the computational time is so long this time. There may be a possiblity that another model will work better. So, let's keep trying.

### Random Forest

In [13]:
from sklearn.ensemble import RandomForestClassifier

In [14]:
# Set the parameters
n_estimators = 100
random_state = 42

# Initialize the Random Forest model
rfc = RandomForestClassifier(n_estimators=n_estimators, random_state=random_state)

# Train the model
rfc.fit(X_train, y_train)

In [15]:
# Evaluate on validation set
print("Validation Results:")
evaluate_model(validation_data, rfc)

# Evaluate on test set
print("Test Results:")
evaluate_model(test_data, rfc)

Validation Results:
[4mConfusion Matrix:[0m
[[TP: [91m266[0m	FP: [91m4[0m	]
 [FN: [91m20[0m	TN: [91m798[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.9779[0m
Precision: [91m0.9852[0m
Recall   : [91m0.9301[0m
F1 Score : [91m0.9568[0m

Test Results:
[4mConfusion Matrix:[0m
[[TP: [91m244[0m	FP: [91m4[0m	]
 [FN: [91m18[0m	TN: [91m822[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.9798[0m
Precision: [91m0.9839[0m
Recall   : [91m0.9313[0m
F1 Score : [91m0.9569[0m



Using 100 estimators, Random Forest performed an even better job than SVM. This time, recall is over 0.94 as well. Let's try one more model to make sure we are using our best model.

### Logistic Regression

In [16]:
from sklearn.linear_model import LogisticRegression

In [17]:
# Set the parameters
max_iter = 10000

# Initialize the Logistic Regression model
log_reg = LogisticRegression(max_iter=max_iter)

# Train the model
log_reg.fit(X_train, y_train)

In [18]:
# Evaluate on validation set
print("Validation Results:")
evaluate_model(validation_data, log_reg)

# Evaluate on test set
print("Test Results:")
evaluate_model(test_data, log_reg)

Validation Results:
[4mConfusion Matrix:[0m
[[TP: [91m162[0m	FP: [91m0[0m	]
 [FN: [91m124[0m	TN: [91m802[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.8860[0m
Precision: [91m1.0000[0m
Recall   : [91m0.5664[0m
F1 Score : [91m0.7232[0m

Test Results:
[4mConfusion Matrix:[0m
[[TP: [91m153[0m	FP: [91m0[0m	]
 [FN: [91m109[0m	TN: [91m826[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.8998[0m
Precision: [91m1.0000[0m
Recall   : [91m0.5840[0m
F1 Score : [91m0.7373[0m



This time, the model didn't do a great job. Recall value is below 60%.

#### Conclusions

After trying four different models, we can say that Random Forest is a great choice to continue. From the next part, we will apply **hyperparameter tuning** to have the even better results. Also, we will make sure we're not **overfitting**.

## Random Forest: Hyperparameter Tuning

As we've picked our model to be Random Forest, there is a crucial step before continuing: tuning. We can improve the performance of the model by adjusting the parameters. Also, we need to make sure that there is no overfitting happening.

In [19]:
from sklearn.model_selection import RandomizedSearchCV

In [20]:
# Define the parameter distributions
param_distributions = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['auto', 'sqrt', 'log2']
}

In [21]:
# Initialize RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=RandomForestClassifier(random_state=42),
    param_distributions=param_distributions,
    n_iter=100,
    cv=5,
    scoring='recall',
    verbose=2,
    random_state=42,
    n_jobs=-1
)

# Fit RandomizedSearchCV
random_search.fit(X_train, y_train)

# Best parameters and score
print("Best parameters:", random_search.best_params_)
print("Best score:", random_search.best_score_)

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best parameters: {'n_estimators': 400, 'min_samples_split': 2, 'min_samples_leaf': 1, 'max_features': 'sqrt', 'max_depth': None}
Best score: 0.9202646815550042


Using **Randomized Search**, we concluded a result which is very close to the original implementation. The best score is based on **recall**, as it was the main concern.

Tuning doesn't seem to have been a necessity, after all this computation time and almost identical results. But trying is important. Now, it's time to build our model.

## Random Forest Model

Using the parameters found in the Randomized Search, let's build the model.

In [24]:
# Set the parameters
params = random_search.best_params_

In [25]:
# Initialize the Random Forest model
model = RandomForestClassifier(**params)

# Train the model
model.fit(X_train, y_train)

In [26]:
# Evaluate on validation set
print("Validation Results:")
evaluate_model(validation_data, model)

# Evaluate on test set
print("Test Results:")
evaluate_model(test_data, model)

Validation Results:
[4mConfusion Matrix:[0m
[[TP: [91m264[0m	FP: [91m2[0m	]
 [FN: [91m22[0m	TN: [91m800[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.9779[0m
Precision: [91m0.9925[0m
Recall   : [91m0.9231[0m
F1 Score : [91m0.9565[0m

Test Results:
[4mConfusion Matrix:[0m
[[TP: [91m245[0m	FP: [91m8[0m	]
 [FN: [91m17[0m	TN: [91m818[0m	]]

[4mClassification Report:[0m
Accuracy : [91m0.9770[0m
Precision: [91m0.9684[0m
Recall   : [91m0.9351[0m
F1 Score : [91m0.9515[0m

