# Spam Email detection Neural Networks
Prepared By Deepa Francis<br>
For BrainStation<br>
On July 31, 2023

# Table of Contents
[1. Configuring Resources](#cr) <br>
- [1.1. Set up Libraries](#sl) <br>
- [1.2. Load Data](#ld) <br>
[2. Neural Network](#nn) <br>
- [2.1. Architecture](#nnr) <br>
- [2.2. TF-IDF model evaluation](#tf) <br>
- [2.3. Sentence2vec model evaluation](#sv) <br>

<a id = "cr"></a>
## 1. Configuring Resources

We are going to configure resources for comparing the performance metrics of neural network models on our dataset.

<a id = "sl"></a>
### 1.1. Setting up Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import MinMaxScaler

from sklearn.metrics import classification_report
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import roc_curve, roc_auc_score

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

import warnings
warnings.filterwarnings('ignore')

<a id = "ld"></a>
### 1.2. Load Data

In [2]:
# Load the data
X_train = pd.read_csv('X_train.csv') 
X_test = pd.read_csv('X_test.csv') 
X_validation = pd.read_csv('X_validation.csv') 

X_train_Vec = pd.read_csv('X_train_Vec.csv') 
X_test_Vec = pd.read_csv('X_test_Vec.csv') 
X_val_Vec = pd.read_csv('X_val_Vec.csv')

y_train = pd.read_csv('y_train.csv') 
y_test = pd.read_csv('y_test.csv') 
y_validation = pd.read_csv('y_validation.csv') 

In [3]:
# Check the shapes of the datasets
print(f'The shape of X_train is {X_train.shape}')
print(f'The shape of X_test is {X_test.shape}')
print(f'The shape of X_validation is {X_validation.shape}')
print('')
print(f'The shape of X_train_Vec is {X_train_Vec.shape}')
print(f'The shape of X_test_Vec is {X_test_Vec.shape}')
print(f'The shape of X_val_Vec is {X_val_Vec.shape}')
print('')
print(f'The shape of y_train is {y_train.shape}')
print(f'The shape of y_test is {y_test.shape}')
print(f'The shape of y_validation is {y_validation.shape}')

The shape of X_train is (22400, 228)
The shape of X_test is (12000, 228)
The shape of X_validation is (5600, 228)

The shape of X_train_Vec is (22400, 428)
The shape of X_test_Vec is (12000, 428)
The shape of X_val_Vec is (5600, 428)

The shape of y_train is (22400, 1)
The shape of y_test is (12000, 1)
The shape of y_validation is (5600, 1)


<a id = "nn"></a>
## 2. Neural Network

The main objective is to compare the performance of the neural network when using two different text representation methods: TF-IDF Vectorizer and Sentence to Vec. The comparison will involve training the neural network with data represented using both approaches and evaluating its performance using the chosen performance metrics.

Each approach may have its strengths and weaknesses depending on the nature of the text data and the complexity of the task at hand. By evaluating the neural network's performance under both methods, we can determine which representation technique yields better results for the given scenario.

<a id = "nnr"></a>
### 2.1. Architecture

**Create a Neural Network**
- First we create a sequential model. A sequential model is a linear stack of layers. In this case, the model will be built layer-by-layer.

- The model has three hidden layers, each followed by dropout regularization and batch normalization.

    - Dense: The dense layer is a fully connected layer with 40 neurons. The activation function used is ReLU (Rectified Linear Unit), which helps introduce non-linearity into the model.
    - Dropout: Dropout is a regularization technique that randomly drops out a fraction (0.2 in this case) of the neurons during training, which helps prevent overfitting.
    - BatchNormalization: Batch normalization normalizes the inputs of each layer to have zero mean and unit variance, which helps stabilize training and improves the learning process.

- The output layer is a dense layer with a single neuron, using the sigmoid activation function. Since this is a binary classification problem (spam or not spam), the sigmoid activation function outputs a probability between 0 and 1, indicating the likelihood of an email being spam.

- The model is compiled with the Adam optimizer, Binary Crossentropy loss function (suitable for binary classification), and Binary Accuracy metric (used to monitor the accuracy during training).

- The model is trained using the fit method with the training data (X_train, y_train). It will undergo 500 epochs (iterations over the entire dataset), and verbose=0 means the training progress won't be printed to the console.

- After training, the model is evaluated on both the validation and test datasets. The training accuracy is extracted from the training history, and the evaluation results (loss and accuracy) for the validation and test datasets are obtained.

- Then we do the predictions on the test data using the trained model. The predicted probabilities are rounded to obtain binary predictions (0 or 1). The true labels (y_test) are converted to integers (0 or 1). Finally, the classification report is printed for the test data, providing a comprehensive summary of the model's performance in classifying spam emails on the test dataset. The classification report includes metrics like precision, recall, F1-score, and support for both classes (spam and not spam).

<a id = "tf"></a>
### 2.2. TF-IDF Model Evaluation

<a id = "sv"></a>
### 2.3. Sentence2Vec Model Evaluation