## Spam Email Classification

In this notebook, we will explore the Spam Email dataset, where we will train a machine learning model to classify whether a set of mails are spam or not spam. We will be using the Random Forest Classifier for this algorithm, which is handy for classification problems like this. 

Random Forest Algorithms or RFs for short, tend to find decision boundaries between features that allow us to classify accurately and achieve a small error rate in result.

### Step 1: Importing the necessary libraries

Since this dataset does not involve any sort of numerical data, there's only three steps to this notebook. 

1. CSV I/O
2. Tokenizing or Vectorizing
3. Training the model

In [None]:
import pandas as pd # For CSV I/O
import numpy as np # For pandas dataframe manipulation

import warnings
warnings.filterwarnings('ignore')

In [None]:
data = pd.read_csv('/kaggle/input/spam-email-dataset/emails.csv')
data.head()

In [None]:
# Resets the old index of the dataframe and samples the entire dataframe.
data = data.sample(frac=1).reset_index(drop=True)
display(data)
class_names=['Not Spam', 'Spam']

In [None]:
# Splitting into text and spam, or formally our dependent and independet variables.
X_data = data['text']
y_data = data['spam']

### Step 2: Tokenizing / Vectorizing

What is the difference? 

You might've noticed that I used sklearn.feature_extraction.text.CountVectorizer(), however I have also used keras.text_preprocessing.Tokenizer() in the past with other datasets that were more NLP-centeric. What to use when?

It comes down to the use case. As I am doing simple text classification, scikit-learn's version is preferred. However, under NLP use cases or if a deep learning model is in use, then pairing the model with frameworks like keras/tensorflow works better. 

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import KFold, train_test_split

We are printing the size of the array of the vectorized data and the actual data here. 

In [None]:
count_vect = CountVectorizer()
X_vect = count_vect.fit_transform(X_data)
print(X_vect.shape, X_data.shape)

### Step 3: Model Training
We will be using train_test_split to split the data into training and testing data and use our RFC algorithm to fit X_train and y_train. 

Arguments: 
1. random_state:- Controls the randomness when building trees as well as the sampling of features when looking for the best split at each node.

2. max_features:- the number of features to look for when looking for the best decision boundary (right from sklearn's documentation). There's two possible arguments for this. "sqrt" or "log2", as well as a "None" argument. 

I tested "log2" first, and I got a 91% classification on Not-Spam, while "sqrt" performed significantly better, with 95% precision.

In [None]:
results = []

X_train, X_test, y_train, y_test = train_test_split(X_vect, y_data, test_size=0.3, random_state=64)

model = RandomForestClassifier(random_state=64, max_features='sqrt')
model.fit(X_train, y_train)

In [None]:
# We are rounding our values and getting ready for the classification report.
y_pred = model.predict(X_test)
results.extend(y_pred)
y_pred2 = np.round(y_pred, 0)

y_pred = np.round(results, 0)
y_true = y_data

In [None]:
from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred2, target_names=class_names, digits=3))

This is the end of the notebook. I hope you got a glimpse of how machine learning algorithms are implemented on a dataset like this. For any queries, email: akshathmangudi@gmail.com

Notebook by Akshath Mangudi