# Spam Email Classification

In this proect I will create a binary classifier that can distinguish spam emails from ham (non-spam) emails using real emails as my dataset. The dataset is from [SpamAssassin](https://spamassassin.apache.org/old/publiccorpus/). It consists of email messages, email IDs, subject lines and their labels (0 for ham, 1 for spam). My training dataset will have 8,348 labeled examples, and the unlabeled test set contains 1,000 unlabeled examples.

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
sns.set(style = "whitegrid", 
        color_codes = True,
        font_scale = 1.5)

In [3]:
import zipfile
with zipfile.ZipFile('spam_ham_data.zip') as item:
    item.extractall()

In [None]:
# Loading training and test datasets
original_training_data = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

# Convert the emails to lowercase as the first step of text processing.
original_training_data['email'] = original_training_data['email'].str.lower()
test['email'] = test['email'].str.lower()

original_training_data.head()

In [None]:
#Filling in Nan values
print('Before imputation:')
print(original_training_data.isnull().sum())
original_training_data = original_training_data.fillna('')
print('------------')
print('After imputation:')
print(original_training_data.isnull().sum())

In [None]:
#Examples of a spam and ham email
first_ham = original_training_data.loc[original_training_data['spam'] == 0, 'email'].iloc[0]
first_spam = original_training_data.loc[original_training_data['spam'] == 1, 'email'].iloc[0]
print(first_ham)
print(first_spam)

The spam email is an ad, it is trying to convince the reader that they should check out their product. This could be marked as spam by looking at key phrases such as "come in here and see how...", and "our methods are guaranteed..." or similar phrases. The ham email includes a personal phrase "thanks, misha!" which makes it seem less automated. 

In [5]:
# This creates a 90/10 train-validation split on our labeled data.
from sklearn.model_selection import train_test_split
train, val = train_test_split(original_training_data, test_size = 0.1, random_state = 42)

# Feature Engineering and Exploratory Data Analysis

Below I create a function called `words_in_texts` that takes in a list of `words` and a pandas `Series` of email `texts`. It outputs a 2-dimensional `NumPy` array containing one row for each email text. The row contains 0 or 1 values associate with each word in the `words` list. If j-th word in the `words` exists in the i-th input of the email `texts` Series, the output element at index (i, j) will be 1, otherwise it will be 0.

In [None]:
def words_in_texts(words, texts):
    """
    Args:
        words (list): words to find
        texts (Series): strings to search in
    
    Returns:
        A 2D NumPy array of 0s and 1s with shape (n, p) where 
        n is the number of texts and p is the number of words.
    """
    indicator_array = []
    for i in texts:
        input = []
        for j in words:
            if j in i:
                input.append(1)
            else:
                input.append(0)
        indicator_array.append(input)
    return indicator_array

We need to identify some features that allow us to distinguish spam emails from ham emails. One idea is to compare the distribution of a single feature in spam emails to the distribution of the same feature in ham emails. If the feature is itself a binary indicator, such as whether a certain word occurs in the text, this amounts to comparing the proportion of spam emails with the word to the proportion of ham emails with the word.

In [None]:
from IPython.display import display, Markdown
df = pd.DataFrame({
    'word_1': [1, 0, 1, 0],
    'word_2': [0, 1, 0, 1],
    'type': ['spam', 'ham', 'ham', 'ham']
})
display(Markdown("> Our Original DataFrame has a `type` column and some columns corresponding to words. You can think of each row as a sentence, and the value of 1 or 0 indicates the number of occurences of the word in this sentence."))
display(df);
display(Markdown("> `melt` will turn columns into entries in a variable column. Notice how `word_1` and `word_2` become entries in `variable`; their values are stored in the value column."))
display(df.melt("type"))

The following plot (which was created using `sns.barplot`) compares the proportion of emails in each class containing a particular set of words. 

In [None]:
train_copy = train.copy()
words = ['discount', 'deal', 'guarantee', 'urgent', 'fast', 'regret']
#words = ['body', 'business', 'html', 'money', 'offer', 'please']
for word in words:
    results = []
    for email in train['email']:
        if (word in email):
            results.append(1)
        else:
            results.append(0)
    train_copy[word] = results
spams = {}
hams = {}
for i in words:
    spam = sum(train_copy[(train_copy[i] == 1) & (train_copy['spam'] == 1)][i])
    ham = sum(train_copy[(train_copy[i] == 1) & (train_copy['spam'] == 0)][i])
    spam_prop = spam/sum(train_copy['spam'])
    ham_prop = ham/(len(train_copy[i]) - sum(train_copy['spam']))
    spams[i] = spam_prop
    hams[i] = ham_prop
df = pd.DataFrame()
for key in spams:
    df[key] = [hams[key], spams[key]]
df['type'] = ['Ham', 'Spam']
df2 = df.melt('type')

train = train.reset_index(drop=True) # We must do this in order to preserve the ordering of emails to labels for words_in_texts
plt.figure(figsize=(8,6))

sns.barplot(data = df2, x = 'variable', y = 'value', hue = 'type')
plt.title("Frequency of Words in Spam/Ham Emails")
plt.ylabel("Proportion of Emails")
plt.xlabel("Words")
plt.tight_layout()
plt.show()

In [None]:
some_words = ['drug', 'bank', 'prescription', 'memo', 'private']
X_train = pd.DataFrame(words_in_texts(some_words, train['email']))
Y_train = np.array(train['spam'])
X_train[:5], Y_train[:5]

Now I will try training a Logistic Regression Model with our training set. I find that the accuracy is only about 0.76.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, Y_train)

training_accuracy = model.score(X_train, Y_train)
print("Training Accuracy: ", training_accuracy)

Calculating precision and recall:

In [None]:
preds = model.predict(X_train)
TP = sum((preds == 1) & (Y_train == 1))
FP = sum((preds == 1) & (Y_train == 0))
TN = sum((preds == 0) & (Y_train == 0))
FN = sum((preds == 0) & (Y_train == 1))
logistic_predictor_precision = TP/(TP+FP)
logistic_predictor_recall = TP/(TP+FN)
logistic_predictor_fpr = FP/(FP+TN)

print(f"{TP=}, {TN=}, {FP=}, {FN=}")
print(f"{logistic_predictor_precision=:.2f}, {logistic_predictor_recall=:.2f}, {logistic_predictor_fpr=:.2f}")

There are more false negatives, there are 1699 false negatives versus 122 false positives. Predicting 0 for every email would give us 0.7447091707706642% prediction accuracy. Since this is very close to the predicton accuracy of our logistic regression classifier, this suggests that the prediction accuracy for our logistic regression classifier was not very good. This tells us that we might as well as marked no emails as spam and gotten nearly the same accuracy. 

# Building my own model

In [None]:
some_words = ['>', '<p', '!!!', 'offer', 'business', 'please', '2002', 'border="0"', 'but', 'they'] #'savings'
X_train = words_in_texts(some_words, train['email'])
Y_train = np.array(train['spam'])

my_model = LogisticRegression()
my_model.fit(X_train, Y_train)
my_model.score(X_train, Y_train)

train_c = train.copy()
val_c = val.copy()
def contains_col(dataframe, wordlist):
    for i in wordlist:
        dataframe[i] = dataframe['email'].apply(lambda x: 1 if i in x else 0)
contains_col(train_c, some_words)
contains_col(val_c, some_words)
train_c["! count"] = train_c['email'].str.count('!')
val_c["! count"] = val_c['email'].str.count('!')
trainCX = train_c.iloc[:, 4:]
trainCY = train['spam']
valCX = val_c.iloc[:, 4:]
valCY = val_c['spam']
model3 = LogisticRegression()
model3.fit(trainCX, trainCY)
preds = model3.predict(valCX)
TP = sum((preds == 1) & (val_c['spam'] == 1))
FP = sum((preds == 1) & (val_c['spam'] == 0))
TN = sum((preds == 0) & (val_c['spam'] == 0))
FN = sum((preds == 0) & (val_c['spam'] == 1))
logistic_predictor_precision = TP/(TP+FP) ##low means many false positives 
logistic_predictor_recall = TP/(TP+FN) ##low means many false negatives
logistic_predictor_fpr = FP/(FP+TN)
logistic_predictor_fnr = FN/(FN+TP)

print(f"{TP=}, {TN=}, {FP=}, {FN=}")
print(f"{logistic_predictor_precision=:.2f}, {logistic_predictor_recall=:.2f}, {logistic_predictor_fpr=:.2f}, {logistic_predictor_fnr=:.2f}")

Results: TP=141, TN=595, FP=18, FN=81,
logistic_predictor_precision=0.89, logistic_predictor_recall=0.64, logistic_predictor_fpr=0.03, logistic_predictor_fnr=0.36

# Findings

To start with, I wanted to find the most impactful words, so I created a table with every word in the emails, a "spam_counts" column (counts occurrences of that word in spam emails), a "ham_counts" column (counts occurrences of that word in ham emails) and "spam/ham" (ratio of spam_counts to ham_counts). I initially chose words that had very large spam/ham ratios and very low spam/ham ratios, but I found that this returned low precision and recall rates. I realized this was probably because some of these words had extremely low spam_counts and ham_counts, and there was colinearity among some of the features. So, I filtered the table to only include words with spam_counts and ham_counts greater than 500, and picked the lowest and highest ratios from that table. I went through a process of trial and error trying to find the words that created the highest accuracy, precision and recall rates until I found a good fit. I also added a column for number of exclamation marks because I found that spam emails had more than 3x as many exclamation marks as ham emails. 
At first, I simply chose words with the highest "spam/ham" ratio, but found that precision and recall rates were very low. So, I filtered the table to only include spam_counts and ham_counts greater than 500. This helped but the precision and recall were still low, especially the recall rates. I realized I needed to also inclue low "spam/ham" ratio, and ended up choosing 5 words with a very low "spam/ham" ratio, and 5 words with a high ratio. Ending up with these 10 words required me to go through the list multiple times, adding and removing words and seeing how it impacted my precision and recall, until I found a list that seemed like the best fit. The most surprising thing for me when finding good features was that the best features were words I would have never considered, such as "border="0"". It's surprising to me that indicators such as these are so common in ham emails. Another thing that was kind of surprising is how certain words that had very large "spam/ham" ratios had a negative impact on the model, as I would expect they would help the model train better. In fact, surprisingly the words with the highest ratios were some of the most damaging to the model's accuracy, and the best words were the ones that did not have such a stark difference in the number of occurences in spam vs ham emails. 

Creating heapmap

In [None]:
words = ['face="verdana"><font', 'width=3d"550"', 'align=3d"right"><font', 'align=3d"center">=20', 'spamassassin-sightings@lists.sourceforge.net', 'wrote:', 'bgcolor="#000000"><img', 'url:', 'size="-1">', 'height="9"']
train_cop = train.copy()
def contains_col(dataframe, wordlist):
    for i in wordlist:
        dataframe[i] = dataframe['email'].apply(lambda x: 1 if i in x else 0)
contains_col(train_cop, words)
train_cop["! count"] = train_cop['email'].str.count('!')
trainCOP = train_cop.iloc[:, 4:]
correl = trainCOP.corr()
sns.heatmap(correl, annot=True, annot_kws={"size":8})

This heapmap was used to test the words I was originally going to use for my model, but those words were unsuitable and the heatmap above offers one reason why that might be the case. There is strong colinearity among many of the features. In fact, 7 out of 10 have colinearity with at least 2 other features. This will strongly negatively impact the accuracy of the model's predictions, which helps explain why my precision and recall rates were so low when using these words. 

# ROC Curve

In [None]:
model3.predict_proba(trainCX)

In [None]:
from sklearn.metrics import roc_curve
import plotly.express as px
model = model3
y = trainCY
x = trainCX
def predict_threshold(model, X, T): 
    prob_one = model.predict_proba(X)[:, 1]
    return (prob_one >= T).astype(int)

def tpr_threshold(X, Y, T):
    Y_hat = predict_threshold(model, X, T)
    return np.sum((Y_hat == 1) & (Y == 1)) / np.sum(Y == 1)

def fpr_threshold(X, Y, T):
    Y_hat = predict_threshold(model, X, T)
    return np.sum((Y_hat == 1) & (Y == 0)) / np.sum(Y == 0)

thresholds = np.linspace(0, 1, 100)
tprs = [tpr_threshold(x, y, t) for t in thresholds]
fprs = [fpr_threshold(x, y, t) for t in thresholds]

fig = px.line(x=fprs, y = tprs, hover_name = thresholds, title="ROC Curve")
fig.update_xaxes(title = "False Positive Rate")
fig.update_yaxes(title = "True Positive Rate")