# Naive Bayes Spam Filter

Use the Scikit-Learn Naive Bayes classifiers to explore how well they perform on the UCI ML Repository spam dataset.

In [25]:
import numpy as np
import pandas as pd
import urllib
import sklearn
import urllib.request

from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB

In [11]:
# create the url for the dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data"

# import the data from the UCI archive website
raw_data = urllib.request.urlopen(url)
dataset = np.loadtxt(raw_data, delimiter=',')
print(dataset[0])

[  0.      0.64    0.64    0.      0.32    0.      0.      0.      0.
   0.      0.      0.64    0.      0.      0.      0.32    0.      1.29
   1.93    0.      0.96    0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.      0.      0.
   0.      0.      0.      0.      0.      0.      0.778   0.      0.
   3.756  61.    278.      1.   ]


In [13]:
# split the dataset into features and target
X = dataset[:, 0:48]
y = dataset[:, -1]

In [22]:
# view count of target values
value_counts = np.unique(y, return_counts=True)
not_spam_count = value_counts[1][0]
spam_count = value_counts[1][1]
total_count = not_spam_count + spam_count

print(f'Label: {value_counts[0][0]} Count: {not_spam_count} Percent: {round(not_spam_count / total_count, 2)*100}')
print(f'Label: {value_counts[0][1]} Count: {spam_count} Percent: {round(spam_count / total_count, 2)*100}')

Label: 0.0 Count: 2788 Percent: 61.0
Label: 1.0 Count: 1813 Percent: 39.0


### This shows there is a significant imbalance in data between spam and non-spam
If there is a significant number of falsely labeled non-spam from the test data, the data may need random resampling of the spam data to improve the number of spam data seen by the model.

In [23]:
# split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=17)

### Bernoulli NB

In [27]:
# create a Bernoulli Naive Bayes model
BernNB = BernoulliNB(binarize=True)
BernNB.fit(X_train, y_train)
print(BernNB)

y_true = y_test
y_pred = BernNB.predict(X_test)

print(accuracy_score(y_true, y_pred), '\n')
print(confusion_matrix(y_true, y_pred),'\n')
print(classification_report(y_true, y_pred))

BernoulliNB(binarize=True)
0.8577633007600435 

[[491  56]
 [ 75 299]] 

              precision    recall  f1-score   support

         0.0       0.87      0.90      0.88       547
         1.0       0.84      0.80      0.82       374

    accuracy                           0.86       921
   macro avg       0.85      0.85      0.85       921
weighted avg       0.86      0.86      0.86       921



### Multinomial NB

In [28]:
MultiNB = MultinomialNB()
MultiNB.fit(X_train, y_train)
print(MultiNB)

y_pred = MultiNB.predict(X_test)

print(accuracy_score(y_true, y_pred), '\n')
print(confusion_matrix(y_true, y_pred),'\n')
print(classification_report(y_true, y_pred))

MultinomialNB()
0.8816503800217155 

[[454  93]
 [ 16 358]] 

              precision    recall  f1-score   support

         0.0       0.97      0.83      0.89       547
         1.0       0.79      0.96      0.87       374

    accuracy                           0.88       921
   macro avg       0.88      0.89      0.88       921
weighted avg       0.90      0.88      0.88       921



### Gaussian NB

In [29]:
GaussNB = GaussianNB()
GaussNB.fit(X_train, y_train)
print(GaussNB)

y_pred = GaussNB.predict(X_test)

print(accuracy_score(y_true, y_pred), '\n')
print(confusion_matrix(y_true, y_pred),'\n')
print(classification_report(y_true, y_pred))

GaussianNB()
0.8197611292073833 

[[388 159]
 [  7 367]] 

              precision    recall  f1-score   support

         0.0       0.98      0.71      0.82       547
         1.0       0.70      0.98      0.82       374

    accuracy                           0.82       921
   macro avg       0.84      0.85      0.82       921
weighted avg       0.87      0.82      0.82       921



### Try Bernoulli NB with different binarize values

In [32]:
# try Bernoulli with binarize since the features are not binary
BernNB = BernoulliNB(binarize=0.1)
BernNB.fit(X_train, y_train)
print(BernNB)

y_pred = BernNB.predict(X_test)

print(accuracy_score(y_true, y_pred), '\n')
print(confusion_matrix(y_true, y_pred),'\n')
print(classification_report(y_true, y_pred))

BernoulliNB(binarize=0.1)
0.9109663409337676 

[[515  32]
 [ 50 324]] 

              precision    recall  f1-score   support

         0.0       0.91      0.94      0.93       547
         1.0       0.91      0.87      0.89       374

    accuracy                           0.91       921
   macro avg       0.91      0.90      0.91       921
weighted avg       0.91      0.91      0.91       921



In [35]:
# grid search through different binarize values
bin_values = [0.05, 0.75, 0.1, 0.125, 0.15]

for bn in bin_values:
    BernNB = BernoulliNB(binarize=bn)
    BernNB.fit(X_train, y_train)
    
    print(BernNB)
    
    y_true = y_test
    y_pred = BernNB.predict(X_test)
    
    print(classification_report(y_true, y_pred))

BernoulliNB(binarize=0.05)
              precision    recall  f1-score   support

         0.0       0.90      0.93      0.92       547
         1.0       0.89      0.85      0.87       374

    accuracy                           0.90       921
   macro avg       0.90      0.89      0.89       921
weighted avg       0.90      0.90      0.90       921

BernoulliNB(binarize=0.75)
              precision    recall  f1-score   support

         0.0       0.90      0.89      0.89       547
         1.0       0.84      0.85      0.84       374

    accuracy                           0.87       921
   macro avg       0.87      0.87      0.87       921
weighted avg       0.87      0.87      0.87       921

BernoulliNB(binarize=0.1)
              precision    recall  f1-score   support

         0.0       0.91      0.94      0.93       547
         1.0       0.91      0.87      0.89       374

    accuracy                           0.91       921
   macro avg       0.91      0.90      0.91     

## Final Analysis

### Main Objective: Not have many non-spam emails end up in the spam folder 

Because of the nature of trade-offs in ML classification, it is important to understand what is the most important factor to optimize. In this classification problem of spam detection, **the most important thing is to optimize around Recall** because that number indicates the number of non-spam emails that were incorrectly classified as spam. This can cause issues for the user, since if an important email ends up in their spam folder they may have a bigger problem than an occasional spam email slipping through the filter into their inbox.

Optimizing for Non-Spam Recall:

**BernoulliNB (binarize=0.1) has the best performance**

- Non-Spam Recall: 0.94
- Non-Spam Precision: 0.91


### If eliminating spam was the most important objective

If for some reason, eliminating spam at the expense of falsely identifying non-spam emails as spam was the most important outcome. Then the **Gaussian NB model** would be the best performer, since it has the highest Spam Recall and highest Non-Spam Precision. This means that it has the best performance of correctly classifying spam emails, but at the expense of mis-classifying non-spam emails at a higher proportion.
