# Naive Bayes Spam Filtering

### Overview

We all hate spam, so developing a classifier to classify email as spam or not spam is useful.  

### Builds on
None

### Run time
approx. 20-30 minutes

### Notes

We can do Naive Bayes classification.

## Step 1: Load Data
We will load the dataframe into pandas.  Since the outcome label is "ham" or "spam", that will be our label.

In [None]:
import pandas as pd

# data_location = 'https://elephantscale-public.s3.amazonaws.com/data/spam/SMSSpamCollection.txt'
data_location = '/data/spam/SMSSpamCollection.txt'

dataset = pd.read_csv(data_location, sep='\t')
dataset

## Step 2 - Explore Data

In [None]:
## TODO :  Count spam/ham
## Hint : group by 'isspam'
## Question : Is there a data skew?
dataset.groupby("???").size()

## Step 3: Vectorize Using TF/IDF

Let's use tf/idf for vecorization at first.  TF/IDF will take and count the instances of each term, and then divide by the total frequecy of that term in the entire dataset.  

This leads to very highly dimensional data, because every word in the document will lead to a dimension in the data.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer


pipeline = Pipeline([('vec', CountVectorizer()),
                     ('tfidf', TfidfTransformer())])

## TODO : vectorize the 'text' column
x = pipeline.fit_transform(dataset['???'])
x

In [None]:
## we can leave y as 'spam' / 'ham'
# y = dataset['isspam']

## TODO : encode y
## Hint : use hte 'isspam' column
y = pd.factorize(dataset['???'])[0]
y

## Step 4: Split train/test

In [None]:
## TODO: Use training / test split of 80%/20%

from sklearn.model_selection import train_test_split

x_train,x_test,y_train, y_test = train_test_split(x,y,  test_size=???)
## to control train/test split set random_state to a number
# x_train,x_test,y_train, y_test = train_test_split(x,y, random_state=0, test_size=0.3)

print ("x_train :" , x_train.shape )
print ("x_test :", x_test.shape)
print ("y_train :", y_train.shape)
print ("y_test :", y_test.shape)

## Step 5 : Run Naive Bayes

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB ()

##TODO : fit on (x_train, y_train)
model = nb.fit(x_train, ???)
model

## Step 6 - Inspect the Model

In [None]:
print('coef : ', model.coef_)
print('intercept' , model.intercept_)

## Step 7 - Evaluate the model

### 7.1: Predict on test data


In [None]:
## TODO : predict on 'x_test'
y_pred = model.predict(???)
y_pred

In [None]:
a = pd.DataFrame ({'label' : y_test, 'prediction' : y_pred })
a

### 7.2 :  Score

Let's look at how our model performs.  We will do an accuracy measure.

In [None]:
## TODO : score the model using (x_test, y_test)
model.score(???,???)

In [None]:
from sklearn.metrics import accuracy_score

## TODO : score the model using (y_test, y_pred)
accuracy_score(y_test, ???)

### 7.3 Confusion Matrix

Hmmm.. the positive case didn't do as well as the negative case. Why is that?

In [None]:
from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
cm

In [None]:
## Plot confusion matrix
%matplotlib inline
import matplotlib.pyplot as plt
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

import my_utils

my_utils.plot_confusion_matrix(cm, target_names=['ham','spam'], normalize=False)

In [None]:
## Metrics calculated from Confusion Matrix
from sklearn.metrics import precision_recall_fscore_support

pd.DataFrame(list(precision_recall_fscore_support(y_test, y_pred)),
             columns=['ham', 'spam'],
             index=['Precision', 'Recall', "F1", "Support"])

### 7.4 :  ROC Curve & AUC

For this to work, 'y' must be encoded as a number.

In [None]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, y_pred)

In [None]:
from sklearn.metrics import roc_curve, auc

fpr, tpr, thresholds = roc_curve(y_test, y_pred)
roc_auc = auc(fpr, tpr)

In [None]:
plt.title('Receiver Operating Characteristic')
plt.plot(fpr, tpr, 'b',
label='AUC = %0.2f'% roc_auc)
plt.legend(loc='lower right')
plt.plot([0,1],[0,1],'r--')
plt.xlim([-0.1,1.2])
plt.ylim([-0.1,1.2])
plt.ylabel('True Positive Rate')
plt.xlabel('False Positive Rate')
plt.show()

## Step 8:  Run your own data!

Now it's your turn!   Make a new dataframe with some sample test data of your own creation.  Make some "spammy" SMSes and some ordinary ones.  See how our spam filter does.

In [None]:
mydata = pd.DataFrame ( { 'text' : [
                                     'can we meet for lunch?',
                                     'win win win instant tickets!',
                                     'ultra cheap medications!!!',
                                     'your text here'
]})

mydata

In [None]:
my_x = pipeline.transform(mydata['text'])
my_x

In [None]:
my_pred = model.predict(my_x)
my_pred

In [None]:
mydata['prediction'] = my_pred
mydata

## FUN : How will you defeat this algorithm? :-) 

If you are spammer, how can you defeat this algorithm?

<img src="../assets/images/come-tothe-dark-side-iin-we-have-cookies.png" />

## Further Reading
Checkout [Amazon Comprehend](https://us-west-2.console.aws.amazon.com/comprehend/v2/home?region=us-west-2#welcome) to parse natural text and extract meaning.