# Naive Bayes keeping it simple, sweetheart

If we use sklearn.naive_bayes to train a spam classifier it can be fairly uncomplicated with Python.
As usual most of the time involved is preparing the data.  Additionally, most of the code is loading the training data 
into a pandas DataFrame:

In [25]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('c:/Users/John/Desktop/ML_coding/emails/spam', 'spam'))
data = data.append(dataFrameFromDirectory('c:/Users/John/Desktop/ML_coding/emails/ham', 'ham'))


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


Let's have a look at that DataFrame:

In [27]:
data.head(10)


Unnamed: 0,class,message
c:/Users/John/Desktop/ML_coding/emails/spam\00001.7848dde101aa985090474a91ec93fcf0,spam,"<!DOCTYPE HTML PUBLIC ""-//W3C//DTD HTML 4.0 Tr..."
c:/Users/John/Desktop/ML_coding/emails/spam\00002.d94f1b97e48ed3b553b3508d116e6a09,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
c:/Users/John/Desktop/ML_coding/emails/spam\00003.2ee33bc6eacdb11f38d052c44819ba6c,spam,1) Fight The Risk of Cancer!\n\nhttp://www.adc...
c:/Users/John/Desktop/ML_coding/emails/spam\00004.eac8de8d759b7e74154f142194282724,spam,##############################################...
c:/Users/John/Desktop/ML_coding/emails/spam\00005.57696a39d7d84318ce497886896bf90d,spam,I thought you might like these:\n\n1) Slim Dow...
c:/Users/John/Desktop/ML_coding/emails/spam\00006.5ab5620d3d7c6c0db76234556a16f6c1,spam,A POWERHOUSE GIFTING PROGRAM You Don't Want To...
c:/Users/John/Desktop/ML_coding/emails/spam\00007.d8521faf753ff9ee989122f6816f87d7,spam,Help wanted. We are a 14 year old fortune 500...
c:/Users/John/Desktop/ML_coding/emails/spam\00008.dfd941deb10f5eed78b1594b131c9266,spam,<html>\n\n<head>\n\n<title>ReliaQuote - Save U...
c:/Users/John/Desktop/ML_coding/emails/spam\00009.027bf6e0b0c4ab34db3ce0ea4bf2edab,spam,TIRED OF THE BULL OUT THERE?\n\nWant To Stop L...
c:/Users/John/Desktop/ML_coding/emails/spam\00010.445affef4c70feec58f9198cfbc22997,spam,"Dear ricardo1 ,\n\n\n\n<html>\n\n<body>\n\n<ce..."


Use the CountVectorizer to split up each message into its own list of words, and run that through a MultinomialNB classifier. Call fit() and now there is a trained spam filter.  How could you integrate that into an e-mail service that doesn't already have one?  Can you use this knowledge to improve a spam filter already in existence?

In [18]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Now you try it out and see how well it works:

In [31]:
examples = ['FREE Viagra free samples!!!',"Hey Daria, are we still meeting for lunch?"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham'], dtype='<U4')

## Activity

The data set is small, therefore the spam classifier isn't really good. 
I tried a few different statements such as:
"Russian Girls want to talk to you now FREE service!"
"FREE ways to make lots of money from home!!!"
with varying levels of success.  Usually it was false negatives.
If you take the path directory:
c:/Users/John/Desktop/ML_coding/emails/spam
and change it to your own directory, where you put the associated fake e-mails, you can build your own.

Next I am applying train/test to this spam classifier so I can see how it predicts certain subsets.