# Naive Bayes (the easy way)

We'll cheat by using sklearn.naive_bayes to train a spam classifier! Most of the code is just loading our training data into a pandas DataFrame that we can play with:

In [1]:
import os
import io
import numpy
from pandas import DataFrame
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
#Alain: Activity for train/test
from sklearn.model_selection import train_test_split

def readFiles(path):
    for root, dirnames, filenames in os.walk(path):
        for filename in filenames:
            path = os.path.join(root, filename)

            inBody = False
            lines = []
            f = io.open(path, 'r', encoding='latin1')
            for line in f:
                if inBody:
                    lines.append(line)
                elif line == '\n':
                    inBody = True
            f.close()
            message = '\n'.join(lines)
            yield path, message


def dataFrameFromDirectory(path, classification):
    rows = []
    index = []
    for filename, message in readFiles(path):
        rows.append({'message': message, 'class': classification})
        index.append(filename)

    return DataFrame(rows, index=index)

data = DataFrame({'message': [], 'class': []})

data = data.append(dataFrameFromDirectory('/Users/amartens/Documents/Udemy/DataScience-Python3/emails/spam', 'spam'), sort=False)
data = data.append(dataFrameFromDirectory('/Users/amartens/Documents/Udemy/DataScience-Python3/emails/ham', 'ham'), sort=False)


Let's have a look at that DataFrame:

In [2]:
data.head()

Unnamed: 0,message,class
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/spam/00249.5f45607c1bffe89f60ba1ec9f878039a,"Dear Homeowner,\n\n \n\nInterest Rates are at ...",spam
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/spam/00373.ebe8670ac56b04125c25100a36ab0510,ATTENTION: This is a MUST for ALL Computer Use...,spam
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/spam/00214.1367039e50dc6b7adb0f2aa8aba83216,This is a multi-part message in MIME format.\n...,spam
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/spam/00210.050ffd105bd4e006771ee63cabc59978,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...,spam
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/spam/00033.9babb58d9298daa2963d4f514193d7d6,This is the bottom line. If you can GIVE AWAY...,spam


Now we will use a CountVectorizer to split up each message into its list of words, and throw that into a MultinomialNB classifier. Call fit() and we've got a trained spam filter ready to go! It's just that easy.

In [3]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(data['message'].values)

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

Let's try it out:

In [5]:
examples = ['Free Viagra now!!!', "Hi Bob, how about a game of golf tomorrow?", "Do you want free dollars?", "Just a simple message"]
example_counts = vectorizer.transform(examples)
predictions = classifier.predict(example_counts)
predictions

array(['spam', 'ham', 'spam', 'ham'], dtype='<U4')

## Activity

Our data set is small, so our spam classifier isn't actually very good. Try running some different test emails through it and see if you get the results you expect.

If you really want to challenge yourself, try applying train/test to this spam classifier - see how well it can predict some subset of the ham and spam emails.

In [58]:
#Also imported Sklearn train_test_split
#Activity - create a place holder for train and test data

from sklearn.metrics import r2_score

train, test = train_test_split(data, test_size=0.2)

In [59]:
train.head()

Unnamed: 0,message,class
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/ham/01584.c2bc0fb5826431ed3df58a0fc968c068,"On Tue, 10 Sep 2002, Rose, Bobby wrote:\n\n\n\...",ham
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/ham/01947.1d30e15168424f7d1342c5dcd60f22b2,"URL: http://www.newsisfree.com/click/-3,776472...",ham
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/ham/01071.5d83f457fafaabe795b84a483a42a9e1,"Once upon a time, Brian wrote :\n\n\n\n> hey i...",ham
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/ham/00120.4de6f88fbcb22a39a0498f84d9ce358b,> That always amazes me about 'regular' dreams...,ham
/Users/amartens/Documents/Udemy/DataScience-Python3/emails/ham/01606.cf1844a356849ed8cdafb12185afd52f,"If you examine the log further, you'll see deb...",ham


In [60]:
vectorizer = CountVectorizer()
counts = vectorizer.fit_transform(train['message'].values)

classifier = MultinomialNB()
targets = train['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [67]:
#Let's apply it to test data set
test_counts = vectorizer.transform(test['message'])

In [79]:
#Predict spam or ham based on the test data using the trained model above
predictions = classifier.predict(test_counts)

In [80]:
#Iterate over test and predictions to get 1 or 0 based on spam or ham
testBool = []
predictionBool = []

for i in test['class'].values:
    if (i == 'spam'):
        testBool.append(1)
    else:
        testBool.append(0)

for i in predictions:
    if (i == 'spam'):
        predictionBool.append(1)
    else:
        predictionBool.append(0)


print("Length of testBool list: ", str(len(testBool)))
print("Length of predictionBool list: ", str(len(predictionBool)))

Length of testBool list:  600
Length of predictionBool list:  600


In [81]:
#Now calculate the R-squared for the test set
r2_score(testBool, predictionBool)

0.7540530015781599