# Using Naive Bayes to classify text
In this example we will use the Enron-Spam dataset (http://www2.aueb.gr/users/ion/data/enron-spam/) to make a simple Naive Bayes classifier

## Read in the datafiles
Here we will read in the datafiles. For it to work please first run `download.sh` in the `spam-data` folder

In [3]:
import os
import pandas as pd
import numpy as np

NEWLINE = '\n'
SKIP_FILES = {'cmds'}

def read_files(path):
    for root, dir_names, file_names in os.walk(path):
        for path in dir_names:
            read_files(os.path.join(root, path))
        for file_name in file_names:
            if file_name not in SKIP_FILES:
                file_path = os.path.join(root, file_name)
                if os.path.isfile(file_path):
                    past_header, lines = False, []
                    f = open(file_path, encoding="latin-1")
                    for line in f:
                        if past_header:
                            lines.append(line)
                        elif line == NEWLINE:
                            past_header = True
                    f.close()
                    content = NEWLINE.join(lines)
                    yield file_path, content
                    


def build_data_frame(path, classification):
    rows = []
    index = []
    for file_name, text in read_files(path):
        rows.append({'text': text, 'class': classification})
        index.append(file_name)

    data_frame = pd.DataFrame(rows, index=index)
    return data_frame


HAM = 'ham'
SPAM = 'spam'

SOURCES = [
    ('./spam-data/beck-s',      HAM),
    ('./spam-data/farmer-d',    HAM),
    ('./spam-data/kaminski-v',  HAM),
    ('./spam-data/kitchen-l',   HAM),
    ('./spam-data/lokay-m',     HAM),
    ('./spam-data/williams-w3', HAM),
    ('./spam-data/BG',          SPAM),
    ('./spam-data/GP',          SPAM),
    ('./spam-data/SH',          SPAM)
]

data = pd.DataFrame({'text': [], 'class': []})

for path, classification in SOURCES:
    data = data.append(build_data_frame(path, classification))

data = data.reindex(np.random.permutation(data.index))

## Data format
Below you can see that their is an array generated with the class (ham/spam) and the text contained

In [4]:
data.head()

Unnamed: 0,class,text
./spam-data/GP/part1/msg836.eml,spam,"<html><body><a href="""">\n\nNo doctor visit nee..."
./spam-data/lokay-m/articles/69,ham,Note: I'm sure you have this on your radar sc...
./spam-data/SH/HP/prodmsg.2.446410.2005717,spam,"Hi,\n\nI have been using your site for a long ..."
./spam-data/GP/part9/msg13294.eml,spam,"<HTML><HEAD><META HTTP-EQUIV=3D""Content-Type"" ..."
./spam-data/kaminski-v/poland/4,ham,Szanowny Panie Kaminski!!!\n\n\n\nBardzo dziek...


## Vectorize
The model cannot work with text data. So we need to create a numeric array for it. For this we can use CountVectorizer. It create an column for each word and than set the count for every row.

In [8]:
from sklearn.feature_extraction.text import CountVectorizer

count_vectorizer = CountVectorizer()
counts = count_vectorizer.fit_transform(data['text'].values)

## Train classifier
Train the classifier with our training data.

In [9]:
from sklearn.naive_bayes import MultinomialNB

classifier = MultinomialNB()
targets = data['class'].values
classifier.fit(counts, targets)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

## Test our classifier

In [13]:
examples = ['Free Viagra call today!', "I'm going to learn something about Machine learning"]
example_counts = count_vectorizer.transform(examples)
predictions = classifier.predict(example_counts)

print(predictions)

['spam' 'ham']
