# Naive Bayes Example

We are going to explore using **classification of emails into spam and ham (not spam) using Naive Bayes.**

Agenda:
* Read in Files
* Tokenize with CountVectorize
* Examine the data
* Generate a few relevant variables in manually calculating Naive Bayes
* Use SKLearn's Implementation of Naive Bayes

In [None]:
import numpy as np
import pandas as pd

To motivate this example, we'll be using a modified form of the [Ling-Spam](http://csmining.org/index.php/ling-spam-datasets.html) dataset. More precisely, this is a set of 960 emails that people have manually read and labeled them as spam or ham.

This modified dataset from the Stanford OpenClassroom (and should be available in your repo), performs the following preprocessing steps on the original emails:
* Stop Word Removal - remove common, nonmeaningful words such as "and","the","of"
* Lemmatization - shorten similar words. "Includes", "Included"," Include" are all shortened to "Include"
* Non-Word Removal - remove numbers, punctuation, tabs. (In practice, perhaps you would want punctuations! i.e. If you see "Buy this now !!!!!!", multiple exclamation marks are pretty useful in classifying spam/ham)

Finally, I've condensed the entire set of files into a single CSV containing the following columns:
* Set: Train/Test - I've explicitly set the emails into train or test sets
* Label: Spam/Ham
* Text: Words in the email

## Read in Files

In [None]:
path_to_repo="/Users/brianchung/Desktop/ga-ds/"
data = pd.read_csv(path_to_repo + '/07_naive/07_emailspam.csv')
data.head()

In [None]:
# Let's see how the training/testing sample are set up
# Technically, cheating by looking at the "test" samples as well.
# In general, want to discipline ourselves to not looking at test set until the end
data.groupby(['Set','Label']).count()

In [None]:
# Let's go ahead and for now, split up our data into a train_data, and test_data
train_data = data[ data.Set == 'Train']
test_data  = data[ data.Set == 'Test' ]

# Tokenize

To get probabilities for $P(w|c)$, we would need to know how often each word is repeated across the various emails. We COULD go in, split the text, and create a dict of {word : #count}, but its easier to let sklearn do it for us.


## CountVectorizer

`sklearn` comes with many built-in feature extraction and manipulation tools. For dealing with text data, there is the  `sklearn.feature_extraction.text` module, which contains the **`CountVectorizer`**

CountVectorizer is a class that transforms text (either in the form of a list of strings, dataframe columns, Series), and outputs a matrix of document x tokens (where tokens represent a word or a phrase).

For instance, if we had a 2 element array of ["apple is an apple", "why is this blue"], the output matrix would look like this:

is|an|apple|why|this|blue
--|--|-----|---|----|----
 1| 1|    2|  0|   0|   0
 1| 0|    0|  1|   1|   1

,where the first row represents the tokenization of the first sentence "apple is an apple"
, and the second row represents the tokenziation of the second sentence "why is this blue".


The `CountVectorizer` (and most feature extraction methods in sklearn) follows a very simple interface:
- `fit` takes a dataset and learns the features it's trying to extract. In this case that means that the algorithm learns the vocabulary of all samples
- `transform` takes a dataset and produces the matrix as described above, based on the vocabulary (or feature elements) it learned.
- `fit_transform` combines the two steps at once.

For example, you may want to fit a vocabulary to a training set, transform the training set to train a model and then continually transform any new incoming examples you want to classify. You will generally only perform the fit step once but the transform step many times for any new datasets.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# Fit learns the vocabularies within the training data
vect.fit(train_data.Text)

# When we pass in a set of documents to transform, countvectorizer returns a matrix of frequencies
# with only words that have been fitted into the vocabulary
train_csr = vect.transform(train_data.Text)

In [None]:
#It returns a sparse matrix, which is more efficient when only a few entries of the matrix are non-zero
print type(train_csr), '\n'
print train_csr.shape, '\n'

In [None]:
# Can also convert to a dense (full) matrix
print train_csr.toarray(),'\n'

print train_csr.toarray().shape,'\n'
print train_data.shape

In [None]:
# Well, it has the same number of documents, but what are the words?
vect.get_feature_names()

In [None]:
# Can convert to a pandas dataframe too (generally not needed for SKLearn methods)
train_df = pd.DataFrame( train_csr.toarray(), columns=vect.get_feature_names() )
train_df.head()

In [None]:
# How many times does 'aa' occur throughout all the training documents?
print train_df.columns[0]
print train_df.iloc[:,0].sum() 

In [None]:
# How many tokens are there per row? 
# I.E. How many words are there per email? 
# Tokens are individual words in this case 
train_df.sum(axis=1).head()

In [None]:
# Whats the most frequent token?
token_count = train_df.sum(axis=0)

print "Max:",  np.max(token_count)
print "Word:", np.argmax(token_count)

In [None]:
# What are the total word frequencies for spam emails?
train_labels = train_data.Label
train_df[ train_labels == 'Spam'].sum(axis=0).head()

# Exercise

If you have not yet already, review and run the code blocks up to this point.

You will need the following defined (at least):
* train_data
* train_labels
* train_df 

**Count the total number of emails in train_data. Save this to "doc_total"**

**Count the total number of spam emails in train_data. Save this to "doc_spam"**

**Count the total number of ham emails in train_data. Save this to "doc_ham"**

In [None]:
# Your code here




assert(doc_spam + doc_ham == doc_total)

**How many total unique words are there in the training set (Vocabulary)? Hint: This is the number of columns in train_df. Save this to "V_train"**

In [None]:
#Your code here

assert(V_train==19073)

**What is the total count of words in spam emails? Save this as "word_total_spam". **

**What is the total count of words in ham emails? Save this as "word_total_ham". **

i.e. if there are two spam emails:

    A: "hello world hello"
    B: "is spam"
then word_total_spam = 3 + 2 = 5

Hint: Hint: Subset train_df on the labels, and use 'sum' to get total frequencies similar. You may need to call xx.sum().sum() to get the sum of every element where xx is the dataframe


In [None]:
#Your code here


assert(word_total_spam==105771 and word_total_ham==86102)

**Generate the total counts per word per class.**

Hint: Subset train_df on the labels, and use 'sum' to get total frequencies similar to above (Use the axis= argument as well in sum).

Save the resulting series as **word_count_ham**, and **word_count_spam**.

In [None]:
# Your code here


assert( len(word_count_spam)==19073 and len(word_count_ham)==19073 and 
       np.sum(word_count_spam)==105771 and np.sum(word_count_ham)==86102 and
      type(word_count_spam) is pd.Series and type(word_count_ham) is pd.Series )

** What is the probability of seeing the word "web" given class == "Spam" ? What is the probability of seeing the word "web" given class == "Ham" ?**

In [None]:
#Your code here



** What is the probability of seeing the word "Brrahhhhh" given class == "Spam"? What about "Ham"?**

Hint: Laplace Smoothing

In [None]:
#Your code here



----

Now you're ready to classify documents! Why? Let's think about it

* You can calculate class priors P(C==ham) and P(C==spam). This is just doc_ham/doc_total, and doc_spam/doc_total.
* You can also iterate through the class labels, and generate P(w|C==ham) and p(w|C==spam). This is just the count of the word `w` divided by the total number of words in the class.
* In essence, if there were a `NaiveBayes` model in SKLearn, it's as if you had calculate all the components of `fit`

```
Algorithm:
    Calculate class Priors p(spam), p(ham)
    For each row in the data:
        Keep a running total probability for spam and ham (logP_spam, logP_ham)
        For each word in row:
            Calculate P(w|spam), then add this to the running total probability for spam
            [P(w|spam)= data frequency of w * log( (word_count_spam[w] + 1) / (word_total_spam + V_train))] 
            Calculate P(w|ham), then add this to the running total probability for ham
            [P(w|ham) = data frequency of w *  log((word_count_ham[w] + 1) / (word_total_ham + V_train))] 
        Compare logP_spam and logP_ham. Classify this row as whichever is higher
            
```


I've included my version of Naive Bayes using a Multinomial model, but you will have the opportunity to write your own later as well!

## Naive Bayes Prediction on the Training Set

In [None]:
# Recreate a count vectorizer here, and fit it on the training data
vect = CountVectorizer()
vect.fit(train_data.Text)

# Transform the training data into a sparse matrix, and then to a pandas dataframe for ease
train_csr = vect.transform(train_data.Text)
train_df = pd.DataFrame( train_csr.toarray(), columns=vect.get_feature_names() )


# Save a list of the truth and one for my predictions
all_true_labels = train_data.Label
all_pred_labels = [""]*len(train_data.Label)


# Calculate our class priors
p_c_spam = np.float(doc_spam) / np.float(doc_total)
p_c_ham  = np.float(doc_ham) / np.float(doc_total)

for row_idx in range(train_data.shape[0]):

    
    # Save a running log probability 
    logP_spam = np.float64(0.0)
    logP_ham  = np.float64(0.0)
    
    # Grab all the tokens for that row in the form of a Series w/index = word, value = frequency of word
    words = train_df.iloc[row_idx,:]
    words = words[ words > 0]
    
    for word in words.index:
        cnt = words[word]

        #Calculate the quantity p(w|spam)
        p_word_spam = 0.0
        if word in word_count_spam.index:
            p_word_spam = (word_count_spam[word] + 1.0) / ( np.float(word_total_spam) + V_train )
        else:
            # If this word is not in our dictionary, give it the laplace smoothed prior!
            p_word_spam = 1.0 / ( np.float(word_total_spam) + V_train )


        #Calculate the quantity p(w|spam)
        p_word_ham = 0.0
        if word in word_count_ham.index:
            p_word_ham = (word_count_ham[word] + 1.0) / ( np.float(word_total_ham) + V_train )
        else:
            # If this word is not in our dictionary, give it the laplace smoothed prior!
            p_word_ham = 1.0 / ( np.float(word_total_ham) + V_train )
        
        # Add the correct count * log( p(w|c) ) to both categories
        logP_spam = logP_spam + cnt * np.log(p_word_spam)
        logP_ham = logP_ham +   cnt * np.log(p_word_ham)

    logP_spam = logP_spam + np.log(p_c_spam)
    logP_ham = logP_ham + np.log(p_c_ham)

    # Check which value is higher. Remember we didn't compute any normalization, since we don't care
    if logP_spam > logP_ham: 
        all_pred_labels[row_idx] = "Spam"
    else:
        all_pred_labels[row_idx] = "Ham"


print "Accuracy:", np.sum(all_pred_labels == all_true_labels) / np.float(len(all_true_labels))

## Naive Bayes Prediction on the Test Set

In [None]:
# Test set Using the Probabilities we've calculated with the training set
# !!Notice that for p(w|c) we are still using the probabilities off the training set!


# Similar to before, but we don't call fit again! We use the same vocabulary as in the training set
test_csr = vect.transform(test_data.Text)
test_df = pd.DataFrame( test_csr.toarray(), columns=vect.get_feature_names() )


all_true_labels = test_data.Label
all_pred_labels = [""]*len(test_data.Label)

p_c_spam = np.float(doc_spam) / np.float(doc_total)
p_c_ham  = np.float(doc_ham) / np.float(doc_total)

for row_idx in range(test_data.shape[0]):
    
    # Save a running log probability 
    logP_spam = np.float(0.0)
    logP_ham  = np.float(0.0)
    
    words = test_df.iloc[row_idx,:]
    words = words[ words > 0 ]

    for word in words.index:
        cnt = np.float(words[word])
        
        p_word_spam = 0.0
        if word in word_count_spam.index:
            p_word_spam = (word_count_spam[word] + 1.0) / ( np.float(word_total_spam) + V_train )
        else:
            # Give it the laplace smoothed prior!
            p_word_spam = 1.0 / ( np.float(word_total_spam) + V_train )

        p_word_ham = 0.0
        if word in word_count_ham.index:
            p_word_ham = (word_count_ham[word] + 1.0) / ( np.float(word_total_ham) + V_train )
        else:
            # Give it the laplace smoothed prior!
            p_word_ham = 1.0 / ( np.float(word_total_ham) + V_train )
        
        logP_spam = logP_spam + cnt * np.log(p_word_spam)
        logP_ham  = logP_ham  + cnt * np.log(p_word_ham)
        
    logP_spam = logP_spam + np.log(p_c_spam)
    logP_ham = logP_ham + np.log(p_c_ham)
    
    if logP_spam > logP_ham: 
        all_pred_labels[row_idx] = "Spam"
    else:
        all_pred_labels[row_idx] = "Ham"


print "Accuracy:", np.sum(all_pred_labels == all_true_labels)/ np.float(len(all_true_labels))

That was quite a lot of work (and probably poor coding) to create a multinomial Naive Bayes solution. You'll have a chance later on the improve upon this and write your own version.

# SKLearn `naive_bayes`

For now, let's also see how SKLearn's implementation of naive bayes operates.

There are many variants of Naive Bayes in the module, and they all mostly differ in how they encode P(x|c) (or the sampling model)
* MultinomialNB() : What we've learned so far, where multiple counts of words increase the probability of a class
* BernoulliNB(): Where the presence or absence (1 and 0s, not counts) will increase or decrease the probability of a class
* GaussianNB(): Where the probability of a class is based on how far this feature is from the center of the class. This would be good for continuous normally distributed features

In [None]:
from sklearn.naive_bayes import MultinomialNB

vect = CountVectorizer()
vect.fit(train_data.Text)

mm = MultinomialNB()


train_csr = vect.transform(train_data.Text)
y_train = train_data.Label

mm.fit(train_csr,y_train)
print "Training:", mm.score(train_csr,y_train)

test_csr = vect.transform(test_data.Text)
y_test = test_data.Label
print "Testing:", mm.score(test_csr,y_test)

In [None]:
from sklearn.naive_bayes import BernoulliNB

vect = CountVectorizer()
vect.fit(train_data.Text)

mm = BernoulliNB()


train_csr = vect.transform(train_data.Text)
y_train = train_data.Label

mm.fit(train_csr,y_train)
print "Training:", mm.score(train_csr,y_train)

test_csr = vect.transform(test_data.Text)
y_test = test_data.Label
print "Testing:", mm.score(test_csr,y_test)

### Ngrams

With CountVectorizer, we don't need to only have single words as tokens. We can have phrases of any length as tokens.

For instance, with the default way, the sentence "DONT BUY" would be tokenized into "DONT", "BUY". However, we would think the total phrase of length 2 (aka 2-gram, or bi-gram), is important.

Using the ngram_range feature in CountVectorizer, we can specify the range of n-grams we want. For instance, ngram_range=(1,5), would generate n-grams of length 1 (single words), length 2 (2 word phrases), length 3 (3 word phrases), and so on to 5.

Be careful! Setting ngram_range too high is not likely to help, and it increases the computing cost greatly!

In [None]:
#Similar to above, but I've used the ngram_range argument in creating the CountVectorizer
ngram2vect = CountVectorizer(ngram_range=(1,2))
ngram2vect.fit(train_data.Text)
ngram2_fn = ngram2vect.get_feature_names()
ngram2_fn

### Some helpful tools in diagnosing your models

In [None]:
vect = CountVectorizer()
vect.fit(train_data.Text)

train_csr = vect.transform(train_data.Text)
y_train = train_data.Label

mm = MultinomialNB()
mm.fit(train_csr,y_train)

print mm.score(train_csr,y_train)
print mm.class_count_
print mm.class_log_prior_
print mm.class_prior
print mm.classes_
print mm.predict_log_proba(train_csr)

In [None]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import cross_val_score
cross_val_score(MultinomialNB(), train_csr, y_train, cv=5, scoring="accuracy")