# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [3]:
# Your code here
import pandas as pd
df = pd.read_csv('SMSSpamCollection', delimiter='\t', names=['label','text'])
#that's right, this is a tsv with no column names and nobody told me
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [8]:
# Your code here
df_spam = df[df['label'] == 'spam']
df_ham_temp = df[df['label'] == 'ham']
df_ham = df_ham_temp.iloc[:(len(df_spam))]
df_equal = pd.concat([df_ham, df_spam])
df_equal.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
6,ham,Even my brother is not like to speak with me. ...


In [9]:
df_equal.tail()

Unnamed: 0,label,text
5537,spam,Want explicit SEX in 30 secs? Ring 02073162414...
5540,spam,ASKED 3MOBILE IF 0870 CHATLINES INCLU IN FREE ...
5547,spam,Had your contract mobile 11 Mnths? Latest Moto...
5566,spam,REMINDER FROM O2: To get 2.50 pounds free call...
5567,spam,This is the 2nd time we have tried 2 contact u...


In [18]:
p_classes = dict(df_equal['label'].value_counts(normalize=True))
#for some reason we need p_classes in order to keep track of the class names?
#this might be a way to easily keep track of the relative frequency of the classes when there are more than two of them
p_classes

{'ham': 0.5, 'spam': 0.5}

## Train-test split

Now implement a train-test split on the dataset: 

In [26]:
# Your code here
from sklearn.model_selection import train_test_split
X = df_equal['text']
y = df_equal['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
train_df = pd.concat([y_train, X_train], axis=1)
test_df = pd.concat([y_test, X_test], axis=1)

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [27]:
# Your code here
class_word_freq = {} #create outer "housing" dictionary
classes = train_df['label'].unique() #create list of unique classes in dataset
for class_ in classes:
    temp_df = train_df[train_df['label'] == class_] #target only the rows belonging to the current class
    bag = {} #create empty dictionary to house word counts
    for row in temp_df.index:
        doc = temp_df['text'][row] #create variable representing the text entry belonging to the current row
        for word in doc.split(): 
            bag[word] = bag.get(word, 0) + 1 #creating frequency dictionary for each word in the current row
    class_word_freq[class_] = bag #filling outer dictionary
class_word_freq

#what this code does is create a dictionary where the keys are the unique class names
#the values are frequency dictionaries, designating frequency of unique words belonging to that class

{'spam': {'You': 64,
  'are': 58,
  'awarded': 29,
  'a': 283,
  'SiPix': 4,
  'Digital': 4,
  'Camera!': 4,
  'call': 137,
  '09061221061': 2,
  'from': 88,
  'landline.': 14,
  'Delivery': 4,
  'within': 5,
  '28days.': 2,
  'T': 11,
  'Cs': 9,
  'Box177.': 2,
  'M221BP.': 2,
  '2yr': 2,
  'warranty.': 2,
  '150ppm.': 4,
  '16': 23,
  '.': 13,
  'p': 2,
  'p£3.99': 2,
  'Message': 1,
  'Important': 2,
  'information': 4,
  'for': 130,
  'O2': 4,
  'user.': 2,
  'Today': 2,
  'is': 116,
  'your': 147,
  'lucky': 5,
  'day!': 2,
  '2': 131,
  'find': 12,
  'out': 33,
  'why': 5,
  'log': 4,
  'onto': 5,
  'http://www.urawinner.com': 5,
  'there': 7,
  'fantastic': 3,
  'surprise': 2,
  'awaiting': 4,
  'you': 120,
  'Free': 23,
  'entry': 20,
  'in': 45,
  'weekly': 16,
  'comp': 7,
  'chance': 13,
  'to': 466,
  'win': 22,
  'an': 18,
  'ipod.': 2,
  'Txt': 47,
  'POD': 3,
  '80182': 2,
  'get': 37,
  '(std': 4,
  'txt': 49,
  'rate)': 3,
  "T&C's": 4,
  'apply': 10,
  '08452810073': 

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [28]:
# Your code here
vocabulary = set()
for text in train_df['text']:
    for word in text.split():
        vocabulary.add(word) 
V = len(vocabulary)
V

#this creates a set of all the UNIQUE words in a dataframe

5976

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [29]:
# Your code here
def bag_it(doc):
    bag = {}
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
    return bag

#this creates a frequency dictionary with keys = unique words in test document 
#and values = # of occurrences in test document

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [25]:
# Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc) #create frequency dictionary of unique words in test document
    classes = [] #initialize empty list to contain unique possible classes
    posteriors = [] #initialize empty list to contain posterior probabilities 
    for class_ in class_word_freq.keys(): #loop through each unique class in list of possible classes
        p = np.log(p_classes[class_]) #logarithm of normalized rate of occurrence of the class in dataset
        for word in bag.keys(): #looping through each unique word in test document
            num = bag[word] + 1 
            #define numerator as # of occurrences of word in test document (plus one for Laplacian smoothing)
            denom = class_word_freq[class_].get(word, 0) + V 
            #define denominator as # of occurrences of word in class (plus total corpus for Laplacian smoothing)
            p *= np.log(num/denom)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [21]:
# Your code here
import numpy as np
y_hat_train = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_hat_train == y_train
residuals.value_counts(normalize=True)

True     0.525
False    0.475
dtype: float64

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!