# Document Classification with Naive Bayes - Lab

## Introduction

In this lecture, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

You will be able to:  

* Implement document classification using Naive Bayes
* Understand the need for the Laplacian smoothing correction
* Explain how to code a bag of words representation

## Import the Dataset

To start, import the dataset stored in the text file `SMSSpamCollection`.

In [None]:
#Your code here
import numpy as np
import pandas as pd
df = pd.read_csv('SMSSpamCollection', sep='\t', names=['ham_or_spam','sms'])
df.ham_or_spam.value_counts()

## Account for Class Imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [None]:
df[df.ham_or_spam=='spam'].count()[0]

In [None]:
#Your code here

data = df.loc[np.random.choice(df[df.ham_or_spam == 'ham'].index, 
                        df[df.ham_or_spam == 'spam'].count()[0], 
                        replace=False)]
data = data.append(df[df.ham_or_spam=='spam'])
data

## Train - Test Split

Now implement a train test split on your dataset.

In [None]:
X = data.drop(columns=['ham_or_spam'])
y = data['ham_or_spam']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y)

## Create the Word Frequency Dictionary for Each Class

Create a word frequency dictionary for each class.

In [None]:
train = X_train.assign(target=y_train)
train

In [None]:
#Your code here
from collections import Counter

def bag_it(s):
    return dict(Counter(s.split()))

def combine_freqs(wdcounts):
    c = Counter({})
    for wdcount in wdcounts:
        c += Counter(wdcount)
    return dict(c)

wordfreqs = {}
for class_ in train.target.unique():
    wordfreqs[class_] = combine_freqs(train[train.target == class_].sms.apply(bag_it))

wordfreqs

## Count the Total Corpus Words
Calculate V, the total number of words in the corpus.

In [None]:
#Your code here
V = sum(wordfreqs['spam'].values()) + sum(wordfreqs['ham'].values())
V

## Create a Bag of Words Function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [None]:
#Your code here


## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [None]:
#Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    classes = []
    posteriors = []
    for class_ in class_word_freq.keys():
        p = np.log(p_classes[class_])
        for word in bag.keys():
            num = bag[word]+1
            denom = class_word_freq[class_].get(word, 0) + V
            p += np.log(num/denom)
        classes.append(class_)
        posteriors.append(p)
    if return_posteriors:
        print(posteriors)
    return classes[np.argmax(posteriors)]

## Test Out Your Classifier

Finally, test out your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [None]:
#Your Code here
class_word_freq = wordfreqs
p_classes = {"spam" : .5, "ham" : .5}
classify_doc(train.iloc[7]['sms'], class_word_freq, p_classes, V, return_posteriors=True)

In [None]:
y_hat_train = X_train.sms.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
residuals = y_train == y_hat_train
residuals.value_counts(normalize=True)

In [None]:
V

## Level-Up

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

In [None]:
import importlib
import documentclassifier
importlib.reload(documentclassifier)
from documentclassifier import DocClassifier
df 
df
dc = DocClassifier(df, text='sms', target='ham_or_spam')
dc.V

In [None]:
dc.data.target.value_counts()

In [None]:
dc.train.groupby('target').count()

In [None]:
dc.print_residuals(dc.X_train, dc.y_train)

In [None]:
dc.print_residuals(dc.X_test, dc.y_test)

In [None]:
dc.bag_it(dc.data['text'].iloc[0])

In [None]:
bag_it(data['sms'].iloc[0])

In [None]:
list(dc.data.target.value_counts().reset_index()['index'])

In [None]:
df.target.value_counts().reset_index()['index'].iloc[-1]

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!