# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [110]:
# Your code here
import numpy as np
import pandas as pd

df = pd.read_csv('SMSSpamCollection', sep='\t', header=None)
df.columns = ['label', 'text']

In [111]:
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Train-test split

Now implement a train-test split on the dataset: 

In [95]:
# Your code here
from sklearn.model_selection import train_test_split
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=19)
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)

## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

(This would only be applied on training set)

In [96]:
# Your code here
train_df.label.value_counts()

ham     3617
spam     562
Name: label, dtype: int64

In [97]:
ham_df = train_df[train_df.label == 'ham'].sample(n=len(train_df[train_df.label == 'spam']), random_state=19)
spam_df = train_df[train_df.label == 'spam']

train_df_undersmpled = pd.concat([ham_df, spam_df])

In [98]:
train_df_undersmpled.label.value_counts()

spam    562
ham     562
Name: label, dtype: int64

In [99]:
p_classes = dict(train_df_undersmpled.label.value_counts(normalize=True))

In [100]:
p_classes

{'spam': 0.5, 'ham': 0.5}

## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [101]:
train_df_undersmpled.label.unique()

array(['ham', 'spam'], dtype=object)

In [102]:
# Your code here
class_word_freq = {}

classes = train_df_undersmpled.label.unique()

for class_ in classes:
    temp_df = train_df_undersmpled[train_df_undersmpled.label == class_]
    bag = {}
    
    for row in temp_df.index:
        doc = temp_df['text'][row].lower()
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
            
    class_word_freq[class_] = bag

## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [103]:
# Your code here
vocabulary = set()

for text in train_df_undersmpled['text']:
    for word in text.lower().split():
        vocabulary.add(word)
        
V = len(vocabulary)
V

5413

## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [104]:
# Your code here
def bag_it(doc):
    doc = doc.lower()
    bag = {}
    
    for word in doc.split():
        bag[word] = bag.get(word, 0) + 1
        
    return bag

## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [105]:
# Your code here
def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    
    classes = []
    posteriors = []
    
    for class_ in class_word_freq.keys():
        p = np.log(p_classes[class_])
        for word in bag.keys():
            num = bag[word] + 1
            denom = class_word_freq[class_].get(word, 0) + V
            p += (num/denom)
            
        classes.append(class_)
        posteriors.append(p)
    
    if return_posteriors:
        print(posteriors)
        
    return classes[np.argmax(posteriors)]

## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [106]:
X_test

4209    Or i go home first lar ü wait 4 me lor.. I put...
5466    http//tms. widelive.com/index. wml?id=820554ad...
2455         Left dessert. U wan me 2 go suntec look 4 u?
2513    Hiya , have u been paying money into my accoun...
80                                 Sorry, I'll call later
                              ...                        
805                             K I'll be there before 4.
3102                         Pathaya enketa maraikara pa'
2961                     Sir send to group mail check it.
539     Ummmmmaah Many many happy returns of d day my ...
2398                            Neshanth..tel me who r u?
Name: text, Length: 1393, dtype: object

In [107]:
# Your code here
y_hat_train = X_train.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))
y_hat_test = X_test.map(lambda x: classify_doc(x, class_word_freq, p_classes, V))

train_resid = y_train == y_hat_train
print('Training Accuracy:', train_resid.value_counts(normalize=True)[True])

test_resid = y_test == y_hat_test
print('Testing Accuracy:', test_resid.value_counts(normalize=True)[True])

Training Accuracy: 0.35654462790141184
Testing Accuracy: 0.34960516870064606


In [108]:
test_resid.value_counts()

False    906
True     487
dtype: int64

## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

In [134]:
class BoW():
    def __init__(self):
        self.df = pd.DataFrame()
        self.p_classes = {}
        self.classes = []
        self.class_word_freq = {}
        self.V = 0
        
    
    def _p_classes(self):
        p_classes = dict(self.df.label.value_counts(normalize=True))
        return p_classes
    
    
    def _classes(self):
        labels = self.df.label.unique()
        return labels

    
    def _class_word_freq(self):
        classes = self._classes()
        for class_ in classes:
            temp_df = df[df.label == class_]
            bag = {}

            for row in temp_df.index:
                doc = temp_df['text'][row].lower()
                for word in doc.split():
                    bag[word] = bag.get(word, 0) + 1

            self.class_word_freq[class_] = bag
        return self.class_word_freq
    
    
    def _vocabulary(self):
        vocab = set()

        for text in self.df.text:
            for word in text.lower().split():
                vocabulary.add(word)

        return len(vocab)
    
    
    def fit(self, df):
        # concat X and y to get 1 dataframe
        self.df = df
        
        # get P for each class
        self.p_classes = self._p_classes()
        self.classes = self._classes()
        
        # get word frequency for each class
        self.class_word_freq = self._class_word_freq()
            
        # calculate the number of words in corpus vocabulary
        self.V = self._vocabulary()
    
    
    def _bag_it(self, doc):
        doc = doc.lower()
        bag = {}

        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1

        return bag
        
    
    def predict(self, X, return_posteriors=False):
        preds = []
        for doc in X:
            # bag the document to be classified
            bag = self._bag_it(doc)

            # initiate classes & posteriors lists
            classes = []
            posteriors = []

            # iterate through each class in class_word_freq
            for class_ in self.class_word_freq.keys():
                p = np.log(self.p_classes[class_])
                for word in bag.keys():
                    num = bag[word] + 1
                    denom = self.class_word_freq[class_].get(word, 0) + V
                    p += (num/denom)

                classes.append(class_)
                posteriors.append(p)

            if return_posteriors:
                print(posteriors)

            preds.append(classes[np.argmax(posteriors)])
        return preds

In [135]:
bow = BoW()
bow.fit(train_df_undersmpled)

In [139]:
y_hat_test = bow.predict(X_test)
test_resid = y_test == y_hat_test
test_resid.value_counts(normalize=True)

False    0.872936
True     0.127064
Name: label, dtype: float64

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!