# Document Classification with Naive Bayes - Lab

## Introduction

In this lesson, you'll practice implementing the Naive Bayes algorithm on your own.

## Objectives

In this lab you will:  

* Implement document classification using Naive Bayes

## Import the dataset

To start, import the dataset stored in the text file `'SMSSpamCollection'`.

In [1]:
# Your code here
import pandas as pd
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'text'])
df.head()

Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


## Account for class imbalance

To help your algorithm perform more accurately, subset the dataset so that the two classes are of equal size. To do this, keep all of the instances of the minority class (spam) and subset examples of the majority class (ham) to an equal number of examples.

In [3]:
# Your code here
# Subset examples to achieve class balance
spam = df[df['label'] == 'spam']
ham = df[df['label'] == 'ham'].sample(n=len(spam), random_state=42)
balanced_data = pd.concat([spam, ham])

# Reset index of balanced dataset
balanced_data = balanced_data.reset_index(drop=True)

# Verify class balance
print(balanced_data['label'].value_counts())


spam    747
ham     747
Name: label, dtype: int64


## Train-test split

Now implement a train-test split on the dataset: 

In [4]:
# Your code here
from sklearn.model_selection import train_test_split
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
train_df = pd.concat([X_train, y_train], axis=1)
test_df = pd.concat([X_test, y_test], axis=1)


## Create the word frequency dictionary for each class

Create a word frequency dictionary for each class: 

In [10]:
# Your code here
class_word_freq = {} 
classes = train_df['label'].unique()
for class_ in classes:
    temp_df = train_df[train_df['label'] == class_]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    class_word_freq[class_] = bag


## Count the total corpus words
Calculate V, the total number of words in the corpus: 

In [11]:
# Your code here
total_vocab = set()

for text in train_df['text']:
    for word in text.split():
        total_vocab.add(word)
V = len(total_vocab)


## Create a bag of words function

Before implementing the entire Naive Bayes algorithm, create a helper function `bag_it()` to create a bag of words representation from a document's text.

In [12]:
# Your code here
def bag_it(text):
    """
    Convert a text document to a bag of words representation.
    """
    bag = {}
    for word in text.split():
        bag[word] = bag.get(word, 0) + 1
    return bag


## Implementing Naive Bayes

Now, implement a master function to build a naive Bayes classifier. Be sure to use the logarithmic probabilities to avoid underflow.

In [13]:
# Your code here
import math

def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    words = doc.split()
    class_scores = {}
    for class_, class_freq in class_word_freq.items():
        # calculate log class probability p(class_)
        log_class_score = math.log(p_classes[class_])
        for word in words:
            # calculate log P(word|class_)
            word_freq = class_freq.get(word, 0)
            log_word_score = math.log((word_freq + 1) / (sum(class_freq.values()) + V))
            log_class_score += log_word_score
        class_scores[class_] = log_class_score
    if return_posteriors:
        # exponentiate and normalize to get posterior probabilities
        log_sum = sum(class_scores.values())
        return {class_: math.exp(score - log_sum) for class_, score in class_scores.items()}
    else:
        # return the class with the highest score
        return max(class_scores, key=class_scores.get)


## Test your classifier

Finally, test your classifier and measure its accuracy. Don't be perturbed if your results are sub-par; industry use cases would require substantial additional preprocessing before implementing the algorithm in practice.

In [15]:
# Your code here
import math

def bag_it(text):
    bag_of_words = {}
    for word in text.split():
        bag_of_words[word] = bag_of_words.get(word, 0) + 1
    return bag_of_words

def classify_doc(doc, class_word_freq, p_classes, V, return_posteriors=False):
    bag = bag_it(doc)
    posteriors = []
    for label in p_classes.keys():
        # calculate log of prior
        log_prior = math.log(p_classes[label])
        # calculate log of likelihood
        word_freq = class_word_freq[label]
        log_likelihood = 0.0
        for word in bag.keys():
            log_likelihood += bag[word] * math.log(word_freq.get(word, 0.0) + 1.0)
        # calculate log of posterior
        log_posterior = log_prior + log_likelihood
        posteriors.append((label, log_posterior))
    # return posteriors if requested
    if return_posteriors:
        return posteriors
    # otherwise, return class with maximum posterior probability
    return max(posteriors, key=lambda x:x[1])[0]

# subset the dataset so that the two classes are of equal size
spam_df = df[df['label'] == 'spam']
ham_df = df[df['label'] == 'ham'].sample(n=len(spam_df), random_state=42)
train_df = pd.concat([spam_df, ham_df], axis=0, ignore_index=True)

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(train_df['text'], train_df['label'], test_size=0.2, random_state=42)

# create word frequency dictionary for each class
class_word_freq = {} 
classes = train_df['label'].unique()
for class_ in classes:
    temp_df = train_df[train_df['label'] == class_]
    bag = {}
    for row in temp_df.index:
        doc = temp_df['text'][row]
        for word in doc.split():
            bag[word] = bag.get(word, 0) + 1
    class_word_freq[class_] = bag

# count the total corpus words
V = len(set([word for class_ in classes for text in train_df[train_df['label'] == class_]['text'] for word in text.split()]))

# calculate class priors
p_classes = {}
for class_ in classes:
    p_classes[class_] = len(train_df[train_df['label'] == class_]) / len(train_df)

# test the classifier on the testing set and calculate accuracy
correct = 0
total = len(X_test)
for i, doc in X_test.iteritems():
    pred = classify_doc(doc, class_word_freq, p_classes, V)
    if pred == y_test[i]:
        correct += 1
accuracy = correct / total

print("Accuracy: {:.4f}".format(accuracy))


Accuracy: 0.9766


## Level up (Optional)

Rework your code into an appropriate class structure so that you could easily implement the algorithm on any given dataset.

## Summary

Well done! In this lab, you practiced implementing Naive Bayes for document classification!