Apply Bayes Rule to mutli-class document classfication <br> (from scratch)
-----

![](images/bayes_rule.png)

------
Naive Bayes Classification Steps
-------

1. Get labeled data
1. Preprocess
1. Apply Mulitnomial Naive Bayes
    1. Calculate document class priors
    1. Calculate conditional probabilities of each word for each class
    1. Calculate the proportional probabilities for each class of new document
    1. Pick the winning class
1. Evaluate with metrics

Get data & preprocess
-----

In [41]:
reset -fs

In [42]:
from collections import namedtuple

In [43]:
Data = namedtuple('data', 'id_num label tokens') 

In [44]:
data = Data(id_num=42, label='cat', tokens="🐱 🐱 🐶 🐈 ".split())

In [45]:
train = [Data(42, 'cat',  "🐈 🐯 🐱 🐩 🐱".split()),
         Data(43, 'dog',  "🐶 🐶 🐈 🐶 🐩 🐈 🐶 🐶".split()),
         Data(45, 'cat',  "🐈 🐈 🐯 🐶 🐈".split()),
         Data(45, 'cat',  "🐈 🐈 🐈".split()),
         Data(48, 'dog',  "🐶 🐶 🐯 🐈 🐩 🐱 🐩 🐶 🐩 🐶 ".split()),
        ]

Calculate document class priors
---- 

$$P(c) = \frac{N_c}{N}$$

In [46]:
# What labels are we dealing with?
labels = {d.label for d in train}
labels

{'cat', 'dog'}

In [47]:
# How many documents are dealing with?
n_docs = len(train)
n_docs

5

In [48]:
from collections import defaultdict

In [49]:
# For each label, find the probability of baseline occurance
doc_priors = defaultdict(float)

for label in labels:
    doc_priors[label] = sum(1 for d in train if d.label == label)/n_docs

print(*doc_priors.items(), sep='\n')

('cat', 0.6)
('dog', 0.4)


Calculate conditional probabilities of each word for each class
-----

In [50]:
# Get all tokens, aka the vocabulary
vocab = []

for doc in train:
    vocab.extend(doc.tokens)
    
print("Vocab:", vocab)

Vocab: ['🐈', '🐯', '🐱', '🐩', '🐱', '🐶', '🐶', '🐈', '🐶', '🐩', '🐈', '🐶', '🐶', '🐈', '🐈', '🐯', '🐶', '🐈', '🐈', '🐈', '🐈', '🐶', '🐶', '🐯', '🐈', '🐩', '🐱', '🐩', '🐶', '🐩', '🐶']


In [51]:
# Unique tokens
set(vocab)

{'🐈', '🐩', '🐯', '🐱', '🐶'}

In [52]:
# Number of unique tokens, cardinality
v = len(set(vocab))
print("Cardinality of vocab:", v)

Cardinality of vocab: 5


In [53]:
# A default dict of default dicts; inner default dict is a probability
cond_prob = defaultdict(lambda: defaultdict(float))

for label in labels:
    
    label_tokens = []
    for doc in train:
         # For a given label, get a list of all the tokens for all the docs 
        if doc.label == label:
            label_tokens.extend(doc.tokens)

    for token in vocab:
        # Find conditional probability: token count / total count
        cond_prob[label][token] = label_tokens.count(token) / len(label_tokens) 

cond_prob

defaultdict(<function __main__.<lambda>()>,
            {'cat': defaultdict(float,
                         {'🐈': 0.5384615384615384,
                          '🐯': 0.15384615384615385,
                          '🐱': 0.15384615384615385,
                          '🐩': 0.07692307692307693,
                          '🐶': 0.07692307692307693}),
             'dog': defaultdict(float,
                         {'🐈': 0.16666666666666666,
                          '🐯': 0.05555555555555555,
                          '🐱': 0.05555555555555555,
                          '🐩': 0.2222222222222222,
                          '🐶': 0.5})})

In [54]:
# Test that each label is a pmf
for label in labels:
    assert round(sum(cond_prob[label].values())) == 1

Given a new document without a label,  calculate the proportional probabilities for each class
-------

$$ P(c | X) = P(c) •  \prod_{i=1}^n P(x_i | c)$$

In [55]:
import operator
from functools import reduce

def  product(iterable):
    return reduce(operator.mul, iterable, 1)

In [66]:
# test = Data(id_num=90, label=None, tokens="🐱".split())
test = Data(id_num=91, label=None, tokens="🐶 🐶".split()) 
# test = Data(id_num=92, label=None, tokens="🐶 🐱".split())
# test = Data(id_num=93, label=None, tokens="🐈 🐈 🐶 🐶 🐩 🐱 🐱".split())
# test = Data(id_num=94, label=None, tokens="🐬".split()) # Out of sample prediction

prob_predicted = defaultdict(float)
for label in labels:
    # For each label, calculate the conditional probability based on the prior and the tokens that appear
    prob_predicted[label] = doc_priors[label] * product(cond_prob[label][t] for t in test.tokens)
    
print(*dict(prob_predicted).items(), sep='\n')

('cat', 0.003550295857988166)
('dog', 0.1)


# Pick the winning class

In [57]:
from operator import itemgetter

In [58]:
# Naive
label, prob = max(prob_predicted.items(),
                  key=itemgetter(1))
print("The predicted class is:", label)

The predicted class is: cat


<br>
<br> 
<br>

----

In [59]:
# Handle ties and fall back to document priors if winning probability is zero
label, prob = max(prob_predicted.items(),
                  key=itemgetter(1))
if prob > 0:
    print("The predicted class is: ", end="")
    print(*(k for k, v in prob_predicted.items() if v == prob))
else:
    label, prob = max(doc_priors.items(),
                      key=itemgetter(1))
    print("The predicted class is:", label)

The predicted class is: cat


Summary
------

- Naive Bayes (NB) is a simple and powerful algorithm for text classification
- To apply NB, follow a step-by-step process to calculate each probability
- Python's Standard Library has helper functions to write elegant and performant code

<br>
<br> 
<br>

----