<center><h2>Implement Naive Bayes <br>From Scratch</h2></center>

<center><h2>Bayes' Theorem</h2></center>


<br>
<center><img src="images/bayes_rule.png" width="100%"/></center>

<center><h2>Training Naive Bayes</h2></center>

1. Acquire labeled data
1. Preprocess data
1. Calculate document class priors
1. Calculate word by class conditional probabilities

Acquire data & preprocess
-----

In [91]:
reset -fs

In [92]:
corpus  = ["🐈 🐯 🐱 🐩 🐱", 
           "🐶 🐶 🐈 🐶 🐩 🐈 🐶 🐶", 
           "🐈 🐈 🐯 🐶 🐈",  
           "🐈 🐈 🐈",
           "🐶 🐶 🐯 🐈 🐩 🐱 🐩 🐶 🐩 🐶 "]

labels = ['cat', 'dog', 'cat', 'cat','dog'] 

In [93]:
data = [['cat', "🐈 🐯 🐱 🐩 🐱"],
        ['dog', "🐶 🐶 🐈 🐶 🐩 🐈 🐶 🐶"],
        ['cat', "🐈 🐈 🐯 🐶 🐈"],
        ['cat', "🐈 🐈 🐈"], 
        ['dog', "🐶 🐶 🐯 🐈 🐩 🐱 🐩 🐶 🐩 🐶 "]]

In [94]:
for label, item in data:
    print(f"{label}: {item}")

cat: 🐈 🐯 🐱 🐩 🐱
dog: 🐶 🐶 🐈 🐶 🐩 🐈 🐶 🐶
cat: 🐈 🐈 🐯 🐶 🐈
cat: 🐈 🐈 🐈
dog: 🐶 🐶 🐯 🐈 🐩 🐱 🐩 🐶 🐩 🐶 


<center><h2>Calculate class priors</h2></center>

$$P(c) = \frac{N_c}{N}$$

In [95]:
# What labels are we dealing with?
labels = set(labels)
labels

{'cat', 'dog'}

In [96]:
# How many documents are dealing with?
n_docs = len(corpus)
n_docs

5

In [97]:
from collections import defaultdict

doc_priors = defaultdict(float)

In [98]:
# For each label, find the probability of baseline occurance
for label in labels:
    doc_priors[label] = sum(1 for item_label, _ in data if item_label == label) / n_docs

print(*doc_priors.items(), sep='\n')

('dog', 0.4)
('cat', 0.6)


<center><h2>Conditional probabilities of word by class</h2></center>

In [99]:
vocab = []

In [100]:
# Get all tokens, aka the vocabulary
for _, doc in data:
    vocab.extend(doc.split())
    
print("Vocab:", vocab)

Vocab: ['🐈', '🐯', '🐱', '🐩', '🐱', '🐶', '🐶', '🐈', '🐶', '🐩', '🐈', '🐶', '🐶', '🐈', '🐈', '🐯', '🐶', '🐈', '🐈', '🐈', '🐈', '🐶', '🐶', '🐯', '🐈', '🐩', '🐱', '🐩', '🐶', '🐩', '🐶']


In [101]:
# Unique tokens
set(vocab)

{'🐈', '🐩', '🐯', '🐱', '🐶'}

In [102]:
# Number of unique tokens, aka cardinality
v = len(set(vocab))
print("Cardinality of vocab:", v)

Cardinality of vocab: 5


In [103]:
# A default dict of default dicts; inner default dict is probability
cond_prob = defaultdict(lambda: defaultdict(float))

In [80]:
for label in labels:    
    label_tokens = []
    for item_label, doc in data:
        # For a given label, get a list of all the tokens for all the docs 
        if item_label == label:
            label_tokens.extend(doc.split())

    for token in vocab:
        # Find conditional probability: token count / total count
        cond_prob[label][token] = label_tokens.count(token) / len(label_tokens) 

In [90]:
for label, sub_dict in cond_prob.items():
    print(label.title(), ":")
    for token, prob in sub_dict.items():
        print(token, prob)

Dog :
🐈 0.16666666666666666
🐯 0.05555555555555555
🐱 0.05555555555555555
🐩 0.2222222222222222
🐶 0.5
Cat :
🐈 0.5384615384615384
🐯 0.15384615384615385
🐱 0.15384615384615385
🐩 0.07692307692307693
🐶 0.07692307692307693


In [60]:
# Test that each label is a probability mass function (pmf). A pmf sums to 1
from math import isclose

for label in labels:
    assert isclose(sum(cond_prob[label].values()), 1)

<center><h2>Predicting with Naive Bayes</h2></center>

1. Acquire and process the new data
1. For each new data point, calculate the proportional probabilities for each class
1. Pick the winning class

<center><h2>Given a new document, <br> calculate the proportional probabilities for each class</h2></center>

$$ P(c | X) = P(c) •  \prod_{i=1}^n P(x_i | c)$$

In [104]:
# Define product function
import operator
from functools import reduce

def product(iterable):
    return reduce(operator.mul, iterable, 1)

In [72]:
new_input = "🐱"
# new_input = "🐱 🐩 "
# new_input = "🐱 🐩 🐩"

prob_predicted = defaultdict(float)
for label in labels:
    # For each label, calculate the conditional probability based on the prior and the tokens that appear
    prob_predicted[label] = doc_priors[label] * product(cond_prob[label][t] for t in new_input.split())
    
print(*dict(prob_predicted).items(), sep='\n')

('dog', 0.0010973936899862826)
('cat', 0.0005461993627674101)


# Pick the winning class

In [74]:
from operator import itemgetter

In [77]:
label, prob = max(prob_predicted.items(), key=itemgetter(1))
print("The predicted class is: ", end="")
print(*(k for k, v in prob_predicted.items() if v == prob))

The predicted class is: dog


In [76]:
# Handle ties and fall back to document priors if winning probability is zero

label, prob = max(prob_predicted.items(), key=itemgetter(1))
if prob > 0:
    print("The predicted class is: ", end="")
    print(*(k for k, v in prob_predicted.items() if v == prob))
else:
    label, prob = max(doc_priors.items(),
                      key=itemgetter(1))
    print("The predicted class is:", label)

The predicted class is: dog


<center><img src="images/questions.png" width="65%"/></center>

<br>
<br> 
<br>

<center><h2>Bonus Material</h2></center>

- Other implementations by Brian Spiering
    - [Using NormalDist from statistics module](https://github.com/brianspiering/naive_bayes_classifer_in_python_3_8)
    - [Naive Bayes for Text Classification](https://github.com/brianspiering/bayesian-text)
    
- [Implementation by mircealex](https://github.com/mircealex)