<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Implement-Naive-Bayes-From-Scratch" data-toc-modified-id="Implement-Naive-Bayes-From-Scratch-1">Implement Naive Bayes From Scratch</a></span></li><li><span><a href="#Bayes'-Theorem" data-toc-modified-id="Bayes'-Theorem-2">Bayes' Theorem</a></span></li><li><span><a href="#Learning-Outcomes" data-toc-modified-id="Learning-Outcomes-3">Learning Outcomes</a></span></li><li><span><a href="#A-Recipe-for-Naive-Bayes-Classification" data-toc-modified-id="A-Recipe-for-Naive-Bayes-Classification-4">A Recipe for Naive Bayes Classification</a></span></li><li><span><a href="#Acquire-data-&amp;-preprocess" data-toc-modified-id="Acquire-data-&amp;-preprocess-5">Acquire data &amp; preprocess</a></span></li><li><span><a href="#Instantiation" data-toc-modified-id="Instantiation-6">Instantiation</a></span></li><li><span><a href="#self" data-toc-modified-id="self-7"><code>self</code></a></span></li><li><span><a href="#What-the-heck-is-@dataclass?" data-toc-modified-id="What-the-heck-is-@dataclass?-8">What the heck is <code>@dataclass</code>?</a></span></li><li><span><a href="#Dataclassess-make-class-easier-to-write" data-toc-modified-id="Dataclassess-make-class-easier-to-write-9">Dataclassess make class easier to write</a></span></li><li><span><a href="#Data-classes-allow-for-more-human-readable-code" data-toc-modified-id="Data-classes-allow-for-more-human-readable-code-10">Data classes allow for more human-readable code</a></span></li><li><span><a href="#Calculate-document-class-priors" data-toc-modified-id="Calculate-document-class-priors-11">Calculate document class priors</a></span></li><li><span><a href="#Calculate-conditional-probabilities-of-each-word-for-each-class" data-toc-modified-id="Calculate-conditional-probabilities-of-each-word-for-each-class-12">Calculate conditional probabilities of each word for each class</a></span></li><li><span><a href="#Given-a-new-document-without-a-label,--calculate-the-proportional-probabilities-for-each-class" data-toc-modified-id="Given-a-new-document-without-a-label,--calculate-the-proportional-probabilities-for-each-class-13">Given a new document without a label,  calculate the proportional probabilities for each class</a></span></li><li><span><a href="#Pick-the-winning-class" data-toc-modified-id="Pick-the-winning-class-14">Pick the winning class</a></span></li><li><span><a href="#Ideas-for-Improvement" data-toc-modified-id="Ideas-for-Improvement-15">Ideas for Improvement</a></span></li><li><span><a href="#Summary" data-toc-modified-id="Summary-16">Summary</a></span></li><li><span><a href="#Bonus-Material" data-toc-modified-id="Bonus-Material-17">Bonus Material</a></span></li><li><span><a href="#Further-Study" data-toc-modified-id="Further-Study-18">Further Study</a></span></li></ul></div>

Implement Naive Bayes From Scratch
-----

Bayes' Theorem
------

$$ P(A | B) = \frac{P(B|A)P(A)}{P(B)} $$


![](images/bayes_rule.png)

Learning Outcomes
------

__By The End Of This Notebook You Should Be Able To:__

1. Write idiomatic Python to model data. 
1. Implement Naive Bayes in pure Python.

------
A Recipe for Naive Bayes Classification
-------

1. Acquire labeled training data
1. With training data:
    1. Calculate document class priors
    1. Calculate conditional probabilities of each word for each class
1. With new data:
    1. Calculate the proportional probabilities for each class of new document
    1. Pick the winning class

Acquire data & preprocess
-----

In [2]:
reset -fs

```python
# Let's make a data class to hold our data
data = LabeledTextData(id_num=42, label='cat', words="🐱 🐱 🐈 🐶".split())
```

In [3]:
class LabeledTextData:
    def __init__(self, id_num, label, words):
        self.id_num = id_num
        self.label = label 
        self.words = words

Instantiation
----

Instantiation is the process of creating instances from classes.

`def __init__` is how that happens in Python.

__init__ is short for initialize.

`def __init__` is the __constructor__ that makes guarantees during initialization. Every instance of this class will have these attributes by default.

`self`
------

`self` refers to the specific instance (think of it as "this particular one")

Use self.data to access class’s data member.

`self` is kinda silly but I'll show you a trick later.

In [4]:
data = LabeledTextData(id_num=42, label='cat', words="🐱 🐱 🐈 🐶".split())

Our `LabeledTextData` class is mostly data. We can use `dataclass` (new to Python 3.7) to simplify our code.

In [5]:
from dataclasses import dataclass

In [6]:
@dataclass
class LabeledTextData:
    id_num: int
    label: str
    words: list

What the heck is `@dataclass`?
----

`@` symbol on a function or class is a called decorator.

You can think of decorator like a hat that gives the function or class super powers.

We might discuss decorators in more detail later.

Dataclassess make class easier to write
--------

Dataclasses reduces boilerplate code, boilerplate code is code that you have to write but does add unique value.

You don't need to write out explicit `self` on every line and you don't need `__int__…`.

Data classes allow for more human-readable code
-----

In [7]:
data = LabeledTextData(42, 'cat',  "🐈 🐯 🐱 🐩 🐱".split())

In [22]:
# Data classes allow for . (dot) access
data.label

'cat'

In [23]:
# Thus fluent interface with custom classes
data.label.title()

'Cat'

In [8]:
train = [LabeledTextData(42, 'cat',  "🐈 🐯 🐱 🐩 🐱".split()),
         LabeledTextData(43, 'dog',  "🐶 🐶 🐈 🐶 🐩 🐈 🐶 🐶".split()),
         LabeledTextData(45, 'cat',  "🐈 🐈 🐯 🐶 🐈".split()),
         LabeledTextData(45, 'cat',  "🐈 🐈 🐈".split()),
         LabeledTextData(48, 'dog',  "🐶 🐶 🐯 🐈 🐩 🐱 🐩 🐶 🐩 🐶 ".split()),
        ]

Calculate document class priors
---- 

$$P(c) = \frac{N_c}{N}$$

In [9]:
# What labels are we dealing with?
labels = {doc.label for doc in train}
labels

{'cat', 'dog'}

In [10]:
# How many documents are dealing with?
n_docs = len(train)
n_docs

5

In [11]:
from collections import defaultdict

In [12]:
# For each label, find the probability of baseline occurance
class_counts = defaultdict(int)

# Find the count of each category
for doc in train:
    class_counts[doc.label] += 1

# Convert them to probabilities
doc_priors = {label: count/n_docs for label, count in class_counts.items()}
    
print(*doc_priors.items(), sep='\n')

('cat', 0.6)
('dog', 0.4)


Calculate conditional probabilities of each word for each class
-----

In [13]:
# Get all tokens, aka the vocabulary
vocab = []

for doc in train:
    vocab.extend(doc.words)
    
print("Vocab:", vocab)

Vocab: ['🐈', '🐯', '🐱', '🐩', '🐱', '🐶', '🐶', '🐈', '🐶', '🐩', '🐈', '🐶', '🐶', '🐈', '🐈', '🐯', '🐶', '🐈', '🐈', '🐈', '🐈', '🐶', '🐶', '🐯', '🐈', '🐩', '🐱', '🐩', '🐶', '🐩', '🐶']


In [14]:
# Unique tokens
set(vocab)

{'🐈', '🐩', '🐯', '🐱', '🐶'}

In [15]:
# Number of unique tokens, aka cardinality
v = len(set(vocab))
print("Cardinality of vocab:", v)

Cardinality of vocab: 5


In [16]:
# A default dict of default dicts; inner default dict is probability
cond_prob = defaultdict(lambda: defaultdict(float))

for label in labels:
    
    label_words = []
    for doc in train:
         # For a given label, get a list of all the tokens for all the docs 
        if doc.label == label:
            label_words.extend(doc.words)

    for word in vocab:
        # Find conditional probability: word count / total count
        cond_prob[label][word] = label_words.count(word) / len(label_words) 

cond_prob

defaultdict(<function __main__.<lambda>()>,
            {'dog': defaultdict(float,
                         {'🐈': 0.16666666666666666,
                          '🐯': 0.05555555555555555,
                          '🐱': 0.05555555555555555,
                          '🐩': 0.2222222222222222,
                          '🐶': 0.5}),
             'cat': defaultdict(float,
                         {'🐈': 0.5384615384615384,
                          '🐯': 0.15384615384615385,
                          '🐱': 0.15384615384615385,
                          '🐩': 0.07692307692307693,
                          '🐶': 0.07692307692307693})})

In [17]:
# Test that each label is a probability mass function (pmf). A pmf sums to 1
from math import isclose

for label in labels:
    assert isclose(sum(cond_prob[label].values()), 1)

Given a new document without a label,  calculate the proportional probabilities for each class
-------

$$ P(c | X) = P(c) •  \prod_{i=1}^n P(x_i | c)$$

In [18]:
import operator
from functools import reduce

def product(iterable):
    return reduce(operator.mul, iterable, 1)

In [19]:
# test = LabeledTextData(id_num=90, label=None, words="🐱".split())
# test = LabeledTextData(id_num=91, label=None, words="🐶 🐶".split()) 
# test = LabeledTextData(id_num=92, label=None, words="🐶 🐱".split())
# test = LabeledTextData(id_num=93, label=None, words="🐈 🐈 🐶 🐶 🐩 🐱 🐱".split())
# test = LabeledTextData(id_num=94, label=None, words="🐬 ".split()) # Out of sample prediction

prob_predicted = defaultdict(float)
for label in labels:
    # For each label, calculate the conditional probability based on the prior and the tokens that appear
    prob_predicted[label] = doc_priors[label] * product(cond_prob[label][t] for t in test.words)
    
print(*dict(prob_predicted).items(), sep='\n')

NameError: name 'test' is not defined

Pick the winning class
----

In [None]:
from operator import itemgetter

In [None]:
# Naive
label, prob = max(prob_predicted.items(),
                  key=itemgetter(1))
print("The predicted class is:", label)

<br>
<br> 
<br>

----

In [20]:
# Handle ties and fall back to document priors if winning probability is zero
label, prob = max(prob_predicted.items(),
                  key=itemgetter(1))
if prob > 0:
    print("The predicted class is: ", end="")
    print(*(k for k, v in prob_predicted.items() if v == prob))
else:
    label, prob = max(doc_priors.items(),
                      key=itemgetter(1))
    print("The predicted class is:", label)

NameError: name 'itemgetter' is not defined

Ideas for Improvement
-----

1. Switch from human-readable data class to matrix
    - Matrix would allow our code to be simpler
    - Faster (vectorized operations and fewer passes through the data)
    

Summary
------

- Naive Bayes (NB) is a simple and powerful algorithm for text classification
- To apply NB, follow a step-by-step process to calculate each probability
- Python's Standard Library has functions to write elegant and performant code

Bonus Material
----

<center><img src="images/bayesian_evol.png" width="75%"/></center>

Further Study
------

- [Data Science Handbook on Naive Bayes](https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html)
- https://monkeylearn.com/blog/practical-explanation-naive-bayes-classifier/
- http://www.statsoft.com/textbook/naive-bayes-classifier (sorry about the blue front)
- Read [Text classification and Naive Bayes](https://web.stanford.edu/class/cs124/lec/naivebayes.pdf)   
- Read [Naive Bayes blogpost](http://sebastianraschka.com/Articles/2014_naive_bayes_1.html)
- https://towardsdatascience.com/naive-bayes-classifier-81d512f50a7c


<center><img src="https://imgs.xkcd.com/comics/modified_bayes_theorem.png" width="75%"/></center>