In [None]:
import re
import numpy as np
import matplotlib.pyplot as plt

## Bayes' Rule reviewed 

$P(Y|X) = \frac{P(Y)*P(X|Y)}{P(X)}$. 

## Fitting and predicting using Naive Bayes

Let's imagine I have the dataset below. For the dependent variable (first column), I use "1" to represent *ate it* and "0" to represent *did not eat it*. For the independent variable (zeroth column), I use "0" to represent *peanut M&M*, "1" to represent *regular M&M* and "2" to represent *raisin M&M*.

Side note: we typically convert qualitative values to numbers for machine learning; it enables us to use all the power of numpy, at some cost in readability to humans. I'm not going to do this today, for readability to us, but on Friday we'll be back to 'numeric' representation of features.

In [None]:
data = np.array([['peanut M&M', 'ate it'], ['peanut M&M', 'ate it'], ['peanut M&M', 'ate it'], ['raisin M&M', 'did not eat it'], ['raisin M&M', 'ate it'], ['regular M&M', 'did not eat it'], ['peanut M&M', 'did not eat it'], ['regular M&M', 'did not eat it'], ['raisin M&M', 'did not eat it'], ['raisin M&M', 'did not eat it']])
print(data)

__Fit__

* Calculate the likelihood of *ate it* given *peanut M&M*, of *ate it* given *regular M&M*, of *ate it* given *raisin M&M* and of *did not eat it* given each type of M&M.
* Calculate the prior for *ate it* and the prior for *did not eat it*.

Store both sets of values.

In [None]:
def fit(data):
    likelihoods = {}
    cats = sorted(np.unique(data[:, 1]))
    values = sorted(np.unique(data[:, 0]))
    for cat in cats:
        likelihoods[cat] = {}
        for value in values:
            likelihoods[cat][value] = len(data[np.where((data[:, 0] == value) & (data[:, -1] == cat))]) / len(data[np.where(data[:, 0] == value)])
 
    priors = {cat : len(data[np.where(data[:, -1] == cat)]) / len(data) for cat in cats}
    return likelihoods, priors

likelihoods, priors = fit(data)
print(likelihoods, priors)

__Predict__

Given a new observation, *peanut M&M*, what is my most likely behavior?

* Compare $P(ate it|peanut M\&M)$ and $P(did not eat it|peanut M\&M)$. 



$P(ate it|peanut M\&M) = \frac{P(ate it)*P(peanut M\&M|ate it)}{P(peanut M\&M)}$

$P(did not eat it|peanut M\&M) = \frac{P(did not eat it)*P(peanut M\&M|did not eat it))}{P(peanut M\&M)}$

Note that since $P(peanut M\&M)$ is in the denominator in both cases, we can ignore it completely for the comparison. So we only need the prior and the likelihood, both of which we calculated during __fit__.

This means that the Naive Bayes "formula" is $argmax_Y P(Y)*P(X|Y)$.

In [None]:
def predict_one(datum, likelihoods, priors):
    res = []
    for cat in likelihoods:
        try:
            likelihood = likelihoods[cat][datum]
        except:
            print("feature value " + datum + " is not in likelihoods, returning most frequent category!")
            likelihood = 1
        res.append(priors[cat]*likelihood)
    print(res)
    return list(likelihoods.keys())[np.argmax(res)]

* Which is higher?

In [None]:
predict_one('peanut M&M', likelihoods, priors)

So, my classifier will say that I ate the peanut M&M.

__What if we got a new observation, *raisin M&M*?__

### Score

Let's say I have the test data below. 
* Does this data include all my labels?
* Does this data include all possible values for my one independent variable?

In [None]:
test = np.array([['peanut M&M', 'did not eat it'], ['peanut M&M', 'ate it'], ['raisin M&M', 'did not eat it'], ['raisin M&M', 'ate it'], ['regular M&M', 'did not eat it']])
print(test)

How well does my Naive Bayes model perform on this test data? 

In [None]:
def predict(data, likelihoods, priors):
    return np.array([predict_one(datum, likelihoods, priors) for datum in data])

In [None]:
yhat = predict(test[:, 0], likelihoods, priors)
print(test[:, -1], yhat)

* What is the accuracy?
* What is the confusion matrix?

In [None]:
# These come directly from day 24

def accuracy(y, yhat):
    assert len(y) == len(yhat)
    diffs = y == yhat
    vals, counts = np.unique(diffs, return_counts=True)
    return counts[np.where(vals == True)] / (np.sum(counts))
    
def confusion_matrix(y, yhat, cats):
    "Thanks to https://stackoverflow.com/questions/2148543/how-to-write-a-confusion-matrix-in-python"
    assert len(y) == len(yhat)
    result = np.zeros((len(cats), len(cats)))
    for i in range(len(y)):
        result[cats.index(y[i])][cats.index(yhat[i])] += 1
    return result

In [None]:
print(accuracy(test[:, 1], yhat))
print(cats, confusion_matrix(test[:, 1], yhat, cats))

## What if I have multiple independent variables?

Let's take the same training data as before, but add one more features: the length of each piece of candy. 

Because numpy won't let me have arrays of heterogeneous types, I'm going to express length and width as 's', 'm' or 'l'.

In [None]:
data = np.array([
    ['peanut M&M', 'l', 'ate it'], 
    ['peanut M&M', 'l', 'ate it'], 
    ['peanut M&M', 'l', 'ate it'], 
    ['raisin M&M', 'm', 'did not eat it'], 
    ['raisin M&M', 'l', 'ate it'], 
    ['regular M&M', 's', 'did not eat it'], 
    ['peanut M&M', 'm', 'did not eat it'], 
    ['regular M&M', 's', 'did not eat it'], 
    ['raisin M&M', 's', 'did not eat it'], 
    ['raisin M&M', 'm', 'did not eat it']])
print(data)

Looking at the data, is the length feature independent of the type feature?
* How many large, medium and small peanut M&Ms are there?
* How many large, medium and small raisin M&Ms are there?
* How many large, medium and small regular M&Ms are there?

### Conditional independence

When we fit a Naive Bayes model across multiple independent features, we assume those features are *conditionally independent*, given $Y$. In other words, the effect of the value of a feature  on the label is independent of the values of other features.

The Naive Bayes formula in this case, where $X = {x_1, x_2, ...}$ looks like:
$P(Y|X) = argmax_Y P(Y)*P(X|Y) = argmax_Y P(Y)*\prod_i{P(x_i|Y)}$.

We can do the multiplication because of this *conditional independence*. This *conditional independence* assumption is the "naive" in Naive Bayes.

Even if the independent variables aren't really conditionally independent, Naive Bayes is still a surprisingly strong baseline model for many tasks.

### Missing feature values

Ok, the fact that there are 0 medium or large regular M&Ms is a problem. It means that if we ever *do* (during testing or inference) see a large regular M&M, the $P(l|regular M\&M) = 0$ will "zero out" the total, but that's not really safe (especially given we are looking at a small data sample). So we use __Laplace smoothing__: for $x_i \in X$ and $y_j \in Y$:
* instead of $P(x_i|y_j) \sim \frac{count(x_i)}{\sum_i x_i}$, 
* we use $P(x_i|y_j) \sim \frac{count(x_i)+1}{\sum_i x_i + |Y|}$. 

In this way, there's some (small!) likelihood for every value of each independent variable.



__Fit__

* Calculate the likelihood of each of *ate it* and *did not eat it* given:
  * *peanut M&M*, *regular M&M* and *raisin M&M*
  * *s*, *m* and *l*
* Calculate the prior for *ate it* and the prior for *did not eat it*.

Store both sets of values.

In [None]:
def fit(data):
    likelihoods = {}
    cats = sorted(np.unique(data[:, -1]))
    for cat in cats:
        likelihoods[cat] = {}
        for feature in range(data.shape[1]-1):
            values = sorted(np.unique(data[:, feature]))
            for value in values:
                print(value, len(data[np.where((data[:, feature] == value) & (data[:, -1] == cat))]), len(data[np.where(data[:, feature] == value)]), len(cats))
                likelihoods[cat][value] = (len(data[np.where((data[:, feature] == value) & (data[:, -1] == cat))]) + 1) / (len(data[np.where(data[:, feature] == value)]) + len(cats))
 
    priors = {cat : len(data[np.where(data[:, -1] == cat)]) / len(data) for cat in cats}
    return likelihoods, priors

likelihoods, priors = fit(data)
print(likelihoods, priors)

### Predict

Calculate $argmax_Y P(Y)*\prod_i{P(x_i|Y)}$

In [None]:
def predict_one(datum, likelihoods, priors):
    res = []
    for cat in likelihoods:
        product = 1
        for value in datum:
            try:
                likelihood = likelihoods[cat][value]
            except:
                print("feature value " + value + " is not in likelihoods, returning most frequent category!")
                likelihood = 1
            product = product*likelihood
        res.append(priors[cat]*product)
    return list(likelihoods.keys())[np.argmax(res)]

Did I eat the large peanut M&M?

In [None]:
predict_one(np.array(['peanut M&M', 'l']), likelihoods, priors)

Did I eat the large raisin M&M?

In [None]:
predict_one(np.array(['raisin M&M', 'l']), likelihoods, priors)

### Score

Am I doing better with one more feature?

In [None]:
test = np.array([['peanut M&M', 's', 'did not eat it'], ['peanut M&M', 'l', 'ate it'], ['raisin M&M', 's', 'did not eat it'], ['raisin M&M', 'l', 'ate it'], ['regular M&M', 'm', 'did not eat it']])
print(test)

In [None]:
yhat = predict(test[:, :-1], likelihoods, priors)
print(test[:, -1], yhat)

In [None]:
print(accuracy(test[:, -1], yhat))
print(cats, confusion_matrix(test[:, -1], yhat, cats))