In [29]:
import re
import numpy as np
import matplotlib.pyplot as plt

## Bayes' Rule reviewed 

$P(Y|X) = \frac{P(Y)*P(X|Y)}{P(X)}$. 
* $P(Y|X)$ is the *posterior*
* $P(Y)$ is the *prior*
* $P(X|Y)$ is the *likelihood*
* $P(X)$ is the *evidence* (or *normalization*)

## Fitting and predicting using Naive Bayes

Let's imagine I have an infinite bag of M&Ms. Some of them are peanut, some raisin and some regular. I only like peanut M&Ms. The thing is, I'm in a dark room so I can only guess the type of the M&M by feel. I repeatedly reach into the bag, grab one M&M and then either eat it or do not eat it. This gives the dataset below:

| Candy | Outcome |
| ----- | ---- |
| peanut M&M | ate it |
| peanut M&M | ate it |
| peanut M&M | ate it |
| raisin M&M | did not eat it |
| raisin M&M | ate it |
| regular M&M | did not eat it |
| peanut M&M | did not eat it |
| regular M&M | did not eat it |
| raisin M&M | did not eat it |
| raisin M&M | did not eat it |

In code:

Side note: we typically convert qualitative values to numbers for machine learning; it enables us to use all the power of numpy, at some cost in readability to humans. I'm not going to do this today, for readability to us, but on Friday we'll be back to 'numeric' representation of features.

We want to fit a model that will tell us the probability that we ate it (or did not eat it) given the type of candy it is. We can do that by calculating:
* The prior that we ate it (or did not eat it)
* The likelihood of the type of candy given that we ate it (or did not eat it) 


## What if I have multiple independent variables?

It's fairly obvious that I *must* be making my eat/don't eat decisions based on more than just the candy type. Just look at the test data! 

Let's take the same training data as before, but add one more feature: the length of each piece of candy. 

Because numpy won't let me have arrays of heterogeneous types, I'm going to express length and width as 's', 'm' or 'l'.

| Candy | Length | Outcome |
| ----- | ---- | --- |
| peanut M&M | l | ate it |
| peanut M&M | l | ate it |
| peanut M&M | l | ate it |
| raisin M&M | m | did not eat it |
| raisin M&M | l | ate it |
| regular M&M | s | did not eat it |
| peanut M&M | m | did not eat it |
| regular M&M | s | did not eat it |
| raisin M&M | s | did not eat it |
| raisin M&M | s | did not eat it |

In code:

In [30]:
data = np.array([
    ['peanut M&M', 'l', 'ate it'], 
    ['peanut M&M', 'l', 'ate it'], 
    ['peanut M&M', 'l', 'ate it'], 
    ['raisin M&M', 'm', 'did not eat it'], 
    ['raisin M&M', 'l', 'ate it'], 
    ['regular M&M', 's', 'did not eat it'], 
    ['peanut M&M', 'm', 'did not eat it'], 
    ['regular M&M', 's', 'did not eat it'], 
    ['raisin M&M', 's', 'did not eat it'], 
    ['raisin M&M', 'm', 'did not eat it']])
print(data)

[['peanut M&M' 'l' 'ate it']
 ['peanut M&M' 'l' 'ate it']
 ['peanut M&M' 'l' 'ate it']
 ['raisin M&M' 'm' 'did not eat it']
 ['raisin M&M' 'l' 'ate it']
 ['regular M&M' 's' 'did not eat it']
 ['peanut M&M' 'm' 'did not eat it']
 ['regular M&M' 's' 'did not eat it']
 ['raisin M&M' 's' 'did not eat it']
 ['raisin M&M' 'm' 'did not eat it']]


Looking at the data, is the length feature independent of the type feature?
* How many large, medium and small peanut M&Ms are there?
* How many large, medium and small raisin M&Ms are there?
* How many large, medium and small regular M&Ms are there?

### Conditional independence

When we fit a Naive Bayes model across multiple independent features, we assume those features are *conditionally independent*, given $Y$. In other words, the effect of the value of a feature  on the label is independent of the values of other features.

The Naive Bayes formula in this case, where $X = {x_1, x_2, ...}$ looks like:
$P(Y|X) = argmax_Y P(Y)*P(X|Y) = argmax_Y P(Y)*\prod_i{P(x_i|Y)}$.

We can do the product of $P(x_i|Y)$ because of this *conditional independence*. This *conditional independence* assumption is the "naive" in Naive Bayes.

Even if the independent variables aren't really conditionally independent, Naive Bayes is still a surprisingly strong baseline model for many tasks.

### Missing feature values

Let's make a table of counts, like before:

| Outcome | Peanut M&M | Regular M&M | Raisin M&M | l | m | s |
| -------- | ------- | ---------- | ------------ | -- | -- | -- |
| Ate it         | 3 | 0 | 1 | 4 | 0 | 0 |
| Did not eat it | 1 | 2 | 3 | 0 | 2 | 4 |
| Total |  4 | 2 | 4 | 4 | 2 | 4 |

Ok, the fact that there are 0s in this table is a problem. It means that if we ever *do* (during testing or inference) see, for example, a small peanut M&M, $P(small | ate it)$ will "zero out" the total even though the peanutness should lend towards edibility. So we use __Laplace smoothing__: for $x_i \in X$:
* instead of $P(Y|x_i) \sim \frac{|x_i \& Y|}{|X|}$, 
* we use $P(Y|x_i) \sim \frac{|x_i \& Y|+1}{|X| + |Y|}$. 

In this way, there's some (small!) likelihood for every value of each independent variable.



__Fit__

* Calculate the prior for *ate it* and the prior for *did not eat it*: $P(ate it) = 4 / 10$; $P(did not eat it) = 6 / 10$

* Calculate the likelihood of each of *ate it* and *did not eat it* given:
  * *peanut M&M*, *regular M&M* and *raisin M&M*. For example, the likelihood of *ate it* given *peanut M&M* = (3 (peanut M&Ms that were eaten) + 1 (smoothing)) / (4 (peanut M&M) + 3 (values for peanuts))
  * *s*, *m* and *l*
  
| Y | Peanut M&M | Regular M&M | Raisin M&M | l | m | s |
| -------- | ------- | ---------- | ------------ | -- | -- | -- |
| Ate it         | 4/6 | 1/4 | 2/6 | 5/6  |  1/4 | 1/6 |
| Did not eat it | 2/6 | 3/4 | 4/6 | 1/6 | 3/4 | 5/6 |

Store both sets of values.

In code:

In [31]:
def fit(data):
    likelihoods = {}
    cats = sorted(np.unique(data[:, -1]))
    all_values = np.unique(data[:, :-1])
    for cat in cats:
        likelihoods[cat] = {}
        for feature in range(data.shape[1]-1):
            values = sorted(np.unique(data[:, feature]))
            for value in values:
                print(value, len(data[np.where((data[:, feature] == value) & (data[:, -1] == cat))]), len(data[np.where(data[:, 0] == value)]), len(cats))
                likelihoods[cat][value] = (len(data[np.where((data[:, feature] == value) & (data[:, -1] == cat))]) + 1) / (len(data[np.where(data[:, feature] == value)]) + len(cats))
 
    priors = {cat : len(data[np.where(data[:, -1] == cat)]) / len(data) for cat in cats}
    return likelihoods, priors

likelihoods, priors = fit(data)
print(likelihoods, priors)

peanut M&M 3 4 2
raisin M&M 1 4 2
regular M&M 0 2 2
l 4 0 2
m 0 0 2
s 0 0 2
peanut M&M 1 4 2
raisin M&M 3 4 2
regular M&M 2 2 2
l 0 0 2
m 3 0 2
s 3 0 2
{'ate it': {'peanut M&M': 0.6666666666666666, 'raisin M&M': 0.3333333333333333, 'regular M&M': 0.25, 'l': 0.8333333333333334, 'm': 0.2, 's': 0.2}, 'did not eat it': {'peanut M&M': 0.3333333333333333, 'raisin M&M': 0.6666666666666666, 'regular M&M': 0.75, 'l': 0.16666666666666666, 'm': 0.8, 's': 0.8}} {'ate it': 0.4, 'did not eat it': 0.6}


### Predict

Given a new observation, *large raisin M&M*, what is my most likely behavior?

* Compare $P(ate it|large, raisin M\&M)$ and $P(did not eat it|large, raisin M\&M)$. 

Calculate $argmax_Y P(Y)*\prod_i{P(x_i|Y)}$:
* ate it: 
* did not eat it:

In code:

In [32]:
def predict_one(datum, likelihoods, priors):
    res = []
    for cat in likelihoods:
        product = 1
        for value in datum:
            try:
                likelihood = likelihoods[cat][value]
            except:
                print("feature value " + value + " is not in likelihoods, returning most frequent category!")
                likelihood = 1
            product = product*likelihood
        res.append(priors[cat]*product)
    return list(likelihoods.keys())[np.argmax(res)]

def predict(data, likelihoods, priors):
    return np.array([predict_one(datum, likelihoods, priors) for datum in data])

### Score

Am I doing better with one more feature? Let's see! Here's my augmented test data:

| Candy | Length | Outcome |
| ----- | ---- | --- |
| peanut M&M | s | did not eat it |
| peanut M&M | l | ate it |
| raisin M&M | s | did not eat it |
| raisin M&M | l | ate it |
| regular M&M | m | did not eat it |

| Y | x1 | x2| P(ate it)\*P(ate it\|x1)\*P(ate it\|x2) | P(did not eat it)\*P(did not eat it\|x1)\*P(did not eat it\|x2) | Yhat |
| --- | --- | --- | --- | --- | --- |
| did not eat it | peanut M&M | s | 16/490 | 60 / 810  | did not eat it  |
| ate it | peanut M&M | l | 80/490 | 12/810 | ate it |
| did not eat it | raisin M&M | s | 8/490 | 120/810 | did not eat it | 
| ate it | raisin M&M | l | 40/490 | 24/810 | ate it | 
| did not eat it | regular M&M | m | 4/490 | 54/810  | did not eat it |

Based on this test data:
* What is the accuracy?
* What is the confusion matrix?

|  | ate it | did not eat it |
| -- | --- | ---- |
| ate it | | |
| did not eat it | | |

In code:

In [33]:
test = np.array([['peanut M&M', 's', 'did not eat it'], ['peanut M&M', 'l', 'ate it'], ['raisin M&M', 's', 'did not eat it'], ['raisin M&M', 'l', 'ate it'], ['regular M&M', 'm', 'did not eat it']])
print(test)

[['peanut M&M' 's' 'did not eat it']
 ['peanut M&M' 'l' 'ate it']
 ['raisin M&M' 's' 'did not eat it']
 ['raisin M&M' 'l' 'ate it']
 ['regular M&M' 'm' 'did not eat it']]


In [34]:
yhat = predict(test[:, :-1], likelihoods, priors)
print(test[:, -1], yhat)

['did not eat it' 'ate it' 'did not eat it' 'ate it' 'did not eat it'] ['did not eat it' 'ate it' 'did not eat it' 'ate it' 'did not eat it']


In [35]:
# These come directly from day 24

def accuracy(y, yhat):
    assert len(y) == len(yhat)
    diffs = y == yhat
    vals, counts = np.unique(diffs, return_counts=True)
    return counts[np.where(vals == True)] / (np.sum(counts))
    
def confusion_matrix(y, yhat, cats):
    "Thanks to https://stackoverflow.com/questions/2148543/how-to-write-a-confusion-matrix-in-python"
    assert len(y) == len(yhat)
    result = np.zeros((len(cats), len(cats)))
    for i in range(len(y)):
        result[cats.index(y[i])][cats.index(yhat[i])] += 1
    return result

In [36]:
print(accuracy(test[:, -1], yhat))
print(cats, confusion_matrix(test[:, -1], yhat, cats))

[1.]
['ate it', 'did not eat it'] [[2. 0.]
 [0. 3.]]
