# The One Goal for Today

To implement a Naive Bayes model using python.

# Bayes' Rule reviewed 

$P(Y|X) = \frac{P(Y)*P(X|Y)}{P(X)}$. 
* $P(Y|X)$ is the *posterior*
* $P(Y)$ is the *prior*
* $P(X|Y)$ is the *likelihood*
* $P(X)$ is the *evidence* (or *normalization*)

# Fitting and predicting using Naive Bayes

Let's imagine I have an *infinite* bag of M&Ms. Some of them are peanut, some raisin and some chocolate. I only like peanut M&Ms. The thing is, I'm in a dark room so I can only guess the type of the M&M by feel. I repeatedly reach into the bag, grab one M&M and then either eat it or do not eat it. Unlike Wednesday, you record the length of the M&M as well as the type. Because numpy won't let me have arrays of heterogeneous types, I'm going to express length and width as 's', 'm' or 'l'. This gives the dataset below:



| Type | Length | Outcome |
| ----- | ---- | --- |
| peanut M&M (N) | l | ate it (E) |
| peanut M&M (N) | l | ate it (E) |
| peanut M&M (N) | l | ate it (E) |
| raisin M&M (R) | m | did not eat it (D) |
| raisin M&M (R) | l | ate it (E) |
| chocolate M&M (C) | s | did not eat it (D) |
| peanut M&M (N) | m | did not eat it (D) |
| chocolate M&M (C) | s | did not eat it (D) |
| raisin M&M (R) | s | did not eat it (D) |
| raisin M&M (R) | s | did not eat it (D) |

In code:

In [None]:
import re
import numpy as np
import matplotlib.pyplot as plt

In [None]:
data = np.array([
    ['N', 'l', 'E'], 
    ['N', 'l', 'E'], 
    ['N', 'l', 'E'], 
    ['R', 'm', 'D'], 
    ['R', 'l', 'E'], 
    ['C', 's', 'D'], 
    ['N', 'm', 'D'], 
    ['C', 's', 'D'], 
    ['R', 's', 'D'], 
    ['R', 's', 'D']])
print(data.dtype)

Side note: we typically convert qualitative values to numbers for machine learning; it enables us to use all the power of numpy, at some cost in readability to humans. 

In [None]:
# Convert the features to numbers

def convert(data):
    mappings = []
    for col in range(data.shape[1]):
        values = sorted(np.unique(data[:, col]))
        mappings.append(values)
        for row in range(data.shape[0]):
            data[row, col] = values.index(data[row, col])
    return data.astype(int), mappings

def convert_one(datum, mappings):
    return [mappings[i].index(x) for i, x in enumerate(datum)]

data, mappings = convert(data)
print(data, "\n", mappings)

We want to fit a model that will tell us the probability that I ate it (or did not eat it) given the type of candy it is *and* its length. 

## Conditional independence

When we fit a Naive Bayes model across multiple independent features, we assume those features are *conditionally independent*, given $Y$. In other words, the effect of the value of a feature  on the label is independent of the values of other features.

The Naive Bayes formula in this case, where $X = {x_1, x_2, ...}$ looks like:
$P(Y|X) = argmax_Y P(Y)*P(X|Y) = argmax_Y P(Y)*\prod_i{P(x_i|Y)}$.

We can do the product of $P(x_i|Y)$ because of this *conditional independence*. This *conditional independence* assumption is the "naive" in Naive Bayes.

Looking at the data, is the length feature *really* independent of the type feature?
* How many large, medium and small peanut M&Ms are there?
* How many large, medium and small raisin M&Ms are there?
* How many large, medium and small regular M&Ms are there?

Even if the independent variables aren't really conditionally independent, Naive Bayes is still a surprisingly strong baseline model for many tasks.

## Missing feature values

Let's make a table of counts, like before:

| Y | N | C | R | l | m | s |
| -------- | ------- | ---------- | ------------ | -- | -- | -- |
| E         | 3 | 0 | 1 | 4 | 0 | 0 |
| D         | 1 | 2 | 3 | 0 | 2 | 4 |
| Total     | 4 | 2 | 4 | 4 | 2 | 4 |

Ok, the fact that there are 0s in this table is a problem. It means that if we ever *do* (during testing or inference) see, for example, a small peanut M&M, $P(s | E)$ will "zero out" the total even though the peanutness should lend towards edibility. So we use __Laplace smoothing__: for $x_i \in X$:
* instead of $P(Y|x_i) \sim \frac{|x_i \& Y|}{|Y|}$, 
* we use $P(Y|x_i) \sim \frac{|x_i \& Y|+1}{|Y| + |x_i|}$. 

In this way, there's some (small!) likelihood for every value of each independent variable.

## Fit

* Calculate the prior for E and the prior for D: $P(E) = 4/10$; $P(D) = 6/10$

* Calculate the likelihood of each of type of M&M given E and given D. For example, the likelihood of N given E = (3 (peanut M&Ms that were eaten) + 1 (smoothing)) / (4 (ate it) + 3 (unique feature values)). The likelihood of R given D = (3 (raisin M&Ms that were not eaten) + 1 (smoothing)) / (6 (did not eat it) + 3 (unique feature values)).
 
| Y | N | C | R | l | m | s |
| -------- | ------- | ---------- | ------------ | -- | -- | -- |
| E         | 4/7 | 1/7 | 2/7 | 5/7 | 1/7 | 1/7 |
| D         | 2/9 | 3/9 | 4/9 | 1/9 | 3/9 | 5/9 |

Store both sets of values.

In code:

In [None]:
def fit(data):
    # we will use this to store the likelihoods
    likelihoods = []
    # cats is all the unique values in the last column of the data; all the unique values for the dependent variable
    cats = sorted(np.unique(data[:, -1]))
    for cat in cats:
        # likelihoods_cat is the likelihoods for this value of the dependent variable
        likelihoods_cat = []
        # for each feature 
        for feature in range(data.shape[1]-1):
            # values is the value of this feature
            values = sorted(np.unique(data[:, feature]))
            for value in values:
                # add P(value|cat) to likelihoods_cat; don't forget to use Laplace smoothing!
                likelihoods_cat.append((len(data[np.where((data[:, feature] == value) & (data[:, -1] == cat))]) + 1) / (len(data[np.where(data[:, -1] == cat)]) + len(values)))
        likelihoods.append(likelihoods_cat)
    # calculate P(cat) for each cat
    priors = [len(data[np.where(data[:, -1] == cat)]) / len(data) for cat in cats]
    # a naive Bayes model is a set of likelihoods and a set of priors; we don't need to worry about the evidence because it's always the same
    return likelihoods, priors

likelihoods, priors = fit(data)
print(likelihoods, priors)

### Predict

Given a new observation, *large raisin M&M*, what is my most likely behavior?

* Compare $P(E|l, R)$ and $P(D|l, R)$. 

Calculate $argmax_Y P(Y)*\prod_i{P(x_i|Y)}$:
* Y=E: $4/10*5/7*2/7$
* Y=D: $6/10*4/9*1/9$

In code:

In [None]:
def predict_one(datum, likelihoods, priors, mappings):
    # we will use this to store the probability of each outcome
    res = []
    datum = convert_one(datum, mappings)
    print(datum)
    # for each possible outcome cat
    for j, cat in enumerate(likelihoods):
        product = 1
        # for each possible feature value
        for i, value in enumerate(datum):
            try:
                # get P(value|cat)
                # the bug was here!!
                likelihood = cat[i*len(mappings[i])+value]
            except:
                print("feature value " + value + " is not in likelihoods, returning most frequent category!")
                likelihood = 1
            print(cat, value, likelihood)
            # all the independent variables are assumed to be conditionally independent of each other
            product = product*likelihood
        # multiply the likelihood by the prior for cat
        res.append(priors[j]*product)
    print(res)
    # return the most likely outcome
    return mappings[-1][np.argmax(res)]

def predict(data, likelihoods, priors, mappings):
    return np.array([predict_one(datum, likelihoods, priors, mappings) for datum in data])

In [None]:
predict_one(['R', 'l'], likelihoods, priors, mappings)

### Score

Am I doing better with one more feature? Let's see! Here's my augmented test data:

| Candy | Length | Outcome |
| ----- | ---- | --- |
| peanut M&M (N) | s | did not eat it (D) |
| peanut M&M (N) | l | ate it (E) |
| raisin M&M (R) | s | did not eat it (D) |
| raisin M&M (R) | l | ate it (E) |
| chocolate M&M (C) | m | did not eat it (D) |

Figuring out the posteriors using the priors and likelihoods:
| Y | x1 | x2| P(E)\*P(x1\|E)\*P(x2\|E) | P(D)\*P(x1\|D)\*P(x2\|D) | $\hat{Y}$ |
| --- | --- | --- | --- | --- | --- |
| D | N | s | .0326 |  .07 |  D |
| E | N | l | .163 | .015  | E  |
| D | R | s | .016 | .148  | D  |
| E | R | l | .082 |  .030 | E  |
| D | C | m | .008 | .067  | D  |

Based on this test data:
* What is the accuracy?
* What is the confusion matrix?

| Guess/Truth | E | D |
| -- | --- | ---- |
| E | 2 |0 |
| D | 0 | 3 |

In code:

In [None]:
test = np.array([['N', 's', 'D'], ['N', 'l', 'E'], ['R', 's', 'D'], ['R', 'l', 'E'], ['C', 'm', 'D']])
print(test)

In [None]:
yhat = predict(test[:, :-1], likelihoods, priors, mappings)
print(test[:, -1], yhat)

In [None]:
# from day 24
def accuracy(y, yhat):
    assert len(y) == len(yhat)
    diffs = y == yhat
    vals, counts = np.unique(diffs, return_counts=True)
    return counts[np.where(vals == True)] / (np.sum(counts))
    
def confusion_matrix(y, yhat, cats):
    "Thanks to https://stackoverflow.com/questions/2148543/how-to-write-a-confusion-matrix-in-python"
    assert len(y) == len(yhat)
    # make a matrix of all zeros
    result = np.zeros((len(cats), len(cats)))
    # update the confusion matrix for each pair of y, yhat
    for i in range(len(y)):
        result[cats.index(y[i])][cats.index(yhat[i])] += 1
    return result

In [None]:
print(accuracy(test[:, -1], yhat))
print(confusion_matrix(test[:, -1], yhat, mappings[-1]))

## How can I get more robust estimates?

Right now, our estimates are literally count-and-divide (and smooth!) on our training data. Can we do better?

Yes, we can, but not without doing some *approximation*.

If we __look at our data__, in particular if we plot it, we can often identify the data distribution. For example, the data may follow a Gaussian distribution. Then instead of counting-and-dividing the *fit* can be to use the training data to fit that distribution. This may be more robust (give better generalization) than just count-and-divide.

For more information please refer to the scikit-learn documentation link below.

## Resources

* https://www.datacamp.com/community/tutorials/naive-bayes-scikit-learn?msclkid=515ef324bb4111ec8b8708b8e0e41101
* https://scikit-learn.org/stable/modules/naive_bayes.html?msclkid=2a4b7189bb4111ec80c80f36bbb9a228