# Review of probability; Bayes' rule

Today we are going to review terminology for talking about probabilities, and then talk about Bayes' rule. 

How is this relevant to data analysis, particularly machine learning? Compare the two small data sets below. In each case, the last column is the dependent variable. What is the relationship between the dependent and independent variables in each case?

In [23]:
import numpy as np

A = np.array([[3,3,4,2,1,4,2,1], [2,4,1,3,1,3,4,2]]).T

B = np.array([A[:, 0], A[:, 0]]).T
print("A:", A, "\n", "B:", B)

A: [[3 2]
 [3 4]
 [4 1]
 [2 3]
 [1 1]
 [4 3]
 [2 4]
 [1 2]] 
 B: [[3 3]
 [3 3]
 [4 4]
 [2 2]
 [1 1]
 [4 4]
 [2 2]
 [1 1]]


Now, in data analysis we very rarely have *all the data*. In most cases, we have a *sample* of the data that we assume / hope / know to be big enough to generalize over. 

If dataset A (or B) above were *all the data* we could calculate actual probabilities; for example, the probability that A[0, 0] = 4. 

Even if A (or B) is a *sample* of the data, we still calculate relative frequencies, and if the sample is large enough they will approximate probabilities. 

So let's review some basic probability terminology.

* Independent probabilities if P(A,B) = P(A)*P(B)
* Conditional probabilities if not

## Bayes' rule

Let's talk about the probability that A[0, 1] = 1 *and* A[0, 0] = 1:$P(A[0,0] = 1, A[0,1] = 1)$.
* Consider A[0,0] first. $P(A[0,0]=1) = 1/4$. Now *if* it A[0,0] = 1, what is the probability that A[0,1] = 1? In other words, what is $P(A[0,1]=1|A[0,0]=1)$? Well, $P(A[0,1]=1|A[0,0]=1) = 1/2$. So $P(A[0,0] = 1, A[0,1] = 1) = P(A[0,0]=1)*P(A[0,1]=1|A[0,0]=1) = 1/4*1/2$.
* Now consider A[0,1] first. $P(A[0,1]=1) = 1/4$. Now *if* it A[0,1] = 1, what is the probability that A[0,0] = 1? In other words, what is $P(A[0,0]=1|A[0,1]=1)$? Well, $P(A[0,0]=1|A[0,1]=1) = 1/2$. So $P(A[0,0] = 1, A[0,1] = 1) = P(A[0,1]=1)*P(A[0,0]=1|A[0,1]=1) = 1/4*1/2$.

So it works both ways! In other words, $P(X,Y) = P(X)*P(Y|X) = P(Y)*P(X|Y)$. So $P(Y|X) = \frac{P(Y)*P(X|Y)}{P(X)}$. 

And that is Bayes' rule.

Let's cover some more terminology. In Bayes' rule, one part each corresponds to the:
* prior
* likelihood
* posterior
* evidence (or normalization)

Match the part to the label:
* P(Y|X) <- posterior
* P(Y) <- prior
* P(X|Y) <- likelihood
* P(X) <- evidence (or normalization)

## Bayes' rule example

Now let's do an example using these tasty peanut M&Ms I have here. In my cup, there are three M&Ms. They might all be yellow, they might all be blue, or some of them might be yellow and some blue. In other words, here are the possibilities:
* 3 blues, 0 yellows
* 2 blues, 1 yellow
* 1 blue, 2 yellows
* 0 blues, 3 yellows

I can draw *one* M&M from the cup, without looking. Then, *given* that one M&M, let's see if we can estimate the probability that there are 2 blues and 1 yellow:
* $P(2b1y | 1b) = P(2b1y)*P(1b|2b1y) / P(1b)$. 
* $P(2b1y) = 1/4$.  
* $P(1b|2b1y) = 2/3$.
* $P(1b) = P(1b|3b0y)*P(3b0y) + P(1b|2b1y)*P(2b1y) + P(1b|1b2y)*P(1b2y) + P(1b|0b3y)*P(0b3y) = 1*1/4 + 2/3*1/4 + 1/3*1/4 + 0 = 1/2$.

$P(2b1y | 1b) = (1/4*2/3) / (1/2) = 1/3$.

Let's repeat for the other three possible outcomes:

__1b2y__

* $P(1b2y | 1b) = P(1b2y)*P(1b|1b2y) / P(1b)$. 
* $P(1b2y) = 1/4$.  
* $P(1b|1b2y) = 1/3$.
* $P(1b) = P(1b|3b0y)*P(3b0y) + P(1b|2b1y)*P(2b1y) + P(1b|1b2y)*P(1b2y) + P(1b|0b3y)*P(0b3y) = 1*1/4 + 2/3*1/4 + 1/3*1/4 + 0 = 1/2$.

So $P(1b2y | 1b) = (1/4*1/3)/ (1/2) = 1/6$.

__3b__
* $P(3b0y | 1b) = P(3b0y)*P(1b|3b0y) / P(1b)$. 
* $P(3b0y) = 1/4$.  
* $P(1b|3b0y) = 1$.
* $P(1b) = P(1b|3b0y)*P(3b0y) + P(1b|2b1y)*P(2b1y) + P(1b|1b2y)*P(1b2y) + P(1b|0b3y)*P(0b3y) = 1*1/4 + 2/3*1/4 + 1/3*1/4 + 0 = 1/2$.

So $P(3b0y | 1b) = (1/4*1)/ (1/2) = 1/2$.

__3y__
* $P(0b3y | 1b) = P(0b3y)*P(1b|0b3y) / P(1b)$. 
* $P(0b3y) = 1/4$.  
* $P(1b|0b3y) = 0$.
* $P(1b) = P(1b|3b0y)*P(3b0y) + P(1b|2b1y)*P(2b1y) + P(1b|1b2y)*P(1b2y) + P(1b|0b3y)*P(0b3y) = 1*1/4 + 2/3*1/4 + 1/3*1/4 + 0 = 1/2$.

So $P(0b3y | 1b) = (1/4*0)/2 = 0$.

Sanity check: does the sum of all four probabilities equal 1?

## Exercise 

Now you do one! Each of you will have a cup with *4* M&Ms. 

* What are the possible outcomes for color combinations in your cup?
* Draw one M&M. Given this M&M, what is the probability of each combination of color combination in your cup?

## Fitting and predicting using Naive Bayes

Let's imagine I have the dataset below. For the dependent variable (first column), I use "1" to represent *ate it* and "0" to represent *did not eat it*. For the independent variable (zeroth column), I use "0" to represent *peanut M&M*, "1" to represent *regular M&M* and "2" to represent *raisin M&M*.

(Side note: we typically convert qualitative values to numbers for machine learning; it enables us to use all the power of numpy, at some cost in readability to humans.)

In [30]:
A = np.array([[0, 1], [0, 1], [1, 0], [2, 0], [2, 1], [1, 0], [0, 1], [1, 0], [2, 0], [2, 0], [0, 1], [0, 0], [1, 0]])
print(A)

[[0 1]
 [0 1]
 [1 0]
 [2 0]
 [2 1]
 [1 0]
 [0 1]
 [1 0]
 [2 0]
 [2 0]
 [0 1]
 [0 0]
 [1 0]]


__Fit__

* Calculate the likelihood of *ate it* given *peanut M&M*, of *ate it* given *regular M&M*, of *ate it* given *raisin M&M* and of *did not eat it* given each type of M&M.
* Calculate the prior for *ate it* and the prior for *did not eat it*.

Store both sets of values.

__Predict__

Given a new observation, *peanut M&M*, what is my most likely behavior?

* Calculate $P(ate it|peanut M&M)$ and $P(did not eat it|peanut M&M)$. Note that since $P(peanut M&M)$ is in the denominator in both cases, we can ignore it completely. So we only need the prior and the likelihood, both of which we calculated during __fit__.
* Which is higher?

