In [1]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../notebook_format')
from formats import load_style
load_style()

In [None]:
os.chdir(path)
%matplotlib inline
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

# Bayes Theorem

**Conjoint probability** is a fancy way to say the probability that two things are true. If you learned about probability in the context of coin tosses and dice, you might have learned the following formula:

$$p(A \text{ and } B) = p(A)p(B)$$

Meaning if I toss two fair coins the probability of both coins end up being head is 0.5 * 0.5 = 0.25.

The formula above only works when A and B are independent, meaning that the outcome of event A does not change the probability of the second, or more formally, $p(B|A) = p(B)$, where $p(B|A)$ denotes the probability of B given that A is true. A different example where the events are not independent would be, suppose A means that it rains today and B means that it rains tomorrow. Then if I know that it rains today, then it is more likely that it will rain tomorrow. So $p(B|A) > p(B)$.

Thus when the two events are not independent of one another, the formula above becomes:

$$p(A \text{ and } B) = p(A)p(B|A)$$

So if the chance of rain on any given day is 0.5, the chance of rain on two consecutive days is not 0.25, but probably a bit higher.

Next, we know that the probabilities are symmetric (communutative), meaning that $p(A \text{ and } B) = p(B \text{ and } A)$. Hence we can put the pieces together that $p(A)p(B|A) = p(B)p(A|B)$. And if we divide both side with $p(B)$ that gives you the **Bayes's theorem:**

$$p(A|B) = \frac{p(A)p(B|A)}{p(B)}$$

Using this formula, let's consider the following cookie problem: Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each.
Now suppose you choose one of the bowls at random and without looking select a cookie at random. The cookie is vanilla. What is the probability that it came from Bowl 1? 

Using the Bayes's theorem, this will get us $p(B_1|V) = \frac{p(B_1)p(V|B_1)}{p(V)}$. We know $p(B_1)$, the probability that we chose bowl 1 is 1/2; $p(V|B_1)$, the probability of getting a vanilla cookie from Bowl 1 is 3/4; $p(V)$, the probability of drawing a vanilla cookie from either bowl is 5/8 ( a total of 50 vanilla cookies in both bowl and a total of 80 cookies in both bowl ).

Plugging it back to the formula that will give us 3/5. So the vanilla cookie that we've random selected is more likely to come from Bowl 1.

## Diachronic Interpretation

An alternative way of looking at the Bayes's theorem is, it gives us a way to update the probability of a hypothesis $H$, in light of some body of data $D$. This is called the **diachronic interpretation**, where “diachronic” means that something is happening over time. Hence, this is equivalent to saying that the probability of the hypotheses changes over time, as we see new data. Given this information, we can now rewriting Bayes theorem with this new set of notations:

$$p(H|D) = \frac{p(H)p(D|H)}{p(D)}$$

- $p(H)$ is the probability of the hypothesis before we see the data, called the **prior probability**. Sometimes we can compute the prior based on background information. For example, the cookie problem specifies that there are only two hypotheses, the cookie either came from Bowl 1 or Bowl 2. In other cases the prior is subjective; that is, people might disagree. Either because they use different background information or because they interpret the same information differently.
- $p(H|D$ is what we want to compute, the probability of the hypothesis after we see the data, called the **posterior probability**.
- $p(D|H)$ is the probability of the data under the hypothesis, called the **likelihood**. This is usually the easiest part to compute.
- $p(D)$ is the probability of the data under any hypothesis, called the **normalizing constant**. In the cookie problem, there are only two hypotheses. In that case we can compute $p(D)$ using the law of total probability, which says that if there are two exclusive ways that something might happen, you can add up the probabilities like this: $p(D) = p(B1) p(D|B1) + p(B2) p(D|B2)$. Plugging in the values from the cookie problem, we have
$p(D) = (1/2) (3/4) + (1/2) (1/2) = 5/8$.

For many problems involving conditional probability, Bayes’s theorem provides a divide-and-conquer strategy. If $p(A|B)$ is hard to compute, or hard to measure experimentally, check whether it might be easier to compute the other terms in Bayes’s theorem.

Let's look at another problem.

## The M&M Problem

M&M’s are small candy-coated chocolates that come in a variety of colors. Mars, Inc., which makes M&M’s, changes the mixture of colors from time to time. In 1995, they introduced blue M&M’s. Before then, the color mix in a bag of plain M&M’s was 30% Brown, 20% Yellow, 20% Red, 10% Green, 10% Orange, 10% Tan. Afterward it was 24% Blue , 20% Green, 16% Orange, 14% Yellow, 13% Red, 13% Brown.

Suppose a friend of yours has two bags of M&M’s, and he tells you that one bag is from 1994 and the other is from 1996. He won’t tell you which is which, but he gives you one M&M from each bag. One is Yellow and one is Green. What is the probability that the Yellow one came from the 1994 bag?

This is similar to the cookie problem, with the twist this time you'll be drawing one sample from each bowl/bag. This problem also gives us a chance to use the table method, which is useful for solving problems like this on paper. The first step is to enumerate the hypotheses. Suppose that the bag with the Yellow M&M came Bag 1; and one with the Green M&M came from Bag 2. So the hypotheses are:

- A: Bag 1 is from 1994, which implies that Bag 2 is from 1996.
- B: Bag 1 is from 1996 and Bag 2 from 1994.

Now we construct a table with a row for each hypothesis and a column for each term in Bayes’s theorem:

|     |Prior $p(H)$ | Likelihood $p(D|H)$ | Prior * Likelihood $p(H)p(D|H)$ | Posterior $p(H|D)$ |
| --- |:-----------:|:-------------------:|:-------------------------------:|:------------------:|
| A   | 1/2         | (20)(20)            | 200                             | 20/27              |
| B   | 1/2         | (14)(10)            | 70                              | 7/27               |

- The first column has the priors. Based on the statement of the problem, it is reasonable to choose $p(A) = p(B) = 1/2$.
- The second column has the likelihoods, which follow from the information in the problem. For example, if $A$ is true, the yellow M&M came from the 1994 bag with probability 20%, and the green came from the 1996 bag with probability 20%. If $B$ is true, the yellow M&M came from the 1996 bag with probability 14%, and the green came from the 1994 bag with probability 10%. Because the selections are independent, we get the conjoint probability by multiplying the two numbers.
- The third column is just the product of the previous two. The sum of this column, 270, is the normalizing constant, $p(D)$. To get the last column, which contains the posteriors, we divide the third column by the normalizing constant.

Well, you might be bothered by one detail. In the table above, we wrote $p(D|H)$ in terms of pure numbers, not probabilities, which means it is off by a factor of 10,000. But that cancels out when we divide through by the normalizing constant, so it doesn’t affect the result.

## Discussion

Among Bayesians, there are two approaches to choosing prior distributions. Some recommend choosing the prior that best represents background information about the problem; in that case the prior is said to be informative. The problem with using an informative prior is that people might use different background information (or interpret it differently). So informative priors often seem subjective.

The alternative is a so-called uninformative prior, which is intended to be as unrestricted as possible, in order to let the data speak for themselves. In some cases you can identify a unique prior that has some desirable property, like representing minimal prior information about the estimated quantity.

Uninformative priors are appealing because they seem more objective. But I am generally in favor of using informative priors. Why? First, Bayesian analysis is always based on modeling decisions. Choosing the prior is one of those decisions, but it is not the only one, and it might not even be the most subjective. So even if an uninformative prior is more objective, the entire analysis is still subjective.

Also, for most practical problems, you are likely to be in one of two regimes: either you have a lot of data or not very much. **If you have a lot of data, the choice of the prior doesn’t matter very much**; informative and uninformative priors yield almost the same results.

## Reference

- [Bayes's Rule](https://arbital.com/p/bayes_rule/?l=1zq)
- [confidence interval and credible intervals' difference](http://stats.stackexchange.com/questions/2272/whats-the-difference-between-a-confidence-interval-and-a-credible-interval)
- [Gibbs sampling with for uninitiated](http://www.umiacs.umd.edu/~resnik/pubs/gibbs.pdf)