In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas import Series, DataFrame
import pytz
from pytz import common_timezones, all_timezones
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline
from datetime import datetime
from __future__ import print_function

## Think Bayes: Bayesian Statistics Made Simple

There are several excellennt modules for doing Bayesian statistics in Python, incluidng ```pymc``` and ```OpenBUGS```.
I chose not to use them for this book because you need a fair amount of background knowledge to get started with these modules, and I want to keep the prerequisites minimal. If you know Python and a little bit about probability, you are ready to start this book.


Chapter 1 is about probability and Baye's theorem; it has no code. Chapter 2 introduces ```Pmf```, a thinly disguised Python dictionary I use to represent a probablity mass function (PMF). Then Chapter 3 introduces ```Suite```, a kind of Pmf that provides a framework for doing Bayesian updates. And that's just about all there is to it.

Well, almost. In some of the later chapters, I use analytic distributions including the Guassian (normal) distribution, the exponential and Poisson distributions, and the beta distribution. In Chapter 15 I break out the less-common Dirichlet distribution, but I explain it as I go along. If you are not familiar with these distributions, you can read about them on Wikipedia. 

In [2]:
785000./311000000

0.0025241157556270097

The usual notation for conditional probability is 
$p(A|B)$, which is the probablity of $A$ given that $B$ is true. In this example, $A$ represents the prediction that I will have a heart attack in teh next year, and $B$ is the set of conditions I listed.

### Conjoint probability

**Conjoint probability** is a fancy way to say the probability that two things are true. I write $p(A \mbox{and} B)$ to mean the probability that $A$ and $B$ are both true.
In geneneral, the probability of a conjunction is

$
p(A \mbox{and} B) = p(A) p(B|A)
$

for any $A$ and $B$. So if the chance of rain on any given day is 0.5, the chance of rain on two consecutive days is not 0.25, but probably a bit higher.

#### The cookie problem

We'll get to Bayes's theorem soon, but I wanted to motivate it with an example called the cookie problem. Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each. Now suppose you choose one of the bowls at random and, without looking, select a cookie at ranom. The cookie is vanilla. What is the probability that it came from Bowl 1?

This is a conditional probability; we want $p(\mbox{Bowl 1} | \mbox{vanilla})$, but it is not obvious how to compute it. If I asked a different question - the probability of a vanilla cookie given Bowl 1 - it would be easy:

$
p(\mbox{vanilla} | \mbox{Bowl 1}) = \frac{30}{40}
$

Sadly, $p(A|B)$ is _not_ the same as $p(B|A)$, but there is a way to get from one to the other: Bayes's theorem.

#### Bayes's theorem

At this point we have everything we need to derive Bayes's theorem. We'll start with the observation that conjunction is commutataive; that is

$
p(A \mbox{and} B) = p(B \mbox{and} A)
$

Next, we write the probability of a conjunction:

$
p(A \mbox{and} B) = p(A)p(B|A)
$

Since we have not said anything about what $A$ and $B$ mean, they are interchangeable. Interchanging them yields

$
p(B \mbox{and} A) = p(B)p(A|B)
$

That's all we need. Pulling these pieces together, we get

$
p(B)p(A|B) = p(A)p(B|A)
$

which means there are two ways to compute the conjunction. If you have $p(A)$, you multiply by the conditional probability
$p(B|A)$. Or you can do it the other way around: if you know $p(B)$, you multiply by $p(A|B)$. Either way you should get the same thing. 
Finally we can divide through by $p(B)$:

$
p(A|B)  = \frac{p(A) p(B|A)}{p(B)}
$

And that's Bayes's theorem! It might not look like much, but it turns out to be suprisingly powerful.

For example, we can use it to solve the cookie problem. I'll write $B_1$ for the hypothesis that the cookie came from Bowl 1 and $V$ for the vanilla cookie. Plugging into Bayes's theorem we get

$
p(B_{1} | V) = \frac{p(B_{1})p(V | B_{1})}{p(V)}
$

The term on the left is what we want: the probability of Bowl 1, given that we chose a vanilla cookie. the terms on the right are:

* $p(B_1)$: This is the probability that we chose Bowl 1, unconditioned by what kind of cookie we got. Since the problem says we chose a bowl at random, we can assume $p(B_1) = \frac{1}{2}$.
* $p(V|B_1)$: This is the probability of getting a vanilla cookie from Bowl 1, which is $\frac{3}{4}$.
* $p(V)$: This is the probability of drawing a vanilla cookie from either bowl. Since we had an equal change of choosing either bowl and the bowls contain the same number of cookies, we had the same chance of choosing any cookie. use $p(B_1)p(V|B_1) + p(B_2)p(V|B_2) = \frac{5}{8}$.

#### The diachronic interpretation

There is another way to think of Bayes's theorem: it gives us a way to update the probability of a hypothesis, $H$, in light of some body of data, $D$.

This way of thinking about Baye's theorem is called the **diachronic interpretation**. "Diachronic" means that something is happening over time; in this case the probability of the hypothesis changtes, over time, as we see new data.

Rewriting Bayes's theorem with $H$ and $D$ yields:

$
\begin{equation}
p(H|D) = \frac{p(H) p(D|H)}{p(D)}
\end{equation}
$

In this interpretatoin, each term has a name:

* $p(H)$ is the probability of the hypothesis before we see the data, called the prior probability, or just **prior**
* $p(H|D)$ is what we want to compute, the probability of the hypothesis after we see the data, called the **posterior**.
* $p(D|H)$ is the probablity of the data under the hypothesis, called the **likelihood**.
* $p(D)$ is the probability of the data under any hypothesis, called the **normalizing constant**.

Sometimes we can compute the prior based on background information. For example, the cookie problem specifies that we choose a bowl at random with equal probability. In other cases the prior is subjective; that is, reasonable people might disagree, either because they use different background information or because they interpret the same information differently.
The likelihood is usually the easiest part to compute. In the cookie problem, if we know which bowl the cookie came from, we find the probability of a vanilla cookie by counting.

The normalizing constant can be tricky. It is supposed to be the probability of seeing the data under any hypothesis at all, but in the most general case it is hard to nail down what that means.

Most often we simplify things by specifiying a set of hypothesis that are

* **Mutually exclusive**: At most one hypothesis in the set can be true, and
* **Collectively exhaustive**: There are no other possiblities; at least one of the hypotheses has to be true.

I use the word **suite** for a set of hypothesis that has these properties.

In the cookie problem, there are only two hypotheses - the cookie came from Bowl 1 or Bowl 2 - and they are mutually exclusive and collectively exhaustive. In that case we can compute $p(D)$ using the law of total probability, which says that if there are two exclusive ways that something might happen, you can add up the probabilities like this:

$
p(D) = p(B_1)p(D|B_1) + p(B_2)p(D|B_2)
$

# 