In [1]:
%matplotlib widget
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb as binom_coeff

# 1. Understanding Bayesian inference

## 1.1. Developing an intuition

The Bayes rule provides a formal (mathematical) description of a type of inference, the act of guessing by means of observation and knowledge. The Bayes rule equates the **posterior** probabilty $p(h \mid D)$ to the product of the **prior** probability ${p(D)}$ and the **likelihood** $p(D \mid h)$, normalized by the **marginal** probability ${p(D)}$:

$$p(h \mid D) = \frac{p(h)p(D \mid h)}{p(D)}$$

For someone who is not used to hearing this statistical jargon, this formula may appear a bit incoherent. But try not to take the way of rote memorization. The intuition behind the rule is crucial and, frankly, quite elegant.If you are like me, you find it hard to think in new and unfamiliar terms. Try replacing some of the words that you are not absolutely comfortable with by more everyday notions. 

For instance, you can swap "probability" with "strength of belief" or "conviction". Note an important distinction between belief _content_ and belief _strength_. When we say that there is _high probability_ that "a coin is fair", we are expressing a _strong conviction_ of in a "belief" that a coin should behave in certain ways. Further, I will refer to belief strength as conviction and to belief content simply as belief. 

Further, you can replace the latin terms "posterior" and "prior" with more intuitive "after-the-fact" and "before-the-fact" (respectively). Thus, you can think of the _prior probability_ as _conviction before the fact_ and the _posterior probability_ as _conviction after the fact_. The "fact" here refers observing some data. Note that conviction after an observation and conviction before the observation are on the opposite sides of the Bayes rule equation, which means that they are related, yet not identical. Observations can change the strength of our beliefs and the Bayes rule describes one way in which this can happen. At this point it is very useful to rearrange the equation in order emphasize that the Bayes rule is an updating procedure:

$$p(h \mid D) =  p(h) \times \frac{p(D \mid h)}{p(D)}$$

The initial and updated convictions differ by a factor that includes the likelihood term in the numerator and the marginal term in the denomenator. What are they? Lets start with the numerator. Likelihood (not to be confused with probability) is a kind of a score for the quality of some belief (i.e. beief accuracy). How do we assess our beliefs? Directly or indirectly, we do it by comparing them to our observations. When a belief is accurate, it can be used to generate guesses that correspond to (our subjective) reality. More observations enable a better assessment. To give you an example, imagine that I am convinced in a belief that "my farts don't smell". Using a sophisticated "cupping" procedure I find out day-to-day that as a matter of fact they are indeed extremely funky. My belief is no good, and I should take it less seriously. But note that if I only sniff my fart once and it happens to stink, I should not be so quick to think that my belief is strictly inaccurate (may be it was a freaky one?). As a result, my conviction in my original belief will not change a lot. So, you can think of likelihood simply as the accuracy of a belief, keeping in mind that the only way to test this accuracy is to compare the belief to reality.

Finally, let's break down the marginal likelihood term $P(D)$ in the denomenator. Mathematically, marginalization of likelihood is a kind of averaging of likelihoods (accuracies) across different beliefs, $p(D) = \int_{H} p(D \vert h)p(h)dh$ (there is an equivalent formulation for a discrete conditional variable in which we take a sum over probability-weighted values). Note that the formula is very similar to the likelihood formula from beore, only now we are considering the entire space of beliefs included in $H$. Therefore, in simpler terms, the marginal likelihood is the combined accuracy of various beliefs, weighted by how strongly we believe in them. Therefore, if we are equally convinced in all of our beliefs about something (this is called a uniform prior), the marginal likelihood is simply the mean accuracy of these beliefs. Incidentally, the accuracy of beliefs that we feel strongly above will inlfuence the average accuracy a lot more. Try and think of different combinations of beliefs' accuracies and their strengths and how these factors determine the magnitude of the denominator.

As you can see, the denominator combines accuracy in a belief with its strength. It is useful to think of what this product of accuracy and conviction ($p(D \vert h) \times p(h)$) stands for. I find it helpful to think about it as a conviction discounted by accuracy (because the ultimate result of the Bayes rule is probability, which we interpret as belief strength). Actually, likelihood can be greater than 1, since it is a product of probabiliy _densities_ (!), not probabilities. Thus, even the term accuracy is a bit misleading, because we usually think of accuracy as a normalized quantity (i.e. a percentage). It would be more accurate (no pun intended) to say that likelihood is a measure of the amount of evidence for a certain belief. It is unbounded and the more positive data you have, the more evidence you have for a belief. However, it can never be negative because we cannot have a negative density. So, a more sensical way to formulate likelihood is the amount of evidence for a belief.

Having been introduced to each of the terms in the Bayes rule, we now should be able to think about the it in a very intuitive way. It should be clear how this equation describes a situation where:

> I am aware of various beliefs about the world in which I can be more or less convinced. Experiencing new things may change how strongly I feel about my beliefs. How? Well, if a belief corresponds well with my observations, I should be more convinced in its truth. On the other hand, if there there are more accurate alternative beliefs that fit my observations better, I should be less convinced aboutmy original less accurate belief. That is all there is to it. Changing convictions in my beliefs before-the-fact requires me know to two things: (1) how accurate my belief is against what I observed, and (2) how accurate are other beliefs against what I observed.

So, the right-hand side of the Bayes rule simply tells us how the data updates an old belief $p(h)$ into a new belief $p(h \mid D)$. If we rearrange it slightly (as above), it becomes very clear that the strength of the initial belief should grow in proportion to its accuracy and shrink in proportion to accuracies of other beliefs:

$$\text{updated conviction in belief}= \text{initial conviction in belief} \times \frac{\text{accuracy of initial belief}}{\text{accuracy of alternative beliefs}}$$

The Bayes rule is super fun because there are multiple ways of thinking about it. If we expand the denominator and put the prior back into the numerator we get:

$$p(h \mid D) = \frac{p(h)p(D \mid h)}{\int_{H} p(D \vert h)p(h)dh}$$

Remember how we said that the product of accuracy and conviction can be thought of as "discounted conviction", a combination of the amount of evidence and strength of a belief? Well, in this form of the Bayes rule we can see that our conviction in a belief after-the-fact equals the discounted or upgraded conviction in a belief before-the-fact, relative to the sum-total of discounted or upgraded convictions in other beliefs. 

Hopefully, this section has provided you with some intuitive understanding of what the Bayes rule stands for. Now let's consider a simple example.

## 1.2. A simple example

Statisticians love to talk about unfair coins, even though it is a myth and you can't really bias a coin flip. However, imagine someone gave you a coin and told you that it was biased to come up heads only 30% of the time. This is an extraordinary claim, since you think that a biased coin is nearly impossible. We can formalize this belief by stating that the probability of the coin being biased is really low. To do that, we need to express our belief mathematically. This can be done through Bernoulli PMF, which has a single parameter $\pi$:

$$P(X=x) = p(\pi) = \pi^{x} (1 - \pi)^{1-x}$$ 

$$P(\pi \mid \text{data}) = $$ 

where $x \in \{0, 1\}$ (0 = tails, 1 = heads). The parameter $\pi$ expresses our belief in a precise numerical manner. If we believe that the coin is fair $\pi$ should be exatly $.5$. It corresponds to the proportion of heads we expect (under our belief) if we tossed the coin indefinitely. If we think the coin is biased towards the heads, we could say that $\pi = .8$. In general, $p(\pi = x)$ corresponds to the strength in the belief that the coin is biased to come up heads $x\%$ of the time in the long run.

Let's say our prior belief is that the coin is in fact biased to come up heads 30% of the time is very low, $p(\pi=.3)=.01$. This means that we are almost sure that the coin is fair, $p(\pi=.5) = .99$. But let's be good scientists and pit our beliefs against some actual data. Let's say we flip a coin 5 times and get the following sequence `[1,0,0,0,0]`. How does observing this sequence update our beliefs? The Bayes rule tells us that we need to know the likelihoods of all our beliefs (we have two, $\pi=.3$ and $\pi=.5$). By definition, likelihood (of independent events) is the product of their probability densities (or masses). So, the likelihood that the coin is biased is:

$$\text{likelihood}(\{1,0,0,0,0\};\pi=.3) = .3^{1} \times (1 - .3)^{4} = 0.36$$

and the likelihood that it is fair is:

$$\text{likelihood}(\{1,0,0,0,0\};\pi=.5) = .5^{1} \times (1 - .5)^{4} = .16$$

So it looks like there is more evidence for the belief in a biased coin, compared to the belief in a fair coin. So, should we believe more strongly now that the coin is biased? We would if we did not have any strong beliefs about the improbability of a biased coin. Remember, the likelihoods only tell us how the initial convictions change. Since we felt so strongly about the the biased coin being so improbable, the convication in that belief shrinks the evidence:

$p(\pi=.3) \times \text{likelihood}(\{1,0,0,0,0\};\pi=.3) = .01 \times .36 = .0036$

while the conviction in a fair coin barely changes the amount of evidence:

$p(\pi=.5) \times \text{likelihood}(\{1,0,0,0,0\};\pi=.5) = .99 \times .16 = .15$

As a result, our disbelief that the coin is biased barely changes:

$$p(\pi=.3 \mid \{1,0,0,0,0\}) = \frac{.0036)}{.15 + .0036} = .2$$

What would happen if we had more evidence? Suppose we tossed the coin 100 times and got 30 heads? Would you be convinced that it was baised, despite not believing so in the beginning? Let's see what the Bayes rule would say. Note that multiple independent coin tosses is a binomial process. Like the Bernoulli PMF, the binomial PMF has only one parameter which we can interpret in the same way (the probability of heads). The likelihood function is:

$$\text{likelihood}(\{N, k\};\pi) = \binom{k}{N} \pi^{k} (1 - \pi)^{N-k}$$

where $N$ and $k$ is our data. $N$ stands for the total number of tosses and $k$ stands for the number of heads in a sequence, and $\binom{k}{N}$ is the binomial coefficient. I will use python to help me with calculations:

In [30]:
def binom_likelihood(N, k, pi):
    B = binom_coeff(N, k) # binomial coefficient
    likelihood = B * pi**k * (1 - pi)**(N - k) # likelihood of data, given pi
    return likelihood

# Beliefs and their strengths before data
coin_biased = .01
coin_fair = .99

# Data
N = 100
k = 30

# Updating
evidence_biased = binom_likelihood(N, k, pi=.3)
evidence_fair = binom_likelihood(N, k, pi=.5)

new_coin_biased = (evidence_biased*coin_biased)/sum([evidence_fair*coin_fair, evidence_biased*coin_biased])

print('Posterior = {:.4}'.format(new_coin_biased))

Posterior = 0.9742


# 2. Estimating the posterior 
## 2.1. Markov Chain Monte Carlo
Markov Chain Monter Carlo (MCMC) is a sampling method that allows to characterize a distribution without knowing its mathematical properties by randomly sampling values from the distribution. MCMC can be broken down into two basic components:
- **Monte Carlo** is a practice of estimating the properties of distribution by random sampling
- **Markov Chain** is a Markovian process in which the samples are generated sequentially, such that a sample drawn on step 1 influences the sample drawn on step 2, but no further. The final sample of data is thus determined only by its predecessor sample, but the whole process is a Markov chain

MCMC is used in Bayesian analyses because it allows to approximate aspects of the posterior that are not easily obtained by analytic derivation. 