In [15]:
%matplotlib notebook
import numpy as np
import matplotlib.pyplot as plt
from scipy.special import comb as binom_coeff

# 1. Understanding Bayesian inference

## 1.1. Intuition

The Bayes rule provides a formal (mathematical) description of a type of inference, the act of guessing that is aided by observation and knowledge. The Bayes rule equates the **posterior** probabilty $p(h \mid D)$ to the product of the **prior** probability ${p(D)}$ and the **likelihood** $p(D \mid h)$, normalized by the **marginal** probability ${p(D)}$:

$$p(h \mid D) = \frac{p(h)p(D \mid h)}{p(D)}$$

For someone who is not used to hearing this statistical jargon, this formula may appear a bit incoherent. But try not to take the way of rote memorization. The intuition behind this is crucial and, frankly, quite elegant.

If you are like me, you find it hard to think in new and unfamiliar terms. Try replacing some of the words that you are not absolutely comfortable with by more everyday notions. For instance, you can swap "probability" with "strength of a belief" or "conviction". Note the important distinction between belief _strength_ and belief _content_. When we talk about the _probability_ of "a coin being fair", we are talking about the _conviction_ of in a "belief". Further, I will refer to belief strength as conviction and to belief content simply as "belief that the coin is fair". Furthermore, feel free to replace the latin terms "posterior" and "prior" with more intuitive "after-the-fact" and "before-the-fact" (respectively). Thus, you can think of the _prior probability_ as _conviction before the fact_ and the _posterior probability_ as _conviction after the fact_. The "fact" here just means data or observation. Note that conviction after an observation and conviction before an observation are on the opposite sides of the _equation_ -- they are _related_, but not identical. Observations can change the strength of our beliefs and the Bayes rule describes one way in which this can happen.

The initial and updated convictions differ by a factor that includes the likelihood term and the marginal term. What are they? Lets start with the former. Likelihood (not to be confused with probability) is a kind of a score for the quality of some belief (i.e. beief accuracy). How do we assess our beliefs? Directly or indirectly, we do it by comparing them to our observations. When a belief is accurate, it can be used to generate guesses that correspond to (our subjective) reality. More observations enable a better assessment. To give you a fun example, if I am convinced in a belief that "my farts don't smell". Using a sophisticated "cupping" procedure I find out that as a matter of fact they are deeply, quintessentially foul. My conviction is no good, and I should take it less seriously. But note that if I only sniff the cup once and it happens stink, I should not be so quick to say that my conviction is entirely inaccurate (may be it was a freak accident?). I think it helps greatly to think of likelihood as the accuracy of a belief, keeping in mind that the only way to test this accuracy is to compare the belief to reality.

Finally, let's break down the marginal likelihood term $P(D)$, before we put all the terms together and make an intuitive sense of the Bayes rule. Mathematically, marginalization of likelihood is a kind of averaging of various likelihoods across beliefs, $p(D) = \int_{H} p(D \vert h)p(h)dh$ (there is an equivalent formulation for a discrete conditional variable in which we take a sum over probability-weighted values). Note that the formula is very similar to the likelihood formula, only now we are considering the entire space of beliefs included in $H$. Therefore, in simpler terms, the marginal likelihood is the expected accuracy of various beliefs, weighted by their respective strengths or convictions. Accordingly, if we are equally convinced in all our beliefs about something, the marginal likelihood is simply the mean accuracy of these beliefs.

Very well. We now should be able to think about the Bayes rule in a very intuitive way: 

> I am aware of various beliefs about the world in which I can be more or less convinced. Experiencing new things may change my convictions. How? Well, if my beliefs corresponds well with my observations, I should believe in them more strongly. On the other hand, if there there are more of more accurate alternative beliefs that fit my observations better, I should be less convinced about the belief I endorce. That is all there is to it. Changing convictions in my beliefs before-the-fact requires me to come up with answers to two questions: (1) how accurate my belief is against what I observed, and (2) how accurate are other beliefs against what I observed.

So, the right-hand side of the Bayes rule simply tells us how the data updates an old belief $p(h)$ into a new belief $p(h \mid D)$. If we rearrange it slightly, it becomes very clear that the strength of my initial belief should grow in proportion to its accuracy and shrink in proportion to accuracies of other beliefs:

$$\text{updated conviction in belief}= \text{initial conviction in belief} \times \frac{\text{accuracy of initial belief}}{\text{accuracy of alternative beliefs}}$$

formally expressed as:

$$p(h \mid D) =  p(h) \times \frac{p(D \mid h)}{p(D)}$$

## 1.2. A simple example

Statisticians love to talk about unfair coins, even though it is a myth and you can't really bias a coin flip. However, imagine some extroadinary and authoritative person gave you a couple of coins and told you that on of them is fair, but the other is in fact biased to come up heads only 30% of the time. Since you trust your source a lot, you are almost convinced that it is true.

We can formalize our belief about the fairness of a coin by using the binomial random variable $P(X_{binomial})$ which has a probability distribution with only one free parameter:

$$P(\pi \mid \text{data}) = f(\pi) = \binom{k}{N} \pi^{k} (1 - \pi)^{N-k}$$ 

where the constants $k$ and $N$ are constants expressing the number of heads ($k$) in some number of tosses ($N$), and $\binom{k}{N}$ is the binomial coefficient. The parameter $\pi$ expresses our belief in a precise numerical manner. If we believe that the coin is fair $\pi$ should be exatly $.5$. It corresponds to the proportion of heads we expect (under our belief) if we tossed the coin indefinitely. If we think the coin is biased towards the heads, we could say that $\pi = .8$. In general, $p(\pi = x)$ corresponds to the strength in the belief that the coin is biased to come up heads $x\%$ of the time in the long run.

Let's say our prior belief is that the coin is in fact biased to come up heads _exactly_ 30% of the time, so $\pi = .3$. . So, how does observing this sequence updates our beliefs? The Bayes rule! Note that we need to take great care in describing our beliefs in a formal way. 

First, what is our initial belief? Saying that we believe that the coin should come up heads _exactly_ 30% of the time is saying that the probability that $\pi=.3$ is 100%:

$$p(\pi=.3) = 1.0$$

This is our initial belief before observation.

We need to For starters, we can compute the accuracy of our prediction to the data (i.e. compute the likelihood). Any likelihood function is a function of its parameters, holding the data fixed (remember, we are assessing the quality of our beliefs against the data). The binomial likelihood function:

$$P(X_{binomial}=1 \mid K=5, \pi=.3) = \binom{5}{1} \times .3^{1} \times .7^{4}$$

We could compute this by hand, but let's use the computer to do it:

In [17]:
def binom_likelihood(data, pi):
    n = np.sum(data)     # number of heads (ones)
    k = np.size(data)    # number of tosses
    B = binom_coeff(k,n) # binomial coefficient
    P = B * pi**n * (1 - pi)**(k-n) # likelihood of data, given pi
    return P

D  = np.array([1,0,0,0,0])
pi = .3

L = binom_likelihood(D, pi)
print('Yay, so the belief that pi==.3 is \'{0.4f}\' likely'.format(L))

AttributeError: 'numpy.float64' object has no attribute '4f'

Looks like the prior should be scaled down by quite a lot! This does not really make sense, because the belief that the coin is rigged to come up heads only 20% of the time should, if anything, be reinforced by observing a samply with 20% heads.
$$p(D\vert \theta=.2) = \prod_{i=1}^{\|D\|} 0.2^{D_i} = 0.2^1 \times 0.8^1 \times 0.8^1 \times 0.8^1 \times 0.8^1 = 0.2$$
Must be that we still haven't calculated the marginal distribution. To calculate the marginal probability we need to consider the entire space of parameters (we only have theta that ranges between 0 amd 1) and for each parameter value compute the likelihood of our data. Luckily, we can simply consider the probability of getting exactly 1 hit on the 1st toss in a total of 5 tosses. We need to count all possible combinations of 0s and 1s in a sequence of 5, or $2^5$, which tells us that $p(D) = \frac{1}{2^5} = 0.03125$

So, the factor by which we update our belief is $.08192 / .03125 = 2.6$

In [18]:
def binom_likihood(data, given):
    return (given**np.sum(data==1))*((1-given)**np.sum(data==0))

# D = np.array([1,0,0,0,0])
# thetas = np.linspace(0,1,50)
# likelihoods = [binom_likihood(D, given=theta) for theta in thetas]
# print(binom_likihood(D, given=.2))

# # plt.close()
# plt.plot(thetas, likelihoods)
# print(1/2**5)

y = 15
N = 20
D = np.zeros(N)
D[:y] = 1
prior0 = binom_likihood(D, .5)
print(D,prior0)

[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0.] 9.5367431640625e-07


# 2. Estimating the posterior 
## 2.1. Markov Chain Monte Carlo
Markov Chain Monter Carlo (MCMC) is a sampling method that allows to characterize a distribution without knowing its mathematical properties by randomly sampling values from the distribution. MCMC can be broken down into two basic components:
- **Monte Carlo** is a practice of estimating the properties of distribution by random sampling
- **Markov Chain** is a Markovian process in which the samples are generated sequentially, such that a sample drawn on step 1 influences the sample drawn on step 2, but no further. The final sample of data is thus determined only by its predecessor sample, but the whole process is a Markov chain

MCMC is used in Bayesian analyses because it allows to approximate aspects of the posterior that are not easily obtained by analytic derivation. 

In [1]:
.8 ** 4

0.4096000000000001