In this post, I'm going to write about how the ever versitile normal distribution can be used to approximate a Bayesian posterior distribution, which is not a direct result of the central limit theorem.  The result has a straight forward proof (given some assumptions) which I will attempt to sketch, and I'll also simulate a few different scenarios to see how it works in practice.

<!-- TEASER_END -->

# Background

We're going to first start by reviewing some simple terminology and definitions regarding Bayesian methods to make the discussion later a bit easier to follow.  The main problem we're trying to solve: given probability model \\(\mathcal{M}\\) (e.g. normal distribution) and some data denoted by \\(y\\) (e.g. observations of customer purchases or readings from a machine), we wish to find a probability distribution of the parameters \\(\theta\\) (e.g. \\(\mu\\) and \\(\sigma\\) in the case of a normal distribution).  The basic idea is given by Bayes theorem:

$$
P(\theta | y) = \frac{P(y | \theta) P(\theta)}{P(y)} = C \cdot P(y | \theta) P(\theta) \tag{1}
$$

Some definitions:

\\( P(\theta | y) \\) is called the **posterior** distribution.
 * a \\(P(y | \theta)\\) is called the **likelihood function**.
 * a \\(P(\theta)\\) is called the **prior** distribution.
 * a \\(P(y)\\) is called the marginal likelihood.

Notice the second form in Equation 1 where \\(\frac{1}{P(y)}\\) term is replaced by a constant \\(C\\).  This is usually written out like this because once you've collected all your data, it's fixed.  Thus, the probability of it occuring does not change with respect to the parameters of your model \\(\theta\\).  Think of it as a normalizing constant to make the posterior have a proper probability distribution (i.e. sum to \\(1\\)). 

[Bayesian inference](https://en.wikipedia.org/wiki/Bayesian_inference) usually follows these high level steps:

1. Decide on a probability model \\(\mathcal{M}\\).
2. Decide on a prior distribution that encodes your previous knowledge about the problem.  For example, if we're modeling the average conversion rate of an email campaign, we might look at previous campaigns and find a number around (or a distribution centred at) 1%.
3. Given samples \\(y\\), find the posterior distribution of your model parameters \\(\theta\\) according to Equation 1.  This is usually done via simulation except for the most basic cases where the posterior has a closed form expression.

Of course, there is a lot of nuance to each of these steps but by and large this is what usually happens.  

The big contrast between this method and more traditional statistics methods (frequentist methods) is that the latter focuses more on the likelihood function.  A popular estimate for \\(\theta\\) is usually given by the [maximum likelihood estimate](https://en.wikipedia.org/wiki/Maximum_likelihood) (MLE), where it tries to find \\(\theta\\) that maximizes the likelihood function.  There's a huge amount of literature written on the differences between the two but here are some of the bigger points:

* The MLE estimate provides a point estimate (a single value) of the parameter (or sometimes a confidence interval).  The Bayesian method provides a proper probability distribution for the parameter.
* The Bayesian method is "more" subjective because of the use of the prior.  The reasoning is that you *should* use everything you know about the problem when conducting inference.  However, finding a prior that accurately represents your state of knowledge is definitely a challenge.
* The MLE is in a way still "subjective" because the choice of model \\(\mathcal{M}\\) is an implicit "prior".
* Bayesian methods treat the model parameters as random variables (i.e. the posterior) while frequentist methods usually treat the parameters as [fixed](link://slug/hypothesis-testing) (with the assumption that there is some theoretical "true" value) and what you're trying to find is a confidence interval that "traps" the fixed value a certain percentage of the time.

With that incredibly brief introduction to Bayesian methods, let's see how we can approximate the posterior with a normal distribution.


# Normal Approximation to the Posterior Distribution [<sup>[1]</sup>](#fn-1)


### Assumptions 

The key assumption is that we have independent and identically distributed ([i.i.d](https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables)) data, \\(y_1, \ldots, y_n\\), drawn from the "true distribution", which we label as \\(f(\cdot)\\) for its density function.  These results only hold given that we have a "true distribution" from which the data was sampled.  If this is not the case, it isn't hard to pick some specific values of \\(y\\) that violate the result presented here.

If the "true distribution" \\(f(\cdot)\\) actually comes from the same family of distributions of our model \\(\mathcal{M}\\), then the "true distribution" should have a fixed unique parameter \\(\theta_0\\).  We expect our posterior distribution to approach \\(\theta_0\\) as \\(n\\) the number of samples increase.  If \\(f(\cdot)\\) is not part of the same distribution as our model, then we expect that threre is a unique value \\(\theta_0\\) in our model that minimizes the "discrepancy" between our model and the true distribution given by \\(f(\cdot)\\).  The discrepency between the true distribution \\(F\\) and our model distribution \\(\mathcal{M}\\) is measured by the [Kullback-Leibeler (KL) divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence):

$$
KL(F || \mathcal{M}) = E[log(\frac{f(y_i)}{p(y_i|\theta)})] = \int log(\frac{f(y_i)}{p(y_i|\theta)})f(y_i) dy_i
$$

If we denote our model density by \\(p(\cdot)\\), then we can see KL divergence is minimized when \\(p(y_i) = f(y_i)\\) and the \\(log\\) evaluates to \\(0\\).  When they're not equal, if the densities are close to each other, the \\(log\\) factor is small and so is KL divergence.  Otherwise, we get a larger value.

We can already see given this setup, as \\(n\\) grows, the posterior will be more and more concentrated around our theoretical \\(theta_0\\).

There are also a couple of other assumptions about the "regularity" of the likilihood function and prior.   A few examples are that the likelihood function is continuous, \\(\theta_0\\) is not on the boundary of the parameter space, the prior is non-zero a the point of convergence.  See section 4.3 in *Bayesian Data Analysis* for a more complete treatment.

### Proof Outline

The proof takes three steps:

1. Show that a given finite parameter space, the posterior distribution approaches to probability \\(1\\) at \\(\theta_0\\) as \\(n \rightarrow \infty\\).
2. Show the result extends to a continous parameter space by chopping it up into a finish set of intervals, of which the one containing \\(\theta_0\\) will convert to probability \\(1\\) as \\(n \rightarrow \infty\\) like above.
3. Show that as the mass of the posterior density function gathers around \\(\theta_0\\), the distribution can be approximated by a normal distribution.

We'll show the proof of each one of these steps separately.

**Theorem 1**: If the parameter space \\(\Theta\\) is finite and \\(P(\theta = \theta_0) > 0\\), then \\(P(\theta=\theta_0 | y) \rightarrow 1\\) as \\(n \rightarrow \infty\\), where \\(\theta_0\\) is the value that minimizes the KL divergence.




## References and Further Reading

* Wikipedia: [Bayesian Inference](https://en.wikipedia.org/wiki/Bayesian_inference), Maximum Likelihood Estimate (https://en.wikipedia.org/wiki/Maximum_likelihood)
* *Bayesian Data Analysis*, Gelman, Carlin, Stern


## Notes

List of Notes: [^1]

[^1]: This section was based upon the proof in Appendix "Outline of proofs of limit theorems" in *Bayesian Data Analysis*.  I recommend taking a look at it for the full details.

