<a href="https://colab.research.google.com/github/USCbiostats/PM520/blob/main/Lab_7_Intro_Bayesian_Inference.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Don't Stop Believin', or: Intro to Bayesian Inference
$\newcommand{\data}{\text{Data}}$
$\newcommand{\E}{\mathbb{E}}$
So far, we've focused primarily on [likelihood](https://en.wikipedia.org/wiki/Likelihood_function)-based inference using [maximum likelihood estimation](https://en.wikipedia.org/wiki/Maximum_likelihood_estimation), or MLE. Recall that MLE roughly seeks $$\hat{\theta}_{MLE} := \arg \max_{\theta} \ell(\theta | \data) = \arg \min_{\theta} \log \Pr(\data | \theta),$$
where $\ell(\theta | \data)$ is a log-likelihood function and $\theta$ are the parameters of interest. This procedure operationally reflects identifying a value of $\theta$ such that our observed $\data$ is most likely.

In contrast to this regime, [_Bayesian_ inference](https://en.wikipedia.org/wiki/Bayesian_inference) operationally reflects updating our _beliefs_ about $\theta$ conditioned on having observed $\data$. The celebrated [Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) states,
$$\Pr(\theta | \data) = \frac{\Pr(\data | \theta) \Pr(\theta)}{\Pr(\data)},$$
where $\Pr(\theta | \data)$ is the [_posterior_ probability](https://en.wikipedia.org/wiki/Posterior_probability) for $\theta$ and reflects our uncertainty in the values that $\theta$ may take on, $\Pr(\data | \theta)$ is our likelihood, $\Pr(\theta)$ is a [_prior_ probability](https://en.wikipedia.org/wiki/Prior_probability) (or _prior_) over $\theta$ and $\Pr(\data)$ is a [_marginal_ probability/likelihood](https://en.wikipedia.org/wiki/Marginal_likelihood) of the data. In the case that $\theta$ is continuous we have,
$$ \Pr(\data) \int \Pr(\data | \theta') \Pr(\theta')d\theta'$$ and in the case
that $\theta$ is discrete we have,
$$\Pr(\data) = \sum_{\theta'} \Pr(\data | \theta') \Pr(\theta').$$



## Example
Say we have some examination which determines whether a _sick_ individual is sick with a TP rate of 0.9. The prevalance of this disease in the population is 0.05. If someone takes this test to determine if they are sick, what is the probability they are indeed sick? Formally, $\newcommand{\sick}{\text{sick}}$
$$\begin{align*}\Pr(\sick | +) &=
  \frac{\Pr(+ | \sick) \Pr(\sick)}{\Pr(+)} =
  \frac{\Pr(+ | \sick) \Pr(\sick)}{\Pr(+ | \sick)\Pr(\sick) + \Pr(+ | \neg \sick)\Pr(\neg \sick)} \\
  &= \frac{0.9 \times 0.05}{(0.9 \times 0.05) + (0.1 \times 0.95)} \approx 0.32
  \end{align*}$$

## Bayesian Estimators
When performing Bayesian inference, we often seek to identify the posterior expectation of $\theta$ which is given by, $$\E[\theta | \data] = \int \theta \Pr(\theta | \data) d\theta$$ (or analogously using summation for discrete $\theta)$. Importantly, this is a _conditional_ expectation! This is the expected value (or mean) given our observations $\data$, which is different from the unconditional, or _prior_ expectation $\E[\theta] = \int \theta \Pr(\theta) d \theta$. This further reflects how our beliefs about the values $
\theta$ may take on change after having observed $\data$.

As currently stated, it is somewhat unclear why we should select $\hat{\theta} = \E[\theta | \data]$ as our [estimator](https://en.wikipedia.org/wiki/Bayes_estimator), but we can reformulate this procedure under a _risk_ framework. Let's define the risk of some estimate $\hat{\theta}$ as, $$\E[L(\theta, \hat{\theta}) | \data]$$ where $L(\theta, \hat{\theta})$ is a [loss function](https://en.wikipedia.org/wiki/Loss_function) that reflects some notion of _distance_ between some putative "true" value $\theta$ and our estimate $\hat{\theta}$. If we select a quadratic loss, we have
$$\E[(\theta - \hat{\theta})^2 | \data]$$, which is the [mean squared error](https://en.wikipedia.org/wiki/Mean_squared_error), or MSE. With a bit of algebra, it's clear that our estimator $\hat{\theta}$ should result in, $\hat{\theta} = \E[\theta | \data]$, which is the posterior expectation!

An alternative approach may be to seek the values $\theta$ which maximize our posterior $\Pr(\theta | \data)$, hence its name [maximum a-posteriori](https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation), or MAP for short. Its definition is given by, $$\hat{\theta}_{MAP} := \arg \max_\theta \log \Pr(\theta | \data).$$ While this can be a valid approach in some sense, it seeks a value that is most-probable given our observations $\data$, or the distributional [mode](https://en.wikipedia.org/wiki/Mode_(statistics)), which not be a good reflection of the inherent uncertainty around its value.

## Example
Let's say we have some possibly biased coin whose outcomes are heads (H) or tails (H). We can model the likelihood of observing the outcomes from a sequence of $N = N_H + N_T$ tosses using the [Binomial distribution](https://en.wikipedia.org/wiki/Binomial_distribution), given by $$\Pr(N_H, N_T | p_H) = \begin{pmatrix}N \\ N_H\end{pmatrix} p_H^{N_H}(1 - p_H)^{(N - N_H)}.$$ Furthermore we can model our _prior_ belief of the probability of observing H by a [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) as,
$$\Pr(p_H) = \frac{1}{B(\alpha, \beta)}p_H^{\alpha - 1}(1 - p_H)^{\beta - 1}.$$

Given these two definitions, we can state the posterior probability of $p_H$ given $N_H, N_T$ as,

$$\begin{align*}
\Pr(p_H | N_H, N_T) &= \frac{\Pr(N_H, N_T | p_H) \Pr(p_H)}{\Pr(N_H, N_T)} \\
  &\propto \Pr(N_H, N_T | p_H) \Pr(p_H) \\
  &= \begin{pmatrix}N \\ N_H\end{pmatrix} p_H^{N_H}(1 - p_H)^{N_T}
  \times \frac{1}{B(\alpha, \beta)}p_H^{\alpha - 1}(1 - p_H)^{\beta - 1} \\
  &\propto p_H^{N_H}(1 - p_H)^{N_T} \times p_H^{\alpha - 1}(1 - p_H)^{\beta - 1} \\
  &= p_H^{(N_H + \alpha - 1)}(1 - p_H)^{(N_T + \beta - 1)} \\
\Pr(p_H | N_H, N_T) &= \text{Beta}(p_H | \widetilde{\alpha}, \widetilde{\beta}),
\end{align*}$$
where $\widetilde{\alpha} = N_H + \alpha - 1$ and $\widetilde{\beta} = N_T + \beta - 1$.

In [None]:
# code for above example
import jax
import jax.numpy as jnp
import jax.random as rdm


## Conjugate Priors and Exponential Families
Recall our definition of exponential families, such that the probability of an observation is given by $$\Pr(x | \eta) \propto \exp(\eta \cdot T(x) - A(\eta)).$$ If our _prior_ over $\eta$ is of the _same_ exponential family, or,
$$\Pr(\eta) \propto \exp(\eta \cdot \Chi - \nu A(\eta))$$

In [None]:
# code for above using our ExpFam classes