# Bayesian Modeling

- Bowen Li
- 2018/01/17

## Introduction

Bayesian modeling is one of the most important machine learning techniques. This notebook is to summarize its methodology and implementations.

## Bayes Theorem

Thomas Bayes and Pierre-Simon Laplace were aware of what is now known as **Bayes theorem,** or the so-called **inverse probablity,** which is a simple and beautiful relation between two conditional probabilities; one is an inverse of the other. I will borrow [Michael Hochster (Director of Data Science at Stitch Fix)'s Quora answer](https://www.quora.com/What-is-an-intuitive-explanation-of-Bayes-Rule/answer/Michael-Hochster) and provide another example to illustrate the concept of Bayes theorem.

**Example (Rich and Happy).** Your friend is trying to convince you that money cannot buy happiness, citing from a Harvard study that shows only 10% of happy people are rich, thus we have $P(\text{rich}\ |\ \text{happy})$. Nevertheless, what we really want to know is how is the proprotion of rich people is happy, that is $P(\text{happy}\ |\ \text{rich})$. How could we obtain the result?

Let's do some probability calculation:

$$
P(\text{happy}\ |\ \text{rich})
= \frac{P(\text{happy}, \text{rich})}{P(\text{rich})}
= \frac{P(\text{rich}\ |\ \text{happy}) \times P(\text{happy})}{P(\text{rich})}
$$

From the above equation we can observe that we have to know how is the proportion of happy people in the whole population, and how is the proportion of rich people. Suppose we know that 40% of people are happy, and 5% of people are rich, then the result follows. Specifically, summarizing the information we have:

- 10% of happy people are rich: $P(\text{rich}\ |\ \text{happy}) = 10\%$
- 40% of people are happy: $P(\text{happy}) = 40\%$
- 5% of people are rich: $P(\text{rich}) = 5\%$

Hence, after simple calculation, we obtain that $P(\text{happy}\ |\ \text{rich}) = 80\%$, that is, 80% of rich people are happy. So a really strong majority of rich people are happy. Let's work hard and smart to be rich. :-)

**Example (Breast Cancer).** TBD

## Probability Models

Most of Bayesian modeling is calculating the following quantities in some way or either:

- **Likelihood:** $p(x | \theta)$
- **Prior distribution:** $p(\theta)$
- **Marginal likelihood:**
  $p(x) = \int p(x | \theta) p(\theta) d(\theta)$
- **Posterior distribution:**
  $p(\theta | x) = p(x | \theta) p(\theta) / p(x) \propto p(x | \theta) p(\theta)$
- **Predictive distribution:**
  $p(x_{new} | x) = \int p(x_{new} | \theta) p(\theta | x) d(\theta)$

### De Finetti's Thereom

TODO

## Point Estimation

### MLE vs. MAP

**Maximum Likelihood Estimation (MLE).** Solve $\theta$ which **maximize the likelihood function** $p(x | \theta)$:

$$
\theta_{MLE} = argmax_{\theta} p(x | \theta)
$$

**Maximum a Posteriori (MAP).** Solve $\theta$ which **maximize the posterior distributin** $p(\theta | x)$:

$$
\theta_{MAP} = argmax_{\theta} p(\theta | x) = argmax_{\theta} p(x | \theta) p(\theta)
$$

### Bayes Estimator

TODO

Nevertheless, the above MLE, MAP and Bayes estimator just provide point estimators for parameter, $\theta$, of interest. In the usual Bayesian modeling, we would like to obtain the distribution of $\theta$ given the observed data $x$.

## Conjugate Prior


**Conjugate Prior.** A family of prior distributions, upon being multiplied by the likelihood, yield a posterior in the same family.

### Binomial-Beta Conjugacy

- Likelihood: $x | \theta \sim Binomial(n, \theta)$

$$
p(x | \theta) = \binom{n}{x} \theta^x (1 - \theta)^{n - x}
$$

- Prior: $\theta \sim Beta(\alpha, \beta)$

$$
p(\theta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)} \theta^{\alpha-1} (1 - \theta)^{\beta-1}
$$

- Posterior: $ \theta | x \sim Beta(\alpha^*, \beta^*)$

For details please see later.

### Multinomial-Dirichlet Conjugacy

- Likelihood: $(x | \theta \sim Multinomial(n, \theta_1,...,\theta_k)$

$$
p(x | \theta) = \binom{n}{x_1,...,x_k} \theta_1^{x_1} \cdots \theta_k^{x_k}
$$

- Prior: $\theta \sim Dirichlet(\alpha_1,...,\alpha_k)$

$$
p(\theta) = \frac{\Gamma(\sum_{i=1}^k \alpha_i)}{\prod_{i=1}^k \Gamma(\alpha_i)} \theta_1^{\alpha_1 - 1} \cdots \theta_k^{\alpha_k - 1}
$$

- Posterior: $\theta | x \sim Dirichlet(\alpha_1^*,...,\alpha_k^*)$

For details please see later.

### Poisson-Gamma Conjugacy

- Likelihood: $x | \theta \sim Poisson(\theta)$

$$
p(x | \theta) = \frac{e^{-\theta} \theta^x}{x!}
$$

- Prior: $\theta \sim Gamma(\alpha_1, \alpha_2)$

$$
p(\theta) = \frac{\alpha_2^{\alpha_1}}{\Gamma(\alpha_1)} \theta^{\alpha_1 - 1} e^{-\alpha_2 \theta}
$$

- Posterior: $p(\theta | x) \sim Gamma(\alpha_1^*, \alpha_2^*)$

For detailes please see later.

### General Exponential Family Conjugacy

- Likelihood: 

$$
p(x | \eta) = h(x) exp(\eta^T T(x)) - A(\eta))
$$

- Prior:

$$
p(\eta) = H(\tau, n_0) exp(\tau^T \eta - n_0 A(\eta))
$$

- Posterior:

$$
p(\eta | x) = H(\tau^*, n_0^*) exp(\tau^{*T}\eta - n_0^* A(\eta))
$$ 

where $(\tau, n_0)$ in the prior are replaced by $(\tau + \sum T(x_j), n + n_0)$ for the posterior. For details please see later.

### Gaussian-Gaussian Conjugacy

- p(x | \mu, \sigma^2) ~ Gaussian(\mu, \sigma^2)
- If \sigma^2 fixed,
  * p(\mu) ~ Gaussian(\mu_0, \sigma_0^2)
  * p(\mu | x) ~ Gaussian(.)
  * Notes: Var(X) = E[Var(X | \mu)] + Var[E(X | \mu)]
- If \mu^2 fixed,
  * p(\sigma^2) ~ InverseGamma(\theta_1, \theta_2)
  * p(\sigma^2 | x) ~ InverseGamma(.)
  * Note: X ~ InverseGamma() => X^(-1) ~ Gamma(.)
- If both (\mu, \sigma^2) are unknown, fit by stagewise modeling:
  * p(\mu | \sigma^2) ~ N(\mu_0, n0 \tau)
  * p(\tau) ~ Gamma(\alpha, \beta)

### Mutivariate Gaussian Conjugacy

- p(x | \mu, \Sigma) ~ MutivariateGaussian(\mu, \Sigma)
  * Similar with Gaussian(.)
- If \Sigma fixed,
  * p(\mu) ~ MutivariateGaussian(\mu_0, \Sigma_0)
  * p(\mu | x) ~ MutivariateGaussian(.)
- If \mu fixed,
  * p(\Sigma) ~ InverseWishart(\theta_1, \theta_2)
  * InverseWishart = InverseGamma(.)
  * p(Precision Matrix) = p(\Sigma^{-1}) 
    ~ Wishart(.) = GeneralizedGamma(.)
  * p(\Sigma | x) ~ InverseWishart(.)
- If both (\mu, \Sigma) are unknown:
  * p(\mu | \Sigma) ~ N(\mu_0, n0 \tau)
  * p(\tau) ~ InverseWishart(\alpha, \beta)

## Jeffreys Priors

- Invariance Principle: Jeffreys Priors is invariant to change of variable \theta.
- Jeffreys Priors: Square root of Fisher Information.
  \pi_J = I(\theta)^{1/2}.
- Fisher Information:
  I(\theta) = E [- d^2 log p(x | \theta) / d\theta^2]

## References

- Jordan (2010)'s lecture notes on Bayesian Modeling and Inference.
- Hochster's Quora answer on [What is an intuitive explanation of Bayes' Rule?](https://www.quora.com/What-is-an-intuitive-explanation-of-Bayes-Rule/answer/Michael-Hochster)