References:
* Pattern recognition and Machine learning (Bishop)
* Wikipedia

# Bayesian inferences


## Bayesian inference for the Binomial distribution

$\text{Bin}(r|n, \mu) = {n \choose r} \mu^r (1-\mu)^{n-r}$.


Suppose that the prior distribution of $\mu$ follows a __Beta distribution__. That is, $$p(\mu) = \text{Beta}(\mu|a,b) \sim  \mu^{a-1} (1-\mu)^{b-1}$$ for some $a,b$. Then, the posterior distribution of $\mu$ is as follows:
$$p(\mu|n,r) \sim p(\mu) p(r|n,\mu) \sim  \mu^{r+a-1} (1-\mu)^{n-r+b-1}$$


Thus $p(\mu|n,r) = \text{Beta}(\mu|r+a,n-r+b)$.


Summary:
* prior = $\text{Beta}(\mu|a,b)$
* likelihood function = $\text{Bin}(r|n, \mu)$
* posterior = $\text{Beta}(\mu|r+a,n-r+b)$

## Bayesian inference for the Multinomial distribution

$\text{Mult}(\mathbf{r}|\boldsymbol{\mu},n) = \displaystyle{{n \choose r_1\cdots r_m} \prod_{i=1}^m \mu_i^{r_i}}$, where $\sum_{i=1}^m r_i = n$.


Suppose that the prior distribution of $\boldsymbol{\mu}=(\mu_1,\ldots,\mu_m)$ follows a __Dirichlet distribution__. That is, $$p(\boldsymbol{\mu}) = \text{Dir}(\boldsymbol{\mu}|\boldsymbol{\alpha}) \sim  \prod_{i=1}^m \mu_i^{\alpha_i -1}$$ for some $\boldsymbol{\alpha} = (\alpha_1,\ldots,\alpha_m)$.  Then, the posterior distribution of $\boldsymbol{\mu}$ is as follows:

$$p(\boldsymbol{\mu}|n,\mathbf{r}) \sim p(\boldsymbol{\mu}) p(\mathbf{r}|n,\boldsymbol{\mu})\sim \prod_{i=1}^m \mu_i^{r_i + \alpha_i -1}$$

Thus $p(\boldsymbol{\mu}|n,\mathbf{r}) = \text{Dir}(\boldsymbol{\mu}|\boldsymbol{\mathbf{r} + \alpha})$.


Summary:

* prior = $\text{Dir}(\boldsymbol{\mu}|\boldsymbol{\alpha})$
* likelihood function = $\text{Mult}(\mathbf{r}|\boldsymbol{\mu},n)$
* posterior = $\text{Dir}(\boldsymbol{\mu}|\boldsymbol{\mathbf{r} + \alpha})$

# Gaussian distribution

## Maximum likelihood for the Gaussian

Suppose $\mathcal{D} = \{x_1,\ldots,x_N\}$ is an _i.i.d_ dataset in $N(\mu,\sigma^2)$. Maximizing the log likelihood function 
$$\ln p(\mathcal{D}|\mu,\sigma^2) = \sum_{n=1}^N p(x_n|\mu,\sigma^2),$$ we get

* $\mu_{ML} = \frac{1}{N}\sum_n x_n$
* $\sigma_{ML}^2 = \frac{1}{N} \sum_n (x_n - \mu_{ML})^2$


and we can show 
* $E[\mu_{ML}] = \mu$
* $E[\sigma_{ML}^2] = \frac{N-1}{N} \sigma^2$

where $\mu$ and $\sigma$ are the population mean and standard deviation. 

Note that $\frac{1}{N} \sum_n (x_n - \mu_{ML})^2$ is a biased estimate of variance and $\frac{1}{N-1} \sum_n (x_n - \mu_{ML})^2$ is unbiased.

Maximum likelihood solutions are associated with the over-fitting problems.


## Bayesian inference for the Gaussian distribution

* likelihood function: $N(\mathcal{D}|\mu,\sigma^2)$ 
* prior: 
    * When $\sigma^2$ is known, assume $p(\mu)$ follows $N(\mu_0,\sigma_0^2)$ for some $\mu_0$ and $\sigma_0$.
    * When $\mu$ is known, let $\lambda=\sigma^{-2}$ and assume $p(\lambda)$ follows $\text{Gamma}(\lambda|a,b) \sim \lambda^{a-1} e^{-b\lambda}$ for some $a,b$.
   
   
## Mixtures of Gaussians

$p(\mathbf{x}) = \sum_{k=1}^K \pi_k N(\mathbf{x}|\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$, where $\sum_{k=1}^K\pi_k = 1$ and $\pi_k\geq 0, \forall k$.

# Akaike information criterion (AIC),  Bayesian information criterion (BIC) 

Reference: https://en.wikipedia.org

Given a collection of models for the data, AIC estimates the quality of each model, relative to each of the other models. Thus, AIC provides a means for model selection.

Suppose that we have a statistical model of some data. Let $k$ be the number of estimated parameters in the model. Let $\hat{L}$ be the maximum value of the likelihood function for the model. Then the AIC value of the model is the following. $$\text{AIC} = 2k - 2\ln(\hat{L})$$

Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. 

AIC rewards goodness of fit (as assessed by the likelihood function), but it also includes a penalty that is an increasing function of the number of estimated parameters. The penalty discourages overfitting, which is desired because increasing the number of parameters in the model almost always improves the goodness of the fit.



The formula for the Bayesian information criterion (BIC) is similar to the formula for AIC, but with a different penalty for the number of parameters. With AIC the penalty is $2k$, whereas with BIC the penalty is $k\ln n$, where $n$ is the number of instances.