$\newcommand{\ind}[1]{\left[#1\right]}$

### Model

- Data set
$$ {\cal D} = \{ x_1, \dots  x_N \} $$
- Model with parameter $\theta$
$$ p(\cal D | \theta) $$

<img src="fig13b.png" width='360' align='center'>


### Maximum Likelihood

- Maximum Likelihood (ML)
$$ \lambda^{\text{ML}} = \arg\max_{\theta} \log p({\cal D} | \theta) $$
- Predictive distribution
$$ p(x_{N+1} |  {\cal D} ) \approx  p(x_{N+1} |  \theta^{\text{ML}})  $$

### Maximum Aposteriori

- Prior
$$ p(\theta) $$

- Maximum a-posteriori (MAP) : Regularised Maximum Likelihood
$$
\theta^{\text{MAP}} = \arg\max_{\theta} \log p({\cal D} | \theta) p(\theta)
$$

- Predictive distribution
$$ p(x_{N+1} |  {\cal D} ) \approx  p(x_{N+1} |  \theta^{\text{MAP}})  $$

### Bayesian Learning

- We treat parameters on the same footing as all other variables
- We integrate over unknown parameters rather than using point estimates (remember the many-dice example)
 - Self-regularisation, avoids overfitting
 - Natural setup for online adaptation
 - Model selection


- Predictive distribution
\begin{eqnarray}
p(x_{N+1} ,  {\cal D} ) &=& \int d\theta \;\; p(x_{N+1} |  \theta) p( {\cal D}| \theta) p(\theta)  \\
 &=& \int d\theta \;\; p(x_{N+1}|  \theta) p( {\cal D}, \theta)   \\
 &=& \int d\theta \;\; p(x_{N+1}|  \theta) p(  \theta| {\cal D}) p({\cal D})   \\
 &=&  p({\cal D}) \int d\theta \;\; p(x_{N+1}|  \theta) p(  \theta| {\cal D})  \\
p(x_{N+1} |  {\cal D} ) &=& \int d\lambda \;\; p(x_{N+1} |  \theta) p(\theta | {\cal D}) 
\end{eqnarray}

The interpretation is that past data provides an 'update' to the recent prior to be used for the current prediction.

- Bayesian learning is just inference ...




### Independent Coin Flips

Suppose we have a coin, flipped several times independently. A vague question one can ask is if one can predict the outcome of the next flip.

It depends. If we already know that the coin is fair, there is nothing that we can learn from past data and indeed the future flips are independent of the previous flips. However, if we don't know the probability of the coin, we could estimate the parameter from past data to create a better prediction. Mathematically, the model is identical to  

<img src="fig13b.png" width='220' align='center'>

Here, $\theta$ is the parameter of the coin.

#### Maximum Likelihood Estimation

We observe the outcome of $N$ coin flips $\{x^{(n)}\}_{n=1\dots N}$ where $x^{(n)} \in \left\{0,1\right\}$. The model is a Bernoulli distribution with parameter $\pi = (\pi_0, \pi_1)$. We have $\pi_0 = 1 - \pi_1$ where $0 \leq \pi_1 \leq 1$. 

\begin{eqnarray}
x^{(n)} & \sim & p(x|\pi) = (1-\pi_1)^{1-x^{(n)} } \pi_1^{x^{(n)} }
\end{eqnarray}

The loglikelihood is 

\begin{eqnarray}
{\cal L}(\pi_1) & = & \sum_{n=1}^N (1- x^{(n)}) \log (1 - \pi_1) + \sum_{n=1}^N x^{(n)} \log (\pi_1)  \\
& = & \log (1 - \pi_1) \sum_{n=1}^N (1- x^{(n)})  + \log (\pi_1) \sum_{n=1}^N x^{(n)}  
\end{eqnarray}

We define the number of $0$'s  
\begin{eqnarray}
c_0 = \sum_{n=1}^N (1- x^{(n)})
\end{eqnarray}
and $1$'s as
\begin{eqnarray}
c_1 = \sum_{n=1}^N  x^{(n)}
\end{eqnarray}

\begin{eqnarray}
{\cal L}(\pi_1) & = & \log (1 - \pi_1) c_0  + \log (\pi_1) c_1  
\end{eqnarray}

We compute the gradient 
\begin{eqnarray}
\frac{\partial}{\partial \pi_1} {\cal L}(\pi_1) & = & - \frac{c_0}{1 - \pi_1} + \frac{c_1}{\pi_1}  = 0 
\end{eqnarray}

The solution is quite predictable
\begin{eqnarray}
\pi_1 & = &\frac{c_1}{c_0 + c_1}  = \frac{c_1}{N}  
\end{eqnarray}

#### Maximum A-posteriori estimation

We need a prior over the probability parameter. One choice is the beta distribution

\begin{eqnarray}
p(\pi_1) & = & \mathcal{B}(\pi_1; \alpha, \beta) =  \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta) } \pi_1^{\alpha-1} (1-\pi_1)^{\beta-1}
\end{eqnarray}

The log joint ditribution of data is
\begin{eqnarray}
\log p(X, \pi_1) & = & \log p(\pi_1) + \log \sum_{n=1}^N \log p(x^{(n)}|\pi_1) \\
& = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\
& & + (\alpha-1) \log \pi_1 + (\beta-1) \log(1-\pi_1) \\
& & + c_1 \log (\pi_1)  + c_0 \log (1 - \pi_1) \\
& = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\
& & + (\alpha + c_1 -1) \log \pi_1 + (\beta + c_0 -1) \log(1-\pi_1) 
\end{eqnarray}

The gradient is 

\begin{eqnarray}
\frac{\partial}{\partial \pi_1} \log p(X, \pi_1)  & = & - \frac{\beta + c_0 -1}{1 - \pi_1} + \frac{\alpha + c_1 -1}{\pi_1}  = 0 
\end{eqnarray}

We can solve for the parameter.
\begin{eqnarray}
\pi_1 (\beta + c_0 -1) & = & (1 - \pi_1) (\alpha + c_1 -1) \\ 
\pi_1 \beta + \pi_1 c_0 - \pi_1 & = & \alpha  + c_1  - 1 - \pi_1 \alpha - \pi_1 c_1 + \pi_1 \\ 
\pi_1  & = & \frac{\alpha - 1  + c_1}{\alpha + \beta  - 2 + c_0 + c_1}    \\ 
\end{eqnarray}

When the prior is flat, i.e., when $\alpha = \beta = 1$, MAP and ML solutions coincide.

#### Full Bayesian inference

We infer the posterior

\begin{eqnarray}
p(\pi_1| X) & = & \frac{p(\pi_1, X)}{p(X)} 
\end{eqnarray}


The log joint density is 
\begin{eqnarray}
\log p(X, \pi_1) & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\
& & + (\alpha + c_1 -1) \log \pi_1 + (\beta + c_0 -1) \log(1-\pi_1) 
\end{eqnarray}

At this stage, we may try to evaluate the integral 
$$
p(X)  =  \int d\pi_1 p(X, \pi_1) 
$$

Rather than trying to evaluate this integral directly, a simple approach is known as 'completing the square': we add an substract terms to obtain an expression that corresponds to a known, normalized density. This typically involves adding and substracting an expression that will make us identify a normalized density. 

\begin{eqnarray}
\log p(X, \pi_1) & = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\
& & - \log \Gamma(\alpha + \beta + c_0 + c_1) + \log \Gamma(\alpha + c_1) + \log \Gamma(\beta + c_0) \\
& & + \log \Gamma(\alpha + \beta + c_0 + c_1) - \log \Gamma(\alpha + c_1) - \log \Gamma(\beta + c_0) \\
& & + (\alpha + c_1 -1) \log \pi_1 + (\beta + c_0 -1) \log(1-\pi_1) \\
& = & \log \Gamma(\alpha + \beta) -\log \Gamma(\alpha) - \log \Gamma(\beta) \\
& & - \log \Gamma(\alpha + \beta + c_0 + c_1) + \log \Gamma(\alpha + c_1) + \log \Gamma(\beta + c_0) \\
& & + \log \mathcal{B}(\alpha + c_1, \beta + c_0) \\
& = & \log p(X) + \log p(\pi_1| X)
\end{eqnarray}

From the resulting expression, taking the exponent on both sides we see that 
\begin{eqnarray}
p(\pi_1| X) & = & \mathcal{B}(\alpha + c_1, \beta + c_0) \\
p(X) & = & \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(\alpha + c_1)\Gamma(\beta + c_0)}{\Gamma(\alpha + \beta + c_0 + c_1)}
\end{eqnarray}

##### Alternative Derivation
Alternatively, we may directly write
\begin{eqnarray}
p(X, \pi_1) & = & \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}   \pi_1^{(\alpha + c_1 -1)} (1-\pi_1)^{(\beta + c_0 -1)} 
\end{eqnarray}

\begin{eqnarray}
p(X) &=& \int d\pi_1 p(X, \pi_1)  =  \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \int d\pi_1  \pi_1^{(\alpha + c_1 -1)} (1-\pi_1)^{(\beta + c_0 -1)} 
\end{eqnarray}


From the definition of the beta distribution, we can arrive at the 'formula' for the integral 
\begin{eqnarray}
1 &=& \int d\pi \mathcal{B}(\pi; a, b) \\
& = & \int d\pi \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \pi^{(a -1)} (1-\pi)^{(b -1)} \\
\frac{\Gamma(a)\Gamma(b)}{\Gamma(a + b)} & = & \int d\pi \pi^{(a -1)} (1-\pi)^{(b -1)}
\end{eqnarray}
Just substitute $a = \alpha + c_1$ and $b = \beta + c_0$

#### An Approximation
For large $x$, we have the following approximation
\begin{eqnarray}
\log \Gamma(x + a) - \log \Gamma(x) & \approx & a \log(x) \\
\Gamma(x + a)  & \approx & \Gamma(x) x^a \\
\end{eqnarray}

When $c_0$ and $c_1$ are large, we obtain:

\begin{eqnarray}
p(X) & \approx & \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} \frac{\Gamma(c_1)\Gamma(c_0)c_0^{\beta}c_1^{\alpha}}{\Gamma(c_0 + c_1)(c_0+c_1)^{\alpha + \beta}}
\end{eqnarray}

Let $\hat{\pi}_1 = c_1/(c_0+c_1)$ and $N = c_0 + c_1$, we have
\begin{eqnarray}
p(X) & \approx &  \frac{\Gamma(c_1)\Gamma(c_0)}{\Gamma(c_0 + c_1)} (1-\hat{\pi}_1) \hat{\pi}_1 \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)} (1-\hat{\pi}_1)^{\beta-1}\hat{\pi}_1^{\alpha-1}
\end{eqnarray}


### Estimating a Categorical distribution

#### Maximum Likelihood Estimation


We observe a dataset $\{x^{(n)}\}_{n=1\dots N}$. The model for a single observation is a categorical distribution with parameter $\pi = (\pi_1, \dots, \pi_S)$ where 

\begin{eqnarray}
x^{(n)} & \sim & p(x|\pi) = \prod_{s=1}^{S} \pi_s^{\ind{s = x^{(n)}}}
\end{eqnarray}
where $\sum_s \pi_s  = 1$.

The loglikelihood of the entire dataset is

\begin{eqnarray}
{\cal L}(\pi_1,\dots,\pi_S) & = & \sum_{n=1}^N\sum_{s=1}^S \ind{s = x^{(n)}} \log \pi_s
\end{eqnarray}
This is a constrained optimisation problem.
Form the Lagrangian
\begin{eqnarray}
\Lambda(\pi, \lambda) & = & \sum_{n=1}^N\sum_{s'=1}^S \ind{s' = x^{(n)}} \log \pi_{s'}  + \lambda \left( 1 - \sum_{s'} \pi_{s'} \right ) \\
\frac{\partial \Lambda(\pi, \lambda)}{\partial \pi_s} & = & \sum_{n=1}^N \ind{s = x^{(n)}} \frac{1}{\pi_s} - \lambda = 0 \\
\pi_s & = & \frac{\sum_{n=1}^N \ind{s = x^{(n)}}}{\lambda}
\end{eqnarray}

We solve for $\lambda$
\begin{eqnarray}
1 & = & \sum_s \pi_s = \frac{\sum_{s=1}^S \sum_{n=1}^N \ind{s = x^{(n)}}}{\lambda} \\
\lambda & = & \sum_{s=1}^S \sum_{n=1}^N \ind{s = x^{(n)}} =  \sum_{n=1}^N 1 = N
\end{eqnarray}

Hence
\begin{eqnarray}
\pi_s & = & \frac{\sum_{n=1}^N \ind{s = x^{(n)}}}{N}
\end{eqnarray}



### Fair/Fake Coin

### Change point
Coin switch

Coal Mining Data
 Single Change Point
 Multiple Change Point
