### Bayesian Machine Learning


- Suppose that your task is to predict a `new' datum $x$, based on $N$  observations $D=\{x_1,\dotsc,x_N\}$.

- The Bayesian approach for this task involves three stages: 
  1. Model specification
  1. parameter estimation (inference, learning)
  1. Prediction (apply the model)
  
Let's discuss these three stages in a bit more detail:

1. **Model specification**. 
Your first task is to propose a model with tuning parameters $\theta$ for generating the data $D$.
  - This involves specification of $p(D|\theta)$ and a prior for the parameters $p(\theta)$.
  - _You_ choose the distribution $p(x|\theta)$ based on your physical understanding of the data generating process.
  - Note that, for independent observations $x_n$,
$$ p(D|\theta) = \prod_{n=1}^N p(x_n|\theta)$$
  - _You_ choose the prior $p(\theta)$ to reflect what you know about the parameter values before you see the data $D$.
2. **Parameter estimation**.
After model specification, use Bayes rule to find the posterior distribution for the parameters,
$$
p(\theta|D) = \frac{p(D|\theta) p(\theta)}{p(D)} \propto p(D|\theta) p(\theta)
$$  
  - Note that there's **no need for you to design a _smart_ parameter estimation algorithm**. The only complexity lies in the computational issues.  
  - [Q.]: What if I have more candidate models, say $\mathcal{M} = \{m_1,\ldots,m_K\}$?
  - [A.]: Specify a prior $p(m)$ for the models and use Bayes again to absorb what we can learn from the data,
$$ 
p(m|D) = \frac{p(D|m) p(m)}{p(D)} \propto p(D|m)p(m)
$$
  - This "recipe" works only if the RHS factors can be evaluated; this is what machine learning is about.
$\Rightarrow$ **Machine learning is easy, apart from computational details:)**
3. **Prediction**. 
Given the data $D$, our knowledge about the yet unobserved datum $x$ is captured by

$$
p(x|D) = \int p(x,\theta|D) \,\mathrm{d}\theta = \int p(x|\theta) p(\theta|D) \,\mathrm{d}\theta
$$

  - Again, no need to invent a special prediction algorithm. Probability theory takes care of all that. The complexity of prediction is just computational: how to carry out the marginalization over $\theta$.
  
  - What did we learn from $D$? Without access to $D$, we would predict new observations through

$$
p(x) = \int p(x,\theta) \,\mathrm{d}\theta = \int p(x|\theta) p(\theta) \,\mathrm{d}\theta
$$

  - Remaining problem: How good really were our model assumptions $p(x|\theta)$ and $p(\theta)$?  More on this in part 2 (Tjalkens).  
  

### Machine Learning and the Scientific Method Revisited

- Bayesian probability theory provides a unified framework for information processing (and even the Scientific Method).

\includegraphics[width=12.5cm]{./figures/fig-Bayesian-scientific-method}
%\caption{The Scientific Method}




### EXAMPLE: Coin Tossing

- Observe a sequence of $N$ coin tosses with $n$ heads. What's the probability that heads ($h$) comes up next?

1. **Model Specification** 

  - Assume a Bernoulli variable $p(x_n=h|\mu)=\mu$, leading to

$$   
p(D|\mu) = \prod_{n=1}^N p(x_n|D) = \mu^n (1-\mu)^{N-n}
$$

  - Also, assume prior belief is governed by a **beta distribution**

$$
p(\mu) = \mathcal{B}(\mu|\alpha,\beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} \mu^{\alpha-1}(1-\mu)^{\beta-1}
$$


  - NB: The beta-distribution is a _binomial_ over the continuous range $[0,1]$.
  - $\alpha$ and $\beta$ are called **hyperparameters**, since they parameterize the distribution for another parameter ($\mu$). E.g., $\alpha=\beta=1$ (uniform), $\alpha=\beta=0.5$ (Jeffreys prior), $\alpha=\beta=0$ (improper Laplace prior)
  - If $\alpha,\beta$ are integers, then $\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)(\Gamma(\beta)} = \frac{(\alpha+\beta)!}{\alpha!\,\beta!}$




#### Coin toss example (2): Parameter estimation

- Infer posterior PDF over $\mu$ through Bayes rule

\begin{align}
p(\mu|D) &= \frac{p(D|\mu)p(\mu|\alpha,\beta)}{\int_0^1 p(D|\mu)p(\mu|\alpha,\beta)\,\mathrm{d}\mu } \\ 
        &= \frac{\mu^n (1-\mu)^{N-n} \times \mu^{\alpha-1}(1-\mu)^{\beta-1}}{\int_0^1 \mu^n (1-\mu)^{N-n}\mu^{\alpha-1}(1-\mu)^{\beta-1} \,\mathrm{d}\mu} \\
        &= \frac{(N+\alpha+\beta-1)!}{(n+\alpha-1)!(N-n+\beta-1)!}  \mu^{n+\alpha-1} (1-\mu)^{N-n+\beta-1}
\end{align}

where we used the formula for the _beta integral_ $$\int_0^1 x^p (1-x)^q \,\mathrm{d}x = \frac{p!q!}{(p+q+1)!}$$

- Essentially, **here ends the machine learning activity**

- Note: $p(\mu|D) \sim \mathcal{B}(\mu|\,n+\alpha, N-n+\beta)$ is again beta, or

$$
\text{beta} \propto \text{binomial} \times \text{beta}\notag
$$

- The Beta distribution is a **conjugate prior** for the Binomial distribution



#### Coin Toss Example (3): Prediction

- Marginalize over the parameter posterior to get the predictive PDF, given the data $D$,

\begin{align}
p(h|D)  &= \int_0^1 p(h|\mu)p(\mu|D) \,\mathrm{d}\mu \\
  &= \frac{(N+\alpha+\beta-1)!}{(n+\alpha-1)!(N-n+\beta-1)!}  \int_0^1 \mu \times  \mu^{n+\alpha-1} (1-\mu)^{N-n+\beta-1} \,\mathrm{d}\mu  \\
  &= \frac{n+\alpha}{N+\alpha+\beta} \qquad \mbox{(a.k.a. Laplace rule)}\hfill
\end{align}

- For large $N$, $p(h|D)=(n+\alpha)/(N+\alpha+\beta)$ goes to relative frequency $n/N$.
- Example: for uniform prior ($\alpha=\beta=1$) and $D=\{hthhtth\}$, we get
 $$  p(h|D)=\frac{n+1}{N+2} = \frac{4+1}{7+2} = \frac{5}{9}$$


#### Coin Toss Example: What did we learn?

- What did we learn from the data? Before seeing any data, we think that $p(h)=\left. p(h|D) \right|_{n=N=0} = \alpha / (\alpha + \beta)$ .
- After the $N$ coin tosses, we think that 
$$
p(h|D) = (n+\alpha)/(N+\alpha+\beta)
$$
- Note the following decomposition

\begin{align}
    p(h|\,D,\alpha,\beta) &= (n+\alpha)/(N+\alpha+\beta) \\
        &= \frac{n}{N+\alpha+\beta} + \frac{\alpha}{N+\alpha+\beta} \\
        &= \frac{N}{N+\alpha+\beta}\cdot \frac{n}{N} + \frac{\alpha+\beta}{N+\alpha+\beta} \cdot \frac{\alpha}{\alpha+\beta} \\
        &= \underbrace{\frac{\alpha}{\alpha+\beta}}_{prior} + \underbrace{\frac{N}{N+\alpha+\beta}}_{gain}\cdot ( \underbrace{\frac{n}{N}}_{MLE} - \underbrace{\frac{\alpha}{\alpha+\beta}}_{prior} )
    \end{align}

- Note that estimate lies between prior and MLE.


#### Bayesian Evolution of $p(\mu|D)$ for the Coin Toss

\begin{figure}[h]\centering
\includegraphics[height=5cm]{./figures/fig-coin-toss-posterior}
\end{figure}

- Evolution of posterior prob $p(\mu|D)$ after increasing number of coin tosses. Left-upper solid curve for uniform prior; dashed curve for Gaussian prior 

$\Rightarrow$ **with more data, the relevance of the prior diminishes**.

        


### From Posterior to Point-Estimate
Sometimes we want just one 'best' parameter (vector), rather than a posterior distribution over parameters. Why?

- Recall Bayesian prediction
$$
p(x|D) = \int p(x|\theta)p(\theta|D)\,\mathrm{d}{\theta}
$$

- If we approximate posterior $p(\theta|D)$ by a delta function for one 'best' value $\hat\theta$, then the predictive distribution collapses to

$$
p(x|D)= \int p(x|\theta)\delta(\theta-\hat\theta)\,\mathrm{d}{\theta} = p(x|\hat\theta)
$$

- This is the model $p(x|\theta)$ evaluated at $\theta=\hat\theta$.
- Note that $p(x|\hat\theta)$ is much easier to evaluate than the integral for full Bayesian prediction.



### Some Well-known Point-Estimates

1. **Bayes estimate**$$
\hat \theta_{bayes}  = \int \theta \, p\left( \theta |D \right)
\d{\theta}
$$

  - (homework). Proof that the Bayes estimate minimizes the expected mean-square error, i.e., proof that
$$
\hat \theta_{bayes} = \arg\min_{\hat \theta} \int_\theta (\hat \theta -\theta)^2 p \left( \theta |D \right) \,\mathrm{d}{\theta}
$$

2. **Maximum A Posteriori** (MAP) estimate 
$$
\hat \theta_{\text{map}}=  \arg\max _{\theta} p\left( \theta |D \right) =
\arg \max_{\theta}  p\left(D |\theta \right) \, p\left(\theta \right)
$$

3. **Maximum Likelihood** (ML) estimate

$$
\hat \theta_{ml}  = \arg \max_{\theta}  p\left(D |\theta\right)
$$

  - Note that Maximum Likelihood is MAP with uniform prior



### Learning with Maximum Likelihood Estimation

Consider the task: predict a datum $x$ from an observed data set $D$.
1. Model specification. 
  - Choose a model $m$ with parameters $\theta \in \Theta$ and the data generating distribution $$p(x|\theta,m)$$. (No need for priors).
2. Learning 
  - By Maximum Likelihood (ML) optimization,
$$ 
    \hat \theta  = \arg \max_{\theta}  p(D |\theta,m)
$$
3. Prediction. 
 - Easy through
$$ 
    p(x|D) =  p(x|\hat\theta,m)
$$

- Note that this is a computationally easy approximation to the Bayesian approach. What is the price?


### Report Card on Maximum Likelihood Estimation
- Maximum Likelihood (ML) is MAP with uniform prior, or MAP is 'penalized' ML
$$
\hat \theta_{map}  = \arg \max _\theta  \{ \overbrace{\log
p\left( D|\theta  \right)}^{\mbox{log-likelihood}} + \overbrace{\log
p\left( \theta \right)}^{\mbox{penalty}} \}
$$
- (good!). Works rather well if we have a lot of data because the influence of the prior diminishes with more data.
- (bad). Cannot be used for model comparison. E.g. best model does generally not correspond to largest likelihood (see part-2, Tjalkens).
- (good). Computationally often do-able. Useful fact (since $\log$ is monotonously increasing):
$$\arg\max_\theta \log p(D|\theta) =  \arg\max_\theta p(D|\theta)$$

$\Rightarrow$ **ML estimation is an approximation to Bayesian learning**, but in the face of lots of available data for good reason a very popular learning method.

