# 4F13: Probabilistic Machine Learning

Lecturer: Prof. Carl Edward Rasmussen

----

# Table of Contents

>## 1. Introduction to Probabilistic Machine Learning
* 1.1. Mathematical Models
* 1.2. Linear in the Parameters Model
* 1.3. Likelihood and Noise
* 1.4. Probability Basics
* 1.5. Bayesian Inference and Prediction in Finite Regression Models

>## 2. Gaussian Process
* 2.1. Introduction
* 2.2. Gaussian Process
* 2.3. Non-Parametric Gaussian Process Models
* 2.4. GP Marginal Likelihood and Hyperparameters
* 2.5. Linear in the Parameters Models and GP
* 2.6. Finite and Infinite Basis GPs
* 2.7. Covariance Functions

>## 3. Probabilistic Ranking
* 3.1. Introduction to Ranking
* 3.2. Gibbs Sampling
* 3.3. Gibbs Sampling in TrueSkill
* 3.4. Factor Graphs and Message Passing
* 3.5. Message Passing in TrueSkill

>## 4. Modelling Document Collections
* 4.1. Introduction
* 4.2. Discrete Binary Distributions
* 4.3. Discrete Categorical Distribution
* 4.4. Document Models
* 4.5. Gibbs Sampling for Bayesian Mixture
* 4.6. Latent Dirichlet Allocation for Topic Modelling

# 1. Introduction to Probabilistic Machine Learning

## 1.1. Mathematical Models

* **Purpose**

>1. Make predictions
>2. Generalise from observations (inter/extrapolation)
>3. Understand and interpret statistical relationships
>4. Evaluate the relative probability of hypothesis about the data
>5. Compress or summarise data
>6. Generate more data, from a similar distribution

* **Originate from**

>1. **First Principles** (e.g. Newtonian mechanics)
>2. **Observations** (data) $\Rightarrow$ ***Machine Learning*** significantly relies on data

* **Rely on**

>1. **Knowledge** (expressed through ***priors***)
>2. **Assumptions** (e.g. conditional independence)
>3. **Simplifying assumptions** (if they are 'good enough')

* **Terminology:**

><img src="images/image1_01.png" width=300>
>
>* $y$: observations
>* $x$: unobserved or hidden or latent variables (# grow with data)
>* $A$: parameter for transitions / $C$: parameter for emissions (# fixed)

## 1.2. Linear in the Parameters Regression

* **Model**

>$$f_w(x)=\sum^{M}_{j=0}{w_j\Phi_j(x)} \;\;\; \text{where} \;\;\; \Phi_j(x)=x^j$$
>
>$$\hat{y}=\Phi \mathbf{w}$$

* **Least Squares Fit**

>$$\text{cost:} \;\;\; \text{E}(\mathbf{w})=(y-\hat{y})^T(y-\hat{y})=(y-\Phi \mathbf{w})^T(y-\Phi \mathbf{w})$$

>$$\frac{\partial \text{E}(\mathbf{w})}{\partial \mathbf{w}}=0 \;\;\; \Rightarrow \;\;\; \hat{\mathbf{w}}=(\Phi^T\Phi)^{-1}\Phi^Ty$$

* **Overfitting**

><img src="images/image1_02.png" width=400>

## 1.3. Likelihood and Noise

><img src="images/image1_03.png" width=400>

* **Gaussian Noise Assumption**

>$$\boldsymbol{\epsilon}\; \text{~} \; \mathcal{N}(\boldsymbol{\epsilon}_n;\textbf{0},\sigma^2_{noise} \textbf{I})$$

* **Maximum Likelihood Estimate**

>$$p(\mathbf{y}|\mathbf{f}, {\sigma}^2_{noise})=\mathcal{N}(\mathbf{y};
\mathbf{f},\sigma^2_{noise})=\left( \frac{1}{\sqrt{2\pi \sigma^2_{noise}}} \right)^N \exp{\left( -\frac{||\mathbf{y}-\mathbf{f}||^2}{2\sigma^2_{noise}}\right)}$$

>* **ML solution = Least squares solution**

## 1.4. Probability Basics
* **Sum Rule**

>$$p(A)=\sum_B{p(A,B)}\;\;\;\text{or}\;\;\;p(A)=\int_B{p(A,B)dB}$$

* **Product Rule**

>$$p(A,B)=p(A|B)p(B)$$

* **Bayes' Rule**

>$$p(A|B)=\frac{p(A,B)}{p(B)}=\frac{p(B|A)p(A)}{p(B)}$$
>
>* $p(A)$: **marginal**
>* $p(B|A)$: **conditional**
>* $p(A,B)$: **joint**
>* If $A$ and $B$ are independent, $p(A,B)=p(A)p(B)$

## 1.5. Bayesian Inference and Prediction in Finite Regression Models

* **Posterior**

>$$p(\mathbf{w}|\mathbf{x},\mathbf{y},\mathcal{M})=\frac{p(\mathbf{w}|\mathcal{M})p(\mathbf{y}|\mathbf{x},\mathbf{w},\mathcal{M})}{p(\mathbf{y}|\mathbf{x},\mathcal{M})}=\mathcal{N}(\mathbf{w;μ,\Sigma})$$
>$$\;$$
>* **Gaussian Prior**: $p(\mathbf{w}|\mathcal{M})=\mathcal{N}(\mathbf{w};\textbf{0},\sigma^2_\mathbf{w}\mathbf{I})$
>$$\;$$
>* **Gaussian Likelihood**: $p(\mathbf{y}|\mathbf{x},\mathbf{w},\mathcal{M})=\mathcal{N}(\mathbf{y};\mathbf{Φw},\sigma^2_{noise}\mathbf{I})$
>$$\;$$
>* **Marginal Likelihood**: $p(\mathbf{y}|\mathbf{x},\mathcal{M})=\int{p(\mathbf{w}|\mathbf{x},\mathcal{M})p(\mathbf{y}|\mathbf{x},\mathbf{w},\mathcal{M})\text{d}\mathbf{w}}$
>$$\;$$
>$$\mathbf{μ}=\left( \mathbf{Φ}^T\mathbf{Φ} + \frac{\sigma^2_{noise}}{\sigma^2_{\mathbf{w}}}\mathbf{I} \right)^{-1}\mathbf{Φ}^T\mathbf{y} \;\;\;,\;\;\; \mathbf{Σ}=\left( \sigma^{-2}_{noise}\mathbf{Φ}^T\mathbf{Φ}+\sigma^{-2}_{\mathbf{w}}\mathbf{I}\right)^{-1}$$


* **Bayesian Inference**

>\begin{align}
p(y_*|x_*,\mathbf{x},\mathbf{y},\mathcal{M})&=\int{p(y_*,\mathbf{w}|\mathbf{x},\mathbf{y},x_*,\mathcal{M})\text{d}\mathbf{w}}\\
&= \int{p(y_*|\mathbf{w},x_*,\mathcal{M})p(\mathbf{w}|\mathbf{x},\mathbf{y},\mathcal{M})\text{d}\mathbf{w}}\\
&=\mathcal{N}(y_*;\mathbf{Φ}(x_*)^T\mathbf{μ},\mathbf{Φ}(x_*)^T\mathbf{Σ}\mathbf{Φ}(x_*)+\sigma^2_{noise}\mathbf{I})
\end{align}

* **Evidence** (marginal likelihood, used to select between models)

>\begin{align}
p(\mathcal{M}|\mathbf{x,y}) \propto p(\mathbf{y|\mathcal{M},x}) &= \int{p(\mathbf{w}|\mathbf{x},\mathcal{M})p(\mathbf{y}|\mathbf{x},\mathbf{w},\mathcal{M})\text{d}\mathbf{w}}\\
&=\mathcal{N}(\mathbf{y;0,\sigma^2_{w}ΦΦ}^T+\sigma^2_{noise}\mathbf{I})
\end{align}
