# MLMI1: Introduction to Machine Learning

Lecturer: Dr. Richard Turner

----

# Table of Contents
## 1. Introduction
* 1.1. Radioactive Decay - Heuristic Approach
* 1.2. Radioactive Decay - Probabilistic Approach
* 1.3. Inference and Decision Making

## 2. Regression
* 2.1. Linear Regression
* 2.2. Non-linear Regression
* 2.3. Regularisation
* 2.4. Bayesian Linear Regression

## 3. Classification
* 3.1. Binary Logistic Classification
* 3.2. kNN Classification
* 3.3. Multi-class Softmax Classification
* 3.4. Overfitting in Classification
* 3.5. Overfitting in Classification
* 3.6. Bayesian Classification
* 3.7. Bayesian Logistic Regression

## 4. Clustering
* 4.1. K-means Algorithm
* 4.2. K-means as Optimisation
* 4.3. K++ means

## 5. EM Method
* 5.1. Mixture of Gaussian: Generative Model
* 5.2. KL Divergence
* 5.3. EM Algorithm
* 5.4. EM - Application to 1D data

## 6. Sequence Modelling
* 6.1. Markov Models
* 6.2. N-gram Models (discrete data)
* 6.3. AR Gaussian Models (continuous data)
* 6.4. HMM (discrete hidden state)
* 6.5. HMM (continuous hidden state)
* 6.6. Kalman Filter

------

# 1. Inference
## 1.1. Radioactive Decay - Heuristic Approach

* **Problem**

>$$\begin{align}
p(x|\lambda) = \frac{1}{Z(\lambda)} \text{exp}\bigg(-\frac{x}{\lambda}\bigg)
\end{align} \Rightarrow \text{decay events in } (x_{min}, x_{max}) \Rightarrow \text{estimate } \lambda$$

* **Histogram-based**

>* $C_x$: No. of events in $[x-w/2,x+w/2]$
>
>\begin{align}
\mathbb{E}(C_x) &= N \int^{x+w/2}_{x-w/2}p(x|\lambda) dx
~\approx N w p(x|\lambda) = \frac{N w}{Z(\lambda)} \exp\bigg(-\frac{x}{\lambda}\bigg) \\
\log(\mathbb{E}(C_{x})) &= -\frac{x}{\lambda} + \text{const.} \rightarrow \text{find } \lambda \text{ from least squares}
\end{align}

>* **Problems** $\lambda$ depends on choice of bins / No uncertainty estimate / Why least squares? (this can be justified)

* **Statistic-based (e.g. mean)**

>* **Calculate mean**

>\begin{align}
\mu = \mathbb{E}(x) = \int x \; p(x|\lambda) \; \text{d}x = \int x \; \frac{1}{Z(\lambda)} \text{exp}\bigg(-\frac{x}{\lambda}\bigg) \; \text{d}x
\end{align}

>* **Calculate $Z(\lambda)$**

>\begin{align}
Z(\lambda) = \int^{x_{\text{min}}}_{x_{\text{max}}} \exp(-x/\lambda) \; \text{d}x = \lambda \left [ \exp(-x_{\text{min}}/\lambda) - \exp(-x_{\text{max}}/\lambda) \right]
\end{align}

>* **Apply it to $\mu$**

>$$
\frac{d}{d(1/\lambda)} \left( \int^{x_{min}}_{x_{max}} \text{exp}(-x/\lambda) dx \right)= \int^{x_{min}}_{x_{max}} \frac{\partial}{\partial (1/\lambda)} \text{exp}(-x/\lambda) dx = -x \int^{x_{min}}_{x_{max}} \text{exp}(-x/\lambda) dx
$$
>
>$$
\begin{align}
\mu  &= - \frac{1}{Z(\lambda)} \frac{d}{d(1/\lambda)} \int^{x_{\text{min}}}_{x_{\text{max}}} \exp(-x/\lambda) \; dx = - \frac{d}{d(1/\lambda)} \log Z(\lambda) \\
&= \lambda +  \frac{  
x_{\text{min}} \exp(-x_{\text{min}}/\lambda) - x_{\text{max}} \exp(-x_{\text{max}}/\lambda) 
}{\exp(-x_{\text{min}}/\lambda) - \exp(-x_{\text{max}}/\lambda)} \rightarrow \text{find } \lambda
\end{align}
$$

>* **Sanity Check**
>  * **1.** $\lambda \rightarrow 0 \;\; \Rightarrow \;\; \mu \rightarrow x_{min}$
>  * **2.** $\lambda \rightarrow \infty \;\; \Rightarrow \;\; \mu \rightarrow (x_{max}+x_{min})/2$
>  * **3.** $x_{min}=0 \; \text{and} \; x_{max} \rightarrow \infty \;\; \Rightarrow \;\; Z(\lambda) \rightarrow \lambda$

>* **Problems** Why $\mathbb{E}(x)$? / What if $\hat{\mu}$ is higher than $(x_{\text{max}} - x_{\text{min}} )/2$?

## 1.2. Radioactive Decay - Probabilistic Approach

* **Bayes' Rule**

>$$p(\lambda | \mathcal{D}) = \frac{p(\lambda) p(\mathcal{D} | \lambda )}{p(\mathcal{D})} \propto p(\lambda) \prod_{n=1}^N p( x_n | \lambda )$$

* **Likelihood** (Mean: Sufficient Statistics)

>$$p(\{ x_n \}_{n=1}^N | \lambda) = \frac{1}{Z(\lambda)^N} \exp\left(-\frac{1}{\lambda} \sum_{n = 1}^{N} x_n \right)
$$

>* Small $\lambda$: inconsistent with the observation
>* Large $\lambda$: flat likelihood


* **Prior** (assume uniform)

>$$\begin{align}
p(\lambda) = \mathcal{U}(\lambda; 0, 100)
\end{align}$$

* **Posterior Distribution**

>* MAP (Maximum a Posteriori): mode
>* Uncertainty: RMS variation around the MAP value

* **Posterior Predictive** (less confident than the MAP predictive)

>$$p(x^\star \lvert \{x_n\}_{n=1}^N) = \int p(x^\star \lvert  \lambda) p(\lambda | \{x_n\}_{n=1}^N) \text{d} \lambda$$

* **MLE** (alternative to full Bayesian)

>$$\lambda_{\text{ML}} = \underset{\lambda}{\mathrm{arg\min}} \;\; p(\{ x_n \}_{n=1}^N | \lambda ) = \underset{\lambda}{\mathrm{arg\min}} \;\;\prod_{n=1}^N p( x_n | \lambda )$$

>* Comparison to MAP (uniform prior)

>$$\lambda_{\text{MAP}} = \underset{\lambda<\lambda_{\text{max}}}{\mathrm{arg\min}} \;\; p(\lambda) p(\{ x_n \}_{n=1}^N | \lambda )  = \underset{\lambda<\lambda_{\text{max}}}{\mathrm{arg\min}} \;\; \frac{1}{\lambda_{\text{max}}} \prod_{n=1}^N p( x_n | \lambda ) $$

>* MLE is recovered in the limit $\lambda_{\text{max}} \rightarrow \infty$

* **Alternative**

>* Draw samples $\lambda \sim p(\lambda | \{ x_n\}_{n=1}^N)$
>* Typical values of the decay constant that are consistent with the observed data

## 1.3. Inference and Decision Making

* **Medical Diagnosis** (a: disease / b: test result)

>$$p(b=1|a=1)=0.95 \;\;\;,\;\;\; p(b=0|a=0)=0.95 \;\;\;,\;\;\; p(a=1)=0.05$$

>\begin{align}
p(a=1|b=1)&=\frac{p(b=1|a=1)p(a=1)}{p(b=1)} \\
&=\frac{p(b=1|a=1)p(a=1)}{p(b=1|a=1)p(a=1)+p(b=1|a=2)p(a=2)} = \frac{1}{2}
\end{align}

* **Medical Treatment**

>* **Reward:**

>$$
\begin{bmatrix}
R(a = 0, t = 0) & R(a = 0, t = 1) \\
R(a = 1, t = 0) &R(a = 1, t = 1) \\
\end{bmatrix} =
\begin{bmatrix}
10 & 7\\
3 &5\\
\end{bmatrix}
$$

>* **Conditional Reward:**

>$$R(t)=\sum_a R(a,t)p(a|b=1)$$