# 3. Classification

## 3.1. Binary Logistic Classification

* **Linear projection + Logistic activation**
* **Algorithm**

>$$ a_n = \mathbf{w}^\top \mathbf{x}_n = \sum_{d=1}^D w_d x_{n,d} $$

>$$p(y_n = 1 | \mathbf{x}_n, \mathbf{w}) = \sigma(a_n)  = \frac{1}{1 + \text{exp}({-\mathbf{w}_n^\top \mathbf{x}})}$$

* **Optimisation Method**

>* **Likelihood**

>$$
p(\{ y_n \}_{n=1}^N|\{\mathbf{x}_n\}_{n=1}^N, \mathbf{w}) = \prod^N_{n = 1} \sigma(\mathbf{w}^\top\mathbf{x}_n)^{y_n} \big(1 - \sigma(\mathbf{w}^\top\mathbf{x}_n)\big)^{1-y_n}
$$

>* **LL**

>$$
\mathcal{L}(\mathbf{w}) =\text{log}~p(\{ y_n \}_{n=1}^N|\{\mathbf{x}_n\}_{n=1}^N, \mathbf{w}) = \sum^N_{n = 1} \left[ y_n\text{log}~\sigma(\mathbf{w}^\top\mathbf{x}_n)+(1-y_n)\text{log}~\big(1 - \sigma(\mathbf{w}^\top\mathbf{x}_n)\big) \right]
$$

>* **Gradient Ascent**

>$$\frac{\partial \mathcal{L}(\mathbf{w})}{\partial \mathbf{w}} = \sum^N_{n = 1} \big(y_n - \sigma(\mathbf{w}^\top\mathbf{x}_n)\big)\mathbf{x}_n 
\;\;\;\Rightarrow\;\;\;
\mathbf{w}_{i+1} = \mathbf{w}_{i} + \eta \frac{\partial \mathcal{L}(\mathbf{w})}{\partial \mathbf{w}}\bigg|_{\mathbf{w}_{i}}$$


## 3.2. kNN Classification

* **Cheap to train & Expensive to use & No uncertainty measure**
* **Distance Metrics ($Lp$ distance)**

>$$d(\mathbf{x}_1, \mathbf{x}_2) = \bigg[\sum_{d} \big|x_{1,d} - x_{2,d}\big|^p\bigg]^{1/p}$$


## 3.3. Multi-class Softmax Classification

* **Computer $k$ activations, using $\mathbf{w}$ of each class**
* **Prob. contours: not linear / Decision boundaries: linear**
* **Algorithm**

>$$
p(y_{n} = k |\mathbf{x}_n, \{\mathbf{w}_k\}_{k=1}^K) = \frac{\exp(a_{n,k})}{\sum_{k'=1}^K \text{exp}(a_{n,k'})} = \frac{\text{exp}(\mathbf{w}_k^\top \mathbf{x}_n)}{\sum_{k'=1}^K \exp(\mathbf{w}_{k'}^\top \mathbf{x}_n)}
$$

* **MLE** (one hot encoding)

>* **Likelihood**

>\begin{align}
p(\{y_{n}\}_{n=1}^N|\{\mathbf{x}_n\}_{n=1}^N, \{\mathbf{w}_k\}_{k=1}^K) &= \prod_{n = 1}^N \prod_{k = 1}^K s_{n,k}^{y_{n,k}}
\end{align}
>
>$$s_{n,k} = p(y_{n} = k |\mathbf{x}_n, \{\mathbf{w}_k\}_{k=1}^K) = \frac{\text{exp}(\mathbf{w}_k^\top \mathbf{x}_n)}{\sum_{k'} \exp(\mathbf{w}_{k'}^\top \mathbf{x}_n)}$$

>* **LL**

>\begin{align}
\mathcal{L}(\{\mathbf{w}\}_{k=1}^K) &= \sum_{n = 1}^N \sum_{k = 1}^K y_{n,k} \log s_{n,k}
\end{align}





>* **Derivative**

>\begin{align}
\frac{\partial \mathcal{L}(\{\mathbf{w}\}_{k=1}^K)}{\partial \mathbf{w}_j} = \sum^N_{n = 1} (y_{n,j} - s_{n,j}) \mathbf{x}_n
\end{align}

## 3.4. Non-linear Classification

* **Non-linear binary logistic classification**

>$$ a_n = w_0 + w_1 \phi_{1}(\mathbf{x}_n) + w_2 \phi_{2}(\mathbf{x}_n) + ... w_D \phi_{D}(\mathbf{x}_n) = \mathbf{w}^\top \boldsymbol{\Phi}(\mathbf{x}_n) $$

>\begin{equation}
\boldsymbol{\Phi} =  \begin{pmatrix}
1 & \phi_1(x_1) & \cdots & \phi_D(x_1)\\\
1 & \phi_1(x_2) & \cdots & \phi_D(x_2)\\\
\vdots & \vdots & \ddots & \vdots \\\
1 & \phi_1(x_N) & \cdots & \phi_D(x_N)\\\
\end{pmatrix}
\end{equation}

>\begin{align}
p(y_n = 1 | \mathbf{x}_n, \mathbf{w}) = \sigma(\mathbf{w}^\top \boldsymbol{\Phi}(\mathbf{x}_n))
\end{align}

* **Example:** **isotropic Gaussian basis fn.** a.k.a. **radial basis fn.**

>$$\phi_{d}(\mathbf{x}) = \exp(-\frac{1}{2 l^2} | \mathbf{x} - \mu_{d}|^2)$$

## 3.5. Overfitting in classification

>Overfitting occurs when the model can contort itself to assign probability $1$ to every training data point $\rightarrow$ unnecessarily confident predictions which are not compatible with test data

>**Linear classification:** overfit when the training data classes are perfectly separable by a linear decision boundary

>**Non-linear logistic classification:** overfit when a large number of basis functions are used and the length-scales are short

## 3.6. Bayesian Classification

* **Algorithm**

>1. Likelihood $p(\mathcal{D}|\mathbf{w})$
2. Assume isotropic gaussian prior $\mathcal{N}(\mathbf{w}; 0, \boldsymbol{\Sigma}_0=\lambda \boldsymbol{I})$
3. Apply Laplace approximation $p(\mathcal{D}|\mathbf{w}) \approx \mathcal{N}(\mathbf{w}; \boldsymbol{\mu}, \boldsymbol{\Sigma})$
4. Calculate the approximate posterior $p(\mathbf{w}|\mathcal{D})$
5. Calculate the approximate predictive distribution $p(y^*| \mathcal{D})$

* **Taylor Expansion of $\log p(z)$ ($z_0$: mode)**

>\begin{align}
\text{log}~p(z) \approx \text{log}~p(z_0) + \frac{1}{2}(z - z_0)^2\frac{d^2}{dz^2}\text{log}~p(z)
\end{align}

* **Laplace approximation of $p(z)$**

>\begin{align}
\text{log}~\mathcal{N}(z; z_0, \sigma^2) = \text{const. } - \frac{1}{2\sigma^2}(z - z_0)^2
\end{align}

>$$\frac{1}{\sigma^2} = - \frac{d^2}{dz^2}\text{log}~p(z)$$

* **Example**

>\begin{align}
p(z) = \frac{a}{\pi}\frac{1}{a^2 + z^2}~\text{ (already normalised)}
\end{align}
>* Mode at $z = 0$
>\begin{align}
\frac{d^2 p(z)}{dz^2} \bigg|_{z=0} = \bigg[ \frac{4z^2}{(a^2 + z^2)^2} - \frac{2}{a^2 + z^2} \bigg]_{z = 0} = -\frac{2}{a^2}
\end{align}
>* Laplace approximation of $p(z)$:
>$$
\mathcal{N}(z; 0, a^2/2)
$$

* **Laplace approximation for multivariation distribution**

>\begin{align}
\text{log}~p(\mathbf{z}) &\approx \text{log}~p(\mathbf{z}_0) + \frac{1}{2}\sum_i\sum_j (z_j - z_{0i})^2\frac{d}{dz_i}\frac{d}{dz_j}\text{log}~p(x)\\
&= \text{log}~p(\mathbf{z}_0) + \frac{1}{2} (\mathbf{z} - \mathbf{z}_0)^\top \bigg[\nabla \nabla \text{log}~p(x)\bigg] (\mathbf{z} - \mathbf{z}_0)\\
\;\\
\text{log}~q(\mathbf{z}) &= \text{const. } - \frac{1}{2} (\mathbf{z} - \mathbf{z}_0)^\top \boldsymbol{\Sigma}^{-1} (\mathbf{z} - \mathbf{z}_0)\\
\text{Hessian: }\boldsymbol{\Sigma}^{-1} &= -\nabla \nabla \text{log}~p(x).\\
\end{align}
>$$\;$$


## 3.7. Bayesian Logistic Regression

* **Determine mode & Calculate Hessian at the mode**
* **Obtaining Hessian**:

>* Use a gaussian prior: $\mathcal{N}(\mathbf{w}; \mathbf{m}_0, \boldsymbol{\Sigma}_0)$
>$$\;$$
>\begin{align}
p(\mathbf{w}| \{\mathbf{y}_n, \mathbf{x}_n\}) &\propto p(\{\mathbf{y}_n, \mathbf{x}_n\} | \mathbf{w}) p(\mathbf{w})\\
\text{log}~p(\mathbf{w}| \{\mathbf{y}_n, \mathbf{x}_n\}) &= -\frac{1}{2} (\mathbf{w} - \mathbf{m}_0)^\top \boldsymbol{\Sigma}^{-1}_0(\mathbf{w} - \mathbf{m}_0) \\
&\;\;\;\; + \sum^N_{n = 1} y_n\text{log}~\sigma(\mathbf{w}^\top\mathbf{x}_n)+(1-y_n)\text{log}~\big(1 - \sigma(\mathbf{w}^\top\mathbf{x}_n)\big)\\
\nabla \text{log}~p(\mathbf{w}| \{\mathbf{y}_n, \mathbf{x}_n\}) &= -\boldsymbol{\Sigma}^{-1}_0(\mathbf{w} - \mathbf{m}_0) + \sum^N_{n = 1} \big(y_n - \sigma(\mathbf{w}^\top\mathbf{x}_n)\big)\mathbf{x}_n\\
\boldsymbol{\Sigma}^{-1} = -\nabla \nabla \text{log}~p(\mathbf{w}| \{\mathbf{y}_n, \mathbf{x}_n\}) &= \boldsymbol{\Sigma}^{-1}_0 + \sum^N_{n = 1} \sigma(\mathbf{w}^\top\mathbf{x}_n)\big(1 - \sigma(\mathbf{w}^\top\mathbf{x}_n)\big)\mathbf{x}_n \mathbf{x}_n^\top\\
\end{align}
>$$\;$$
>* $\therefore$ The Laplace approximation: $\mathcal{N}(\mathbf{w}; \mathbf{w}_{MAP}, \boldsymbol{\Sigma})$

* **Predictive Distribution**:

>\begin{align}
p(y^* = 1| \mathbf{x}^*, \{y_n, \mathbf{x}_n\}) &= \int p(y^* = 1| \mathbf{x}^*, \mathbf{w}) p(\mathbf{w} | \{y_n, \mathbf{x}_n\})d\mathbf{w} = \int \sigma(\mathbf{w}^\top\mathbf{x}) p(\mathbf{w} | \{y_n, \mathbf{x}_n\})d\mathbf{w}\\
&\approx \int \sigma(\mathbf{w}^\top\mathbf{x}) q(\mathbf{w})d\mathbf{w}\\
&= \int \int \sigma(a)\delta(a - \mathbf{w}^\top\mathbf{x})da~q(\mathbf{w})d\mathbf{w}\\
&= \int \sigma(a)p(a)da, \text{ where }~p(a) = \int \delta(a - \mathbf{w}^\top\mathbf{x})q(\mathbf{w})d\mathbf{w}\\
\end{align}
>$$\;$$
>* Our expression for distribution $p(a)$ can be simplified by noting that the Dirac-delta $\delta(a - \mathbf{w}^\top\mathbf{x})$ imposes a linear constraint on $\mathbf{w}$ so the effect of the integral $\int \delta(a - \mathbf{w}^\top\mathbf{x}) q(\mathbf{w}) d\mathbf{w}$ is to integrate out $\mathbf{w}$ along all directions orthogonal to $\mathbf{x}$. Since a marginal of a gaussian is also a gaussian the resulting distribution $p(a)$ will also be gaussian, and it suffices to find its mean and variance to fully characterise it:

* **Mean and variance of $p(a)$**:

>\begin{align}
\mu_a &= \int a p(a) da = \int a \int \delta(a - \mathbf{w}^\top\mathbf{x}) q(\mathbf{w}) d\mathbf{w} da = \int  \mathbf{w}^\top\mathbf{x} q(\mathbf{w}) d\mathbf{w} = \mathbf{w}_{MAP}^\top\mathbf{x}\\
~\\
\sigma_a^2 &= \int \big(a^2 - \mu_a^2\big)p(a) da = \int \big(a^2 - \mu_a^2\big)\delta(a - \mathbf{w}^\top\mathbf{x})q(\mathbf{w}) d\mathbf{w} da \\
&= \int \big((\mathbf{w}^\top\mathbf{x})^2 - (\mathbf{w}_{MAP}^\top\mathbf{x})^2\big)q(\mathbf{w}) d\mathbf{w}\\
&= \int (\mathbf{w}^\top\mathbf{x})^2 q(\mathbf{w}) d\mathbf{w} - \mathbf{x}^\top\mathbf{w}_{MAP}\mathbf{w}_{MAP}^\top\mathbf{x}\\
&= \mathbf{x}^\top \bigg[\int\mathbf{w}\mathbf{w}^\top q(\mathbf{w}) d\mathbf{w} \bigg]\mathbf{x} - \mathbf{x}^\top\mathbf{w}_{MAP}\mathbf{w}_{MAP}^\top\mathbf{x}\\
&= \mathbf{x}^\top\bigg[\mathbf{w}_{MAP}\mathbf{w}_{MAP}^\top + \Sigma \bigg] \mathbf{x} - \mathbf{x}^\top\mathbf{w}_{MAP}\mathbf{w}_{MAP}^\top\mathbf{x}\\
&= \mathbf{x}^\top\Sigma\mathbf{x}
\end{align}
>$$\;$$
>* Therefore $p(a) = \mathcal{N}(a; \mathbf{w}_{MAP}, \mathbf{x}^\top\Sigma\mathbf{x})$
>* Still the integral $\int \sigma(a)\mathcal{N}(a; \mathbf{w}_{MAP}, \mathbf{x}^\top\Sigma\mathbf{x})da$ is the convolution of a sigmoid with a gaussian and cannot be evaluated explicitly, so another approximation must be made. 

* **Approximate the sigmoid using the probit function**:

>$$\sigma(a) \approx \Phi(\lambda a) = \int_{-\infty}^{\lambda a} \mathcal{N}(z|0, 1) dz$$
>$$\;$$
>* The scaling constant $\lambda$ is picked such that the gradients of $\sigma(a)$ and $\Phi(\lambda a)$ are equal at the origin:

* **Predictive Distribution Integral**

>* Under this approximation, it can be shown that the predictive distribution integral is equal to another scaled probit:
>$$\;$$
>\begin{align}
\int \sigma(a)\mathcal{N}(a; \mu, \sigma^2)da &\approx \int \Phi(\lambda a)\mathcal{N}(a; \mu, \sigma^2)da\\
&= \Phi\Bigg(\frac{\mu}{(\lambda^{-2} + \sigma^2)^{1/2}}\Bigg), \text{ where } \mu = \mathbf{w}_{MAP}^\top \mathbf{x}, ~\sigma^2 = \mathbf{x}^\top\Sigma\mathbf{x}\\
\end{align}