# 1. Inference
## 1.1. Radioactive Decay - Heuristic Approach

* **Problem**

>$$p(x|\lambda) = \frac{1}{Z(\lambda)} \text{exp}\bigg(-\frac{x}{\lambda}\bigg)$$

* **Histogram-based**

>\begin{align}
\mathbb{E}(C_x) &= N \int^{x+w/2}_{x-w/2}p(x|\lambda) dx
~\approx N w p(x|\lambda) = \frac{N w}{Z(\lambda)} \exp\bigg(-\frac{x}{\lambda}\bigg) \\
\log(\mathbb{E}(C_{x})) &= -\frac{x}{\lambda} + \text{const.} \rightarrow \text{find } \lambda \text{ from least squares}
\end{align}

* **Statistic-based (e.g. mean)**

>\begin{align}
Z(\lambda) = \int^{x_{\text{max}}}_{x_{\text{min}}} \exp(-x/\lambda) \; \text{d}x = \lambda \left [ \exp(-x_{\text{min}}/\lambda) - \exp(-x_{\text{max}}/\lambda) \right]
\end{align}

>$$
\mu  = - \frac{d}{d(1/\lambda)} \log Z(\lambda) = \lambda +  \frac{  
x_{\text{min}} \exp(-x_{\text{min}}/\lambda) - x_{\text{max}} \exp(-x_{\text{max}}/\lambda) 
}{\exp(-x_{\text{min}}/\lambda) - \exp(-x_{\text{max}}/\lambda)}
$$

## 1.2. Radioactive Decay - Probabilistic Approach

* **Likelihood** (Mean: Sufficient Statistics)

>$$p(\{ x_n \}_{n=1}^N | \lambda) = \frac{1}{Z(\lambda)^N} \exp\left(-\frac{1}{\lambda} \sum_{n = 1}^{N} x_n \right)
$$

* **Posterior Predictive** (less confident than the MAP predictive)

>$$p(x^\star \lvert \{x_n\}_{n=1}^N) = \int p(x^\star \lvert  \lambda) p(\lambda | \{x_n\}_{n=1}^N) \text{d} \lambda$$

* **MLE & MAP**

>$$\lambda_{\text{ML}} = \underset{\lambda}{\mathrm{arg\max}} \;\; p(\{ x_n \}_{n=1}^N | \lambda ) = \underset{\lambda}{\mathrm{arg\max}} \;\;\prod_{n=1}^N p( x_n | \lambda )$$

>$$\lambda_{\text{MAP}} = \underset{\lambda<\lambda_{\text{max}}}{\mathrm{arg\max}} \;\; p(\lambda) p(\{ x_n \}_{n=1}^N | \lambda )  = \underset{\lambda<\lambda_{\text{max}}}{\mathrm{arg\max}} \;\; \frac{1}{\lambda_{\text{max}}} \prod_{n=1}^N p( x_n | \lambda ) $$

# 2. Regression

## 2.1. Linear Regression

* **Least Squares Fit - Cost fn.**

>$$C_2 = \big|\big|\mathbf{y} - \mathbf{X}\mathbf{w}\big|\big|^2 = \big(\mathbf{y} - \mathbf{X}\mathbf{w}\big)^\top \big(\mathbf{y} - \mathbf{X}\mathbf{w}\big)$$

* **Solution** (has no uncertainty)

>\begin{align}
\frac{\partial C_2}{\partial \mathbf{w}} &= -2\mathbf{X}^\top\big(\mathbf{y} - \mathbf{X}\mathbf{w}\big)=0\\
\implies &\boxed{\mathbf{w} = \big( \mathbf{X}^\top\mathbf{X}\big)^{-1}\mathbf{X}^\top \mathbf{y}}
\end{align}

* **MLE Fit** ($\mathbf{y} = \mathbf{Xw}+\epsilon_n$)

>\\[p(\mathbf{y}\mid\mathbf{X}, \mathbf{w}, \sigma_y^2) = \frac{1}{(2\pi \sigma_y^2)^{N/2}}\text{exp}\left(-\frac{1}{2\sigma_y^2}(\mathbf{y} - \mathbf{X}\mathbf{w})^\top (\mathbf{y} - \mathbf{X}\mathbf{w})\right)\\]

>\\[-\mathcal{L}(\mathbf{w}) = \frac{N}{2}\log(2\pi \sigma_y^2) +\frac{1}{2\sigma_y^2}(\mathbf{y} - \mathbf{X}\mathbf{w})^\top (\mathbf{y} - \mathbf{X}\mathbf{w})\\]

>\\[\boxed{\text{Least squares} \equiv \text{minimize}~ (\mathbf{y} - \mathbf{X}\mathbf{w})^\top (\mathbf{y} - \mathbf{X}\mathbf{w}) \Leftrightarrow \text{Maximum-likelihood}}\\]

## 2.2. Non-linear Regression

* **MLE Fit**

>\\[y_n = w_0 + w_1 \phi_{1}(x_n) + w_2 \phi_{2}(x_n) + ... w_D \phi_{D}(x_n) + \epsilon_n = \boldsymbol{\phi}(x_n)^\top \mathbf{w} + \epsilon_n.\\]

>$$ \Rightarrow \mathbf{y} = \boldsymbol{\Phi}\mathbf{w} + \boldsymbol{\epsilon}$$

>\begin{equation}
\text{design matrix: }\boldsymbol{\Phi} =  \begin{pmatrix}
1 & \phi_1(x_1) & \cdots & \phi_D(x_1)\\\
1 & \phi_1(x_2) & \cdots & \phi_D(x_2)\\\
\vdots & \vdots & \ddots & \vdots \\\
1 & \phi_1(x_N) & \cdots & \phi_D(x_N)
\end{pmatrix}
\end{equation}

>\\[C_2 = \big|\big| \mathbf{y} - \boldsymbol{\Phi}\mathbf{w}\big|\big|^2 = \big(\mathbf{y} - \boldsymbol{\Phi}\mathbf{w}\big)^\top \big(\mathbf{y} - \boldsymbol{\Phi}\mathbf{w}\big)\\]

>\\[- \mathcal{L}(\mathbf{w}) = - \text{log}~ p(\mathbf{y}|\boldsymbol{\Phi}, \mathbf{w}, \sigma_y^2) = \frac{N}{2}\text{log}(2\pi \sigma^2) + \frac{1}{2\sigma^2}(\mathbf{y} - \boldsymbol{\Phi}\mathbf{w})^\top (\mathbf{y} - \boldsymbol{\Phi}\mathbf{w})\\]

>\begin{align}
\boxed{\mathbf{w} = \big( \boldsymbol{\Phi}^\top\boldsymbol{\Phi}\big)^{-1}\boldsymbol{\Phi}^\top \mathbf{y}}
\end{align}

## 2.3. Regularisation

* **$Lp$ regularisation**: $||\mathbf{w}||^p$

>\\[C_2^{(\text{reg})} = \big|\big|\mathbf{y} - \boldsymbol{\Phi}\mathbf{w}\big|\big|^2 +\frac{\alpha}{2}\mathbf{w}^\top\mathbf{w} \\]

* **MAP Fit** - Gaussian prior: $p(\mathbf{w}|\sigma_\mathbf{w}^2)=\mathcal{N}(\mathbf{w};\mathbf{0},\sigma^2_\mathbf{w}\mathbf{I})$

>\begin{align}
\mathbf{w}^{\text{MAP}} & = \underset{\mathbf{w}}{\mathrm{arg\,max}} \; p(\mathbf{w} | \{x_n,y_n\}_{n=1}^N,\sigma_y^2,\sigma_{\mathbf{w}}^2)\\
& = \underset{\mathbf{w}}{\mathrm{arg\,max}} \frac{1}{(2\pi \sigma_\mathbf{w}^2)}\text{exp}\big(-\frac{1}{2\sigma_\mathbf{w}^2}\mathbf{w}^\top \mathbf{w} \big) \times \frac{1}{(2\pi \sigma_y^2)^{N/2}}\text{exp}\big(-\frac{1}{2\sigma_y^2}(\mathbf{y} - \mathbf{X}\mathbf{w})^\top (\mathbf{y} - \mathbf{X}\mathbf{w})\big) \\
& = \underset{\mathbf{w}}{\mathrm{arg\,min}} \;   (\mathbf{y} - \mathbf{X}\mathbf{w})^\top (\mathbf{y} - \mathbf{X}\mathbf{w}) - \alpha \mathbf{w}^\top \mathbf{w}  \;\; \text{where} \;\;\alpha = \frac{\sigma_y^2}{\sigma_\mathbf{w}^2}
\end{align}

* **Optimisation**

>\begin{align}
\frac{\partial\mathcal{L}}{\partial \mathbf{w}} &= - \boldsymbol{\Phi}^\top(\mathbf{y} - \boldsymbol{\Phi}\mathbf{w}) + \alpha\mathbf{w} \\
&= -\boldsymbol{\Phi}^\top\mathbf{y} + \boldsymbol{\Phi}^\top\boldsymbol{\Phi}\mathbf{w} + \alpha \mathbf{I}\mathbf{w} = 0 \\
\end{align}

>$$\implies \boxed{\mathbf{w} = (\boldsymbol{\Phi}^\top\boldsymbol{\Phi} + \alpha\mathbf{I})^{-1}\boldsymbol{\Phi}^\top\mathbf{y}}$$

## 2.4. Bayesian Linear Regression
* **Inference**

>* **Prior and Likelihood**

>$$\begin{align}
p(\mathbf{w}| \sigma_{\mathbf{w}}^2) &= \frac{1}{(2\pi \sigma_{\mathbf{w}}^2)^{D/2}}\text{exp}\big(-\frac{1}{2\sigma_w^2}\mathbf{w}^\top \mathbf{w}\big)\\
p(\mathbf{y}|\mathbf{X}, \mathbf{w}, \sigma_y^2) &= \frac{1}{(2\pi \sigma_y^2)^{N/2}}\text{exp}\left(-\frac{1}{2\sigma_y^2}(\mathbf{y} - \boldsymbol{\Phi}\mathbf{w})^\top (\mathbf{y} - \boldsymbol{\Phi}\mathbf{w})\right)
\end{align}$$

>* **Posterior Distribution**

>\begin{align}
p(\mathbf{w}|\mathbf{y}, \mathbf{X}, \sigma_{\mathbf{w}}^2, \sigma_{y}^2) = \mathcal{N}(\mathbf{w}; \mathbf{\mu}_{\mathbf{w} | \mathbf{y}, \mathbf{X} },\Sigma_{\mathbf{w} | \mathbf{y}, \mathbf{X} }).
\end{align} 
>
>\begin{align}
\Sigma_{\mathbf{w} | \mathbf{y}, \mathbf{X} } & = \left( \frac{1}{\sigma_y^2} \boldsymbol{\Phi}^\top \boldsymbol{\Phi} + \frac{1}{\sigma_{\mathbf{w}}^2} \mathrm{I} \right)^{-1} \;\;\; \text{and} \;\;\;
\mathbf{\mu}_{\mathbf{w} | \mathbf{y}, \mathbf{X} } =  \Sigma_{\mathbf{w} | \mathbf{y}, \mathbf{X} } \frac{1}{\sigma_y^2}  \boldsymbol{\Phi}^\top \mathbf{y}
\end{align}

>* **MAP Setting:** Mean

# 3. Classification

## 3.1. Binary Logistic Classification

* **Algorithm**

>$$p(y_n = 1 | \mathbf{x}_n, \mathbf{w}) = \sigma(a_n)  = \frac{1}{1 + \text{exp}({-\mathbf{w}_n^\top \mathbf{x}})}$$

* **Optimisation Method**

>$$
p(\{ y_n \}_{n=1}^N|\{\mathbf{x}_n\}_{n=1}^N, \mathbf{w}) = \prod^N_{n = 1} \sigma(\mathbf{w}^\top\mathbf{x}_n)^{y_n} \big(1 - \sigma(\mathbf{w}^\top\mathbf{x}_n)\big)^{1-y_n}
$$

>$$
\mathcal{L}(\mathbf{w}) =\text{log}~p(\{ y_n \}_{n=1}^N|\{\mathbf{x}_n\}_{n=1}^N, \mathbf{w}) = \sum^N_{n = 1} \left[ y_n\text{log}~\sigma(\mathbf{w}^\top\mathbf{x}_n)+(1-y_n)\text{log}~\big(1 - \sigma(\mathbf{w}^\top\mathbf{x}_n)\big) \right]
$$

>$$\frac{\partial \mathcal{L}(\mathbf{w})}{\partial \mathbf{w}} = \sum^N_{n = 1} \big(y_n - \sigma(\mathbf{w}^\top\mathbf{x}_n)\big)\mathbf{x}_n 
\;\;\;\Rightarrow\;\;\;
\mathbf{w}_{i+1} = \mathbf{w}_{i} + \eta \frac{\partial \mathcal{L}(\mathbf{w})}{\partial \mathbf{w}}\bigg|_{\mathbf{w}_{i}}$$

## 3.2. Multi-class Softmax Classification

* **Compute $k$ activations, using $\mathbf{w}$ of each class**
* **Prob. contours: not linear / Decision boundaries: linear**
* **Algorithm**

>$$
p(y_{n} = k |\mathbf{x}_n, \{\mathbf{w}_k\}_{k=1}^K) = \frac{\exp(a_{n,k})}{\sum_{k'=1}^K \text{exp}(a_{n,k'})} = \frac{\text{exp}(\mathbf{w}_k^\top \mathbf{x}_n)}{\sum_{k'=1}^K \exp(\mathbf{w}_{k'}^\top \mathbf{x}_n)}
$$

* **MLE** (one hot encoding)

>\begin{align}
p(\{y_{n}\}_{n=1}^N|\{\mathbf{x}_n\}_{n=1}^N, \{\mathbf{w}_k\}_{k=1}^K) &= \prod_{n = 1}^N \prod_{k = 1}^K s_{n,k}^{y_{n,k}}
\end{align}
>
>$$\mathcal{L}(\{\mathbf{w}\}_{k=1}^K) = \sum_{n = 1}^N \sum_{k = 1}^K y_{n,k} \log s_{n,k}
\;\;\;\Rightarrow\;\;\;
\frac{\partial \mathcal{L}(\{\mathbf{w}\}_{k=1}^K)}{\partial \mathbf{w}_j} = \sum^N_{n = 1} (y_{n,j} - s_{n,j}) \mathbf{x}_n$$

## 3.3. Non-linear Classification

* **Non-linear binary logistic classification**

>$$ a_n = w_0 + w_1 \phi_{1}(\mathbf{x}_n) + w_2 \phi_{2}(\mathbf{x}_n) + ... w_D \phi_{D}(\mathbf{x}_n) = \mathbf{w}^\top \boldsymbol{\Phi}(\mathbf{x}_n) $$

>\begin{equation}
\boldsymbol{\Phi} =  \begin{pmatrix}
1 & \phi_1(x_1) & \cdots & \phi_D(x_1)\\\
1 & \phi_1(x_2) & \cdots & \phi_D(x_2)\\\
\vdots & \vdots & \ddots & \vdots \\\
1 & \phi_1(x_N) & \cdots & \phi_D(x_N)\\\
\end{pmatrix}
\end{equation}

>\begin{align}
p(y_n = 1 | \mathbf{x}_n, \mathbf{w}) = \sigma(\mathbf{w}^\top \boldsymbol{\Phi}(\mathbf{x}_n))
\end{align}

* **Example:** **isotropic Gaussian basis fn.** a.k.a. **radial basis fn.**

>$$\phi_{d}(\mathbf{x}) = \exp \left( -\frac{1}{2 l^2} | \mathbf{x} - \mu_{d}|^2 \right)$$

## 3.4. Bayesian Classification

* **Taylor Expansion of $\log p(z)$ ($z_0$: mode)**

>\begin{align}
\text{log}~p(z) \approx \text{log}~p(z_0) + \frac{1}{2}(z - z_0)^2\frac{d^2}{dz^2}\text{log}~p(z)
\end{align}

* **Laplace approximation of $p(z)$**

>\begin{align}
\text{log}~\mathcal{N}(z; z_0, \sigma^2) = \text{const. } - \frac{1}{2\sigma^2}(z - z_0)^2
\end{align}

>$$\frac{1}{\sigma^2} = - \frac{d^2}{dz^2}\text{log}~p(z)$$