# 5. Neural Networks for Computer Vision

## 5.1. Single Neuron

* **Formulation**

>$$a = \sum^D_{d=0} w_d z_d \;\;\;\rightarrow\;\;\; x(a) = \frac{1}{1 + \exp (-a)}$$

>* $\hat{\mathbf{w}}$: direction of boundary
>* $|\mathbf{w}|$: steepness of boundary

* **Training**

>$$G (\mathbf{w}) = - \sum_n \big[ t^{(n)} \log x (\mathbf{z}^{(n)};\mathbf{w}) + (1-t^{(n)}) \log ( 1 - x (\mathbf{z}^{(n)};\mathbf{w})) \big]$$

>$$\frac{d}{d\mathbf{w}} G(\mathbf{w}) = \sum_n \frac{dG_n(\mathbf{w})}{dx^{(n)}} \frac{dx^{(n)}}{d\mathbf{w}} = - \sum_n (t^{(n)} - x^{(n)}) \mathbf{z}^{(n)} \;\;\;\rightarrow\;\;\; \mathbf{w}=\mathbf{w}-\eta \frac{d}{d\mathbf{w}} G(\mathbf{w})$$

* **Regularised Training**

>$$E(\mathbf{w}) = \frac{1}{2} \sum_i w_i^2 \;\;\;\rightarrow\;\;\; M (\mathbf{w}) = [G (\mathbf{w}) + \alpha E(\mathbf{w})]$$

>$$\frac{d}{d\mathbf{w}} M(\mathbf{w}) = - \sum_n (t^{(n)} - x^{(n)}) \mathbf{z}^{(n)} + \alpha \mathbf{w}$$

* **Probabilistic Interpretation**

>\begin{align}
p(t|\mathbf{w},\mathbf{z}) &= x^t (1-x)^{(1-t)} \\
p(D|\mathbf{w},Z) &= \exp (-G(\mathbf{w})) \\
p(\mathbf{w}|\alpha) &= \frac{1}{Z_W(\alpha)} \exp (-\alpha E(\mathbf{w})) \\
p(\mathbf{w}|D,\alpha) &= \frac{1}{Z_M} \exp (-G(\mathbf{w})-\alpha E(\mathbf{w}))
\end{align}

>* The result of training: **(locally) most probable weight vector given data**

## 5.2. Single Hidden Layer Neural Networks

* **Framework**

><img src='images/image17.png' width=400>

* **Training**

>\begin{align}
G(W,\mathbf{w}) &= - \sum_n \big[ t^{(n)} \log x^{(n)} + (1-t^{(n)}) \log ( 1 - x^{(n)}) \big] \\
E(W,\mathbf{w}) &= \frac{1}{2} \sum_i w_i^2 + \frac{1}{2} \sum_{i,j} W_{ij}^2
\end{align}

* **Back-Propagation**

>$$\frac{dG(W,\mathbf{w})}{dW_{ij}} = \sum_{n,i} \frac{dG(W,\mathbf{w})}{dx^{(n)}} \frac{dx^{(n)}}{da^{(n)}} \frac{da^{(n)}}{dx_i^{(n)}} \frac{dx_i^{(n)}}{da_i^{(n)}} \frac{da_i^{(n)}}{dW_{ij}}$$

## 5.3. Hierarchical Models with Many Hidden Layers

* **Justifications**

>1. Visual scenes are hierarchically organised
>2. Biological vision is hierarchically organised
>3. Shallow architectures are inefficient at representing deep functions

* **Initialisation Methods**

>1. Unsupervised pre-training (e.g. using a restricted Boltzmann machine)
>2. Recursively apply back propagation
>3. Initialise randomly, but ensure activations have $\mu=0$ and $\sigma^2=1$ across the training data

## 5.4. Convolutional Neural Networks

* **Key Ideas** (all three ideas reduce the no. of parameters)

>* Image statistics are **translation invariant** 
>  * $\rightarrow$ tie weights together
>* **Low-level features:** local
>  * $\rightarrow$ allow only **local connectivity**
>* **High-level features:** coarse, abstract, invariance to translation, rotation, lighting, ...
>  * $\rightarrow$ **subsample** up the hierarchy

* **Framework**

><img src = 'images/image19.png' width=500>

* **Building Blocks**

>1. Convolutional stage: $a_{i,j} = \sum_{k,l} w_{k,l} z_{i-k,j-l}$
>2. Non-linear stage: $y_{i,j}=f(a_{i,j})$ (sigmoid, ReLU, tanh, ...)
>3. Pooling stage: $x_{i,j} = \underset{|k|<\tau, |l|<\tau}{\max} y_{i-k,j-l}$

* **Training**

>* **Back-propgation:** optimisation over a mini-batch of data
>* **Data Augmentation:** shift, rotation, mirroring, local distortion, ...