# NN Notation

- Training set: $ \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \;..., (x^{(m)}, y^{(m)})  \} $

- No. layers in the network: $ L $
- No. units in layer $l$ : $ s_l$

## Binary Classification ($y = 0$ or $1$)

Single output unit.

$h_\Theta (x) \in \mathbb{R} $

$s_L = 1 \;\;;\;\;  K = 1$

## Multi-class classification (K classes)

$ y \in \mathbf{R}^K $

$K$ output units

$h_\Theta (x) \in \mathbb{R}^K $

$s_L = K \;\;;\;\;  K = \geqslant 3$

# NN Cost function

Generalisation of logistic regression cost function.

__Logistic regression cost function__

$$ J(\theta) = -\frac{1}{m} \left[\sum\limits_{i=1}^m y^{(i)}\;log h_\theta(x^{(i)}) + (1-y^{(i)})log(1 - h_\theta(x^{(i)}) \right] + \frac{\lambda}{2m}\sum\limits_{j=1}^n\theta^2_j $$

__Neural network cost function__

If $h_\Theta (x) \in \mathbf{R}^K ,\;\; (h_\Theta (x))_i = i^{th}$ output of the neural network.

$$
\begin{split}
J(\Theta) = -\frac{1}{m} \left[\sum\limits_{i=1}^m \sum\limits_{k=1}^K y^{(i)}_K\;log( h_\theta(x^{(i)}) )_K + (1-y^{(i)}_K)\;log(1 - h_\theta(x^{(i)})_K \right]\\
+ \frac{\lambda}{2m}\sum\limits_{l=1}^{L-1} \sum\limits_{i=1}^{s_l} \sum\limits_{j=1}^{s_l+1} (\theta^{(l)}_{ji})^2 
\end{split}
$$

Comparing $h_\Theta()_K$, the $K^{th}$ of of the nn, with the respective y value, $y_K$. For all training exampples

Relgularization term, $(\Theta^{(l)}_{ji})^2$, is computed for every $j, i , l$ (# neurons, # layers,  ).

# NN backpropagation 

__Gradient computation__

$J(\Theta)$ as defined above

Goal is to minimise $J(\Theta)$: $ \min\limits_\Theta J(\Theta) $

In order to do so, one must compute:

   - $J(\Theta)$
   - $\frac{\partial}{\partial \Theta^{(l)}_{ij} } J(\Theta)$

__Forward propagation:__

On one training example

(assuming 4 layers)

$ a^{(1)} = x $

$ z^{(2)} = \Theta^{(1)}a^{(1)} $

$ a^{(2)} = g(z^{(2)}) $

$ z^{(3)} = \Theta^{(2)}a^{(2)} $

$ a^{(3)} = g(z^{(3)}) $

$ z^{(4)} = \Theta^{(3)}a^{(3)} $

$ a^{(4)} = h_\Theta(x) = g(z^{(4)}) $

__Backward propagation:__

On one training example

(assuming 4 layers)

For each node $\delta^{(l)}_j = $ 'error' of node $j$ in layer $l$. Error of the activation unit ($a^{(l)}_j$ activation unit of unit $j$ in layer $l$)

For each output unit (layer $L=4$)

$\delta^{(4)}_j = a^{(4)}_j - y_j = (h_\Theta(x) )_j - y_j $

$\bar{\delta^{(4)}} = \bar{a^{(4)} } - \bar{y}$

For the two iternal layers:

$\delta^{(3)} = (\Theta^{(3)})^T\delta^{(4)} .* g'(z^{(3)}) $

Where $g'(z^{(3)})$ can be simplified to $ a^{(3)} .* (1-a^{(3)} $

$\delta^{(2)} = (\Theta^{(2)})^T\delta^{(3)} .* g'(z^{(2)}) $ and $ a^{(2)} .* (1-a^{(2)} $

## Implementation of Backpropagation 

__Algorithm__

Training set $ \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \;..., (x^{(m)}, y^{(m)})  \} $

Set $\Delta^{(l)}_{ij} = 0$ (for all $l, i, j$)

For $i$ = 1 to $m$

$\;\;\;\;$ Set $a^{(1)} = x^{(1)}$

$\;\;\;\;$ Forward propagation to compute $a^{(l)}$ for $l = 2, 3,\;..., L$

$\;\;\;\;$ Compute $\delta^{(L)} = a^{(L)} - y^{(i)}$

$\;\;\;\;$ Compute $\delta^{(L-1)}, \delta^{(L-2)},\;..., \delta^{(2)}$

$\;\;\;\;$ $\Delta^{(l)}_{ij} := \Delta^{(l)}_{ij} + a^{(l)}_j\delta^{(l+1)}_i$


($\Delta^{(l)}_{ij} := \Delta^{(l)}_{ij} + a^{(l)}_j\delta^{(l+1)}_i$ can be vectorized as $\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$)

Outside the For loop:

$ D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij}\;\;\;\;\;\;\;\;\;\;\;$ if $j = 0$

$ D^{(l)}_{ij} = \frac{1}{m}\Delta^{(l)}_{ij} + \lambda\Theta^{(l)}_{ij} \;$ if $j \neq 0$

Calculating $D$ is equivalent to calculating the derivative terms:

$ \frac{\partial}{\partial \Theta^{(l)}_{ij}} J(\Theta) = D^{(l)}_{ij} $

### Unrolling Parameters

Optimisation routines take gradient & cost. They arrume both variables are vectors.

However it is not the case in our neural network. Where both the gradients and the cost are both matrices. They both need to be __unrolled__

__Example__

Assume a neural network $s_1 = 10, s_2 = 10, s_3 = 1$

The weghts and gradient matrices are of sizes:

$ \Theta^{(1)} \in \mathbb{R}^{10\times11},  \Theta^{(2)} \in \mathbb{R}^{10\times11}, \Theta^{(3)} \in \mathbb{R}^{1\times11} $ 

$ D^{(1)} \in \mathbb{R}^{10\times11}, D^{(2)} \in \mathbb{R}^{10\times11}, D^{(3)} \in \mathbb{R}^{1\times11} $

In order to turn them into vectors they will need to be unrolled. Once the optimisation has completed. After that they will be reshaped into their original sizes. 

This can be done in Octave as follows:

```Octave

thetaVec = [Theta1(:); Theta2(:); Theta3(:)];
DVec = [D1(:); D2(:); D3(:)];

Theta1 = reshape(thetaVec(1:110), 10, 11);
Theta2 = reshape(thetaVec(111:220), 10, 11);
Theta3 = reshape(thetaVec(221:231), 1, 11);

```

In [1]:
((1+0.01)**3-(1-0.01)**3)/(2*0.01)

3.0001000000000055