# Shallow neural network

**Notation**

* superscript w/ square brackets refers to NN layer, superscript w/ round brackets refers to observation
* input (features) /hidden (not-observed in the data)/output (target) layers
* example of a NN with one input, one hidden and one output layer  

    * input layer $x = a^{[0]}$
    * hidden layer with 4 elements $a^{[1]} = \begin{bmatrix} a_1^{[1]} \\ ... \\ a_4^{[1]} \end{bmatrix}$
    * output layer $\hat{y} = a^{[2]}$

* hidden and output layers store internal parameters $w$ and $b$

**Forward pass**

Hidden layer  

* vectorization of the operations, ie $W^{[1]} \cdot X + b^{[1]} = Z^{[1]}$
* to see the inner elements of the pass

    * $Z^{[1]} = \begin{bmatrix} w_1^{[1] T} \\ w_2^{[1] T} \\ w_3^{[1] T}\\ w_4^{[1] T}\\\end{bmatrix} \cdot 
    \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix} +
    \begin{bmatrix} b_1^{[1]} \\ b_2^{[1]} \\ b_3^{[1]}\\ b_4^{[1]}\\\end{bmatrix} =
    \begin{bmatrix} z_1^{[1]} \\ z_2^{[1]} \\ z_3^{[1]}\\ z_4^{[1]}\\\end{bmatrix} $
    * activation applied element-wise on  $Z$ $ a^{[1]} = \begin{bmatrix} a_1^{[1]} \\ ... \\ a_2^{[1]} \end{bmatrix} = \sigma(Z^{[1]} )$

* analogous for other hidden layer
* output layer same in log res

**Vectorizing across multiple examples**

* for a simple NN classifier, with one hidden and one output layer
    * $Z^{[1]} = W^{[1]} \cdot X + b^{[1]}$
    * $A^{[1]} = \sigma (Z^{[1]} )$
    * $Z^{[2]} = W^{[2]} \cdot A^{[1]}  + b^{[2]}$
    * $ A^{[2]} = \sigma (Z^{[2]} ) = \hat{y} $
    * all matrices on LHS have elements vertically stacked (examples), horizontal dimension contains units (neurons)

**Activation functions**

* sigmoid $a = \frac{1}{1+e^{-z}}$ (almost never used except for output)
* tangent $a = \frac{e^z-e^z}{e^z+e^{-z}}$ (scaled version of sigmoid, centers data in hidden layers improving the learning speed)
* relu $a = argmax(0, x)$ (improves learning as derivatives are large)
* leaky relu $a = argmax(0.01z, x)$ (improves learning as derivatives are large)
* non-linear activation functions make sure that more complex systems then just linear ones can be learned

**Backpropagation**

* $\frac{\delta L(a,y)}{\delta a} \frac{\delta a}{\delta z} \frac{\delta z}{\delta w} \frac{\delta w}{\delta a} (output) \frac{\delta a}{\delta z} \frac{\delta z}{\delta w} \frac{\delta w}{\delta x} (hidden)$
* make sure that matrices have correct dimensions

**Random initialization**

* cannot initialize weights at 0, as they will be not able to learn different things (they will be symmetrical)
* small random values improves convergence for sigmoid/tanh (small gradients if numbers large)