# Deep neural network

* network with high number of hidden layers

**Notation**  

* $l$ -> hidden layer
* $n^{[l]}$ -> no of units in layer $l$, $n^{[0]}=n_x$ (number of features)
* $a^{[l]}$ -> activation functions in layer $l$, $a^{[l]}=g^{[l]}=g^{[l]}(z^{[l]})$
* $W^{[l]}, b^{[l]}$ -> params for computation $z^{[l]}$

**Forward prop**

* generalized approach $z^{[l]} = W^{[l]} a^{[l-0]} + b^{[l]}$
* analogous for vectorized approach
* for-loop for interation between layers cannot be overriden

**Matrix dimensions**

* an example>
    * $n^{[0]}= 2, n^{[1]}= 3$
    * $Z^{[1]} = W^{[1]} Z^{[0]} + b^{[1]}$
    * matrix operation -> $\begin{bmatrix} .\\.\\.\\ \end{bmatrix} = \begin{bmatrix} . .\\. .\\. .\\ \end{bmatrix} \cdot \begin{bmatrix} .\\. \end{bmatrix}$ (w/o bias term)
    * general notion of dimensions is
        * for $W^{[l]} : (n^{[l]}, n^{[l-1]})$
        * for $dW^{[l]} : (n^{[l]}, n^{[l-1]})$
        * for $b^{[l]} : (n^{[l]}, 1)$
        * for $db^{[l]} : (n^{[l]}, 1)$
        * for $Z^{[l]}, A^{[l]} : (n^{[l]}, m)$
        * for $dZ^{[l]}, dA^{[l]} : (n^{[l]}, m)$

**Deep representation**

* learning hierarchical representations (CNNs) in images, audio or other data, from simple to more complex abstractions
* deep allows for using exponentialy fewer units to learn complex patterns (multiple xors, circuit theory)

**Building blocks**

* Forward
    * $Z^{[l]} : W^{[l]} a^{[l-1]} + b^{[l]}$
    * $a^{[l]} : g^{[l]} (z^{[l]})$
    * output $a^{[l]}$, cache $z^{[l]}$
    
* Backward
    * input $da^{[l]}$
        * cache $z^{[l]}$

    * output $da^{[l-1]}$
        * cache $dw^{[l]}$, $db^{[l]}$

* Schema description (see the slides, building blocks part of the presentation)
    * for layer $l$, we have
        * Forward pass
            * inputs $a^{[l-1]}$
            * outputs $a^{[l]}$
            * params $W^{[l]}$, $b^{[l]}$
            * cache for $Z^{[l]}$
        * Back pass
            * input cache $Z^{[l]}$
            * inputs $da^{[l]}$
            * outputs $da^{[l-1]}$
            * params $W^{[l]}$, $b^{[l]}$
            * other outputs are changes in internal params $dW^{[l]}$, $db^{[l]}$

    * the operations are chained together across the layers for both passes

* Reiteration of the backward pass
    * Input $da^{[l]}$
    * Output $da^{[l-1]}, dW^{[l]}, db^{[l]}$
    * $dz^{[l]} = da^{[l]} \cdot g^{[l]'} (z^{[l]})$
    * $dW^{[l]} = dz^{[l]} \cdot a^{[l-1]T}$
    * $db^{[l]} = dz^{[l]}$
    * $da^{[l-1]} = W^{[l]T} \cdot dz^{[l]}$
    * $dz^{[l]} = W^{[l+1]} db^{[l+1]} * g^{[l]'} (z^{[l]})$

**Additional read**
* https://jonaslalin.com/2021/12/10/feedforward-neural-networks-part-1/
* https://jonaslalin.com/2021/12/21/feedforward-neural-networks-part-2/
* https://jonaslalin.com/2021/12/22/feedforward-neural-networks-part-3/
* https://jonaslalin.com/2021/10/12/forward-vs-reverse-accumulation-mode/