# Deep Neural Network

## Deep L-layer neural network

- Logistic Regression are considered a shallow neural network
- Then there are 1 hidden layer which are a 2layer NN
- We could have 2 hidden layers
- Lastly, we have 5 hidden layers which could be considered a deep neural network

Notation: (You can also use the screenshot as a reference)
- L = 4 (# of layers)
- $n^{[l]}$ = # of units in layers, l
    - $n^{[0]}$ = $n_{[x]}$ = 3
    - $n^{[1]}$ = 5
    - $n^{[2]}$ = 5
    - $n^{[3]}$ = 3
    - $n^{[4]}$ = $n^{[L]}$ = 4
- $a^{[l]}$ = activation in layers, l
- $a^{[l]}$ = $g^{[l]}(z^{[l]})$
- $W^{[l]}$ & $b^{[l]}$ for $z^{[l]}$

- $\hat{y} = a^{[L]}$

<img src="./images/deep_68.png" alt="Drawing" style="width: 450px;"/>


## Forward Propagation in a Deep Network

Must remember: Andrew Ng consistently talks about the forward propogation. It is the mechanism that goes through the network. 

Remember: There's a difference btw the A vector and the W and b vectors.  The A vector begins at 0 which the inputs or X. BUT, the W vector is called $W^{[1]}$

Vectorization
- $Z^{[1]} = W^{[1]}A^{[0]} + b^{[1]}$
- $A^{[1]} = g^{[1]}(Z^{[1]})$
- $Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$
- $A^{[2]} = g^{[2]}(Z^{[2]})$


## Getting your matrix dimensions right

<img src="./images/deep_69.png" alt="Drawing" style="width: 350px;"/>

- Pay attention to the hidden units in each of the layers

<img src="./images/deep_70.png" alt="Drawing" style="width: 350px;"/>

- Follow the steps:
- The X vectors or $A^{[0]}$ is a 2 by 1 Matrix and $n^{[1]}$ is a 3 by 1 matrix.
- Thus, by matrix multiplication, our weights must be a 3 by 2 matrix
- Moreover, this a ($n^{[1]}, n^{[0]}$) matrix.
- This rule could be applied to every layer, and it would work. 
- Thus for $W^{[l]}: (n^{[l]}, n^{[l-1]})$
- Since the bias vector is an addictive operation, we do not have much to change:
    - $b^{[l]}: (n^{[l]}, 1)$

- Remember, the gradients of the function should be the same as the function:
- $dW^{[l]}: (n^{[l]}, n^{[l-1]})$
- $db^{[l]}: (n^{[l]}, 1)$

The following is an example:
- <img src="./images/deep_71.png" alt="Drawing" style="width: 350px;"/>
- Also all the training examples are compressed: Meaning that every single of the vector (horizontally) is for a person where each variable is a row. They are transposed!
- <img src="./images/deep_72.png" alt="Drawing" style="width: 350px;"/>
- Notice how we are changing the 1 to m. Meaning we are not looking at one observation but rather, all the training examples!
- $z^{[l]}, a^{[l]}: (n^{[l]}, 1)$
- $Z^{[l]}, A^{[l]}: (n^{[l]}, m)$
- Special case (when we are layer 0)
- layer = 0, $A^{[0]} = X = (n^{[0]}, m)$
- $dZ^{[l]}, dA^{[l]}: (n^{[l]}, m)$

## Why deep representations?

The understanding is that we need to figure out how we can break down a complex thing, like a puzzle, and then normalize it to a correct intrepreation of the data problem.

There has been comparision with the brain. We start thinking through small, insignficant concept and then we build a complex understand through these small tasks.

Another reason for why deep learning do great
- From Circuit theory and deep learning
    - Informally: There are functions you can compute with a "small" L-layer deep neural network that shallower networks require exponetially more hidden units to compute
    - Shallow networks tend to have one hidden network but we would need a lot more hidden nodes in the same layers for what a small L-layer deep neural network would (prob. would not even not need a lot of neural network)
    
Why is that the case:
- Remember when we were learning information theory. One key concept was that we should try to use a decision tree to break down the number of choices rather than compare each choice with each choice. The concept is the same.
- with a neural network, we would need Order(Log(n)) options, so if we have 8 differert examples, we would need 3 neural networks. $log_2(8) = 3$
- However, if we perform this against all the possibilities would exponentially large, or $2^8$ = 256, since you need to calculate for all possible calculations

## Building blocks of deep neural networks
Forward Propogation
- <img src="./images/deep_73.png" alt="Drawing" style="width: 400px;"/>

Backward/Forward Propogation
- <img src="./images/deep_74.png" alt="Drawing" style="width: 400px;"/>


## Forward and Backward Propagation

Forward propogation for layer l
- Input $a^{[l-1]}$
- Output $a^{[l]}$, cache $(z^{[l]})$

Backward propogation for layer l
- Input $da^{[l]}$
- Output $da^{[l-1]}$, $dW^{[l]}$, $db^{[l]}$

One imp thing to know:
- For $da^{[l]}$, the derative is $-y/a + ((1-y)/(1-a))$

## Parameters vs Hyperparameters

What are hyperparameters?
- Parameters: $W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}, W^{[3]}, b^{[3]}$
- Hyperparamters could be:
    - The learning rate alpha $\alpha$
    - Total nums of iterations
    - Total nums of hidden layers, L
    - Total nums of hidden units, $n^{[1]}, n^{[1]}$
    - Choice of activation functions (ReLu)
- Other hyper parameters could be:
    - Momentium, mini-batch, size, regularization

You must play around with the hyperparameters to better understand how they work. 
 
The process is very empirical, where we have to find the best hyperparameters through experimentation!

## What does this have to do with the brain?

- <img src="./images/deep_75.png" alt="Drawing" style="width: 400px;"/>

While the anaology btw a neuron and neuron from a neural network could have similarities, even neuroscientist do not have a concrete idea of what a neuron does
