# Setting up your optimization problem

## Normalizing inputs
When normalizing the inputs, the goal is to transform the values where the values are around 0 (0 will be the mean)
1. Subract each value from the average of all values
2. Divide the variance of all the values from the result in step 1 

<img src="./images/improv_11.png" alt="Drawing" style="width: 600px;"/>

Scale the training set and testing set
- <img src="./images/improv_12.png" alt="Drawing" style="width: 350px;"/>
- Noticed that if we scale, then when use gradient descent, we are more likely to have find the local minima in an efficient manner

## Vanishing / Exploding gradients
You are performing these network, the derivative could either be very large or very small

In the example below, we must think of the layers as being linear. Thus y_hat could use the function in the image below.

<img src="./images/improv_50.png" alt="Drawing" style="width: 550px;"/>

The image above explains that when train on a deep neural network, one difficutlies is that are weight could increase/decrease exponentially as we go further in the layers. Hence, the weights deeper in the network will not carry much weight. 

The image above is not an accurate representation of a deep layer network since we do not have an activation function. It does serves the purpose of displaying how using the same weight throughout the network could be an issue.

## Weight Initialization for Deep Networks

<img src="./images/improv_51.png" alt="Drawing" style="width: 550px;"/>

As we learned in the previous section, the weights must be a well-thought out process. A partial solution would be to intialize the weights by the total number of inputs/output from the activation function. Therefore, the weights will decrease across all layers to a certain proportion that can reduce the possibility of exploding/vanishing gradient descent. 

The formulas used are in the image above.

Since we do not want to suffer from the vanishing/exploding gradient, this method helps our weight remain close to 1. 

```python
# relu function
W_layer_1 = np.random.rand(shape)*np.sqrt(2/n_layer_minus_1)

# tanh function
W_layer_1 = np.random.rand(shape)*np.sqrt(1/n_layer_minus_1)

# xavier initilization function
W_layer_1 = np.random.rand(shape)*np.sqrt(2/(n_layer_minus_1+n_layer_1))
```

## Numerical approximation of gradients
- <img src="./images/improv_14.png" alt="Drawing" style="width: 550px;"/>

The idea is that we are creating the 'triangle' using the epilson (plus and negative) to get a closer approx. to the actual gradient. When we use a larger scope, our gradient is more accurate.

Intersting: Our epilson error depends on whether we are using epilson or 2 * epilson.
- <img src="./images/improv_15.png" alt="Drawing" style="width: 450px;"/>


## Gradient checking

This section is useful when you have to debug any issues in the calculations. Your goal is to find if the derivative is the actual derivative of the cost fuction...

Gradient check for a neural network:
- Take $W^{[1]}$, $b^{[1]}$, ..., $W^{[L]}$, $b^{[L]}$ and reshape into a large vector delta.
- Take $dW^{[1]}$, $db^{[1]}$, ..., $dW^{[L]}$, $db^{[L]}$ and reshape into a large vector d_delta.


<img src="./images/improv_16.png" alt="Drawing" style="width: 550px;"/>

- Noticed that we compute the approx, and then compare it with the gradient we actually got.
- Thus, we can use the Eucidean distances to find the distance btw computed and calculated gradient descent.
- In the image, he provides thresholds of acceptables differences versus non-acceptables differences.

## Gradient Checking Implementation Notes

Gradient checking implementation notes:
- Don't use in training-only to debug
- If algoirthm fails grad check, look at components to try to identify bug.
    - e.g. there are only some layers for only b's
- Remember regularization.
- Doesn't work with dropout because we cannot mimic the neuron that are being dropped out.
    - We could turn on keep.prop to 1.0 just to check.