# Regularizing your neural network

## Regularization

<img src="./images/improv_3.png" alt="Drawing" style="width: 350px;"/>

- Notice that we are adding a proportional regularization to W!
- The equation above is the L2 regularization
- We do not to add the b regularization. In practice, you could. W is a high-dimensional parameter! B is typically a single number.

<img src="./images/improv_4.png" alt="Drawing" style="width: 500px;"/>

- While this is the same as the L2 norm, it is not called L2 norm. Instead, it is called the `Frebenius Norm`


Implementing the regularization (affects backprop)
<img src="./images/improv_5.png" alt="Drawing" style="width: 500px;"/>

Substuting the formulas:
- Also noticed, that is sometimes called the weight decay since we are mult. the weight by a fraction (which is than 1) in addition to the gradient of W

<img src="./images/improv_6.png" alt="Drawing" style="width: 500px;"/>

We add the regularization term because we do want to prevent the cost function from decreasing. Hence, we add the term. Also, when we compute the derivatives, we see that adding a term will further decrease the weights (a lower coefficient is a positive when you to prevent overfitting).

## Why regularization reduces overfitting?
Remember, if we have a regularization that is large, than we would have the weight close to 0. Meaning, this model will not be able to overfit the data.

Remember, as you increase the regularization, we are reducing the gradient by a lot.

Another way of understanding this:
- When we have a large lambda, we will have smaller weights.
- With the smaller weights, we will have a smaller activation function. If we look at the tanh function, we will have our activation around the linear portion of the graph not the outier portion.
- It will be every layer is like a linear function. Thus, we will not be able to compute complicated models but only simple linear models across the network.
- <img src="./images/improv_7.png" alt="Drawing" style="width: 350px;"/>


## Dropout Regularization
In the image below, each node has a 50 percent probability of being removed.
- Once the node is removed, we will remove any incoming/outgoing links to that node. 

- <img src="./images/improv_8.png" alt="Drawing" style="width: 550px;"/>


How is dropout regularization done?
- We have to use "inverted dropout"
```python
# If we are using the dropout at layer 3
d3 = np.random.rand(a3.shape[0], a3.shape[1]) < keep.prob
a3 = np.multiple(a3, d3)
a3 /= keep.prob
```
- If the keep.prob is 0.8, then there is .2 prob that a neuron will be dropped off.
- The last step is a bit weird. We have to invert back the expected value even after we dropout neurons.
- Think of it like this. We are training with fewer neurons. Yet, we will need to use all the neurons when we are training. Thus, the expected value must remain the same (to train the network more efficient)

## Understanding Dropout
Intuition: Cannot rely on any one feature, so have to spread out weights since any one feature could be dropped out!

You can also have a keepprops for different layer.
- <img src="./images/improv_9.png" alt="Drawing" style="width: 350px;"/>

## Other regularization methods

Data augmentation:
- You can train with more examples by changing your format (like images)
- Think of it like adding noise, rotating, or changing it the position.
- We can get more data this way without having to get more training example


Early stopping:
- <img src="./images/improv_10.png" alt="Drawing" style="width: 550px;"/>
- Noticed that if we stop mid way, we will not have large weight, thus the network will not have learned the data to well. 
- Since larger are weights will make the network perform better for the training set than the dev set
- Drawback
    - Orthongonaliation: Thus, meaning we are trying to do two things at once, where we are not doing a good job at optimizing and we are trying to not overfit!!