## \[Assignment 1\] Initialization 

#### 1. Zeros initialization

In this case, due to loss being backpropagated identically by symmetry, we wont be utilizing the hidden units.

The network simply becomes a thin and deep string like network.




<font color='blue'>
    
**What you should remember**:
- The weights $W^{[l]}$ should be initialized randomly to break symmetry. 
- However, it's okay to initialize the biases $b^{[l]}$ to zeros. Symmetry is still broken so long as $W^{[l]}$ is initialized randomly. 


#### 2.  Random initialization with large values

**Observations**:

- When we initialize the weights to be super large values, it is veeery likely that sigmoid output of last layer is either very close to 1 ooor very close to 0 (think of sigmoid of -1000, or 1000). 
- This causes our loss to be veery high if we get things wrong. And in random initialization we do expect to get half of the samples wrong. This results in $inf$ loss which is not good. 

- Poor initialization leads to vanishing/exploding gradients.

- Eventually we might get good results but initializing with high values causes training convergence to slow down.



**Why randn (standard normal) and not rand (uniform)?**

- Well think of the uniform sampler. Then significant number of weights will be in the extremes (such as very close to -1 or close to 1). 

- This results in the last layer sigmoid output to have a veery small gradient (because sigmoid gradient is only good around the center which is near 0 value).

- So if we use randn most weights will be around the center, and this will yield larger gradient values at the beginning of the training.

- Uniform would yield a much slower convergence on the other hand.

In [2]:
# Basically check what happens when
import numpy as np
scale = 10
w_l = np.random.randn(3,2) * scale 

In [3]:
w_l

array([[ -6.37196125, -10.52565532],
       [ 27.63194141, -13.91045437],
       [ -2.2577012 ,  -7.41340445]])

In [8]:
np.random.rand(10,1) #range is [0,1)

array([[0.92864105],
       [0.7002517 ],
       [0.38039037],
       [0.103201  ],
       [0.09608372],
       [0.15344371],
       [0.50486605],
       [0.78872799],
       [0.21165572],
       [0.68714761]])

In [10]:
np.random.randn(10,1)  #range is arbitrary, standard normal (mean=0,std=1)

array([[-0.4394421 ],
       [ 0.24302633],
       [ 1.2852578 ],
       [-0.19889892],
       [ 1.09327393],
       [ 0.12481458],
       [-0.21623744],
       [ 0.70704534],
       [-1.81829869],
       [-0.17450156]])

**Congratulations**! You've completed this notebook on Initialization. 

Here's a quick recap of the main takeaways:

<font color='blue'>
    
- Different initializations lead to very different results
- Random initialization is used to break symmetry and make sure different hidden units can learn different things
- Resist initializing to values that are too large!
- He initialization works well for networks with ReLU activations

## \[Assignment 2\] Regularization

**$L_2$ Regularization**

<br>
<font color='blue'>
    
**What you should remember:** the implications of L2-regularization on:
- The cost computation:
    - A regularization term is added to the cost.
- The backpropagation function:
    - There are extra terms in the gradients with respect to weight matrices.
- Weights end up smaller ("weight decay"): 
    - Weights are pushed to smaller values.


    
    
**Dropout**
    <br>
    
Critical observation: When you shutdown some neurons, you actually modify the model. So at every iteration, you train a subset (new) model. Dropout helps with overfitting, intuitively because it teaches every neuron to be less sensitive to the input become, they might disappear on other iteration. Quoting the assignment notes:
   
    "When you shut some neurons down, you actually modify your model. The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, because that other neuron might be shut down at any time."
    

    
Reminder: When you apply dropout to a layer do NOT forget to scale it up by dividing it with keep_prob!!


<br>
<font color='blue'>
    
**What you should remember about dropout:**
- Dropout is a regularization technique.
- You only use dropout during training. Don't use dropout (randomly eliminate nodes) during test time.
- Apply dropout both during forward and backward propagation.
- During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.  