# Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization

## Practical aspect of Deep Learning

### Regulation
- L1 Regulation $J(w^{(1)},w^{(2)}...) = \sum_{i=1}^{m}l(\hat{y}^{(i)},y^{(i)}） + \frac{\lambda}{2m}\sum\vert\vert w^{[l]} \vert \vert$
    - L1 norm is the sum of absolute value of all $w$
- L2 Regulation $J(w^{(1)},w^{(2)}...) = \sum_{i=1}^{m}l(\hat{y}^{(i)},y^{(i)}） + \frac{\lambda}{2m}\sum\vert\vert w^{[l]} \vert \vert^2$
    - L2/Euclidean/Forbenius norm is the sum of the square of elements of $w$ matrix
    - L2 norm is also called the weight decay
    - L2 regulation relies on the assumption that a model with small weights is simpler than a modle with large weighs. Thus by penalizing the square values of the weights in the cost function, we drive all the weights to smaller values. This leads to a smoother model in which the output changes more slowly as the input changes. 
    - Backpropogation with L2 Regulation
        - For each gradient, regularization term's gradient needs to be added: $\frac{1}{dW}(\frac{1}{2}\frac{\lambda}{m}W^2) = \frac{\lambda}{m}W$
- Dropout Regulation
    - Dropout randomly shuts down some neurons in each iterarion.The idea behind drop-out is that at each iteration, you train a different model that uses only a subset of your neurons. __With dropout, your neurons thus become less sensitive to the activation of one other specific neuron, becaue that other neuron might be shut down at any time.__
    - Only use dropout during training. __Don't use dropout (randomly eliminate nodes) during test time__, since you donot want your prediction to be random.
    - __During training time, divide each dropout layer by keep_prob to keep the same expected value for the activations__. For example, if keep_prob is 0.5, then we will on average shut down half the nodes, so the output will be scaled by 0.5 since only the remaining half are contributing to the solution. Dividing by 0.5 is equivalent to multiplying by 2. Hence, the output now has the same expected value. You can check that this works even when keep_prob is other values than 0.5.
    - __Apply dropout both during forward and backward propagation.__
    - Implementation tips:
        - Create a random matrix  $D^{[l]}$of the same dimentions as $A^{[l]}$
        -  Use $D^{[l]} = D^{[l]}< keep_prob$ to get a mask matrix
        - $A^{[l]} =A^{[l]} * D^{[l]} $ to shut down some neurons. 
        - $A^{[l]} = A^{[l]}/keep_prob$ to assure that the result of the cost will still have the same expected value as without drop out(which is also called inverted dropout)
        - Back propogation with dropout
            - In forward propogation some neurons were shut down by applying a mask $D^{[l]}$ to $A^{[l]}$. In backporgation, we will have to shut down the same neurons by reapplying  $D^{[l]}$ to $dA^{[l]}$ 
            - In forward propogation, $A^{[l]}$ is scaled by keep_prob, then $dA^{[l]}$ is also scaled by the same keep_prob.

### Normalizing Input
- If you normalize your inputs this will speed up the training process.
-  Normalization steps:
     1. Get the mean of the training set:
     2. Subtract this mean from each input: X=X-mean
         - This makes your input centered around 0
     3. Get the variance of the training set:
     4. Normalize the variance
- These steps should be applied to __training, dev, and testing set， but using the mean and variance of the training set__, you donot want to normalize your training and test set differently.
- Why normalize?
    - If we don't normalize the inputs our cost functions will be elongated then optimaztion will take a long time
    - But if we normalize the input, our features will on the similar scales, the cost function will be look more symmetric and we can use a larger learning rate. 
- However, for image/picture datasets, it is simpler and more convenient and works almost as well to just divide every row(unroll the image to a row vector) by 255(255 is the maximum number of a pixel chanel)
- If you have tanh/sigmoid 

### Exploding/vanishing gradient

- Very deep neural networks can have the problems of vanishing and exploding gradients, it turns out that a partial solution to this is more careful choice of the random initialization for your neural network. 
- A well chosen initialization can 
    - speed up the converance of gradient descent
    - Increase the odds of gradient descent converging to a lower training (and generalization) error
- Before, we talked about that the weight w[l] should be initialized randomly to break symmetry and make sure different hidden units/ neurons are learing different things
- However it is ok to initialize the biases b[l] to zeros, symmetry still breaks
- Don't intialize to values that are too large to avoid explode the cost.
    - $ w[l] = np.random.randn(shape) * 0.01 $
    - We need small values because in sigmoid (or tanh), for example, if the weight is too large you are more likely to end up even at the very start of training with very large values of Z. Which causes your tanh or your sigmoid activation function to be saturated, thus slowing down learning. If you don't have any sigmoid or tanh activation functions throughout your neural network, this is less of an issue.

    - Constant 0.01 is alright for 1 hidden layer networks, but if the NN is deep this number can be changed but it will always be a small number.
- One reasonable thing to do would be set the variance of $w_i$ to be $1/n$, n is the number of inputs. in practice:
    ``` w[l] = np. random. randn(shape) * np.sqrt(1/n[l-1])```, this is called __Xavier initialization__, works better for tanh activation
- It turns out that if you're using a Relu activation, the following works better
    ``` w[l] = w[l] = np. random. randn(shape) * np.sqrt(2/n[l-1])```, This is called the __He initialization__


# Optimization
### Mini Batch Gradient Descent
- Vectorization allows you to efficiently compute on __m__ examples, __each column of X and Y represents a training example__.
    - $X = [x^{(1)}, x^{(2)}, .....x^{(m)}]$, with shape ($n_x$,m)
    - $Y = [y^{(1)},y^{(2)}], ....y^{<M>}$, with shape ($n_y$, m)
- When you take gradient steps with respect all $m$ examples on each step, it is also called __Batch Gradient Descent__
- But when m is large, the computation is still slow.To speed up the algorithm, we use __mini batch gradient descent__. 

- To build mini-batches from the training set(X,Y)
   1. Shuffle: Create a shuffled version of the training set(X,Y). The shuffling steps ensures that examples will be split randomly into different mini-batches.
    ```
    permutation = list(np.random.permutation(m))
    shuffled_X = X[:, permutation]
    shuffled_Y = Y[:, permutation].reshape((1,m))
    ```
   2. Partition: Partition the shuffled(X,Y) into mini-batches of size mini_batch_size. Note that the number of training exmaples is not always divisible by min_batch_size.
   ```
   initialize parameters
   for n in range(0, num_iteration)
          for m in range(0, num_mini_batch)
               forward propagate
               compute cost
               backward propagate
               update parameters
   ```
   
- In mini-batch algorithm, the cost won't go down with each step as it does in batch gradient descent algorithm. It could contain somes ups and downs but generally it will go down.
- If each mini_batch has just 1 example, it's called __Stochastic Gradient Descent__

###  Gradient descent with momentum
##### Exponentially weighted average
- $v_t = \beta v_t + (1-\beta)\theta_t$
- $v_t$ is approxiamately averaging over $1/(1-\beta)$ datas. Because $(1-\epsilon)^{1/\epsilon} \sim \frac{1}{e}$
- Bias Correction:
    - Initialize $v_0 = 0$ will cause the moving average starts off really low for the first few days.It will take a few iterations to "build up " time and start to take bigger steps. 
    - To fix this problem, we let $v_t` = v_t/(1-\beta ^t) $, this nice things is $\beta ^t$ approaches zero when t gets large. 
    - But people usually donot bother to implement bias correction, because most people would rather just wait that initial period to pass and have a slightly more biased estimate and go from there.

#### Gradient descent with momentum
- Because mini-batch gradient descent makes a parameter update after seeing just a subset of examples, the direction of the update has some variance, and so the path taken by mini-batch gradient descent will "oscillate" towards convergence. Using momentum can reduce these oscillations. 
- Momentum takes into account the past gradients to smooth out the update. The 'direction' of the previous gradients are stored in variable $v$. Formally, this will be the exponentially weighted average of the gradient on previous steps, and then use that gradient to update your weights instead. $v$ can also be thought as the 'velocity' of a ball rolling downhill, building up speed(and momentum) according to the direction of the gradient/shape of the hill.
- __The momentum algorithm almost always works faster than standard gradient descent.__
- The idea is to calculate the exponentially weighted averages for your gradient and then update your weights with the new value.
    
    ``` 
    vdW = 0, vdb =0
    on iteration t:
        # can be mini-batch or batch gradient descent
        compute dw, db on current mini-batch
        vdw = beta * vdw + (1-beta) * dw
        vdb = beta * vdb + (1-beta) * db
        w = w - learning_rate * vdw
        b = b - learning_rate * vdb
    ```
- $\beta$ is another ```hyperparameters```. beta = 0.9 is very common and works well in most cases.Common values for $\beta$ range from 0.8 to 0.999.
- The larger the momentum $\beta$ is, the smoother the update because the more we take the past gradients(``` ~ 1/(1-beta)```) into account, but if $\beta$ is too big, it could also smooth out the updates too much.

- In practice, people don't bother implementing bias correction.


### RMSprop(Root mean square prop)
- Pseudo code:
   ```
   sdw =0, sbd =0
    on iteration t:
         # can be mini-batch or batch gradient descent
        compute dw, db on current mini-batch
       
        compute dw, db on current mini-batch
        sdw = (beta*sdw) + (1-beta) *dw^2   ## squaring is element-wise
        sdb = (beta * sdb) + (1-beta) * db ^2  ## squaring is element -wise
        w = w -learning_rate * dw/sqrt(sdw)
        b = b - learning_rate *db/sqrt(sdb)
   ```
- RMSprop will make the cost function move slower on the vertical direction and faster on the horizontal direction. 
- Ensure that sdw is not zero by adding a small value epsilon (e.g. epsilon = 10 ^ -8) to it: ``` w = w - learning_rate * dw/(sqrt(sdw) + epsilon)```

### Adam(Adaptive moment estimation) optimization algorithm
- Adam opatimization and RMSprop are among the optimization algorithms that worked very well with a lot of NN architectures
- Adam optimization simply puts RMSprop and momentum together
- How does Adam work:
    - It calculate an exponentially weighted average of past gradients, and store it in variable $v$
    - It calculate the exponentially weighted average of the squares of the past gradients, and store it in variables $s$
    - It updates parameters in a direction based on combining information from the previous step.
- Pseudo Code:
```
    vdw = 0, vdb = 0
    sdw =0, sdb = 0
    on iteration t: 
        compute dw, db on current mini-batch
            vdw= (beta1 * vdw) + (1-beta1)*dw       #momnetum
            vdb = (beta1 * vdb) + (1-beta1) * db    #momentum 
            
            sdw = (beta2 *sdw) + (1-beta2) * dw^2    # RMSprop
            sdb = (beta2 * sdb) + (1-beta2)* db^2    #RMSprop
            
            vdw = vdw/(1-beta1^t)   # fixbing bias
            vdb = vdb/ (1-beta1^t)   # fixing bias
            
            sdw = sdw/(1-beta2^t)   # fixbing bias
            sdb = sdb/ (1-beta2^t)   # fixing bias
            
            w= w -learning_rate *vdw/(sqrt(sdw) + epsilon)
            b =b - learning_rate * vdb / (sqrt(sdb)+epsilon)
            ```
- Hyperparameters for Adam:
    - Learning rate: needed to be tuned.
    - beta1: parameter of the momentum - 0.9 is recommended by default.
    - beta2: parameter of the RMSprop - 0.999 is recommended by default.
    - epsilon: 10^-8 is recommended by default.
-- Momentum usually helps, but given the small learning rate and the simplistic datasets, its impact is almost negligible. The huge oscillation in the cost come from the fact that some minibatches are more difficult than others for the optimization algorithm. Adam outperforms mini-batch gradient descent and Momentum, converges a lot faster.
-- In high dimensial space, it's unlikely to get stuck in a bad local optima as long as you're training a reasonbaly large neural network, but plateaus are a problem and can actually make learning pretty slow, this is where algorithms like momentum, RMSprop or Adam can actually speed up the rate at which you could move down the plateau and then get off the plateau.


### Learning rate decay
- As mentioned before, mini-batch gradient descent won't reach the optimum point. But by making the learning rate decay with iterations it will be much closer to it because the steps near the optimum are smaller
- ``` learning_rate = (1/(1+decay_rate * epoch_num))*learning_rate_0```

# Hyperparameter tuning, Batch normalization and programming Frameworks

## Batch Normaliztion
- Previously we normalized input by subtracting the mean and dividing by variance. This helped a lot for the shape of the cost function and for reaching the minimum point faster
- The question is: by any hidden layer can we normalize ```A[L]``` to train ```w[l+1], b[l+1]``` faster? This is what batch normalization is about
- There are some debates in the deep learning literature about whether you should normalize values before the activation function Z[l] or after applying the activation function A[l]. In practice, __normalizing Z[l] is done much more often__.
- Algorithm:
    - on layer l, Given ```Z[l] = [z(1),...z(m)], i = 1 to m ```
    - Compute ```mean = 1/m *sum(z[i])```
    - compute ``` variance = 1/m * sum((z[i]-mean)^2)```
    - Then``` z_norm[i] = (z[i] - mean)/np.sqrt(variance + epsilon)``` (add epsilon for numerical stability if variance = 0)
    - Then ``` z_tilde[i] = gamma * z_norm[i] + beta```
        - To make inputs belong to  distribution with other mean and variance
        - __Gamma and beta can be learnable parameters of the model__
        - Making the NN learn the distribution of the outputs
-If we are using batch normalization, parameters b[1],....b[L] doesn't needed anymore, since they will be elimated after mean subtration step, 
- So the parameter becomes $W[l],\gamma[l]$ and $\beta[l]$
        

    ```
    Z[l] = w[l]A[l-1]
    Z_norm[l] ==...
    Z_tilde[l] = gamma[l] * Z_norm[l] + beta[l]
    ```
- Why batch norm works?
    - Batch norm makes parameters, deeper in yout network, say layer 10,  more robust to changes to parameters in earlier layers of the NN, say layer 1. since the input of this layer has the same mean and variance controlled by $\beta$ and $\gamma$.  
    - Batch norm has a slight regularization effect
        - Each mini-batch is scaled by mean/variance computed on just that mini-batch, this adds some noise to the values $z[l]$ within that minibatch. So similar to dropout, it adds some noise to each hidden layer's activation. Because by adding noise to the hidden units, it's forcing the downstream hidden units not to rely too much on any one hidden unit. Thus has a slight regularization effect. 

### Batch  Normalization at test time
- When we train a NN with batch normalization, we compute the mean and variance of the mini-batch
- When using batch normalization,since normalization zero out the mean in the hidden layer, the bias parameter b[l] is not needed anymore. 
- In testing time, we might need to process examples one at a time. The mean and variance of one example won't make sense
- __We have to compute an estimate value of mean and variance to use it in testing time__
- __We can use the weighted average across the mini-batches, we will use the esitmate value of the mean and variance to test, this method is also called "running average"__
- Pseudo Code
    ``` 
    for t=1 ... num of iterations
            compute forwardpop on x^{t}
                in each hidden layer, use BN to replace z^[l] with z_tilde^{l}
            compute the cost
            use backprop to compute dw^[l], dbeta^[l], dgamma^[l]
            update parameters
    ```

## Tuning process
- Hyperparameters importance are:
    -  Learning rate
    - Momentum beta
    - Mini-batch size
    - No. of hidden units
    - No. of layers
    - Learning rate decay
    - Regulation lambds
    - Activation functions
    - Adam beta1 & beta2
- When tune the parameters, you want to try random values, don't use grid
- You can use ```Coarse to fine sampleing scheme```
    - When you find sime hyperparaters values that give you a better performance, zoom into a smaller region around these values and sample more densly within this space. 

## Tensorflow

- Writing and running programs in Tensorflow has the following steps:
    - Create Tensors(variables) that are not yet excuted/evaluated
    - Write operations between those tensors
    - Initialize your tensors
    - Create a session
    - Run the session, this will run the operations you've written in the session.
- In Tensorflow, a placeholder is a variable you can assign a value to only later. To specify values for a placeholder, you can pass in values by using a "feed dictionary"(feed_dictionary variable) when you run the session.

- tf. constant, tf. placeholder, tf. variable
- Instead of needing to write code to compute the cost function, we can use:
    ```tf.nn.sigmoid_cross_entropy_with_logits(logits= , lables =) ```
- To initialize weights in NN use:
    ```
    w1 = tf.get_variable("W1", [25,12288], initializer = tf.contrib.layers.xavier_initializer(seed =1))
    b1= tf.get_variable("b1",[25,1],initializer = tf.zeros_initializer())
       ```
- For n-layer NN, it is important to note that the forward propagation stops at Z3, since Tensorflow use the last layer linear output as input to the function computing the loss. Therefore, you don't need A[L]
- There are two typical ways to create and use sessions in tensorflow:
    Method 1
    ```
    sess = tf.Session()
    result = sess.run(..., feed_dict = {placeholder: value })
    sess.close
    ```
    Method 2
    ```
    with tf.Session() as sess:
        session.run(init) / result = sess.run(..., feed_dict= { })   # This takes care of closing the session for you
    ```    
- To reset the graph, use ``` tf.reset_default_graph()```
- ```tf.one_hot(labels,depth,axis)```
-  In Tensorflow, all the backprogation and the parameters update is taken care of in 1 line of code. After you compute the cost function. You will create an "optimizer" object. When it is called, it will perform an optimization on the given cost with the chosen method and learning rate. For instance, for gradient descent:
 ``` optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate).minimize(cost)```

- To make the optimiztion you would do: 
     ``` _, c = sess.run([optimizer, cost], feed_dict = {X:minibatch_X, Y: minibatch_Y})```
   - This computes the backpropation by passing through the tensorflow graph in the reverse order. 
- When coding, we often use _ as a "throwaway" variable to store values that we won't need to use later.

- Differences between Tensor and Variable:
    - Variable is mutable, Tensor is immutable. 
    - Variable is often used to store weight matrix, filter, etc. 
    -  Plases that uses Tensor, can also variable.  
    - Varaible needs initilization, and all operations on variable has be in an session.
    - Variable 会显示分配内存空间的，设置完变量之后，必须要对变量进行初始化工作，变量才能使用， 所有和varible有关的操作在计算的时候都要使用session会话来控制。 相反的， 诸如const，zeros 等操作创造的tensor， 是记录在graph中，所以没有单独的内存空间； 而其他未知的由其他tensor 操作得来的tensor 则只在程序运行中出现
    