## Tuning process

- We need to tune our hyperparameters to get the best out of them.
- Hyperparameters importance are:
    - Learning rate $\alpha$
    - Momentum beta $\beta$
    - Mini-batch size
    - \# of hidden units
    - \# of layers
    - Learning rate decay
    - Regularization lambda
    - Activation function
    - Adam $\beta_{1}$, $\beta_{2}$ and $\epsilon$
- It's hard to decide which hyperparameter is the most important in a problem. It depends a lot on the problem.
- One way to tune is to sample grid with $N$ hyperparameter settings and then try all setting combinations 
- Try random values: don't use a grid
- Use the` Coarse to fine sampling scheme`: When you find some hyperparameter values that give you a better performance--zoom in to a smaller region around these values and sample more densely within this space


## Using an appropriate scale to pick hyperparameters

- Suppose you have a specific range for a hyperparameter from "a" to "b". It's better to search for the right ones using the logarithmic scale rather than on a linear scale
<img src="screenshot/13.PNG" style="width:600px;height:350px;">

## Hyperparameters tuning in practice: Pandas vs. Caviar

- Intuitions about hyperparameter settings from one application area may or may not transfer to a different one.
- If you don't have many computational resources, you can use the "babysitting model".
- Panda approach:
    - Day $0$, initialize your parameter as random and then start training
    - Watch your learning curve gradually decrease over the day
    - And each day you nudge your parameters a little during training
- Caviar approach: 
    - If you have enough computational resources, you can run some models in parallel, and at the end of the day(s) you check the results.

<img src="screenshot/14.PNG" style="width:600px;height:350px;">

## Normalizing activations in a network

- In the rising of deep learning, one of essential ideas has been an algorithm called batch normalization. Batch normalization speeds up learning.

<img src="screenshot/15.PNG" style="width:600px;height:350px;">

## Fitting batch normalization into a neural network

- Using batch norm in $3$ hidden layers NN
<img src="screenshot/16.PNG" style="width:600px;height:350px;">
- Working with mini-batches
<img src="screenshot/17.PNG" style="width:600px;height:350px;">
- Implementing gradient descent
<img src="screenshot/18.PNG" style="width:600px;height:350px;">

## Why does batch normalization work?

- The first reason is the same reason as to why we normalize $X$
- The second reason is that batch normalization reduces the problem of input values shifting
- Batch normalization does some regularization:
    - Each mini-batch is scaled by the mean/variance computed of that mini-batch
    - This adds some noise to the values $Z^{[l]}$ within that mini-batch, so similar to dropout it adds some noise to each hidden layer's activations
    - This has a slight regularization effect
    - Using a bigger size of the mini-batch, you are reducing noise and therefore regularization effect
    - Don't rely on batch normalization as regularization. It's intended for normalization of hidden units, activations and therefore speeding up learning. For regularization use other regularization techniques. 


## Batch normalization at test time

- When we train a NN with Batch normalization, we compute the mean and the variance of the mini-batch
- In testing we might need to process examples one at a time. The mean and the variance of one example won't make sense.
- We have to compute an estimated value of mean and variance to use it in testing time
- We can use the weighted average across the mini-batches
- We will use the estimated values of the mean and variance to test
- This method is also sometimes called "running average"
- In practice most often you will use a deep learning framework and it will contain some default implementation of doing such a thing

## Softmax Regression

- Softmax layer
<img src="screenshot/19.PNG" style="width:600px;height:350px;">
- Loss function
<img src="screenshot/20.PNG" style="width:600px;height:350px;">

## Deep learning framework

- How to choose a deep learning framework
    - Ease of programming (development and deployment)
    - Running speed
    - Truly open (open source with good governance)
    - Programming frameworks cannot only shorten your coding time but sometimes also perform optimizations that speed up your code

- TensorFlow
- Example 1
    
```python
import numpy as np
import tensorflow as tf
 
w = tf.Variable(0, dtype = tf.float32)                            ## creating a variable w
cost = tf.add(tf.add(w ** 2, tf.multiply(-10.0, w)), 25.0)        ## can be written as this - cost = w**2 - 10*w + 25
train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)
init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w)    # Runs the definition of w, if you print this it will print zero
session.run(train)

print("W after one iteration:", session.run(w))
for i in range(1000):
    session.run(train)
print("W after 1000 iterations:", session.run(w))

```

- Example 2




```python
import numpy as np
import tensorflow as tf

coefficients = np.array([[1.], [-10.], [25.]])

x = tf.placeholder(tf.float32, [3, 1])
w = tf.Variable(0, dtype = tf.float32)                 # Creating a variable w
cost = x[0][0]*w**2 + x[1][0]*w + x[2][0]

train = tf.train.GradientDescentOptimizer(0.01).minimize(cost)

init = tf.global_variables_initializer()
session = tf.Session()
session.run(init)
session.run(w)                                         # Runs the definition of w, if you print this it will print zero
session.run(train, feed_dict = {x: coefficients})

print("W after one iteration:", session.run(w))

for i in range(1000):
    session.run(train, feed_dict = {x: coefficients})

print("W after 1000 iterations:", session.run(w))

```