# Neural Networks - Practical Advices<a id="Top"></a>

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
Table of Content
<ul>
<li>1. <a href="#Part_1">Vanishing/Exploding Gradients</a></li>
<li>2. <a href="#Part_2">Reusing Pretrained Layers</a></li>
<li>3. <a href="#Part_3">Faster Optimizers</a></li>
<li>4. <a href="#Part_4">Avoiding Overfitting Through Regularization</a></li>    
</ul>
</font>
</div>

## 1. Vanishing/Exploding Gradients

From the previous chapter, we know that training a network amounts to determine the weights and biases of the net in such a way that the loss is minimal. The backpropagation technique allows us to work out the gradients from the output layer all the way to the input layer. Hence traning for each layer is possible because now we have a way of updating the network's trainable parameters, i.e. weights and biases. 

Unfortunately, gradients often get smaller and smaller as the algorithm progresses down to the lower layers. As a result, the gradient descent update leaves the lower layers unchanged, and training never converges to a good solution. This is called the __vanishing gradient__ problem. In some cases, the opposite happens: the gradient gets bigger and bigger as we walk down the layers, creating the so called __exploding gradients__ problem.

For the logistic activation, the problem can be understood as follows. When the input is large in value, the logistic function value saturates at either 1 or 0, with a derivative extremely close to zero. Thus when backpropagation kicks in, there is essentially no gradient to propagate back through the network.

### Xavier and He Initialization

The gradient vanishing/exploding problem has to do with how the network gets initialized and the type of activation function is implemented. The former issue can be tackled using the Xavier and He initialization method. The method argues that the connection weights must be initialized as follows

| Activation Function | Uniform Distribution | Normal Distribution |
|---------------------|----------------------|---------------------|
| Logistic | $r = \sqrt{\displaystyle\frac{6}{n_i + n_o}}$  | $\sigma= \sqrt{\displaystyle\frac{2}{n_i + n_o}}$  | 
| $\tanh$  | $r = 4\sqrt{\displaystyle\frac{6}{n_i + n_o}}$ | $\sigma= 4\sqrt{\displaystyle\frac{2}{n_i + n_o}}$ | 
| ReLU     | $r = \sqrt{\displaystyle\frac{12}{n_i + n_o}}$ | $\sigma= \sqrt{\displaystyle\frac{4}{n_i + n_o}}$  | 

where $n_i$ and $n_o$ denote the number of input and output neurons. The strategy for the ReLU function is sometimes called __He Initialization__.

### Nonsaturating Activation Functions

As mentioned previously, the vanishing/exploding gradient problem also has to do with activation functions. For ReLU, in particular, a class of problem called __dying ReLU__ occurs. This is because during training, if the updated weights are negative, then the neurons will stop outputing anything except zero. When this happens, the neurons are unlikely to come back since the gradient of ReLU now is always zero when the outputs are negative. 

To solve the problem, one could adopt __leaky ReLU__ or __ELU__.

$$
  \mbox{Leaky ReLU}(z) = \max(\alpha z, z)
$$
where $\alpha$ is a hyperparameter. ELU is defined as follows

$$ 
        \mbox{ELU}(z) = \left\{
           \begin{array}{lc}
               \alpha(e^z-1 ) & \mbox{if}\,\, z < 0,\\
                           z  & \mbox{if}\,\, z \geq 0.
            \end{array}
        \right.
$$

### Batch Normalization

He initialization and ELU activation combined could reduce the gradient vanishing/exploding problem significantly. But there is no guarantee the problem will not resurface. Batch normalization was proposed in 2015 and it is shown to be able to considerably improve deep neural network learning convergence. 

The idea is to add an operation in the model before the activation function of each layer: zero-centering and normalizing the inputs, then scale and shift the results using two additional parameters per layer:

$$ 
   \begin{align}
     \mu_B       &= \frac{1}{m_B}\sum_{i=1}^{m_B}\, \mathbf{x}^{(i)} \\
     \sigma_B^2  &= \frac{1}{m_B}\sum_{i=1}^{m_B}\, \left( \mathbf{x}^{(i)} - \mu_B \right)^2 \\
     \hat{\mathbf{x}}^{(i)} &= \frac{\mathbf{x}^{(i)} - \mu_B}{\sqrt{\sigma_B^2+\epsilon}} \\
     \mathbf{z}^{(i)} &= \gamma\,\mathbf{x}^{(i)} + \beta
   \end{align}
$$

Here
- $\mu_B$ is the empirical mean, computer over the whole mini-batch $B$.
- $\sigma_B$ is the standard deviation of the whole mini-batch $B$.
- $m_B$ is the number of instances in the mini-batch.
- $\hat{\mathbf{x}}^{(i)}$ is the zero-centered and normalized input.
- $\gamma$ is the scaling parameter for the layer under consideration.
- $\beta$ is the shifting parameter for the layer.
- $\epsilon$ is a tiny number (typically set to $10^{-3}$) to avoid division by zero.
- $\mathbf{z}^{(i)}$ is the output of the batch normalization operation.

At test time, one just compute the mean and standard deviation of the whole training set. So in total, there are four additional hyperparameters for each batch-normalized layer: scale $\gamma$, offset $\beta$, mean $\mu$, and standard deviation $\sigma$.

### Gradient Clipping

Another popular technique to alleviate the exploding gradients problem is to simply clip the gradients during backpropagation so that they never exceed some threshold.

## 2. Reusing Pretrained Layers<a id="Part_2"></a>
<a href="#Top">Back to page top</a>

### Transfer learning in action

In training neural networks, the goal is to determine, for each layer, the bias $b$ and weight $w$ for each neuron. These variables are often called collectively as *trainable variables*. There are cases where trainable variables contain more than just bias and weight. For example, when applying batch normalization, each layer has four additional trainable variables: scale $\gamma$, offset $\beta$, mean $\mu$, and standard deviation $\sigma$.

So in this sense, transfer learning means reusing the trainable variables of a pretrained network. One way of doing this is just to reload the saved model and train it on the new task. However, it is often the case that we want to reuse part of the saved model, say trainable parameters from bottom layers. In tensorflow, how do we retrieve these variables from a pre-trained network? tensorflow provides `tf.get_collection()` function for this task:

```python
reuse_vars = tf.get_collection(key, scope=None)
```

Here `key` is the collection of variables we wish to retrieve. `scope` is a regular expression string that can be fed 
to `re.match`. Here is an application example from the Handson book:

```python
reuse_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='hidden[123]'),
```

the key `tf.GraphKeys.TRAINABLE_VARIABLES` belongs to the `tf.GraphKeys` class that contains many standard variables. `hidden[123]` is a regular expression that produces any hidden layer whose namescope is `hidden1`, `hidden2`, or `hidden3`. Namely, we are getting hidden layers 1 to 3.

Now with `tf.get_collection()` we retrieve the desired variables. The next step is to use `tf.Saver()` to load these variables from the pre-trained model. In particular, we need to feed `tf.Waver()` a dictionary that specifies the variables to be loaded. We now can piece toether the above code segments to write a program that reuses the first three hidden layers of a saved mode:
```python
[... build new model with the same definition as before the hidden layers 1-3 ...]

init = tf.global_variables_initializer()

reuse_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='hidden[123]')
reuse_var_dict = dict([(var, var) for var in reuse_vars])
original_saver = tf.Saver(reuse_var_dict)
new_saver = tf.Saver()

with tf.Session() as sess:
    sess.run(init)
    original_saver.restore('/path/to_the_saved_model.ckpt')
    [... train the model ...]
    new_saver.save('/path/to_the_new_model.ckpt')
```
The code does the following:
1. Build the new model, making sure to copy the saved model's hidden layers 1-3.
2. Initialize variables.
3. Get all the trainable parameters from the saved model using `tf.get_collection()`.
4. Create a dictionary that mapps the name of each variable in the saved model to its name in the new model.
5. Create a saver that restores the saved model.
6. Create a separate saver to save the new model.
7. Train the new model.
8. Save the new model.

This process assumes that __both the saved and new models are performing the same or similar task__.

### Freezing and caching lower layers

There are times when the reused layers have already learned to detect low-level features of the dataset. In this case, it is possible to exclude them from training. These are called _frozen layers_. In tensorflow, we can use the following piece of code to achieve the goal:
```python
train_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='hidden[34]|outputs')
training_op = optimizer.minimize(loss, var_list=train_vars)
```
In this example, we just pass a list of variables to be trained (retrieved by calling `tf.get_collection()`) to the optimizer. That's it!
​
Now that layers 1 and 2 are frozen, we may run layer 2 through the entire dataset and store its outputs for the training of layers 3, 4 and output layers. This is possible because layer 2's outputs are fixed (remember that layer 1 and 2 will not be trained). Doing so will save a lot of training time, provided we have enough memory to store the outputs.

### Practical advices
- Normally, the output layer of saved model should be replaced, because it very likely will not work with the new task nor does it have the right number of outputs.
- The upper hidden layers of the saved model sould also be replaced. This is because upper layers are trained to recognize higher level features, which are task-dependent.

To find the right number of layers to use:
1. Freeze all copied layers first, then train the model and evaluate its performance.
2. Gradually unfreeze these layers and monitor the model's performance change. __The more training data you have, the more layers you can unfreeze__.
3. If there is no improvement, try dropping top hidden layers and freeze all remaining hidden layers from the saved model again.

Iterate these steps until one finds the right number of laers to reuse.

### Getting pretrained models: Model Zoo
1. Tensorflow (<a href="https://www.tensorflow.org/about/uses">Link</a>). Eg. VGG, Inception, ResNet for image clasification.
2. Caffe Model Zoo: (<a href="http://caffe.berkeleyvision.org/model_zoo.html">Link</a>). Eg. LeNet, AlexNet, VGGNet, ...etc for image classification.

## 3. Faster Optimizers<a id="Part_3"></a>
<a href="#Top">Back to page top</a>

### Momentum optimization
Recall the Gradient Descent update:

$$ \theta' = \theta - \eta \nabla_\theta J(\theta) $$

The step size is fixed and the update does not care about what the earlier gradients were. If the local gradient is small, then the update will goes very slowly. 

Momentum optimization does exactly the opposize: it cares greatly about the previous gradients. At each iteration, it adds the local gradient to the momentum vector $\mathbf{m}$, and updates the weights by subtracting the momentum vector:

$$ 
   \begin{align}
     \mathbf{m}' &= \beta\,\mathbf{m} + \eta \nabla_\theta J(\theta) \\
     \theta' &= \theta - \mathbf{m} 
   \end{align}
$$

If the gradient remains constant, one can show that the maximal size of the weight updates is equal to that gradient multiplied by $\eta / (1-\beta)$. Let $G$ denote the gradient and $\mathbf{m}^{(0)}$ the initial guess of the momentum vector. The update formula gives the following scheme:

$$  
  \begin{align}
    \mathbf{m}^{(1)} &= \beta\,\mathbf{m}^{(0)} + \eta G\\
    \mathbf{m}^{(2)} &= \beta\,\mathbf{m}^{(1)} + \eta G
                      = \beta^2\,\mathbf{m}^{(0)} + \beta\eta G + \eta G\\
    \mathbf{m}^{(3)} &= \beta\,\mathbf{m}^{(2)} + \eta G
                      = \beta^3\,\mathbf{m}^{(0)} + \beta^2\beta G + \beta\eta G + \eta G\\
    \vdots \\
    \mathbf{m}^{(k)} &= \beta^k\,\mathbf{m}^{(0)} + 
                        \left( 
                            \beta^{k-1} + \ldots + \beta^2 + \beta + 1
                        \right) \eta G
  \end{align}
$$

Then it's easy to show that the series in $\beta$ converges to $1/(1-\beta)$ for $\beta < 1$.

### Nesterov Accelerated Gradient
This is essentially the momentum optimization with the only difference being that now the gradient is measured in the direction of the momentum:

$$ 
   \begin{align}
     \mathbf{m}' &= \beta\,\mathbf{m} + \eta \nabla_\theta J(\theta+\beta\mathbf{m}) \\
     \theta' &= \theta - \mathbf{m} 
   \end{align}
$$

It is shown that this small modification leads to a faster optimization than the plain vanilla momentum optimization and helps reduce oscillations near the minimum point.

### AdaGrad

AdaGrad means __adaptive learning rate__. The algorithm updates the weights as follows

$$ 
   \begin{align}
          s_i' &= s_i + \left( \frac{\partial}{\partial \theta_i} J(\theta) \right)^2 \\
     \theta_i' &= \theta_i - \frac{\eta}{\sqrt{s_i+\epsilon}} \frac{\partial}{\partial \theta_i} J(\theta)
   \end{align}
$$

So the algorithm scales down the learning rate, and does so faster for steeper dimensions than for dimensions with gentler slopes. Hence the name __adaptive learning rate__. Accodring to the Handson book, the technique works well for simple quadratic problems (loss functions?), but often stops too early when training neural networks. In general, AdaGrad should be avoided.

### RMSProp

RMSProp is proposed by Tieleman and Hinton to fix the problem of AdaGrad. The algorithm goes as follows

$$ 
   \begin{align}
          s_i' &= \beta s_i + (1-\beta)\left( \frac{\partial}{\partial \theta_i} J(\theta) \right)^2 \\
     \theta_i' &= \theta_i - \frac{\eta}{\sqrt{s_i+\epsilon}} \frac{\partial}{\partial \theta_i} J(\theta)
   \end{align}
$$

So the $\mathbf{s}$ vector update is replaced by using exponential decay. $\beta$ normally is set to be 0.9. It turns out the algorithm is much better than AdaGrad. It also generally performs better than momentum and optimization and Nesterov accelerated gradient.

### Adam Optimization

Adam stands for __adaptive moment estimation__. The method combines the ideas of momentum optimization and RMSProp:

$$ 
   \begin{align}
     \mathbf{m}' &= \beta_1\,\mathbf{m} + (1-\beta_1) \nabla_\theta J(\theta) \\   
            s_i' &= \beta_2 s_i + (1-\beta_2)\left( \frac{\partial}{\partial \theta_i} J(\theta) \right)^2 \\
     \mathbf{m}' &= \frac{\mathbf{m}}{1-\beta_1^T}\\
     \mathbf{s}' &= \frac{\mathbf{s}}{1-\beta_2^T}\\
       \theta_i' &= \theta_i - \frac{\eta\,\mathbf{m}}{\sqrt{s_i+\epsilon}}      
   \end{align}
$$

Therefore, the method keeps track of
1. the exponential decaying average of past gradients.
2. the exponential decaying average of past squared gradients.
The algorithm is an adaptive learning rate technique, it requires less tuning of the learning rate hyperparameter $\eta$. The default $\eta = 0.001$ usually works well.

### Learning Rate Scheduling
Four methods to tune the learning rate on the fly in order to speed up convergence:
1. __Predetermined piecewise constant learning rate__
    Set the learning rate every $N$ steps by hand. Works well, but requires fiddling around.

2. __Performance scheduling__
    Measure the validation error at every $N$ steps and reduce the learning rate by a factor of $\lambda$ when the error stops dropping.

3. __Exponential scheduling__
    Schedule $\eta$ as follows:

    $$\eta(t) = \eta_0\,10^{-t/r}$$

    where $\eta_0$ and $r$ are two new hyperparameters.

4. __Power scheduling__
Use a power-law to schedule $\eta$:

$$ \eta(t) = \eta_0(1+t/r)^{-c}$$

AdaGrad, RMSProp, and Adam optimization automatically reduce $\eta$ during training. No training rate scheduling is necessary for these algorithms.

## 4. Avoiding Overfitting Through Regularization<a id="Part_4"></a>
<a href="#Top">Back to page top</a>

### Early Stopping

Evaluate the model on a validation set at regular intervals, and set a "winner" snapshot if it outperforms previous "winner" snapshots.

### $\ell_1$ and $\ell_2$ Regularization

Use $\ell_1$ and $\ell_2$ rgularizations to constrain a neural network's weights (but typically not its biases).
In tensorflow, many functions that create variables, such as `get_variable()` or `fully_connected()` accept a `*_regularizer` argument for each created variable, e.g. `weights_regularizer`. One can pass any function that takes weights as an argument and returns the corresponding regularization loss. An example of the implementation given in the Handson book goes as follows:
```python
    with arg_scope(
            [fully_connected],
            weights_regularizer = tf.contrib.layers.l1_regularizer(scale=0.1)):
        hidden1 = fully_connected(X, n_hidden1, scope='hidden1')
        hidden2 = fully_connected(hidden1, n_hidden2, scope='hidden2')
        logits = fully_connected(hidden2, n_outputs, activation_fn = Nono, scope='out')
        
    reg_loss = tf.get_collection(tg.GraphKeys.REGULARIZATION_LOSSES)
    loss = tf.add_n([base_loss] + reg_loss, name='loss')
```
The first six lines of the code create two hidden layers and one output layer. It also creates nodes in the graph to compute $\ell_1$ regularization loss corresponding to each layer's weights. Tensorflow would add these nodes to a special collection containing all the regularized losses. We need to add these losses to the overall loss. This is done by the last two lines of the code.

### Dropout

At every training step, every neuron has a probability $p$ of being temporarily dropped out from the network, meaning it will be entirely ignored during this training step, but it might be activated again in the next step. The hyperparameter $p$ is called __dropout rate__. After training, neurons don't get dropped anymore. 

<img src="./images/fig_dropout.png" width='400'>

In tensorflow, use the `dropout` class from `tf.contrib.layers`. This class would correctly turn off the dropout operation during model evaluation.
Rules of thumb:
- If the model overfits, then increase the dropout rate, i.e. reduce `keep_prob` parameter.
- If the model underfits otherwise, decrease the dropout rate.
- Reduce the dropout rate for small layers.
- Increase the rate for large layers that have lots of neurons.
Note that __dropout tend to significantly slow down convergence__, but it usually results in a much better model when tuned properly.

### Max-Norm Regularization

As the name suggests, the max-norm algorithm cuts the weights $\mathbf{w}$ of the incoming connections such that $|\mathbf{w}|_2 \leq r$, where $r$ is the max-norm hyperparameter and $|\cdot|_2$ denotes the $\ell_2$ norm. This is typically implemented in the following way:

$$
  \mathbf{w}' = \mathbf{w}\cdot\frac{r}{|\mathbf{w}|_2}
$$

Reducing $r$ increases the amount of regularization and helps reduce overfitting.

### Data Augmentation

The method consists of generating new training instances from existing ones, artificially boosting the size of the training set. This reduces overfitting, making it a regularization technique.
- New instances must be learnable.
- One needs to generate realistic training instances that a human cannot tell.
- Adding white noise will not help, because noise is not learnable.
- It is often preferrable to generate new samples on the fly to save memory.

For tasks that involve images, typical ways of generating new instances include: transposing, rotating, resizing, flipping, and cropping, as well as adjusting the brightness, contrast, saturation, and hue.