# Understanding the vanishing gradient problem through visualization

There're reasons why deep neural network could work very well, while few people get a promising result or make it possible by simply make their neural network *deep*.

* Computational power and data grow tremendously. People need more complex model and faster computer to make it feasible.
* Realize and understand the difficulties associated with training a deep model.

In this tutorial, we would like to show you some insights of the techniques that researchers find useful in training a deep model, using MXNet and its visualizing tool -- TensorBoard.

Let’s recap some of the relevant issues on training a deep model:

* Weight initialization.  If you initialize the network with random and small weights, when you look at the gradients down the top layer, you would find they’re getting smaller and smaller, then the first layer almost doesn’t change as the gradients are too small to make a significant update. Without a chance to learn the first layer effectively, it's impossible to update and learn a good deep model.
* Nonlinearity activation. When people use `sigmoid` or `tanh` as activation function, the gradient, same as the above, is getting smaller and smaller. Just remind the formula of the parameter updates and the gradient.

## Experiment Setting

Here we create a simple MLP for cifar10 dataset and visualize the learning processing through loss/accuracy, and its gradient distributions, by changing its initialization and activation setting.

## General Setting 

We adopt MLP as our model and run our experiment in MNIST dataset. Then we'll visualize the weight and gradient of a layer using `Monitor` in MXNet and `Histogram` in TensorBoard.

### Network Structure

Here's the network structure:

```python
def get_mlp(acti="relu"):
    """
    multi-layer perceptron
    """
    data = mx.symbol.Variable('data')
    fc   = mx.symbol.FullyConnected(data = data, name='fc', num_hidden=512)
    act  = mx.symbol.Activation(data = fc, name='act', act_type=acti)
    fc0  = mx.symbol.FullyConnected(data = act, name='fc0', num_hidden=256)
    act0 = mx.symbol.Activation(data = fc0, name='act0', act_type=acti)
    fc1  = mx.symbol.FullyConnected(data = act0, name='fc1', num_hidden=128)
    act1 = mx.symbol.Activation(data = fc1, name='act1', act_type=acti)
    fc2  = mx.symbol.FullyConnected(data = act1, name = 'fc2', num_hidden = 64)
    act2 = mx.symbol.Activation(data = fc2, name='act2', act_type=acti)
    fc3  = mx.symbol.FullyConnected(data = act2, name='fc3', num_hidden=32)
    act3 = mx.symbol.Activation(data = fc3, name='act3', act_type=acti)
    fc4  = mx.symbol.FullyConnected(data = act3, name='fc4', num_hidden=16)
    act4 = mx.symbol.Activation(data = fc4, name='act4', act_type=acti)
    fc5  = mx.symbol.FullyConnected(data = act4, name='fc5', num_hidden=10)
    mlp  = mx.symbol.SoftmaxOutput(data = fc5, name = 'softmax')
    return mlp
```

As you might already notice, we intentionally add more layers than usual, as the vanished gradient problem becomes severer as the network goes deeper.


### Weight Initialization

The weight initialization also has `uniform` and `xavier`. 

```python
if args.init == 'uniform':
        init = mx.init.Uniform(0.1)
if args.init == 'xavier':
    init = mx.init.Xavier(factor_type="in", magnitude=2.34)
```

Note that we intentionally choose a near zero setting in `uniform`. 

### Activation  Function

We would compare two different activations, `sigmoid` and `relu`. 

```python
# acti = sigmoid or relu.
act  = mx.symbol.Activation(data = fc, name='act', act_type=acti)
```

## Logging with TensorBoard and Monitor

In order to monitor the weight and gradient of this network in different settings, we could use MXNet's `monitor` for logging and `TensorBoard` for visualization.

### Usage

Here's a code snippet from `train_model.py`:

```python
import mxnet as mx
from tensorboard import summary
from tensorboard import FileWriter

# where to keep your TensorBoard logging file
logdir = './logs/'
summary_writer = FileWriter(logdir)

# mx.mon.Monitor's callback 
def get_gradient(g):
    # get flatten list
    grad = g.asnumpy().flatten()
    # logging using tensorboard, use histogram type.
    s = summary.histogram('fc_backward_weight', grad)
    summary_writer.add_summary(s)
    return mx.nd.norm(g)/np.sqrt(g.size)

mon = mx.mon.Monitor(int(args.num_examples/args.batch_size), get_gradient, pattern='fc_backward_weight')  # get the gradient passed to the first fully-connnected layer.

# training
model.fit(
        X                  = train,
        eval_data          = val,
        eval_metric        = eval_metrics,
        kvstore            = kv,
        monitor            = mon,
        epoch_end_callback = checkpoint)

# close summary_writer
summary_writer.close()
```

In [3]:
import sys
sys.path.append('./mnist/')
from train_mnist import *

## What to expect?

If a setting suffers from an vanish gradient problem, the gradients passed from the top should be very close to zero, and the weight of the network barely change/update. 

### Uniform and Sigmoid

## Uniform and sigmoid
args = parse_args('uniform', 'uniform_sigmoid')
data_shape = (784, )
net = get_mlp("sigmoid")

# train
train_model.fit(args, net, get_iterator(data_shape))

As you've seen, the metrics of `fc_backward_weight` is so close to zero, and it didn't change a lot during batchs.

```
2017-01-07 15:44:38,845 Node[0] Batch:       1 fc_backward_weight             5.1907e-07	
2017-01-07 15:44:38,846 Node[0] Batch:       1 fc_backward_weight             4.2085e-07	
2017-01-07 15:44:38,847 Node[0] Batch:       1 fc_backward_weight             4.31894e-07	
2017-01-07 15:44:38,848 Node[0] Batch:       1 fc_backward_weight             5.80652e-07

2017-01-07 15:45:50,199 Node[0] Batch:    4213 fc_backward_weight             5.49988e-07	
2017-01-07 15:45:50,200 Node[0] Batch:    4213 fc_backward_weight             5.89305e-07	
2017-01-07 15:45:50,201 Node[0] Batch:    4213 fc_backward_weight             3.71941e-07	
2017-01-07 15:45:50,202 Node[0] Batch:    4213 fc_backward_weight             8.05085e-07
```

You might wonder why we have 4 different `fc_backward_weight`, cause we use 4 cpus.

### Uniform and ReLu

In [4]:
# Uniform and sigmoid
args = parse_args('uniform', 'uniform_relu')
data_shape = (784, )
net = get_mlp("relu")

# train
train_model.fit(args, net, get_iterator(data_shape))

2017-03-11 11:04:11,110 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, init='uniform', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, name='uniform_relu', network='mlp', num_epochs=10, num_examples=60000, save_model_prefix=None)
  **model_args)
  self.initializer(k, v)
2017-03-11 11:04:12,619 Node[0] Start training with [cpu(0), cpu(1), cpu(2), cpu(3)]
2017-03-11 11:04:14,613 Node[0] Batch:       1 fc_backward_weight             0.00025978	
2017-03-11 11:04:14,614 Node[0] Batch:       1 fc_backward_weight             0.000253863	
2017-03-11 11:04:14,615 Node[0] Batch:       1 fc_backward_weight             0.000261572	
2017-03-11 11:04:14,615 Node[0] Batch:       1 fc_backward_weight             0.000264203	
2017-03-11 11:05:05,561 Node[0] Epoch[0] Resetting Data Iterator
2017-03-11 11:05:05,588 Node[0] Epoch[0] Time cost=52.952
2017-03-11 11:05:09,414 Node[0] Epoch[0] Validation-accuracy=0.647736
2017

Even we have a "poor" initialization, the model could still converge quickly with proper activation function. And its magnitude has significant difference.

```
2017-01-07 15:54:12,286 Node[0] Batch:       1 fc_backward_weight             0.000267409	
2017-01-07 15:54:12,287 Node[0] Batch:       1 fc_backward_weight             0.00031988	
2017-01-07 15:54:12,288 Node[0] Batch:       1 fc_backward_weight             0.000306785	
2017-01-07 15:54:12,289 Node[0] Batch:       1 fc_backward_weight             0.000347533

2017-01-07 15:55:25,936 Node[0] Batch:    4213 fc_backward_weight             0.0226081	
2017-01-07 15:55:25,937 Node[0] Batch:    4213 fc_backward_weight             0.0039793	
2017-01-07 15:55:25,937 Node[0] Batch:    4213 fc_backward_weight             0.0306151	
2017-01-07 15:55:25,938 Node[0] Batch:    4213 fc_backward_weight             0.00818676
```

### Xavier and Sigmoid  

In [4]:
# Xavier and sigmoid
args = parse_args('xavier', 'xavier_sigmoid')
data_shape = (784, )
net = get_mlp("sigmoid")

# train
train_model.fit(args, net, get_iterator(data_shape))

2017-01-07 15:59:10,021 Node[0] start with arguments Namespace(batch_size=128, data_dir='mnist/', gpus=None, init='xavier', kv_store='local', load_epoch=None, lr=0.1, lr_factor=1, lr_factor_epoch=1, model_prefix=None, name='xavier_sigmoid', network='mlp', num_epochs=10, num_examples=60000, save_model_prefix=None)
2017-01-07 15:59:13,299 Node[0] Start training with [cpu(0), cpu(1), cpu(2), cpu(3)]
2017-01-07 15:59:15,909 Node[0] Batch:       1 fc_backward_weight             9.27798e-06	
2017-01-07 15:59:15,909 Node[0] Batch:       1 fc_backward_weight             8.58008e-06	
2017-01-07 15:59:15,910 Node[0] Batch:       1 fc_backward_weight             8.96261e-06	
2017-01-07 15:59:15,911 Node[0] Batch:       1 fc_backward_weight             7.33611e-06	
2017-01-07 15:59:20,779 Node[0] Epoch[0] Resetting Data Iterator
2017-01-07 15:59:20,780 Node[0] Epoch[0] Time cost=7.433
2017-01-07 15:59:21,086 Node[0] Epoch[0] Validation-accuracy=0.105769
2017-01-07 15:59:21,087 Node[0] Epoch[0] Val

## Visualization

Now start using TensorBoard:

```bash
tensorboard --logdir=logs/
```

![Dashboard](https://github.com/zihaolucky/tensorboard/raw/data/docs/tutorial/mnist/pic1.png)

![dist](https://github.com/zihaolucky/tensorboard/raw/data/docs/tutorial/mnist/pic2.png)

![hist](https://github.com/zihaolucky/tensorboard/raw/data/docs/tutorial/mnist/pic3.png)



## References

You might find these materials useful:

[1] [Rohan #4: The vanishing gradient problem – A Year of Artificial Intelligence](https://ayearofai.com/rohan-4-the-vanishing-gradient-problem-ec68f76ffb9b#.bojpejg3o)    
[2] [On the difficulty of training recurrent and deep neural networks - YouTube](https://www.youtube.com/watch?v=A7poQbTrhxc)    
[3] [What is the vanishing gradient problem? - Quora](https://www.quora.com/What-is-the-vanishing-gradient-problem)