# Lecture 6: Going Deeper

An important issue to train deep architectures is to control the ampliture of the gradient, which is tightly related to controlling activations.
In particular, we must ensure that:
* the gradient does not vanish
* gradient amplitude is homogeneous so that all parts of the netword train at the same rate
* the gradient does not vary too unpredictably when the weights change

An addional issue for training large architectures is the computational cost -> usually the main practical problem.


## Rectifiers

* ReLU: its derivative does not vanish :D, hence better that *tanh* function. And also its way faster.
![relu_vs_tanh](img/lecture6/relu_vs_tanh.png)

However, sometimes ReLU function may lead to some dead neurons (that will not activated). We come up with a new **Leaky ReLU**:
![leaky_relu](img/lecture6/leaky_relu.png)


* **Maxout layer**: takes max of several linear units. It can encode ReLU, absolute value, or approximate any convex function.

* **Concatenated Rectified Linear Unit (CReLU)**: doubles the number of activations but keeps the norm of signal intact during forward and backward passes.
![crelu](img/lecture6/crelu.png)


## Dropout

A regularization technique.
Consists of removing units at random during the forward pass on each sample, and putting them all back during test.

This method increases independence between units, and distribute the representation. Hence, generally improves performances.

> In a standard neural network, the derivative received by each parameter tells it how it should change so the final loss function is reduced, given what all other units are doing. Therefore, units may change in a way that they fix up the mistakes of the other units. This may lead to complex co-adaptations. This in turn leads to overfitting because these co-adaptations do not generalize to unseen data. We hypothesize that for each hidden unit, dropout prevents co-adaptation by making the presence of other hidden units unreliable. Therefore, a hidden unit cannot rely on other specific units to correct its mistakes. It must perform well in a wide variety of different contexts provided by the other hidden units.
(Srivastava et al., 2014)   

![dropout](img/lecture6/dropout_network.png)

In PyTorch: torch.nn.Dropout (torch.Module): default probability p = 0.5



In [1]:
import torch
from torch import Tensor, nn
from torch.autograd import Variable

x = Variable(Tensor(3,9).fill_(1.0), requires_grad = True)
print (x.data)

dropout = nn.Dropout(p = 0.75)
y = dropout(x)
print (y.data)

l = y.norm(2, 1).sum()
l.backward()
print (x.grad.data)

tensor([[ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.],
        [ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.]])
tensor([[ 0.,  0.,  4.,  0.,  0.,  0.,  0.,  4.,  0.],
        [ 0.,  4.,  0.,  0.,  4.,  0.,  4.,  4.,  0.],
        [ 4.,  0.,  0.,  0.,  4.,  0.,  0.,  0.,  4.]])
tensor([[ 0.0000,  0.0000,  2.8284,  0.0000,  0.0000,  0.0000,  0.0000,
          2.8284,  0.0000],
        [ 0.0000,  2.0000,  0.0000,  0.0000,  2.0000,  0.0000,  2.0000,
          2.0000,  0.0000],
        [ 2.3094,  0.0000,  0.0000,  0.0000,  2.3094,  0.0000,  0.0000,
          0.0000,  2.3094]])


In [2]:
""" In a network, we can simply add dropout as a layer"""
>>> model = nn.Sequential(nn.Linear(10, 100), nn.ReLU(), nn.Dropout(), nn.Linear(100, 2))
""" A model using drop out has to be set in **train** or **test** mode"""
>>> model = nn.Sequential(nn.Linear(10, 100), nn.Dropout(), nn.Linear(100, 2))
>>> dropout.training
True
>>> model.train(False)
Sequential (
    (0): Linear(10 -> 100)
    (1): Dropout(p = 0.5)
    (2): Linear(100 -> 2)
)
>>> dropout.training
False

SyntaxError: invalid syntax (<ipython-input-2-6ef5b357fb62>, line 9)

In a 2D activation maps, units are generally correlated, and dropout has virtually no effect.
-> Spatial Dropout: drops channels instead of units.

```python
>>> dropout2d = nn . Dropout2d ()
>>> x = Variable ( Tensor (2 , 3 , 2 , 2) . fill_ (1.0) )
>>> dropout2d ( x )
Variable containing :
(0 ,0 ,. ,.) =
0 0
0 0
(0 ,1 ,. ,.) =
0 0
0 0
(0 ,2 ,. ,.) =
2 2
2 2
(1 ,0 ,. ,.) =
2 2
2 2
(1 ,1 ,. ,.) =
0 0
0 0
(1 ,2 ,. ,.) =
2 2
2 2
[ torch . FloatTensor of size 2 x3x2x2 ]
```

Another type of dropout is dropconnect: drop connection between units

## Activation normalization

To keep proper statistics of the activations and derivatives.
Why?
* To learn (much) faster. 
-> **Batch normalization**

* During training, batch normalization **shifts and rescales according to the mean and variance estimated on the batch.**
* During test, it simply shifts and rescales according to the empirical moments estimated during training.

Further more, there is **layer normalization** proposed by Ba et al. (2016). It normalizes across layers instead of batch. This gives similar or better improvements than BN.

## Residual Networks

Residual: the difference between the value before the block and the one needed after


![residual](img/lecture6/residual.png)



## Smart Initialization

### Look Linear initialization
Combine LL initialization with CReLU will make the network linear initially.

![ll_init](img/lecture6/ll_init.png)

# Summary

Techniques to enable training of deep networks:
* **rectifiers** to prevent the gradient from vanishing during backward pass
* **drop-out** to force a distributed representation(**prevent overfitting, dependence between units**)
* **batch normalization** to dynamically maintain the statistics of activations (makes training much faster)
* **identity pass-through (resnet, residual block)** to keep a structured gradient and distribute representation
* **smart initialization** (look linear) to put the gradient in a good regime

In [None]:
""" Using multiple GPUs """

class Dummy nn.Module) :
    def __init__(self, m  :
        super( Dummy,self).__init__()
        self.m = m
    def forward (self, x ) :
        print (’Dummy.forward’ , x.size(), torch.cuda.current_device())
        return self.m(x)
x = Variable(Tensor(50, 10).normal_()))
m = Dummy(nn.Linear(10, 5))
x = x.cuda()
m = m.cuda()

print (’Without data_parallel’)
y = m(x)
print ()
mp = nn.DataParallel(m)
print (’With data_parallel’)
y = mp(x)

# Output
Without data_parallel
Dummy . forward torch . Size ([50 , 10]) 0
With data_parallel
Dummy . forward torch . Size ([25 , 10]) 0
Dummy . forward torch . Size ([25 , 10]) 1