In [7]:
import torch
from torch import Tensor, LongTensor
from torch.autograd import Variable
from torch import nn, optim, cuda
from torch.nn import functional as F
from torchvision import datasets

## Cross-entropy loss
* Compute the difference between two distributions
* Compute how close predicted distribution to the true distribution
* Usually used in **classification** problems
* More info: https://stackoverflow.com/a/41990932


In [6]:
# torch.nn.CrossEntropyLoss
f = Variable(Tensor([[3, 3, -2], [1, -1, 2]]))
target = Variable(LongTensor([0, 1]))
criterion = nn.CrossEntropyLoss()
criterion(f, target)

tensor(2.0228)

---
## LogSoftmax & NLLLoss
* **Negative log likelihood loss** function: useful in **training** classification problem with C classes
  * torch.nn.NLLLoss()
* **Log soft-max** function: usually the final layer for a network **trained with NLLLoss()**
  * torch.nn.LogSoftmax()
* If a network should compute log probabilities, it may have a LogSoftmax() final layer, and be trained with NLLLoss()

---
## Stochastic gradient descent
* Update parameters w(t) after every sample (instead of every n samples for each iteration)
* However, does not benefit from the speed-up of batch-processing
* Hence, **mini-batch stochastic gradient descent** is used
  * Visit samples in "mini-batches" (a few tens of samples), update parameters each mini-batch
  * This behavior helps **evade local minima**
* Example below

---
## Momentum and moment estimation
* Deep learning relies on smarter use of the gradient, using *statistics* over its past values to make a *smarter step* with the current one
* The use of "momentum" to add *inertia* in the choice of step direction (see slide 5-P.22)
  * With γ = 0, this is the same as normal Stochastic Gradient Descent (SGD)
  * With γ > 0, advantages are:
    * it can "go through" local barriers,
    * it accelerates if gradient does not change much
    * it dampends oscillations in narrow valleys.   
    
Vanilla SGD | With Momentum
----- | -----
![dampening1](img/lecture5/moment_dampening1.png) {width=75%} | ![dampening2](img/lecture5/moment_dampening2.png) {width=75%}
   
#### Adam Algorithm
* Uses moving averages of each coordinate and its square to rescale each coordinate separately (see slide5-P.25)

Vanilla SGD | Adam algorithm
----- | -----
![dampening1](img/lecture5/moment_dampening1.png) {width=75%} | ![adamalgo](img/lecture5/moment_adam.png) {width=75%}


In [None]:
# Normal version is one used in pratical4
# Stochastic Gradient Descent with torch.optim
eta = 1e-1
optimizer = torch.optim.SGD(model.parameters(), lr = eta)
# optimizer = torch.optim.Adam(model.parameters(), lr = eta)

for e in range(25):
    for b in range(0, train_input.size(0), mini_batch_size):
        output = model(train_input.narrow(0, b, mini_batch_size))
        loss = criterion(output, train_target.narrow(0, b, mini_batch_size))
        model.zero_grad()
        loss.backward()
        # Stochastic and optim here, update after every batch size
        optimizer.step()

---
## Putting things together

In [13]:
class Net(nn.Module):
    def __init__(self, nb_hiddens):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=5)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=5)
        self.fc1 = nn.Linear(256, nb_hiddens)
        self.fc2 = nn.Linear(nb_hiddens, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), kernel_size=3, stride=3))
        x = F.relu(F.max_pool2d(self.conv2(x), kernel_size=2, stride=2))
        x = F.relu(self.fc1(x.view(-1, 256)))
        x = self.fc2(x)
        return x
    
train_set = datasets.MNIST('./data/mnist/', train = True, download = True)
train_input = Variable(train_set.train_data.view(-1, 1, 28, 28).float())
train_target = Variable(train_set.train_labels)

model, criterion = Net(200), nn.CrossEntropyLoss()

if cuda.is_available():
    model.cuda()
    criterion.cuda()
    train_input, train_target = train_input.cuda(), train_target.cuda()
    
# normalization
muy, std = train_input.data.mean(), train_input.data.std()
train_input.data.sub_(muy).div_(std)

lr, nb_epochs, batch_size = 1e-1, 10, 100

optimizer = optim.SGD(model.parameters(), lr = lr)

for k in range(nb_epochs):
    sum_loss = 0
    for b in range(0, train_input.size(0), batch_size):
        output = model(train_input.narrow(0, b, batch_size))
        loss = criterion(output, train_target.narrow(0, b, batch_size))
        sum_loss += loss.data
        model.zero_grad()
        loss.backward()
        optimizer.step()
    print (k, sum_loss)


0 tensor(172.3186)
1 tensor(36.4774)
2 tensor(23.9906)
3 tensor(17.9796)
4 tensor(14.3676)
5 tensor(11.5056)
6 tensor(9.3696)
7 tensor(7.4484)
8 tensor(5.9644)
9 tensor(4.7334)


---
# REGULARIZATION

Regularization is a process of modifying the model in order to prevent overfitting without changing its "data core concept" (idk :D).
> Regularization, một cách cơ bản, là thay đổi mô hình một chút để tránh overfitting trong khi vẫn giữ được tính tổng quát của nó (tính tổng quát là tính mô tả được nhiều dữ liệu, trong cả tập training và test). Một cách cụ thể hơn, ta sẽ tìm cách **di chuyển nghiệm của bài toán tối ưu hàm mất mát tới một điểm gần nó**. Hướng di chuyển sẽ là hướng làm cho mô hình ít phức tạp hơn mặc dù giá trị của hàm mất mát có tăng lên một chút.   
Reference: [machinelearningcoban.com](https://machinelearningcoban.com/2017/03/04/overfitting/#-regularization "Regularization")   

Types of regularization:
* Early stopping
* L<sub>2</sub> regularization (weight decay): using norm 2 
* L<sub>1</sub>

---
## Weight Initialization
* Relies on controlling the variances (slide5-P.50) so that:
  * the gradient does not vanish
  * weights evolve at the same rate across layers during training and no layer reaches a saturation behavior before others
  

Variance -> std = sqrt(Variance)   


First Type | Xavier initialization
--- | ---
![weight1](img/lecture5/weight1.png) | ![xavier](img/lecture5/xavier.png)


* This calculates the variance (or std) of variables like weights and biases so that we can later randomize them with it to get uniform gradients across all layers without them vanishing
![betterweight](img/lecture5/graph1.png "Before and after Xavier initialization")

In [None]:
# Xavier implementation
def xavier_init(tensor, gain = 1):
    if isinstance(tensor, Variable):
        xavier_normal(tensor.data, gain = gain)
        return tensor
    # N(l-1) & N(l)
    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
    # important
    std = gain * math.sqrt(2.0 / (fan_in + fan_out))
    return tensor.normal_(0, std)

So ReLU impacts the forward and backward pass as if the weights had half their variances, which motivates multiplying them by a corrective gain of √2.
*(He et al., 2015)*
![Init coefficients](img/lecture5/init_coef.png "Coefficients for each type of activation functions")
   
Using these values in pratice:
* For ReLU activation function: V = 2.0 / N(l-1)
* For tanh activation function: V = 1.0 / N(l-1) (this is Xavier init)
* V = 2.0 / (N(l-1) + N(l)): this is also Xavier init   
[Reference: deeplearning.ai](https://www.coursera.org/learn/deep-neural-network/lecture/RwqYe/weight-initialization-for-deep-networks)

---
## Data Normalization

The analysis for the weight initialization relies on keeping the activation variance constant.   
For this to be true, not only the variance has to remained unchanged through layers, but it has to be correct for the input too.   
**V(x<sup>(0)</sup>) = 1**

In [None]:
# We can do as follow
mu, std = train_input.mean(), train_input.std()
# OR for component-wise normalization
mu, std = train_input.mean(0), train_input.std(0)

train_input.sub_(mu).div_(std)
test_input.sub_(mu).div_(std)

---
## Choosing the Network structure

* Reuse or start from **"well known, that works"** structure
* Split feature extraction / inference
* Modulate the capacity until it overfits a small subset, but does not overfit/ underfit the full set
* Capacity increases with more layers, more channels, larger receptive fields, or more units.
* Regularization to reduce the capacity or induce sparsity.
* Identify common paths for siamese-like ([For more info on Siamese network](https://www.quora.com/What-are-Siamese-neural-networks-what-applications-are-they-good-for-and-why "Siamese network")
* Identify what path(s) or sub-parts need more/less capacity
* Use prior knowledge about the "scale of meaningful context" to size filters/ combinations of filters (e.g. knowing the size of objects in a scene, the max duration of a sound snippet that matters).
* Grid-search all the variations that come to mind.

More on learning rate:
* reduce the loss quickly -> large learning rate
* not be trapped in a bad minimum -> large learning rate
* not bounce around in narrow valleys -> small learning rate
* not ascillate around a minimum -> small learning rate

-> Using larger step size first, andn a smaller one in the end.