# Autograd & Modules

## Back to basics

Training loop over CIFAR10 (40,000 train images, 10,000 test images). What happens if you
- Remove the `ReLU()`? 
- Increase the learning rate?
- Stack more layers? 
- Don't normalize the input?
- Perform more epochs?

Can you completely overfit the training set (i.e. get 100% accuracy?)

This code is highly non-modulable. Can you create functions for each specific task and train it from the command line, e.g. something like 
`python main.py --batch_size 64 --n_epochs 10 --lr 0.01`?
(hint: see [this](https://github.com/pytorch/examples/blob/master/mnist/main.py))

Your training went well. Good. Why not save the weights of the network (`net.state_dict()`) using `torch.save()`?

You have a GPU (remote or local). Where do you put the magic `.cuda()` to switch the training to a GPU? Is it faster?

In [1]:
import torch
import torchvision
import torch.nn as nn
import torchvision.transforms as t

# define network structure 
net = nn.Sequential(nn.Linear(3 * 32 * 32, 1000), nn.ReLU(), nn.Linear(1000, 10))
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(net.parameters(), lr = 0.01, momentum=0.9, weight_decay=1e-4)

# load data
to_tensor =  t.ToTensor()
normalize = t.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010))
flatten =  t.Lambda(lambda x:x.view(-1))

transform_list = t.Compose([to_tensor, normalize, flatten])
train_set = torchvision.datasets.CIFAR10(root='.', train=True, transform=transform_list, download=True)
test_set = torchvision.datasets.CIFAR10(root='.', train=False, transform=transform_list, download=True)

train_loader = torch.utils.data.DataLoader(train_set, batch_size=64)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=64)

# === Train === ###
net.train()

# train loop
for epoch in range(3):
    train_correct = 0
    train_loss = 0
    print('Epoch {}'.format(epoch))
    
    # loop per epoch 
    for i, (batch, targets) in enumerate(train_loader):

        output = net(batch)
        loss = criterion(output, targets)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        pred = output.max(1, keepdim=True)[1]
        train_correct += pred.eq(targets.view_as(pred)).sum().item()
        train_loss += loss

        if i % 100 == 10: print('Train loss {:.4f}, Train accuracy {:.2f}%'.format(
            train_loss / (i * 64), 100 * train_correct / (i * 64)))
        
print('End of training.\n')
    
# === Test === ###
test_correct = 0
net.eval()

# loop, over whole test set
for i, (batch, targets) in enumerate(test_loader):
    
    output = net(batch)
    pred = output.max(1, keepdim=True)[1]
    test_correct += pred.eq(targets.view_as(pred)).sum().item()
    
print('End of testing. Test accuracy {:.2f}%'.format(
    100 * test_correct / (len(test_loader) * 64)))

Files already downloaded and verified
Files already downloaded and verified
Epoch 0
Train loss 0.0367, Train accuracy 24.38%
Train loss 0.0300, Train accuracy 33.84%
Train loss 0.0291, Train accuracy 36.24%
Train loss 0.0287, Train accuracy 37.26%
Train loss 0.0282, Train accuracy 38.20%
Train loss 0.0280, Train accuracy 38.71%
Train loss 0.0278, Train accuracy 39.22%
Train loss 0.0278, Train accuracy 39.53%
Epoch 1
Train loss 0.0282, Train accuracy 47.66%
Train loss 0.0267, Train accuracy 43.91%
Train loss 0.0265, Train accuracy 44.27%
Train loss 0.0265, Train accuracy 44.32%
Train loss 0.0262, Train accuracy 44.74%
Train loss 0.0262, Train accuracy 44.92%
Train loss 0.0261, Train accuracy 45.09%
Train loss 0.0262, Train accuracy 45.14%
Epoch 2
Train loss 0.0269, Train accuracy 55.16%
Train loss 0.0263, Train accuracy 47.05%
Train loss 0.0259, Train accuracy 47.57%
Train loss 0.0259, Train accuracy 47.78%
Train loss 0.0256, Train accuracy 47.99%
Train loss 0.0255, Train accuracy 48.15

## Autograd

Autograd handles well almost every basic tensor operation you could think of!

In [3]:
# don't hit enter before to guessed the answer!
x = torch.Tensor(4, 10)
x.requires_grad=True
loss = x[:, :4].sum()
loss.backward()
x.grad

tensor([[1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0., 0., 0.],
        [1., 1., 1., 1., 0., 0., 0., 0., 0., 0.]])

In [4]:
# don't hit enter before to guessed the answer!
x = torch.Tensor(2, 3)
x.requires_grad=True
y = torch.Tensor([[1, 2], [3, 4]])
loss = y.mm(x).sum()
loss.backward()
x.grad

tensor([[4., 4., 4.],
        [6., 6., 6.]])

In [5]:
# don't hit enter before to guessed the answer!
x = torch.Tensor([[1, 2, 3, 4]])
x.requires_grad=True
y = 2 * x
y.backward(torch.Tensor([[1, 0, 0, 0]]))
x.grad

tensor([[2., 0., 0., 0.]])

## Create your own module

In [7]:
from torch.nn import Parameter

class Permutation(nn.Module):

    def __init__(self, input_features, axis=1):
        super(Permutation, self).__init__()
        self.input_features = input_features
        self.axis = axis
        self.perm = Parameter(torch.randperm(self.input_features), requires_grad=False)

    def forward(self, input):
        return input.index_select(self.axis, self.perm)

    def __repr__(self):
        return self.__class__.__name__ + '(' \
            + str(self.input_features) + ', ' \
            + str(self.output_features) + ', ' \
            + 'axis=' + str(self.axis) + ')'
            
net = Permutation(10)
x = torch.arange(10).view(1, 10)
print(x)
y = net(x)
print(y)

tensor([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
tensor([[5, 2, 3, 8, 7, 6, 0, 9, 1, 4]])


In [8]:
print(net.state_dict())

OrderedDict([('perm', tensor([5, 2, 3, 8, 7, 6, 0, 9, 1, 4]))])


## Hooks

Forward/backward hooks allow to catch the activations/backpropagated gradients! First, print the norm of the activations.

In [9]:
# create a FC 
def deep_net(n_layers, features=1000):
    layers = []
    for _ in range(n_layers):
        layers.append(nn.Linear(features, features))
        layers.append(nn.ReLU())
        
    return nn.Sequential(*layers)

# print activations norms
def forward_hook(module, input, output):
    print('Layer {}\t Activations norm: {:.4f}'.format(
        module.__class__.__name__, output.norm().item()))

# register hook for every layer 
def register_forward_hooks(net, forward_hook):
    for layer in net.children():
        layer.register_forward_hook(forward_hook)

In [10]:
net = deep_net(10)
register_forward_hooks(net, forward_hook)

x = torch.randn(512, 1000)
y = net(x)

Layer Linear	 Activations norm: 412.4592
Layer ReLU	 Activations norm: 291.7160
Layer Linear	 Activations norm: 170.7109
Layer ReLU	 Activations norm: 122.4180
Layer Linear	 Activations norm: 71.3020
Layer ReLU	 Activations norm: 48.7552
Layer Linear	 Activations norm: 31.0433
Layer ReLU	 Activations norm: 21.3409
Layer Linear	 Activations norm: 17.6968
Layer ReLU	 Activations norm: 12.1569
Layer Linear	 Activations norm: 15.1106
Layer ReLU	 Activations norm: 10.9613
Layer Linear	 Activations norm: 14.2940
Layer ReLU	 Activations norm: 10.0419
Layer Linear	 Activations norm: 14.0194
Layer ReLU	 Activations norm: 9.7039
Layer Linear	 Activations norm: 14.0066
Layer ReLU	 Activations norm: 9.7330
Layer Linear	 Activations norm: 14.4152
Layer ReLU	 Activations norm: 9.9619


Is this network well initialized? Why?

Next, print the size of the backpropagated gradients

In [22]:
# print  sizes
def backward_hook(module, grad_input, grad_output):
    print('Layer {}\t Backpropagated gradient size: {}'.format(
        module.__class__.__name__, grad_output[0].size()))

# register hook for every layer 
def register_backward_hooks(net, backward_hook):
    for layer in net.children():
        layer.register_backward_hook(backward_hook)

In [23]:
net = deep_net(10)
register_backward_hooks(net, backward_hook)

x = torch.rand(512, 1000)
y = net(x).sum()
y.backward()

Layer ReLU	 Backpropagated gradient size: torch.Size([512, 1000])
Layer Linear	 Backpropagated gradient size: torch.Size([512, 1000])
Layer ReLU	 Backpropagated gradient size: torch.Size([512, 1000])
Layer Linear	 Backpropagated gradient size: torch.Size([512, 1000])
Layer ReLU	 Backpropagated gradient size: torch.Size([512, 1000])
Layer Linear	 Backpropagated gradient size: torch.Size([512, 1000])
Layer ReLU	 Backpropagated gradient size: torch.Size([512, 1000])
Layer Linear	 Backpropagated gradient size: torch.Size([512, 1000])
Layer ReLU	 Backpropagated gradient size: torch.Size([512, 1000])
Layer Linear	 Backpropagated gradient size: torch.Size([512, 1000])
Layer ReLU	 Backpropagated gradient size: torch.Size([512, 1000])
Layer Linear	 Backpropagated gradient size: torch.Size([512, 1000])
Layer ReLU	 Backpropagated gradient size: torch.Size([512, 1000])
Layer Linear	 Backpropagated gradient size: torch.Size([512, 1000])
Layer ReLU	 Backpropagated gradient size: torch.Size([512, 100

The backpropagated gradients are not to be mistaken with the gradients w.r.t the weights! Can you recover the gradients w.r.t the weights using the backpropagated gradients and the input of the layer? 

In [24]:
net = nn.Sequential(nn.Linear(1, 2, bias=False), nn.Linear(2, 1, bias=False))
net[1].weight.data = torch.Tensor([[1, 2], [3, 4]])
x = torch.ones(1, 1)

loss = net(x).sum()
loss.backward()
print('How to recover this?', net[0].weight.grad)
print('Weights of second layer', net[1].weight)
print('Backpropagated gradients', net[1].weight)

How to recover this? tensor([[4.],
        [6.]])
Weights of second layer Parameter containing:
tensor([[1., 2.],
        [3., 4.]], requires_grad=True)
Backpropagated gradients Parameter containing:
tensor([[1., 2.],
        [3., 4.]], requires_grad=True)


## Buffers

In [25]:
class Normalize(nn.Module):

    def __init__(self, num_features, momentum=0.1):
        super(Normalize, self).__init__()
        self.num_features = num_features
        self.momentum = momentum
        self.register_buffer('running_mean', torch.Tensor(num_features))
        self.reset_parameters()

    def reset_parameters(self):
        self.running_mean.zero_()

    def forward(self, input):
        # training mode: use batch statistics and update running statistics
        if self.training:
            mean = input.mean(dim=0)
            self.running_mean.mul_(1-self.momentum).add_(self.momentum, mean.data)

        # eval mode: use running statistics
        else:
            mean = self.running_mean

        # return output
        return input - mean

In [26]:
net = nn.Sequential(nn.Linear(10, 10), Normalize(10), nn.Linear(10, 2))
x = torch.rand(3, 10)

net.train()
print(net[1].running_mean)
net(x)
print(net[1].running_mean)

net.eval()
print(net[1].running_mean)
net(x)
print(net[1].running_mean)

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([-0.0505,  0.0084, -0.0412,  0.0265,  0.0038, -0.0156,  0.0816, -0.0134,
         0.0900,  0.0227])
tensor([-0.0505,  0.0084, -0.0412,  0.0265,  0.0038, -0.0156,  0.0816, -0.0134,
         0.0900,  0.0227])
tensor([-0.0505,  0.0084, -0.0412,  0.0265,  0.0038, -0.0156,  0.0816, -0.0134,
         0.0900,  0.0227])


In [27]:
print(net.state_dict())

OrderedDict([('0.weight', tensor([[ 0.1115, -0.1369,  0.0000, -0.0608, -0.1517,  0.1381, -0.2863, -0.1260,
         -0.0241, -0.1595],
        [ 0.1688,  0.0803,  0.2164, -0.1989, -0.0444, -0.1110, -0.0143, -0.0228,
         -0.2681, -0.2556],
        [ 0.1423, -0.0197, -0.0733,  0.1175, -0.1611, -0.2186, -0.0668, -0.2200,
          0.2234, -0.0160],
        [ 0.2108,  0.0021, -0.0887, -0.1745, -0.1142, -0.1653,  0.0424,  0.0575,
          0.0745,  0.2680],
        [ 0.0193,  0.2182, -0.2062,  0.1599, -0.1594, -0.1948,  0.1907, -0.1503,
         -0.0568,  0.1625],
        [ 0.1952, -0.2669,  0.1554, -0.2945, -0.0373,  0.1569, -0.2045, -0.2011,
          0.0663, -0.2940],
        [ 0.2757,  0.1726, -0.2297,  0.1615,  0.1225,  0.1398,  0.0226,  0.2855,
          0.1712, -0.0851],
        [ 0.0820, -0.2636,  0.2857, -0.2934,  0.1559,  0.1338, -0.1358, -0.1520,
          0.0096, -0.1594],
        [ 0.2417,  0.0266,  0.2303,  0.2760,  0.2888, -0.0185,  0.2606,  0.2783,
         -0.2460, -0.

## Gradchecking

In [28]:
x = torch.rand(256, 2, requires_grad=True).double()
y = torch.randint(0, 10, (256, ), requires_grad=True).double()
custom_op = nn.Linear(2, 10).double()
res = torch.autograd.gradcheck(custom_op, (x, ))
print(res)

True


## Autograd tips and tricks

Pointers are everywhere!

In [29]:
net = nn.Linear(2, 2)
w = net.weight
print(w)

x = torch.rand(1, 2)
y = net(x).sum()
y.backward()
net.weight.data -= 0.01 * net.weight.grad # <--- What is this?
print(w)

Parameter containing:
tensor([[-0.0943,  0.6001],
        [ 0.4884, -0.3737]], requires_grad=True)
Parameter containing:
tensor([[-0.1038,  0.5979],
        [ 0.4789, -0.3759]], requires_grad=True)


In [30]:
net = nn.Linear(2, 2)
w = net.weight.clone()
print(w)

x = torch.rand(1, 2)
y = net(x).sum()
y.backward()
net.weight.data -= 0.01 * net.weight.grad # <--- What is this?
print(w)

tensor([[ 0.1687, -0.1027],
        [-0.6148,  0.2257]], grad_fn=<CloneBackward>)
tensor([[ 0.1687, -0.1027],
        [-0.6148,  0.2257]], grad_fn=<CloneBackward>)


Sharing weights 

In [31]:
net = nn.Sequential(nn.Linear(2, 2), nn.Linear(2, 2))
net[0].weight = net[1].weight  # weight sharing

x = torch.rand(1, 2)
y = net(x).sum()
y.backward()
print(net[0].weight.grad)
print(net[1].weight.grad)

tensor([[ 0.7365,  0.7855],
        [-0.6110, -0.6640]])
tensor([[ 0.7365,  0.7855],
        [-0.6110, -0.6640]])
