### This is from the 60-minute blitz Pytorch tutorial, introducing the nn neural net module.

The following code creates a neural net with the following layers:

    Input: 32x32
    C1: 6@28x28 -- by applying 6 5x5 convolutions
    S2: 6@14x14 -- by subsampling (maxpooling?) each of the 6 channels
    C3: 16@ 10x10 -- by applying 16 5x5 convolutions (? 16?)
    S4: 16@ 5x5 -- subsampling each of the 16 channels
    F5: fully connected layer, 120
    F6: fully connected layer, 84
    F7: fully connected layer, 10 (outputs)


In [1]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F


class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        
        # these are the parametrized components of the neural net.
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1   = nn.Linear(16 * 5 * 5, 120)  # an affine operation: y = Wx + b
        self.fc2   = nn.Linear(120, 84)
        self.fc3   = nn.Linear(84, 10)

    def forward(self, x):
        
        # backward is automatically defined using autograd!
        # note the first two lines here actually combine
        # convolution -> relu -> maxpool
        # steps C1, S2
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))  # Max pooling over a (2, 2) window
        # steps C3, S4
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)  # If the size is a square you can only specify a single number
        
        # What does this do? I've no idea.
        # oh wait -- it probably totally flattens the result of the last step.
        x = x.view(-1, self.num_flat_features(x))
        
        # apply F5, relu
        x = F.relu(self.fc1(x))
        # apply F6, relu
        x = F.relu(self.fc2(x))
        # apply F7 and done
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features

net = Net()
print(net)

Net (
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear (400 -> 120)
  (fc2): Linear (120 -> 84)
  (fc3): Linear (84 -> 10)
)


### The learnable parameters can be obtained.

In [2]:
params = list(net.parameters())
print("there are {} parameter objects in the net".format(len(params)))
weightct = params[0].size()
print("The weights to layer C1 map from {} input channel to {} channels, each with {} convolution parameters".format(weightct[1], weightct[0], weightct[2]*weightct[3]))


there are 10 parameter objects in the net
The weights to layer C1 map from 1 input channel to 6 channels, each with 25 convolution parameters


In [3]:
# I don't know what the extra parameter layers in there are doing, i.e,
# params with size 6, 16, 120, 84, 10. 
# according to the text these all represent learnable parameters.
# are they the biases? That could be it. Probably is.
for ii in range(len(params)):
    print(ii, params[ii].size())

0 torch.Size([6, 1, 5, 5])
1 torch.Size([6])
2 torch.Size([16, 6, 5, 5])
3 torch.Size([16])
4 torch.Size([120, 400])
5 torch.Size([120])
6 torch.Size([84, 120])
7 torch.Size([84])
8 torch.Size([10, 84])
9 torch.Size([10])



# Note about minibatches in pytorch nn.

torch.nn only supports mini-batches The entire torch.nn package only supports inputs that are a mini-batch of samples, and not a single sample.

For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width.

If you have a single sample, just use input.unsqueeze(0) to add a fake batch dimension.

The input to the net is a 4D tensor -- in this example it is a randomly generated single sample of the actual input 3D tensor -- a single channel 32x32 image.

In [4]:
input = Variable(torch.randn((1,1,32,32)))

# apparently you don't explicitly call 'forward'.
out = net(input)
print(out)


Variable containing:
 0.1191 -0.0391 -0.0911 -0.0144  0.1056  0.0148  0.0118 -0.0789  0.1334 -0.0505
[torch.FloatTensor of size 1x10]



# It is a bit weird how backpropagation works. 

First you zero out the gradient buffers of all the parameters. I assume it is these that will accumulate the backpropagated gradients.

Then, since the output of the net is currently 10-dimensional, you have to give the net a 10-d gradient $dZ/dx_i$ to start with. In this case, you're making it random.

Note that out.backward has no output itself.

Note that the call out.backward() does not explicitly reference the network. So what is it doing? 

All of the Variables in the computation graph of 'out' (which means the weight variables in the neural net) are getting gradients computed and placed in their .grad attributes. 


In [None]:
# You need to clear the gradients in each variable, or the new results 
# will be added to what's already there.
net.zero_grad()
out.backward(torch.randn(1, 10))


In [12]:
# How to see the parameters themselves? 
# The implementation is a bit unclear.
# I guess they implement it as a generator because it is a 4D tensor.
# this printout confirms that the conv1 parameters do include an extra set of bias parameters.
print(net.conv1.parameters())
for param in net.conv1.parameters():
    print(param)

<generator object Module.parameters at 0x7fadf97b2c50>
Parameter containing:
(0 ,0 ,.,.) = 
 -0.1021 -0.0174  0.1215 -0.0019  0.0973
 -0.0035 -0.0146  0.1239 -0.1160 -0.0383
  0.1767  0.1816  0.1941  0.0389  0.1270
 -0.0529  0.1996  0.0967  0.0255  0.1016
 -0.1507  0.0912  0.1385  0.1273 -0.1909

(1 ,0 ,.,.) = 
  0.1432 -0.1984 -0.1351 -0.1715  0.0474
  0.1262  0.0971 -0.0033 -0.0196 -0.1496
 -0.0555  0.1132 -0.1805  0.1554 -0.0678
 -0.0557 -0.0401  0.0557 -0.0195 -0.0014
 -0.1601  0.0202  0.0182  0.1517  0.0191

(2 ,0 ,.,.) = 
 -0.1816  0.0594  0.1996 -0.0621  0.1648
  0.0702  0.0767 -0.1108  0.0961 -0.0520
 -0.1225 -0.1699 -0.0082 -0.0744  0.0446
  0.1444  0.1849  0.0453  0.0824 -0.1156
  0.0489 -0.0804 -0.1365  0.1102 -0.0840

(3 ,0 ,.,.) = 
  0.0219 -0.0350  0.0316 -0.1324  0.1103
  0.1659  0.1660 -0.0234 -0.1602 -0.0108
  0.1611 -0.0371 -0.0661 -0.0128  0.1859
  0.0217  0.0385 -0.0635  0.0217 -0.1827
  0.0617  0.0964 -0.1648  0.1463  0.1507

(4 ,0 ,.,.) = 
 -0.1769  0.0607  0.0163

## Loss functions

In [14]:
output = net(input)

# this range target is a dummy target
# again, the implementation is a bit awkward.
# but the result is a Variable giving the loss function.
# It looks like the MSELoss function call returns a function object which
# you then call with the target and output.

target = Variable(torch.range(1, 10))  
criterion = nn.MSELoss()
loss = criterion(output, target)
print(loss)

Variable containing:
 38.4115
[torch.FloatTensor of size 1]



In [16]:
# now you are urged to follow the graph backward using creator.
# MSELoss
print(loss.creator) 

# What is Linear? Is it a matrix multiply? And if so why doesn't it follow ReLU?
# Linear 
print(loss.creator.previous_functions[0][0])  
# Relu
print(loss.creator.previous_functions[0][0].previous_functions[0][0]) 


<torch.nn._functions.thnn.auto.MSELoss object at 0x7fadf80ffac8>
<torch.nn._functions.linear.Linear object at 0x7fadf80ff908>
<torch.nn._functions.thnn.auto.Threshold object at 0x7fadf80ff828>


In [17]:

# In this step, we look at the gradient buffers before and after backprop.
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
Variable containing:
 0
 0
 0
 0
 0
 0
[torch.FloatTensor of size 6]

conv1.bias.grad after backward
Variable containing:
 0.1713
 0.0999
 0.0469
-0.1110
 0.0136
-0.0019
[torch.FloatTensor of size 6]



## Weight update

The gradient buffers of the variables just contain the backpropped gradient data, it's up to the user to modify the weights using it.

The method everyone uses is stochastic gradient descent.

In [18]:
# Set a learning rate, run through all the parameters in the network, and
# subtract a multiple of the gradient from every one.
# recall that sub_ is postfixed with underscore because it subtracts 
# in place.

learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

### torch.optim package

From the tutorial:

- However, as you use neural networks, you want to use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc. To enable this, we built a small package: torch.optim that implements all these methods. Using it is very simple:

In [19]:
import torch.optim as optim

# create your optimizer with learning rate 
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad()   # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

This appears to be straightforward enough. the optimizer has the parameters, you do the forward and backprop, and then at the end the optimizer updates the parameters with whatever's in the gradient, in the appropriate way.