## 1. Neural Networks
+ Made using `torch.nn` package
+ `nn` depends on `autograd` to define models and distinguish them.
+ `nn.Module` contains layers, and a method `forward(input)` that returns the `output`.
![convnet](../reports/figures/1_neural_network.png)
+ Training Procedure Neutral Network
    + Define a NN with learnable parameters (or weights)
    + Iterate over dataset of inputs
    + Process input through Network
    + Compute the loss (how far is the output from being correct)
    + Propogate graients back into the network's parameters
    + Update the weights of the network, typically using a simple rule: `weights - learning_rate * gradient`

### **1.1 Define a network**

In [17]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    """
    Summary: 
        Defining the forward function and backward function (were the gradients are              computed). This is automatically defined for you using autograd. You can use
        any of the Tensor operations in the forward function.

    """

    def __init__(self):
        
        super(Net, self).__init__()
        # 1 input image channel, 6 output channels, 3x3 square convolution kernal
        self.conv1 = nn.Conv2d(1, 6, 3)
        self.conv2 = nn.Conv2d(6, 16, 3)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16*6*6, 120) # 6*6 from image dimension
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)
    
    def forward(self, x):
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        size = x.size()[1:] # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


# print out the NN
net = Net()
print(net)

Net(
  (conv1): Conv2d(1, 6, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(3, 3), stride=(1, 1))
  (fc1): Linear(in_features=576, out_features=120, bias=True)
  (fc2): Linear(in_features=120, out_features=84, bias=True)
  (fc3): Linear(in_features=84, out_features=10, bias=True)
)


In [18]:
# net.parameters() are the learnable parameters
params = list(net.parameters())
print("You have {} learnable parameters.".format(len(params)))
print("Size of Conv1's weights {}".format(params[0].size()))

You have 10 learnable parameters.
Size of Conv1's weights torch.Size([6, 1, 3, 3])


In [19]:
# random 32*32 input. Input size of LeNet 32x32. 
input = torch.randn(1, 1, 32, 32)
out = net(input)
print(out) 

tensor([[ 0.0342, -0.0895, -0.0471, -0.0365, -0.0176, -0.1162,  0.1336, -0.0312,
          0.1568, -0.0211]], grad_fn=<AddmmBackward>)


In [20]:
# Return zeroed gradient buffer of all parameters and backprops with random gradients
net.zero_grad()
out.backward(torch.randn(1, 10))

**Recap:**
+ `torch.Tensor` - A multi-dimensional array with support for autograd operations like backward(). Also holds the gradient w.r.t. the tensor.

+ `nn.Module` - Neural network module. Convenient way of encapsulating parameters, with helpers for moving them to GPU, exporting, loading, etc.

+ `nn.Parameter` - A kind of Tensor, that is automatically registered as a parameter when assigned as an attribute to a Module.

+ `autograd.Function` - Implements forward and backward definitions of an autograd operation. Every Tensor operation creates at least a single Function node that connects to functions that created a Tensor and encodes its history.

### **1.2 Loss Function**
+ A loss function takes the (output, target) pair of inputs, and computes a value that estimates how far away the output is from the target.
+ [more loss functions](https://pytorch.org/docs/nn.html#loss-functions)

In [21]:
output = net(input)
target = torch.randn(10)
target = target.view(1, -1)
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

tensor(1.4548, grad_fn=<MseLossBackward>)


In [22]:
# So, when we call loss.backward(), the whole graph is differentiated w.r.t. the loss, and all Tensors in the graph that has requires_grad=True will have their .grad Tensor accumulated with the gradient.

print(loss.grad_fn)
print(loss.grad_fn.next_functions[0][0])
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])

<MseLossBackward object at 0x7f602d7e37b8>
<AddmmBackward object at 0x7f602dc38e48>
<AccumulateGrad object at 0x7f602dc38c18>


### **1.3 Backprop**
+ To backpropagate the error all we have to do is to `loss.backward()`. You need to clear the existing gradients though, else gradients will be accumulated to existing gradients.
+ Now we shall call `loss.backward()`, and have a look at conv1’s bias gradients before and after the backward.

In [23]:
net.zero_grad() # zeros the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

conv1.bias.grad before backward
tensor([0., 0., 0., 0., 0., 0.])
conv1.bias.grad after backward
tensor([-0.0036,  0.0146, -0.0029,  0.0294,  0.0171,  0.0029])


### **1.4 Update the weights**
Stochastic Gradient Descent (SGD):
+ `weight = weight - learning_rate * gradient`

In [25]:
# We can implement this using simple Python code:
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

In [27]:
# use various different update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop
optimizer.zero_grad() # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # does the update