In [None]:
%matplotlib inline

<div class="alert alert-info"><h4>Further reading:</h4><p>This notebook is adapted from the <a href="https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html">PyTorch: A 60 Minute Blitz</a> tutorial on the PyTorch website. For documentation and more tutorials, visit <a href="https://pytorch.org">pytorch.org</a></p></div>


# Neural Networks

Neural networks can have many, many parameters. These would be hard to set by hand, but ``autograd`` will do all the work for you (see previous notebook). 

In PyTorch, neural nets are built using the ``torch.nn`` package, specifically ``nn.Module``. You just have to provide two things:
1. The layers of the network, and 
2. A method ``forward(input)`` that takes in some input data, passes it through the layers, and returns an ``output``

Once you've described your network, you have to train it. That will typically look something like:

- Iterate over a dataset of inputs, passing each input through the network and getting an output
- Compute the loss for each input (how far the output is from being correct)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:
  ``weight = weight - learning_rate * gradient``

As an example, let's build [LeNet](https://doi.org/10.1109%2F5.726791), a network that classifies images of handwritten digits:

![convnet](./mnist_cnn.png)

## Define the network

Let’s define this network:


In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):

    def __init__(self):
        super(Net, self).__init__()
        # first some convolutions (see lecture notes for details on convolution)
        # 1 input image channel, 6 output channels, 5x5 square convolution kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5) # 6 in, 16 out, 5x5 convolution
        
        # Next some linear (technically "affine") operations, y = Wx + b
        # 16x5x5 in (see diagram above), 120 out
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84) # 120 in, 84 out
        self.fc3 = nn.Linear(84, 10) # 84 in, 10 out

    def forward(self, x):
        # given input x, apply the first convolution
        x = self.conv1(x)
        # then apply a relu activation function
        x = F.relu(x)
        # then subsample with a 2x2 max pooling window
        x = F.max_pool2d(x, 2)
        # repeat with the second convolution
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        # flatten the dimensions beyond the first (that's the batch dimension)
        x = torch.flatten(x, 1)
        # pass through the linear operations, with relus in between
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

In [None]:
net = Net()
print(net)

We defined the ``forward`` function, and the ``backward`` function (where gradients are computed) is automatically defined for us with ``autograd``.

The learnable parameters of a model are returned by ``net.parameters()``

In [None]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's weight, conv1's bias, conv2's weight, conv2's bias, etc.

Let's try a random 32x32 input

In [None]:
input = torch.randn(1, 1, 32, 32) # 1 image, 1 channel, with dimensions 32x32
out = net(input)
print(out)

Zero the gradient buffers of all parameters and backprops with random
gradients:



In [None]:
net.zero_grad()
out.backward(torch.randn(1, 10))

<div class="alert alert-info"><h4>Note</h4>The torch.nn package generally expects inputs that are a mini-batch of samples, rather than a single sample. For example, nn.Conv2d will take in a 4D Tensor of nSamples x nChannels x Height x Width. If you have a single sample, you can use use input.unsqueeze(0) to add a fake batch dimension.</div>

Before proceeding further, let's recap all the classes you’ve seen so far.

**Recap:**
  -  ``torch.Tensor`` - A *multi-dimensional array* with support for autograd
     operations like ``backward()``. Also *holds the gradient* w.r.t. the
     tensor.
  -  ``nn.Module`` - Neural network module. *Convenient way of
     encapsulating parameters*, with helpers for moving them to GPU,
     exporting, loading, etc.
  -  ``nn.Parameter`` - A kind of Tensor, that is *automatically
     registered as a parameter when assigned as an attribute to a*
     ``Module``.
  -  ``autograd.Function`` - Implements *forward and backward definitions
     of an autograd operation*. Every ``Tensor`` operation creates at
     least a single ``Function`` node that connects to functions that
     created a ``Tensor`` and *encodes its history*.

**At this point, we covered:**
  -  Defining a neural network
  -  Processing inputs and calling backward

**Still Left:**
  -  Computing the loss
  -  Updating the weights of the network

## Loss Function
A loss function takes the (output, target) pair of inputs, and computes a
value that estimates how far away the output is from the target.

There are several different
[loss functions](https://pytorch.org/docs/nn.html#loss-functions) under the
nn package .
A simple loss is: ``nn.MSELoss`` which computes the mean-squared error
between the output and the target.

For example:



In [None]:
output = net(input)
target = torch.randn(10)  # a dummy target, for example
target = target.view(1, -1)  # make it the same shape as output
criterion = nn.MSELoss() # you can't call nn.MSELoss() directly

loss = criterion(output, target)
print(loss)

Now, if you follow ``loss`` in the backward direction, using its ``.grad_fn`` attribute, you will see a graph of computations that look like this:

    input -> conv2d -> relu -> maxpool2d -> conv2d -> relu
          -> maxpool2d -> flatten -> linear -> relu -> linear
          -> relu -> linear -> MSELoss -> loss

So, when we call ``loss.backward()``, the whole graph is differentiated
w.r.t. the neural net parameters, and all Tensors in the graph that have
``requires_grad=True`` will have their ``.grad`` Tensor accumulated with the
gradient.

For illustration, let's follow a few steps backward:

In [None]:
print(loss.grad_fn)  # MSELoss
print(loss.grad_fn.next_functions[0][0])  # Linear
print(loss.grad_fn.next_functions[0][0].next_functions[0][0])  # ReLU

## Backprop
To backpropagate the error all we have to do is call ``loss.backward()``. We'll first need to clear the existing gradients though, or else the new gradients will be accumulated with the existing gradients.

Let's look at conv1's bias gradients before and after the backward.

In [None]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

**Read Later:** The neural network package contains various modules and loss functions that form the building blocks of deep neural networks. A full list with documentation is [here](https://pytorch.org/docs/nn).


## Update the weights
A simple update rule is Stochastic Gradient Descent (SGD):

In [None]:
# weight = weight - learning_rate * gradient
learning_rate = 0.01
for f in net.parameters():
    f.data.sub_(f.grad.data * learning_rate)

In practice, though, there's a package called ``torch.optim`` that implements this and many other update rules for us:

In [None]:
import torch.optim as optim

# create your optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01)

# in your training loop:
optimizer.zero_grad() # zero the gradient buffers
output = net(input)
loss = criterion(output, target)
loss.backward()
optimizer.step() # Does the update

<div class="alert alert-info"><h4>Note</h4>Observe how gradient buffers had to be manually set to zero using optimizer.zero_grad(). This is because gradients are accumulated as explained in the Backprop section.</div>