In [None]:
%matplotlib inline

# Introduction to PyTorch

(Adapted from [Deep Learning with PyTorch: A 60 Minute Blitz](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html))


What is PyTorch?
================

It’s a Python based scientific computing package. It's mainly targeted at two sets of
audiences:

- A deep learning research platform that provides good flexibility
   and speed. This is what we'll be using PyTorch for.
- A replacement for NumPy to use the power of GPUs. If you want to know what a GPU is, <a href="https://en.wikipedia.org/wiki/Graphics_processing_unit">you can check out the Wikipedia article after class</a>. We won't be using GPUs or learning about them in this class.

Getting Started
---------------

### Tensors

Tensors are similar to lists, except they make multidimensional computations easy. Work through the examples below.

In [None]:
import torch

Construct a 5x3 matrix, uninitialized:


In [None]:
x = torch.Tensor(5, 3)
print(x)

Fill a tensor with zeros:

In [None]:
x.zero_()
print(x)

Construct a Tensor from python lists:

In [None]:
x = torch.Tensor([[1, 2, 3], [4, 5, 6]])
print(x)

Construct a randomly initialized matrix:



In [None]:
x = torch.rand(5, 3)
print(x)

Get tensor's size:


In [None]:
print(x.size())
print(x.shape) # Another way of getting the size

<div class="alert alert-info"><h4>Note</h4><p>``torch.Size`` is in fact a tuple, so it supports all tuple operations.</p></div>

### Operations

There are multiple syntaxes for operations. In the following
example, we will take a look at the addition operation.

Addition: syntax 1



In [None]:
y = torch.rand(5, 3)
print(x + y)

Addition: syntax 2

In [None]:
print(torch.add(x, y))

Addition: providing an output tensor as argument


In [None]:
result = torch.Tensor(5, 3)
torch.add(x, y, out=result)
print(result)

Addition: in-place


In [None]:
# adds x to y
y.add_(x)
print(y)

<div class="alert alert-info"><h4>Note</h4><p>Any operation that mutates a tensor in-place is post-fixed with an ``_``.
    For example: ``x.copy_(y)``, ``x.t_()``, will change ``x``.</p></div>

You can use two numbers within the indexing operation to refer to a specific spot, or use syntax like [:,2] to refer to a particular column (and [2,:] to refer to a particular row. Try playing with this in the cell below - make sure you can do the following:
- Print out a specific row of the matrix using a single line of code. This one is done for you.
- Print out a specific column of the matrix using a single line of code.
- Replace a particular entry with another number



In [None]:
print(x)
print(x[1,:])

Resizing: If you want to resize/reshape tensor, you can use ``torch.view``. Use print statements in the cell below to understand what y and z are, and what is meant by each element of the returned size. Then make your own 4x6 matrix and reshape it into a 12x2 matrix.


In [None]:
x = torch.randn(4, 4)
y = x.view(16)
z = x.view(-1, 8)  # the size -1 is inferred from other dimensions
print(x.size(), y.size(), z.size())

**Learn more later if you wish:**


  100+ Tensor operations, including transposing, indexing, slicing,
  mathematical operations, linear algebra, random numbers, etc.,
  are described
  [here](http://pytorch.org/docs/torch>).



Autograd: automatic differentiation
===================================

Central to all neural networks in PyTorch is the ``autograd`` package.
Let’s first briefly visit this, and we will then go to training our
first neural network.


The ``autograd`` package provides automatic differentiation for all operations
on Tensors. It is a define-by-run framework, which means that your backpropagation is
defined by how your code is run, and that every single iteration can be
different. (Not quite sure what that means? That's okay, you'll learn more below. If you want to follow up after class, <a href="https://towardsdatascience.com/battle-of-the-deep-learning-frameworks-part-i-cff0e3841750">check out this blog post.</a>)

Let us see this in more simple terms with some examples.


Tensor
--------
``torch.Tensor`` is the central class of the package. If you set its attribute ``.requires_grad`` as ``True``, it starts to track all operations on it. When you finish your computation you can call ``.backward()`` and have all the gradients computed automatically (gradient is just a term for derivative when we have multiple variables - having the gradients computed automatically allows us to automatically perform backpropagation without figuring out the derivatives ourselves). The gradient for this tensor will be accumulated into .grad attribute.

To stop a tensor from tracking history, you can call .detach() to detach it from the computation history, and to prevent future computation from being tracked.

To prevent tracking history (and using memory), you can also wrap the code block in with torch.no_grad():. This can be particularly helpful when evaluating a model because the model may have trainable parameters with requires_grad=True, but for which we don’t need the gradients.

There’s one more class which is very important for autograd implementation - a ``Function``.

``Tensor`` and ``Function`` are interconnected and build up an acyclic graph, that encodes a complete history of computation. Each tensor has a ``.grad_fn`` attribute that references a ``Function`` that has created the ``Tensor`` (except for Tensors created by the user - their grad_fn is None).

If you want to compute the derivatives, you can call ``.backward()`` on a ``Tensor``. If ``Tensor`` is a scalar (i.e. it holds a one element data), you don’t need to specify any arguments to ``backward()``, however if it has more elements, you need to specify a gradient argument that is a tensor of matching shape.



Create a tensor and set requires_grad=True to track computation with it

In [None]:
x = torch.ones(2, 2, requires_grad=True)
print(x)

Do an operation on the tensor:

In [None]:
y = x + 2
print(y)

``y`` was created as a result of an operation, so it has a ``grad_fn``.


In [None]:
print(y.grad_fn)

Do more operations on y


In [None]:
z = y * y * 3
out = z.mean()

print(z, out)

Gradients
---------
Let's perform backpropagation now.

In [None]:
out.backward()

Now that we've performed backpropagation, we'll print the gradients (i.e. $d(out)/dx$).


In [None]:
print(x.grad)

You should have got a matrix of ``4.5``. Let’s call the ``out``
*Tensor* “$o$”.
We have that $o = \frac{1}{4}\sum_i z_i$,
$z_i = 3(x_i+2)^2$ and $z_i\bigr\rvert_{x_i=1} = 27$. Try computing the derivative here with respect to the variable $x_{i}$ (one of the elements of the original 2x2 matrix of x's). Do you end up with 4.5? Double click this cell to see the derivation, but try it with your partner first.

<!--
$\frac{\partial o}{\partial x_i} = \frac{3}{2}(x_i+2)$, hence
$\frac{\partial o}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$.
-->



You can do many amazing things with autograd! Work through what's happening in the example below, and why you get the gradient results that you do. (Ading some print statements may be helpful...) What happens if you change the input to backward? <a href="https://pytorch.org/docs/stable/autograd.html#torch.autograd.backward">You can look in the docs to learn a little more about backward if you like.</a>


In [None]:
x = torch.randn(3, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

In [None]:
gradients = torch.tensor([0.1, 1.0, 0.0001], dtype=torch.float)
y.backward(gradients)

print(x.grad)

**Learn more later if you wish:**

Documentation of ``autograd`` and ``Function`` is at
http://pytorch.org/docs/autograd




Neural Networks
===============

Neural networks can be constructed using the ``torch.nn`` package.

Now that you had a glimpse of ``autograd``, ``nn`` depends on
``autograd`` to define models and differentiate them.
An ``nn.Module`` contains layers, and a method ``forward(input)`` that
returns the ``output`` when given input.

For example, look at this network that classifies digit images:

![](images/mnist.png)

It is a feed-forward network. It takes the input, feeds it
through several layers one after the other, and then finally gives the
output.

A typical training procedure for a neural network is as follows:

- Define the neural network that has some learnable parameters (or
  weights)
- Iterate over a dataset of inputs
- Process input through the network
- Compute the loss (how far is the output from being correct - the error that we talked about in class)
- Propagate gradients back into the network’s parameters
- Update the weights of the network, typically using a simple update rule:
  ``weight = weight - learning_rate * gradient``

This should sound familiar from our discussions in class.

We'll define the network, and show how to process input and train it. If you want more detail on any of these points, <a href="https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html#define-the-network">the original tutorial gives more depth (mastery of this material is not needed for this class)</a>.

Define the network
------------------

Let’s define a convolutional neural network that we'll use to recognize images of digits; read through the code below and try to get a sense of what's happening - note down what doesn't make sense, and return to those questions at the end of the lab:



In [None]:
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.inputsize = 28 # Height of the input image (we assume square images)
        kernel_size1 = 5 # How big the convolutions are
        padding1 = (kernel_size1-1)//2 # We add extra zero columns to the sides of the image
        self.conv1 = nn.Conv2d(1, 10, kernel_size=kernel_size1, padding = padding1) # 10 filters (different convolution patterns) in first convolution layer
        self.pool = nn.MaxPool2d(2, 2)
        kernel_size2 = 5
        padding2 = (kernel_size2-1)//2
        self.conv2 = nn.Conv2d(10, 20, kernel_size=kernel_size2, padding = padding2) # 20 filters in second conv. layer
        # How many outputs we have for one dimension and 1 filter
        # Based on formula here: https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d
        output_size_one_dim= ((self.inputsize+2*padding1-(kernel_size1-1))/2 +2*padding2-(kernel_size2-1))/2 
        self.fc1size=int(20*output_size_one_dim**2) # Total inputs to fully-connected layer is number of filters * total filter outputs in x direction * total filter outputs in y direction
        self.fc1 = nn.Linear(self.fc1size, 50) # 50 hidden units in first fully-connected layer
        self.fc2 = nn.Linear(50, 10) # 10 output units

    def forward(self, x):

        # first convolutional layer
        h_conv1 = self.conv1(x)
        h_conv1 = F.relu(h_conv1)# This is the rectilinear unit - a particular instance of the "g" function we talked about in class 
        h_conv1_pool = self.pool(h_conv1)

        # second convolutional layer
        h_conv2 = self.conv2(h_conv1_pool)
        h_conv2 = F.relu(h_conv2) 
        h_conv2_pool = self.pool(h_conv2)

        # fully-connected layer
        h_fc1 = h_conv2_pool.view(-1, self.fc1size) # this reshapes the tensor, so it's flat to give as input to the fc layer    
        h_fc1 = self.fc1(h_fc1)
        h_fc1 = F.relu(h_fc1) 


        # classifier output
        output = self.fc2(h_fc1)
        output = F.log_softmax(output, dim=1)

        return output

net = Net()

We only had to define the ``forward`` function, and the ``backward``
function (where gradients are computed) is automatically defined for us
using ``autograd``.
You can use any of the Tensor operations in the ``forward`` function.




Let's try a random 28x28 input:


In [None]:
input = torch.randn(1, 1, 28, 28)
out = net(input)
print(out)

We get 10 outputs because we'll be learning to distinguish between 10 digits (0-9). The output unit with the highest activation will be our best guess at  the true digit.



**Recap:**

  - ``torch.Tensor`` - A multi-dimensional array with support for autograd operations like backward(). Also holds the gradient w.r.t. the tensor.
  -    nn.Module - Neural network module. 
  -    ``nn.Parameter`` - A kind of ``Tensor``, that is automatically registered as a parameter when assigned as an attribute to a Module.
  -   autograd.Function - Implements *forward and backward definitions of an autograd operation*. Every Tensor operation creates at least a single Function node that connects to functions that created a Tensor and encodes its history.
  
  
Let's load some data to make things more concrete and train our network.

Loading MNIST data
-------------
For this tutorial, we will use the MNIST dataset of hand-drawn digits. MNIST is one of the most famous data sets in machine learning. It is a digit recognition task, where the goal is classify images of handwritten digits with right label, e.g, either '0','1','2', ... '9'. The training set consists of 60,000 images (6,000 per digit), and the test set of has 10,000 additional images. The images in MNIST are of
size 28x28, i.e. greyscale images 28x28 pixels in size.

``torchvision`` is a module that has data loaders for common datasets such as
Imagenet, CIFAR10, MNIST, etc. and data transformers for images. <a href="https://pytorch.org/docs/stable/torchvision/datasets.html">You can read about the available datasets here.</a> Run the code below to load the MNIST data.

In [None]:
import torchvision
import torchvision.transforms as transforms

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.MNIST(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=100,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.MNIST(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=100,
                                         shuffle=False, num_workers=2)

Let's show some of the training images to see what we're working with:

In [None]:
import matplotlib.pyplot as plt
import numpy as np

# functions to show an image


def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))


# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images[:4]))
# print labels
print(' '.join('%5s' % labels[j] for j in range(4)))

Defining a loss function and optimizing
-------------
Loss is another word for the error we've been talking about in class. The loss function takes the (output, target) pair of inputs, and computes a
value that estimates how far away the output is from the target. There are several different
[loss functions](http://pytorch.org/docs/nn.html#loss-functions) under the
nn package .
A simple loss is: ``nn.MSELoss`` which computes the mean-squared error
between the input and the target. This is similar to what we've been using in classe error, but averages over the outputs to give a single error value even though there are ten ouputs.

For example:


In [None]:
input = images[0] # The first image shown up above
output = net(input.unsqueeze(0)) # Pytorch expects an extra dimension - as if we were passing in multiple images
target = torch.randn(10)
target.zero_()
target[labels[0]] = 1 # Set the spot for the true label of the first image to 1
target = target.view(1, -1)  # make it the same shape as output
print("target:",target)
criterion = nn.MSELoss()

loss = criterion(output, target)
print("output:",output)
print("loss:",loss)

Now that we have a loss function, we'll improve the weights using (stochastic) gradient descent. "Stochastic gradient descent" is the form of gradient descent we've talked about most often in class: we optimize the weights after one (or a small batch) of training examples, rather than only after seeing the entire training set. This is computationally easier for large datasets, and in practice, updating with a small number of examples at once, rather than only one or the whole training set, tends to perform best.

Here's an example of using gradient descent in Pytorch

In [None]:
import torch.optim as optim

# create the optimizer - SGD is stochastic gradient descent,
# net is our neural net and lr is learning rate
optimizer = optim.SGD(net.parameters(), lr=0.001) 

# to optimize for our single example (from above)
optimizer.zero_grad()   # zero the gradient buffers
output = net(input.unsqueeze(0))
print(output)
print(target.squeeze(0))
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update

# Now we'll see how our loss changed
output = net(input.unsqueeze(0))
loss = criterion(output, target)
print(loss) # Compare this to what you saw above for the loss - it's a little smaller; we'll need more data to do well though

**Learn more later if you wish:**

  The neural network package contains various modules and loss functions
  that form the building blocks of deep neural networks. A full list with
  documentation is [here](http://pytorch.org/docs/nn)

Training the network
-------------

To fully train the network, we'll do the same sort of thing as above, with the optimizer, except we'll run through the whole training set. Above, we defined train loader - here's the line again:

In [None]:
trainloader = torch.utils.data.DataLoader(trainset, batch_size=100,
                                          shuffle=True, num_workers=2)


batch_size refers to how many images we'll use for each gradient descent update. So, this says to use 100 images for each update. The DataLoader class is set up so that when we iterate over the trainloader, the amount of data in each iteration is equal to the batch_size. ``shuffle`` means that each time we iterate through the trainloader again, the order of the images will be shuffled. (Confused by this? Try printing the ``shape`` of the labels or inputs in the "Training loop" cell below. On any tensor, the shape attribute is defined so you can say things like ``print(labels.shape)``. Because this is a big loop, I'd recommend saying ``break`` afterwards so you don't print thousands of lines...)

Before we get to running the loop, though, we're going to make one other change. We're going to use a different loss function. Squared-error doesn't make that much sense for a classification problem with 10 classes because it will spend lots of time trying to exactly match the zeros on all the classes that are incorrect when what we care about is that the highest output is on the output node representing the correct class. We'll use the CrossEntropy loss function for this; <a href="https://pytorch.org/docs/stable/nn.html#torch.nn.CrossEntropyLoss">you can read about it in the pytorch documentation</a> if you'd like to know the details, but intuitively, this loss is treating the output nodes like probabilities of each class, and it decreases as the probability on the correct class increases. The cell right below this changes the criterion we'll optimize, and also sets up 

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=.01) # We don't actually need this line since we defined optimizer above, but just a reminder for where the gradient descent comes in


In [None]:
for epoch in range(1):  # Usually, we'd loop over the data multiple times, but to save time, we'll just do it once

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs
        inputs, labels = data
        
        # We'd need the line below if we were using MSELoss because that criterion expects a tensor that has 10
        # outputs. We don't need it if you use the cross entropy loss because that's specifically for classification
        # problems and thus can work with getting just a label.
        # labels = torch.eye(10).index_select(dim=0, index=labels)
        
        # zero the parameter gradients
        optimizer.zero_grad()
        
        # forward + backward + optimize
        outputs = net(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 100 == 99:    # print every 100 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 100))
            running_loss = 0.0

print('Finished Training')

So what does that mean? The loss went down, but is it any good? We can check this by comparing the class label that the neural network
outputs to the ground-truth (the label from the dataset. If the prediction is
correct, we add the sample to the list of correct predictions.

Let's try this with a few images from the test set. The test set is more digit data, but not digit data we've trained on (why might we want to focus on accuracy on a test set rather than accuracy on the training set?).

In [None]:
dataiter = iter(testloader)
images, labels = dataiter.next()
imshow(torchvision.utils.make_grid(images[:4]))
print('GroundTruth: ', ' '.join('%5s' % labels[j] for j in range(4)))

Okay, now let us see what the neural network thinks these examples above are:


In [None]:
outputs = net(images)
print(outputs[:4])

The outputs are activations for the 10 classes.
The higher the activation for a class, the more the network
thinks that the image is of the particular class.
So, let's get the index with highest activation:



In [None]:
_, predicted = torch.max(outputs.data, 1)

print('Predicted: ', ' '.join('%5s' % predicted[j]
                              for j in range(4)))

How well does it do? Compare the predicted to what you saw as the true labels above.

Let's look at how the network performs on the whole dataset. We'll calculate how many of the 10000 test images it calculates correctly.



In [None]:
correct = 0
total = 0
for data in testloader:
    images, labels = data
    outputs = net(images)
    _, predicted = torch.max(outputs.data, 1)
    total += labels.size(0)
    correct += (predicted == labels).sum()
    break

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

That looks waaay better than chance, which is 10% accuracy (randomly picking
a class out of 10 classes).
Seems like the network learned something, even if it's not perfect.



Extra time?
---------
If you have extra time in class while we're looking at the lab, there are lots of things you might try.
- Go back to any questions you had after reading the neural net code. Which things do you still not understand? Try looking in the documentation to learn more - e.g. <a href="https://pytorch.org/docs/stable/nn.html#torch.nn.Conv2d">the documentation for Conv2d</a> may help with understanding what's going on there.
- Subsitute in a different dataset. <a href="https://pytorch.org/docs/stable/torchvision/datasets.html#torchvision.datasets.FashionMNIST">FashionMNIST</a> is one dataset that has the same dimensions as MNIST. <a href="https://github.com/zalandoresearch/fashion-mnist">Read about it here.</a>.
- Experiment with changing the learning rate. How does the loss change over time? You may want to add additional loops through the training set. Make sure you reinitialize the network each time (otherwise, it's already starting from pretty good weights!).
- Calculate the accuracy on the training set and the test set at each iteration of training.
- Examine some cases where the network makes errors - do they seem like reasonable errors to you?
- Look in the documentation of PyTorch for the parameters of the optimizer. What other optimizers might you try? Do they make much of a difference? What happens if you use a different criterion, like MSELoss?