# Getting Started With PyTorch

In this tutorial, we're going to dive into the basics of running [PyTorch](http://pytorch.org/) on Linux, including installation, creating and training a simple neural network that can recognize digits, and finally a more complicated example that uses convolutional neural networks (CNNs) to improve accuracy. This won't be a full introduction to neural networks, but I'll be explaining concepts as they crop up in our code.

While a computer with a GPU is not entirely necessary for this tutorial, it's recommended. If you'd prefer following along in a Jupyter notebook, you can find this article on [GitHub](https://github.com/falloutdurham).

## Installing PyTorch

The easiest way to install PyTorch is to use the Anaconda Python distribution. If you have it installed, getting the latest PyTorch is just entering this on the command line:

		conda install pytorch torchvision -c pytorch
		
If you'd rather use Python's PIP, then for Python 2.6, it's 

		pip install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp27-cp27mu-linux_x86_64.whl
		pip install torchvision
		
And for Python 3.6, you'd use:

		pip3 install http://download.pytorch.org/whl/cu80/torch-0.3.0.post4-cp36-cp36m-linux_x86_64.whl
		pip3 install torchvision		

Note that if you're wanting to use GPU-accelerated calculations, you'll need to have the [CUDA](https://developer.nvidia.com/cuda-zone) libraries installed as well (and consequently, an [NVIDIA](https://nvidia.com) graphics card). 

## Our First Model

Having got PyTorch installed, we're going to do the "Hello world" of deep learning, which is creating a neural network that can look at the images of handwritten digits from a dataset called MNIST and output which number it's looking at. Here's what some of the digits look like:

![](mnist.png)

Firstly, we're going to need to get our hands on the dataset. While we could download these directly from the [MNIST website](http://yann.lecun.com/exdb/mnist/) and build scaffolding to load them into PyTorch, the framework allows us to download standard reference datasets like MNIST, CIFAR-10, COCO, and others without much fuss.

	

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.autograd import Variable



In [2]:
transforms = transforms.Compose([
                               transforms.ToTensor(),
                               transforms.Normalize((0.1307,), (0.3081,))])

train_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=True, download=True,
               transform=transforms),
        batch_size=64, shuffle=True)
test_loader = torch.utils.data.DataLoader(
        datasets.MNIST('../data', train=False, transform=transforms),
        batch_size=64, shuffle=True)

This code will create two DataLoader objects that will download the MNIST dataset (if not present, so the `test_loader` DataLoader will simply use the images that were downloaded by 'train_loader') and serve up random batches of 64 images from MNIST's collection of 60000. You can also see a `transforms` argument applied to both loaders. PyTorch's `torchvision` package allows you to create a complex pipeline of transformations  for data augmentation that are applied to images as they get pulled out of the DataLoader, including random cropping, rotation, reflection, and scaling. In our example, we're not doing any of that, but we are taking advantage of the pipeline to transform the image data into a tensor (in MNIST's case, this tensor is an array of 1x28x28, as the images are all grayscale 28x28 pixels) and normalizing that tensor to the standard deviation and mean of the MNIST dataset as a whole. This takes us from an array of pixels going form 0…255 to a tensor of values from -1 to 1. We do this because neural network training does a lot better within this smaller range rather than the full integer pixel values.
 
Next, let's create our first neural network by creating a new Python class that inherits from PyTorch's nn.Module:


In [3]:
class FirstNet(nn.Module):
    def __init__(self,image_size):
        super(FirstNet, self).__init__()
        self.image_size = image_size
        self.fc0 = nn.Linear(image_size, 1000)
        self.fc1 = nn.Linear(1000, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = x.view(-1, self.image_size)
        x = F.relu(self.fc0(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return F.log_softmax(x)

The general convention for these network classes is that you create all your layers in the constructor, and then lay out their relationship in the `forward()` method. Here, we're creating a very simple network where all our layers are `Linear`, the classic 'fully-connected' neural network, which applies a linear translation to all input (the values in the layer are initialized randomly). We start with `image_size`, the size of our MNIST images, and the network ends with 10 outputs, corresponding to the 10 digits (zero to nine) that we're attempting to recognize.

`forward()` then shows us how an image flows through the network. Firstly, we have to convert the image tensor (1x28x28 once it comes through the transformation pipeline) into a shape that the first layer can understand. We do this via the `view()` method, which in this case _flattens_ the tensor into a shape of 1x784, the shape for the fist linear layer.

The next three lines of code all apply the layers to the incoming data in turn, but there's also a 'F.relu()' call happening at each level. What is this? Well, it's an example of an _activation function_. These functions can be applied to outputs of each layer and insert non-linearity into the system. Without them, we'd just essentially have a linear regression model, but with them neural networks gain their power as universal function approximators. There are many different types of activation function, but most modern deep learning architectures will use the ReLU, or Rectified Linear Unit. While this sounds intimidating, it's literally just a function f(x) where f(x) = max(x,0), i.e. the function returns zero if the output is less than zero, or the original output if it's greater than zero. 

Finally, we use a different activation, softmax, on the output of the final layer which squashes the output in the final layer to be in the range of 0…1 for each of the ten output classes. These will become probability estimates for each class, so to determine the predicted class of an image, we find the class with the probability closest to 1.

Creating an instance of the network is done in the traditional Python way of calling the constructor:

In [4]:
model = FirstNet(image_size=28*28)


If you have a GPU-enabled machine, you can copy this model to the GPU by calling the `cuda()` method:

In [5]:
model.cuda()

FirstNet(
  (fc0): Linear(in_features=784, out_features=1000)
  (fc1): Linear(in_features=1000, out_features=50)
  (fc2): Linear(in_features=50, out_features=10)
)

## Training And Testing

Having created our model, we now need to train it. In some frameworks, like Keras, most of this will be handled for you behind the scenes, but PyTorch requires that you write an explicit training procedure. Here's an example, taken from the PyTorch examples:


In [11]:
optimizer = optim.SGD(model.parameters(), lr=0.001)

def train(epoch):
    model.train()
    for batch_idx, (data, labels) in enumerate(train_loader):
        if torch.cuda.is_available():
            data, labels = data.cuda(), labels.cuda()
        data, labels = Variable(data), Variable(labels)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output, labels)
        loss.backward()
        optimizer.step()
        if batch_idx % 100 == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.data[0]))

While there's a lot going on here, it's fairly straightforward if we take it a line at a time. Firstly, before we create the `train()` method itself, we instantiate our optimizer which will update the values of the layers of the neural network at each step through each batch from the DataLoader, exploring values which will hopefully get more accurate as training continues. 

There are various different optimizers that you can choose from, including [RMSProp](http://ruder.io/optimizing-gradient-descent/index.html#rmsprop), [AdaGrad](http://ruder.io/optimizing-gradient-descent/index.html#adagrad), and the one most commonly used today, [ADAM](http://ruder.io/optimizing-gradient-descent/index.html#adam). But here we're simply going to use the classic vanilla [Stochastic Gradient Descent](http://ruder.io/optimizing-gradient-descent/index.html#stochasticgradientdescent) with a learning rate of 0.001. The learning rate tells the optimizer how much to move the values in the layers during each pass; too high and your network may bounce around high and low accuracy, while too low may see training take a very long time. 0.001 is a decent starting choice.

In the `train()` method itself, we first put the model in training mode and then loop through all the batches in the dataset. For each batch, we copy the image data and the labels (i.e. what digit an image represents) to the GPU if available and reset the optimizer for this batch.

The images in this batch are then passed through the model to generate the `output` tensor, our predictions. This is then compared to what they should have been (the `labels`) via a _loss function_. We're using the negative log likelihood loss function here, which is commonly used in classification architectures. 

We then invoke PyTorch magic. The call to `loss.backward()` calculates the backpropagation, working out the gradient of the loss with respect to the values in the layers (or 'weights'). Then by calling optimizer.step() we adjust the layers using this gradient and the optimizer function.  You can think of this as a ball rolling through a landscape with each step, and we're trying to get to the bottom. Each step, we nudge the network in the direction we think is down. 

Finally, we print out some debugging information on some batch indices - the current epoch and the loss on the training set.




The `test()` method then switches the model into evaluation mode, makes predictions and reports the accuracy of the model. If we run the train/test cycle for 10 iterations (also known as _epochs_), we'll get an accuracy in the area of 80%. Not bad, but we can do better without much effort.

In [7]:
def test():
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        if torch.cuda.is_available():
            data, target = data.cuda(), target.cuda()
        data, target = Variable(data, volatile=True), Variable(target)
        output = model(data)
        test_loss += F.nll_loss(output, target, size_average=False).data[0] # sum up batch loss
        pred = output.data.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.data.view_as(pred)).cpu().sum()

    test_loss /= len(test_loader.dataset)
    print('\nTest set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        test_loss, correct, len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


## Convolutions For The Win!

Most computer vision deep learning architectures these days are made up of stacks of _convolutional neural networks (CNNs)_ instead of the fully-connected layers shown above. These networks can be thought of as a group of small filters that pass over the image that are each trained to look for certain things, so one filter might end up recognizing eyes, another might seek out noses, and so on. Here's a very basic CNN network:

In [8]:
class CNNNet(nn.Module):
    def __init__(self):
        super(CNNNet, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

If we re-initialize the optimizer and create a new `model` with this network and run again for 10 epochs, we're already at an accuracy above 90%. Aside from the convolutional layers (`conv2d`), the other new concepts introduced here are MaxPooling, which is form of downsampling, and Dropout, which forces the network to randomly discount a number of activations when it's in training mode. This helps the model to train in a more generalizable fashion, i.e learning to discern the structure of what makes a 1 instead of just learning to recognize exact pixel values from the training images.        

## Where To Go Next

That's it for this tutorial - if you're eager to learn more about the framework, then the [PyTorch tutorials site](http://pytorch.org/tutorials/) has all sorts of examples, from image classification to translating text between different languages. If you're looking to explore deep learning in general using PyTorch, I recommending having a look at the [fast.ai](https://fast.ai) course. It'll take you through all the theory of deep learning while staying focussed on its applications in a very accessible manner.

In [9]:
print(model)

FirstNet(
  (fc0): Linear(in_features=784, out_features=1000)
  (fc1): Linear(in_features=1000, out_features=50)
  (fc2): Linear(in_features=50, out_features=10)
)


In [12]:
for epoch in range(1, 10 + 1):
    train(epoch)
    test()

  



Test set: Average loss: 2.0090, Accuracy: 4920/10000 (49%)


Test set: Average loss: 1.5291, Accuracy: 6129/10000 (61%)


Test set: Average loss: 1.2115, Accuracy: 6621/10000 (66%)


Test set: Average loss: 1.0563, Accuracy: 6810/10000 (68%)


Test set: Average loss: 0.9694, Accuracy: 6914/10000 (69%)


Test set: Average loss: 0.9144, Accuracy: 6974/10000 (70%)


Test set: Average loss: 0.8767, Accuracy: 7017/10000 (70%)


Test set: Average loss: 0.8487, Accuracy: 7079/10000 (71%)


Test set: Average loss: 0.8283, Accuracy: 7112/10000 (71%)


Test set: Average loss: 0.8114, Accuracy: 7134/10000 (71%)



In [13]:
model = CNNNet()

In [14]:
if torch.cuda.is_available():
    model.cuda()
optimizer = optim.SGD(model.parameters(), lr=0.001)

In [15]:
for epoch in range(1, 10 + 1):
    train(epoch)
    test()


Test set: Average loss: 2.2180, Accuracy: 3538/10000 (35%)


Test set: Average loss: 1.8973, Accuracy: 7246/10000 (72%)


Test set: Average loss: 1.0909, Accuracy: 7957/10000 (80%)


Test set: Average loss: 0.6365, Accuracy: 8592/10000 (86%)


Test set: Average loss: 0.4749, Accuracy: 8886/10000 (89%)


Test set: Average loss: 0.3969, Accuracy: 8991/10000 (90%)


Test set: Average loss: 0.3472, Accuracy: 9085/10000 (91%)


Test set: Average loss: 0.3121, Accuracy: 9132/10000 (91%)


Test set: Average loss: 0.2860, Accuracy: 9202/10000 (92%)


Test set: Average loss: 0.2683, Accuracy: 9231/10000 (92%)

