PyTorch Introduction
================
 
<div class="alert alert-info">
    <strong>Note:</strong> This exercise is optional and only serves as an introduction and cheatsheet to the general concepts of PyTorch.
</div>

PyTorch is a scientific computing package for Python:

-  Tensor and Neural Network computations (inparticular deep learning)
-  Research oriented (in comparison to e.g. TensorFlow)
-  Dynamic computational graph (in comparison to e.g. TensorFlow)
-  “NumPy on the GPU”
-  Backend and API heavily inspired by the original Torch written in Lua

An in-depth tutorial of the concepts described in this notebook can be found [here](https://github.com/jcjohnson/pytorch-examples).

In [1]:
%matplotlib inline
import numpy as np
import torch

Tensors
=====

The PyTorch `Tensor` class is very similar to the NumPy `ndarray` class. Their main distinction is the ability of PyTorch Tensors to be used on a GPU which lets them benefit from vastly accelerated and parallelized computations. In order to work with PyTorch it is crucial to understand the basic behavior of its `Tensor` class.

Let's start with the initialization of a regular `5x3` matrix `Tensor`:

In [2]:
x = torch.Tensor(5, 3)
print(x)


 8.3571e-35  4.5831e-41  8.3571e-35
 4.5831e-41 -1.9296e+12  4.5829e-41
 8.4078e-45  4.5831e-41  3.3409e-34
 1.2612e-44  3.8522e-34  4.5831e-41
 1.6816e-44  4.5831e-41  0.0000e+00
[torch.FloatTensor of size 5x3]



The same matrix can be initialized with random entries:



In [3]:
x = torch.rand(5, 3)
print(x)


 0.3163  0.5654  0.2828
 0.9385  0.4899  0.3137
 0.8784  0.5316  0.8473
 0.8816  0.1655  0.0295
 0.4115  0.7418  0.7783
[torch.FloatTensor of size 5x3]



The size of a `Tensor` can be retrieved with:



In [4]:
print(x.size())

torch.Size([5, 3])


<div class="alert alert-info">
    <h3>Note</h3>
    <p>`torch.Size` is in fact a tuple, so it supports the same operations as a regular Python tuple.</p>
    <p>In contrast to a static computational graph of for example Tensorflow the dynamic graph of PyTorch allows to retrieve information such as its size at any time during runtime.</p>
</div>

Tensor Operations
--------

There are multiple syntaxes for `Tensor` operations. We illustrate the different options on the example of `Tensor` addition.

Regular (NumPy) syntax:



In [5]:
y = torch.rand(5, 3)
print(x + y)


 0.3809  1.1081  1.1115
 1.5668  0.7949  1.0771
 1.8586  1.4302  1.0375
 1.2942  0.6657  0.7280
 0.9071  1.2504  1.5418
[torch.FloatTensor of size 5x3]



PyTorch syntax:



In [6]:
print(torch.add(x, y))


 0.3809  1.1081  1.1115
 1.5668  0.7949  1.0771
 1.8586  1.4302  1.0375
 1.2942  0.6657  0.7280
 0.9071  1.2504  1.5418
[torch.FloatTensor of size 5x3]



PyTorch syntax with specific output variable:



In [7]:
result = torch.Tensor(5, 3)
torch.add(x, y, out=result)
print(result)


 0.3809  1.1081  1.1115
 1.5668  0.7949  1.0771
 1.8586  1.4302  1.0375
 1.2942  0.6657  0.7280
 0.9071  1.2504  1.5418
[torch.FloatTensor of size 5x3]



PyTorch syntax for inplace operations:

In [8]:
# adds x to y
y.add_(x)
print(y)


 0.3809  1.1081  1.1115
 1.5668  0.7949  1.0771
 1.8586  1.4302  1.0375
 1.2942  0.6657  0.7280
 0.9071  1.2504  1.5418
[torch.FloatTensor of size 5x3]



<div class="alert alert-info">
    <h3>Note</h3>
    <p>Any operation that mutates a `Tensor` in-place is post-fixed with an ``_``.</p>
    <p>For example: ``x.copy_(y)``, ``x.t_()``, will copy  ``y`` to ``x``.</p>
</div>

`Tensor` indexing works just like standard NumPy indexing. And since recently PyTorch even supports `Tensor` [broadcasting](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html)!



In [9]:
print(x[:, 1])


 0.5654
 0.4899
 0.5316
 0.1655
 0.7418
[torch.FloatTensor of size 5]



NumPy: There and back again
---------------------------

Converting a PyTorch `Tensor` to a NumPy `ndarray` and vice versa is a very simple. The `Tensor` and the `ndarray` will share the location of the underlying memory, and changing one will also change the other.

Converting a `Tensor` to a `ndarray` works by simply calling the `Tensor.numpy()` method:

In [10]:
a = torch.ones(5)
b = a.numpy()
print(a)
print(b)


 1
 1
 1
 1
 1
[torch.FloatTensor of size 5]

[ 1.  1.  1.  1.  1.]


Changing the `Tensor` effects the `ndarray` as well:

In [11]:
a.add_(1)
print(a)
print(b)


 2
 2
 2
 2
 2
[torch.FloatTensor of size 5]

[ 2.  2.  2.  2.  2.]


The conversion from a `ndarray` to a `Tensor` is just as simple and holds the same properties:

In [12]:
a = np.ones(5)
b = torch.from_numpy(a)
np.add(a, 1, out=a)
print(a)
print(b)

[ 2.  2.  2.  2.  2.]

 2
 2
 2
 2
 2
[torch.DoubleTensor of size 5]



Every `Tensor` allocated on the CPU (except the `torch.CharTensor`) support converting to
NumPy and back.

Tensors on the GPU
------------------

PyTorch Tensors can be moved onto a GPU using the ``Tensor.cuda()`` method. Before converting a GPU `Tensor` to NumPy it has to be moved back to the CPU by calling the ``Tensor.cpu()`` method.



In [13]:
# first check if cuda is available
if torch.cuda.is_available():
    x = x.cuda()
    y = y.cuda()
    z = x + y
    
    print(z)
    print(z.cpu().numpy())
else:
    print("CUDA not available.")


 0.6972  1.6734  1.3942
 2.5053  1.2848  1.3908
 2.7370  1.9618  1.8848
 2.1758  0.8312  0.7575
 1.3186  1.9923  2.3201
[torch.cuda.FloatTensor of size 5x3 (GPU 0)]

[[ 0.69717693  1.6734134   1.39424455]
 [ 2.505337    1.28478622  1.39080715]
 [ 2.7369628   1.96182108  1.88478351]
 [ 2.17579365  0.83123899  0.75750089]
 [ 1.31861353  1.99226499  2.32007504]]


More on PyTorch Tensors
-----------------------

The documentation of many more `Tensor` operations, including transposing, indexing, slicing, mathematical operations, linear algebra, random numbers can be found [here](http://pytorch.org/docs/torch).


Autograd - automatic differentiation
===================================

Central to all neural networks in PyTorch is the ``autograd`` package. The package provides automatic differentiation for all operations on Tensors. PyTorch is a define-by-run framework, which means that the calculation of gradients ( e.g. during backpropagation) is defined at runtime and can be different at every single iteration.

Variable
--------

``autograd.Variable`` is the central class of the package. It wraps the `Tensor` class, and supports nearly all of its operations. Once a computation graph is executed the ``Variable.backward()`` method can be used to automatically compute all the gradients.

If ``Variable`` is not a scalar, the ``backward()`` method requires an additional ``grad_output`` argument which matches the shape of the ``Variable``. ``grad_output`` is supposed to be the gradient w.r.t the given output. For a scalar ``Variable`` ``grad_output`` is assumed to be `torch.Tensor([1.0])`.

The underlying, raw tensor can be accessed with the ``Variable.data`` attribute and the
gradient w.r.t. this variable is accumulated into ``Variable.grad``.

<img src="http://pytorch.org/tutorials/_images/Variable.png">

In addition to the ``Variable`` class the `autograd` packages provides a `Function` class. The two classes are interconnected and build up an acyclic graph which encodes a complete history of computation. Each `Variable` has
a ``Variable.grad_fn`` attribute which references the ``Function`` (e.g. an operation such as addition) that created the respective ``Variable`` and thereby determines its gradient. For Variables that were created by the user and not as a result of an operation the ``grad_fn`` attribute is ``None``.

The following simple examples will illustrate the basic concepts of the ``autograd`` package.

In [14]:
from torch.autograd import Variable

Create a `Variable` object:



In [15]:
x = Variable(torch.ones(2, 2), requires_grad=True)
print(x)

Variable containing:
 1  1
 1  1
[torch.FloatTensor of size 2x2]



Apply an operation to the `Variable`:



In [16]:
y = x + 2
print(y)

Variable containing:
 3  3
 3  3
[torch.FloatTensor of size 2x2]



Since ``y`` was created as a result of an operation it has a ``grad_fn`` attribute (`Function`) unequal to `None`:



In [17]:
print(y.grad_fn)

<torch.autograd.function.AddConstantBackward object at 0x7fc1d413d9a8>


Applying more operations to `y` increases the computational graph:



In [18]:
z = y * y * 3
out = z.mean()

print(z, out)

Variable containing:
 27  27
 27  27
[torch.FloatTensor of size 2x2]
 Variable containing:
 27
[torch.FloatTensor of size 1]



Gradients
---------
The gradient w.r.t the input `x` can now be computed (backpropagated) with ``out.backward()``. Remember for a scalar this is equivalent to doing ``out.backward(torch.Tensor([1.0]))``.



In [19]:
out.backward()

The input `x` was a `2x2` `Variable` and therefore $\frac{d(out)}{dx}$ yields a matrix with the same shape:



In [20]:
print(x.grad)

Variable containing:
 4.5000  4.5000
 4.5000  4.5000
[torch.FloatTensor of size 2x2]



For such a small computation graph the solution can easily be verified:

The output w.r.t. the input is given as 
$$
\begin{align}
    out =& \frac{1}{4}\sum_i z_i \\
        =& \frac{1}{4}\sum_i 3y_i y_i \\
        =& \frac{1}{4}\sum_i 3(x_i+2)^2
\end{align}
$$.

Therefore the gradient is $\frac{\partial out}{\partial x_i} = \frac{3}{2}(x_i+2)$, which yields
$\frac{\partial out}{\partial x_i}\bigr\rvert_{x_i=1} = \frac{9}{2} = 4.5$ for a particular input $x_i=1$.



The `autograd` package in combination with the dynamic graph structure allow to do crazy things such as:



In [21]:
x = torch.randn(3)
x = Variable(x, requires_grad=True)

y = x * 2
while y.data.norm() < 1000:
    y = y * 2

print(y)

Variable containing:
 1022.6609
 -657.0992
-1235.1356
[torch.FloatTensor of size 3]




Neural Networks
===============

The `Tensor` and `Variable` classes in combination with the `autograd` package build the foundation for constructing Neural Networks (NNs) with PyTorch. To further fascilitate the construction and training of a NN the ``torch.nn`` package, which depends on `autograd` to define NN models and differentiate them, includes additional NN-specifc classes and helper functions.

For example the ``nn.Module`` class which works as a boilerplate NN model class and eventually contains all the individual layers and the ``Module.forward(x)`` method that infers the input ``x`` and returns the output of a NN.

The following is a grapical illustration of the infamous *LeNet* NN from Yann LeCun. This NN was trained to classify the MNIST dataset of handwritten digit images:

<img src="http://pytorch.org/tutorials/_images/mnist.png">

It is a simple feed-forward network which takes the input, feeds it through several layers one after the other, and then finally produces the classification output.


Define *LeNet* with PyTorch
--------------------------

The following is an example implementation of the classification network above: 



In [22]:
import torch
from torch.autograd import Variable
import torch.nn as nn
import torch.nn.functional as F


class LeNet(nn.Module):

    def __init__(self):
        """
        Class constructor which preinitializes NN layers with trainable
        parameters.
        """
        super(LeNet, self).__init__()
        # 1 input image channel, 6 output channels, 5x5 square convolution
        # conv kernel
        self.conv1 = nn.Conv2d(1, 6, 5)
        self.conv2 = nn.Conv2d(6, 16, 5)
        # an affine operation: y = Wx + b
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, 10)

    def forward(self, x):
        """
        Forwards the input x through each of the NN layers and outputs the result.
        """
        # Max pooling over a (2, 2) window
        x = F.max_pool2d(F.relu(self.conv1(x)), (2, 2))
        # If the size is a square you can only specify a single number
        x = F.max_pool2d(F.relu(self.conv2(x)), 2)
        # An efficient transition from spatial conv layers to flat 1D fully 
        # connected layers is achieved by only changing the "view" on the
        # underlying data and memory structure.
        x = x.view(-1, self.num_flat_features(x))
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

    def num_flat_features(self, x):
        """
        Computes the number of features if the spatial input x is transformed
        to a 1D flat input.
        """
        size = x.size()[1:]  # all dimensions except the batch dimension
        num_features = 1
        for s in size:
            num_features *= s
        return num_features


net = LeNet()
print(net)

LeNet (
  (conv1): Conv2d(1, 6, kernel_size=(5, 5), stride=(1, 1))
  (conv2): Conv2d(6, 16, kernel_size=(5, 5), stride=(1, 1))
  (fc1): Linear (400 -> 120)
  (fc2): Linear (120 -> 84)
  (fc3): Linear (84 -> 10)
)


Due to the `autograd` package a NN merely requires the definition of the ``Module.forward()`` method. The ``.backward()`` function (which backpropagtes the gradients) is automatically defined. Any `Tensor` operation is allowed in the ``forward`` function.

The learnable parameters of a model are returned by ``Module.parameters()``:



In [23]:
params = list(net.parameters())
print(len(params))
print(params[0].size())  # conv1's .weight

10
torch.Size([6, 1, 5, 5])


The input to the `forward` method is an ``autograd.Variable``, and so is the output:



In [24]:
x = Variable(torch.randn(1, 1, 32, 32))
output = net(x)
print(output)

Variable containing:
-0.0497  0.0175  0.0472 -0.0009 -0.0625  0.0819  0.0941  0.0521 -0.0082  0.1112
[torch.FloatTensor of size 1x10]



Before backpropagating for example a random gradient, the gradient buffers of all parameters should be set to zero:



In [25]:
net.zero_grad()
output.backward(torch.randn(1, 10))

<div class="alert alert-info">
    <h3>Note</h3>
    <p>Calling the ``Variable.backward()`` method a second time before new inputs are forwarded will through an error. This is due to PyTorch deleting all the intermediary results in order to reduce memory consumption. Calling the ``.backward()`` method with the `retain_graph=True` argument keeps those results.
    </p>
</div>

<div class="alert alert-info">
    <h3>Note</h3>
    <p>The entire ``torch.nn`` package only supports inputs that are a mini-batch of samples, and not a single sample.

    For example, ``nn.Conv2d`` will take in a 4D Tensor of ``nSamples x nChannels x Height x Width``.

    If you have a single sample, just use ``x.unsqueeze(0)`` to add a fake batch dimension.
    </p>
</div>

Loss Function
-------------
A loss function takes the (output, target) pair as inputs, and computes a value that estimates "how far away" the output is from the target.

There are several different loss functions predefined under the `torch.nn` package. An example of a simple loss is the ``nn.MSELoss`` which computes the mean-squared error between the input and the target value.

More examples of predefined losses are documented [here](http://pytorch.org/docs/nn.html#loss-functions).

A MSE loss example:

In [26]:
output = net(x)
target = Variable(torch.arange(1, 11))  # a dummy target with 10 classes
criterion = nn.MSELoss()

loss = criterion(output, target)
print(loss)

Variable containing:
 38.0207
[torch.FloatTensor of size 1]



When ``loss.backward()`` is called, the whole graph is differentiated w.r.t. the loss, and all Variables in the graph will have their ``Variable.grad`` attribute accumulated with the gradient.

Backpropagate the Loss
--------------------

A curical step for optimizing the network weights is the backpropogation of the loss. The nature of a computational graph makes this as easy as calling ``loss.backward()``. But since the gradients will be accumulated to already existing gradients one has to clear them first.

In [None]:
net.zero_grad()     # zeroes the gradient buffers of all parameters

print('conv1.bias.grad before backward')
print(net.conv1.bias.grad)

loss.backward()

print('conv1.bias.grad after backward')
print(net.conv1.bias.grad)

Weights Optimization
------------------
The simplest update rule used in practice for optimizing the weights of a NN is the Stochastic Gradient Descent (SGD):

``weight = weight - learning_rate * gradient``

Like any other NN component the optimization step can be implemented with the basic PyTorch classes.

For example:

In [None]:
def sgd_step(net):
    learning_rate = 0.01
    for f in net.parameters():
        f.data.sub_(f.grad.data * learning_rate)

However, the PyTorch framework contains a small optimization package called ``torch.optim``. It includes various predefined update rules such as SGD, Nesterov-SGD, Adam, RMSProp, etc.

<div class="alert alert-info">
    <h3>Note</h3>
    <p>Common optimization options such as the L2-regularization (see ``weight_decay`` argument) are already included in the predefined optimization schemes.</p>
</div>

Using it is very simple:

In [None]:
import torch.optim as optim

# create an optimizer
optimizer = optim.SGD(net.parameters(), lr=0.01, weight_decay=1e-3)

# a single step of an example training loop
optimizer.zero_grad()   # zero the gradient buffers
output = net(x)
loss = criterion(output, target)
loss.backward()
optimizer.step()    # Does the update based on the accumalted gradients

Recap
==============

  -  ``torch.Tensor`` - A multi-dimensional array.
  -  ``autograd.Variable`` - Wraps a `Tensor` and records the history of
     operations applied to it. Has the same API as a ``Tensor``, with
     some additions like ``backward()``. Also holds the gradient
     w.r.t. the tensor.
  -  ``nn.Module`` - Neural network module. Convenient way of
     encapsulating parameters, with helpers for moving them to GPU,
     exporting, loading, etc.
  -  ``nn.Parameter`` - A kind of `Variable`, that is automatically
     registered as a parameter when assigned as an attribute to a
     ``Module``.
  -  ``autograd.Function`` - Implements forward and backward definitions
     of an autograd operation. Every ``Variable`` operation, creates at
     least a single ``Function`` node, that connects to functions that
     created a ``Variable`` and encodes its history.

<div class="alert alert-info">
    <h3>Note</h3>
    <p>The `torchvision` package includes many predefined helper funcitons specifally designed for solving computer vision problems.</p>
</div>