## Neural Network Training with ~~Tensorflow~~ PyTorch
We will look at how this lab can be done in PyTorch, a deep learning library developed by Facebook. We will focus on the differences to Tensorflow, and not care too much about the actual results.

Let's start with Excersie 1. Here we make a simple computation:

In [1]:
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def exercise_1():
    x = torch.Tensor([2])
    y = torch.Tensor([3])
    op1 = x + y
    op2 = x * y
    op3 = op2 ** op1
    return op3

exercise_1()

tensor([ 7776.])

We get the result without any `Session` or calling `.run()`. PyTorch executes all operations immediately, on the line they are written. There is no C++ computation graph build up behind the scenes. This greatly helps debugging, since errors tend to happen on a specific line, and you can break in the debugger to see the exact state of the computations.

It this example, we just got back a simple tensor, which is not very useful for gradient decent. To get gradient information, we have to use PyTorch's autodiff facilities.

In [2]:
from torch.autograd import Variable

def exercise_1_autodiff():
    x = Variable(torch.Tensor([2]), requires_grad=True)
    y = Variable(torch.Tensor([3]), requires_grad=True)
    op1 = x + y
    op2 = x * y
    op3 = op2 ** op1
    return op3

OP3 = exercise_1_autodiff()
OP3

tensor([ 7776.])

`Variable` is a thin wrapper around the tensor object (side-note: tensors can share memory with numpy, which reduces copies). Variables track gradient information and makes it possible to back-propagate gradients from the output to the input, that is, backprop. It does this by function pointers, which build up a graph that we can walk.

In [3]:
def _expand_children(_op):
    children = [f[0] for f in _op.next_functions]
    edges = [(repr(_op), repr(child)) for child in children]

    return children, edges

def walk_grad_fn_bf(op):
    queue, edges = _expand_children(op.grad_fn)
    while queue:
        op = queue.pop(0)
        children, new_edges = _expand_children(op)
        queue += children
        edges += new_edges

    for e in edges:
        print e[0].replace("\n", " "), "->", e[1].replace("\n", " ")
        
walk_grad_fn_bf(OP3)

<PowBackward1 object at 0x7f4001058290> -> <MulBackward1 object at 0x7f40010582d0>
<PowBackward1 object at 0x7f4001058290> -> <AddBackward1 object at 0x7f4001058310>
<MulBackward1 object at 0x7f40010582d0> -> <AccumulateGrad object at 0x7f4001058290>
<MulBackward1 object at 0x7f40010582d0> -> <AccumulateGrad object at 0x7f4001058350>
<AddBackward1 object at 0x7f4001058310> -> <AccumulateGrad object at 0x7f4001058290>
<AddBackward1 object at 0x7f4001058310> -> <AccumulateGrad object at 0x7f4001058350>


We see that the variable itself has a "Pow" that points to a "Mul" and an "Add", which is expected. And those two operations points to gradient accumulators. Note that it's the same two accumulators that appear twice. This is since "Mul" and "Add" both depent on both variables, `x` and `y`, and we need to sum the gradient contribution from both "Mul" and "Add".

Also note that we had to say that the two inputs "requires_grad". For efficiency, Variables will only propagate gradients if something upstream of them requires gradients. And the default for `requires_grad` is False, since Variable usually wraps the input, which doesn't need gradients, since we don't want to optimize the input, but the network weights.

All other exercises involves training networks, so let's set up the machinery we need for that.

In [4]:
def _train_loop(loader, net, criterion, optimizer):
    net.train()
    total_loss = 0
    for image, target in loader:
        image = Variable(image).to(device)
        target = Variable(target).to(device)

        prediction = net(image)
        loss = criterion(prediction, target)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        
    return total_loss / float(len(loader.dataset))

`_train_loop` is a helper that will train a single epoch. `loader` will provied batches of training samples, `net` is our neural net, `criterion` is our cost function and `optimizer` is, naturally, our optimizer. During training, we will average the loss and report that back.

`net.eval()` sets things like dropout and batch normalization to the proper state.

There are several things to note here. First we wrap the input in Variable, which is required by the autodiff backend. Then we call `.to(device)`. This transfers the data to the GPU.

We then do the forward pass on the net and the criterion, which is self-explanatory. There is no `feed_dict`.

More interesting is the `.zero_grad()` call on the optimizer. As we saw in the first example, Variables, like the network weights, have gradient accumulators that can be added to via separate paths through the back-propagation graph. And they don't know when they should be zeroed. By calling `.zero_grad()`, we zero the gradient accumulators for all Variables the optimizer is optimizing on.

Calling `.backward()` on the loss computes the gradient for all Variables that `loss` depends on, with respect to `loss`.

Finally, we call `.step()` on the optimizer, which will do some sort of optimization step on all Variables it's working on, based on the gradient information that is now stored in those Variables.

Accessing `.data` on the loss side-steps the autograd machinery, so that `total_loss` won't become connected to any back-propagation graph.

It should be notet that the back-propagation graph is rebuild on each forward pass, so we can add, remove or change components in the model between batches, which is very hard to do in Tensorflow. It also mean you can use normal python control flow, you don't need things like `tf.control dependencies`.

That was a helper for training, now we make a similar one for testing.

In [5]:
def _test_loop(loader, net):
    net.eval()
    correct = 0
    for image, target in loader:
        with torch.no_grad():
            image = Variable(image).to(device)
            target = Variable(target).to(device)

            prediction = net(image)

            _, prediction = torch.max(prediction, dim=1)
            correct += torch.sum(prediction == target).item()
        
    return correct / float(len(loader.dataset))

This is very similar to the training loop, but here we say that the input is "volatile". This will kill gradient information on any Variable the input touches, which is the whole net. This improves efficiency, since the net doesn't have to store the information necessary for back-propagation. `net.eval()` disables dropout and freezes any batch normalization.

Finally we make a function to train for several epochs.

In [6]:
from torch.utils.data import DataLoader
from torchvision.datasets import CIFAR10
from torchvision.transforms import ToTensor

def epoch_loop(net, optimizer, epochs):
    print net
    
    data_train = CIFAR10("data", train=True, download=False, transform=ToTensor())
    data_test = CIFAR10("data", train=False, download=False, transform=ToTensor())

    def loader_factory(_data):
        return DataLoader(_data, batch_size=128, shuffle=True, num_workers=8, pin_memory=True)

    criterion = torch.nn.CrossEntropyLoss()

    for epoch in range(epochs):
        loss = _train_loop(loader_factory(data_train), net, criterion, optimizer)
        top1_accuracy = _test_loop(loader_factory(data_test), net)
        print("Epoch {}: Top-1 = {:.1f}%, Loss = {:.3e}".format(epoch, 100 * top1_accuracy, loss))

The interesting thing here is `DataLoader`. It is a multi-process dataloader provided by PyTorch. It helps you run IO, image decompression and any data augmentation you want in multiple worker threads, batches the results and  feeds the batches to the GPU. `pin_memory` uses special memory that the GPU has faster access to. `transform` is operations the loader does on the raw data. In this case, we just convert it to tensors. But we could use some of torchvision's pre-defined transforms, like random crops and flips. Unlike the method used in the lab description, this will randomize a new sample in each epoch.

Usually it is better to inherit from `DataLoader` and write your own data augmentation code, instead of relying on the pre-defined transforms. This is different from Tensorflow, where loaders are build into the graph and you are limited to what Tensorflow is providing. That is actually a recurring schism between PyTorch and Tensorflow. PyTorch uses standard packages, like `multiprocessing` and `pillow`, while Tensorflow builds everything themselfs.

Here we only print the training and test errors as text. Facebook recommends a package called `visdom` for visualization. I didn't like it, but there is an excellent TensorBoard wrapper for PyTorch, `tensorboardX`, which exposes almost the complete interface, and TensorBoard is nice.

With all that training stuff in place, let's define some nets. I think everything below is quite straight-forward. Note that we also call `.to(device)` on the nets, to move the parameters to the GPU. And we don't specify input size for anything, so if we input something with the wrong size, we will only find out during the forward-pass.

In [7]:
class Flatten(torch.nn.Module):
    def forward(self, x):
        batch_size = x.size(0)
        return x.view(batch_size, -1)

def exercise_2_to_4():
    print "Running Net 2:"
    net = torch.nn.Sequential(
        Flatten(),
        torch.nn.Linear(3 * 32 ** 2, 10, bias=True),
    ).to(device)
    optimizer = torch.optim.SGD(net.parameters(), lr=1e-2)
    epoch_loop(net, optimizer, epochs=20)
    
    print "\nRunning Net 3:"
    net = torch.nn.Sequential(
        Flatten(),
        torch.nn.Linear(3 * 32 ** 2, 128, bias=True),
        torch.nn.ReLU(),
        torch.nn.Linear(128, 10, bias=True),
    ).to(device)
    optimizer = torch.optim.SGD(net.parameters(), lr=1e-3, momentum=0.99)
    epoch_loop(net, optimizer, epochs=20)
    
    print "\nRunning Net 4:"
    net = torch.nn.Sequential(
        torch.nn.Conv2d(in_channels=3, out_channels=5, kernel_size=3, padding=1, bias=True),
        torch.nn.ReLU(),
        torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=1),
        Flatten(),
        torch.nn.Linear(5 * 16 ** 2, 128, bias=True),
        torch.nn.ReLU(),
        torch.nn.Linear(128, 10, bias=True),
    ).to(device)
    optimizer = torch.optim.Adam(net.parameters())
    epoch_loop(net, optimizer, epochs=20)
    
exercise_2_to_4()

Running Net 2:
Sequential(
  (0): Flatten()
  (1): Linear(in_features=3072, out_features=10, bias=True)
)
Epoch 0: Top-1 = 31.4%, Loss = 1.558e-02
Epoch 1: Top-1 = 35.0%, Loss = 1.459e-02
Epoch 2: Top-1 = 34.5%, Loss = 1.431e-02
Epoch 3: Top-1 = 36.9%, Loss = 1.414e-02
Epoch 4: Top-1 = 37.6%, Loss = 1.402e-02
Epoch 5: Top-1 = 37.8%, Loss = 1.394e-02
Epoch 6: Top-1 = 38.1%, Loss = 1.388e-02
Epoch 7: Top-1 = 39.2%, Loss = 1.381e-02
Epoch 8: Top-1 = 38.6%, Loss = 1.376e-02
Epoch 9: Top-1 = 38.9%, Loss = 1.371e-02
Epoch 10: Top-1 = 38.3%, Loss = 1.367e-02
Epoch 11: Top-1 = 38.3%, Loss = 1.364e-02
Epoch 12: Top-1 = 37.6%, Loss = 1.361e-02
Epoch 13: Top-1 = 38.6%, Loss = 1.359e-02
Epoch 14: Top-1 = 39.1%, Loss = 1.356e-02
Epoch 15: Top-1 = 39.3%, Loss = 1.353e-02
Epoch 16: Top-1 = 39.9%, Loss = 1.350e-02
Epoch 17: Top-1 = 37.7%, Loss = 1.348e-02
Epoch 18: Top-1 = 39.0%, Loss = 1.345e-02
Epoch 19: Top-1 = 40.1%, Loss = 1.344e-02

Running Net 3:
Sequential(
  (0): Flatten()
  (1): Linear(in_fe

Not very impressive results for any of the networks. But hopefully you have learned some of the strengths of PyTorch, and how it compares to Tensorflow. Maybe the simplicity of PyTorch makes you think that it is better than Tensorflow, but that is not necessarily true.

Since Tensorflow makes a C++ computation graph, it can apply much heavier optimizations to that graph. Those optimizations could be things like fusing batch normalization into adjacent layers, or common-subexpression-elimination. And since you have your computations in a monolithic C++ blob, it is very easy to deploy on servers, which may be important to you. PyTorch on the other hand, is much more geared towards research.