# Introduction to Deep Learning Packages
In this tutorial, we'll take you through developing models to classify images in PyTorch from start to finish. We'll go through preprocessing, building neural networks, and experimentation.

Let's get started!

In [1]:
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from tqdm import tqdm

## What is PyTorch
It's a python based deep learning library. It's very popular amongst researchers because of it's speed and flexibility. 

At the base of pytorch is the idea of a `Tensor`.
A `Tensor` is just an `n-dimensional` array, like a numpy `ndarray`.

For example,
Let's make a random `3x3` tensor.
We can inspect tensors by printing them, and get their size with `.size()` 

In [2]:
a = torch.rand(3,3)
print(a)
print(a.size())

tensor([[ 0.9705,  0.3110,  0.8510],
        [ 0.8947,  0.1625,  0.9597],
        [ 0.7825,  0.8919,  0.2780]])
torch.Size([3, 3])


We can also take an array, and convert it to a tensor.


In [3]:
b = torch.Tensor([[1,2,3],[4,5,6]])
print(b)
print(b.size())

tensor([[ 1.,  2.,  3.],
        [ 4.,  5.,  6.]])
torch.Size([2, 3])


### Operations on Tensors
Any operation between tensors produces new tensors.
You can use regular python syntax to add, multiply them. PyTorch also nice functions for matrix multipication, and reshaping tensors.


In [4]:
a = a + 4
print(a)
d = a * 2
print(d)
e = a - d
print(e)

print(a.size(), b.size())
# Wont work because shapes number of columns in a 
# doesn't match number of rows in b
'c = torch.matmul(a, b)'
# This will work
c = torch.matmul(b, a)
print(c.size())


tensor([[ 4.9705,  4.3110,  4.8510],
        [ 4.8947,  4.1625,  4.9597],
        [ 4.7825,  4.8919,  4.2780]])
tensor([[ 9.9410,  8.6220,  9.7021],
        [ 9.7894,  8.3250,  9.9193],
        [ 9.5651,  9.7838,  8.5559]])
tensor([[-4.9705, -4.3110, -4.8510],
        [-4.8947, -4.1625, -4.9597],
        [-4.7825, -4.8919, -4.2780]])
torch.Size([3, 3]) torch.Size([2, 3])
torch.Size([2, 3])


If you're running into a bug, it's often helpful to step through and check your dimensions.

### The magic: Autograd
The power behind PyTorch comes from its automatic differentiation engine, Autograd. To turn it on, construct your tensors with `requires_grad = True`.

Every computation you make, i.e `c=a+b` will create a computation graph with node `c` being linked to `a` and `b` via a `+` operator. 

<img src="compute_graph.png">

If you call `.backward()` on your final node, autograd will work out all the gradients for you and store the values in `a.grad` and `b.grad`.

Let's look at an example.

Consider the function
`y = a*(x^2) + b`, where `a = b = 1`. This is a simple parabola. 
<img src="parabola.png">

The compute graph for this would be:
<img src="compute_graph_parabola.png">

From basic calculas, we know that the derivative of `dy/dx` is
`dy/dx = 2a x`. So the derivative at `x = 1` is `2`. 
This wasn't very hard, but let's see how autograd can do this automatically.

In [5]:
a = torch.ones(1, requires_grad=False)
b = torch.ones(1, requires_grad=False)
x = torch.ones(1, requires_grad=True)
y = a*(x*x)  + b
print(y)
y.backward()
print("x.grad={}".format(x.grad))


tensor([ 2.])
x.grad=tensor([ 2.])


### Why autograd is exciting
Now, this may have seemed trivial and contrived, but this flexible automatic differentiation process really shines when our computation graph is large and complex, i.e when it's a neural network.

If we place our whole model into our computation graph, and the loss calculation, the a call to `backward`, will compute all the gradients, and it becomes very easy to train neural networks. 




# The Task: MNIST, Digit Classification
<img src="mnist.png">

In this lab, we'll build a neural network to classify hand-written digits.



## Step 1: Loading Data and Preprocessing
Let's start by loading the data.
We're going to normalize our images to have 0 mean, and unit variance. We'll do this using some torchvision transforms. This generally helps stablize learning, and is common practice. 

In [6]:
normalize_image = transforms.Compose([
                           transforms.ToTensor(),
                           transforms.Normalize((0.1307,), (0.3081,))
                ])

all_train = datasets.MNIST('data', train=True, download=True,
                        transform=normalize_image)
num_train = int(len(all_train)*.8)
train = [all_train[i] for i in range(num_train)]
dev = [all_train[i] for i in range(num_train,len(all_train))]
test = datasets.MNIST('data', train=False, download=True, 
                      transform=normalize_image)
                           


Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!


In [7]:
train[0][0].size()

torch.Size([1, 28, 28])

## Step 2: Building a model

All pytorch models should be implemented as instances of `nn.Module`. 

To build a model you need to:
a) define what parameters it'll need in it's `__init__` function
b) define the model's computation, using those parameters, in a forward function.


To keep things simple, lets define a simple linear classifer, like logistic regression. We'll experiment with more complex models soon.

In [8]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc = nn.Linear(28*28, 10)

    def forward(self, x):
        batch_size, num_channels, height, width = x.size()
        x = x.view(batch_size, -1)
        return self.fc(x)


## Step 3. Defining our training procedure

To train our model, let's introduce a couple new PyTorch ideas.

A [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) is an iterator that goes over our entire dataset and selects batches. 
We'll be using this to iterate through our train/dev/test sets.

Let's intialize these now. 

An [Optimizer](https://pytorch.org/docs/stable/optim.html) defines an update rule. In class, we've discussed vanilla SGD, which is one method to compute the next weight, given the current weight and gradient. There are plently of other optimizers you can try from the pytorch library. 


In [22]:
# Training settings
batch_size = 64
epochs = 15
lr = .01
momentum = 0.5


train_loader = torch.utils.data.DataLoader(train, batch_size=batch_size, shuffle=True)
dev_loader = torch.utils.data.DataLoader(dev, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test, batch_size=batch_size, shuffle=True)


model = Model()
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)


To train our model:

1) we'll randomly sample batches from our train loader

2) compute our loss (using standard `cross_entropy`)

3) compute our gradients (by calling `backward()` on our loss)

4) update our neural network with an `optimizer.step()`, and go back to 1)

I've added some extra stuff here to log our accuracy and avg loss for the epoch.


In [23]:
def train_epoch( model, train_loader, optimizer, epoch):
    model.train() # Set the nn.Module to train mode. 
    total_loss = 0
    correct = 0
    num_samples = len(train_loader.dataset)
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = model(data)
        loss = F.cross_entropy(output, target)
        loss.backward()
        optimizer.step()
        pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()
        total_loss += loss.detach() # Don't keep computation graph 

    print('Train Epoch: {} \tLoss: {:.4f}, Accuracy: {}/{} ({:.0f}%)'.format(
            epoch, total_loss / num_samples, 
            correct, 
            num_samples,
            100. * correct / num_samples))


## Step 3.5 Define our evaluation loop
Similar to above, we'll also loop through our dev or test set, and compute our loss and accuracy. 
This lets us see how well our model is generalizing. 

In [24]:
def eval_epoch(model, test_loader, name):
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        output = model(data)
        test_loss += F.cross_entropy(output, target).item() # sum up batch loss
        pred = output.max(1, keepdim=True)[1] # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print('\n{} set: Average loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
        name,
        test_loss, 
        correct, 
        len(test_loader.dataset),
        100. * correct / len(test_loader.dataset)))


## Step 4: Training the model

In [25]:

for epoch in range(1, epochs + 1):
    train_epoch(model, train_loader, optimizer, epoch)
    eval_epoch(model,  dev_loader, "Dev")
    print("---")

Train Epoch: 1 	Loss: 0.0308, Accuracy: 16512/48000 (34%)

Dev set: Average loss: 0.0259, Accuracy: 5506/12000 (46%)

---
Train Epoch: 2 	Loss: 0.0220, Accuracy: 28596/48000 (60%)

Dev set: Average loss: 0.0187, Accuracy: 7990/12000 (67%)

---
Train Epoch: 3 	Loss: 0.0168, Accuracy: 32857/48000 (68%)

Dev set: Average loss: 0.0150, Accuracy: 8613/12000 (72%)

---
Train Epoch: 4 	Loss: 0.0143, Accuracy: 34473/48000 (72%)

Dev set: Average loss: 0.0133, Accuracy: 8929/12000 (74%)

---
Train Epoch: 5 	Loss: 0.0130, Accuracy: 35453/48000 (74%)

Dev set: Average loss: 0.0125, Accuracy: 9049/12000 (75%)

---
Train Epoch: 6 	Loss: 0.0122, Accuracy: 36175/48000 (75%)

Dev set: Average loss: 0.0118, Accuracy: 9177/12000 (76%)

---
Train Epoch: 7 	Loss: 0.0116, Accuracy: 36588/48000 (76%)

Dev set: Average loss: 0.0114, Accuracy: 9259/12000 (77%)

---
Train Epoch: 8 	Loss: 0.0112, Accuracy: 36956/48000 (77%)

Dev set: Average loss: 0.0107, Accuracy: 9361/12000 (78%)

---
Train Epoch: 9 	Loss: 0.

# Step 5. Experiment with MLP
This model gets a dev accuracy of 93%, which isn't too bad. However, the power of neural networks comes from composing layers with nonlinearities.

Let's try a more complex model.

In [20]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.fc1 = nn.Linear(28*28, 200)
        self.fc2 = nn.Linear(200, 200)
        self.fc3 = nn.Linear(200, 10)
        

    def forward(self, x):
        batch_size, num_channels, height, width = x.size()
        x = x.view(batch_size, -1)
        hidden1 = F.relu(self.fc1(x))
        hidden2 = F.relu(self.fc2(hidden1))
        logit = self.fc3(hidden2)
        return logit
    
model = Model()
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

for epoch in range(1, epochs + 1):
    train_epoch(model, train_loader, optimizer, epoch)
    eval_epoch(model,  dev_loader, "Dev")
    print("---")

Train Epoch: 1 	Loss: 0.0097, Accuracy: 40389/48000 (84%)

Dev set: Average loss: 0.0044, Accuracy: 11013/12000 (92%)

---
Train Epoch: 2 	Loss: 0.0041, Accuracy: 44392/48000 (92%)

Dev set: Average loss: 0.0034, Accuracy: 11271/12000 (94%)

---
Train Epoch: 3 	Loss: 0.0031, Accuracy: 45206/48000 (94%)

Dev set: Average loss: 0.0027, Accuracy: 11425/12000 (95%)

---
Train Epoch: 4 	Loss: 0.0025, Accuracy: 45763/48000 (95%)

Dev set: Average loss: 0.0023, Accuracy: 11507/12000 (96%)

---
Train Epoch: 5 	Loss: 0.0021, Accuracy: 46156/48000 (96%)

Dev set: Average loss: 0.0021, Accuracy: 11578/12000 (96%)

---
Train Epoch: 6 	Loss: 0.0018, Accuracy: 46392/48000 (97%)

Dev set: Average loss: 0.0019, Accuracy: 11588/12000 (97%)

---
Train Epoch: 7 	Loss: 0.0015, Accuracy: 46639/48000 (97%)

Dev set: Average loss: 0.0017, Accuracy: 11625/12000 (97%)

---
Train Epoch: 8 	Loss: 0.0013, Accuracy: 46832/48000 (98%)

Dev set: Average loss: 0.0017, Accuracy: 11622/12000 (97%)

---
Train Epoch: 9 	

## Step 6. Experiment with CNN
A 3 layer MLP gets a dev accuracy of 97%, a strong imporvement over the simple linear model.
Now let's experiment with a covolutional neural network.


In [21]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.hidden_dim = 128
        self.conv1 = nn.Conv2d(1, self.hidden_dim, kernel_size=3)
        self.fc = nn.Linear(self.hidden_dim, 10)
        
    def forward(self, x):
        batch_size, num_channels, height, width = x.size()
        
        hidden = F.relu(self.conv1(x))
        hidden = hidden.view((batch_size, self.hidden_dim, -1))
        hidden,_ = torch.max(hidden, dim=-1)
        logit = self.fc(hidden)
        return logit
    
model = Model()
optimizer = optim.SGD(model.parameters(), lr=lr, momentum=momentum)

for epoch in range(1, epochs + 1):
    train_epoch(model, train_loader, optimizer, epoch)
    eval_epoch(model,  dev_loader, "Dev")
    print("---")

Train Epoch: 1 	Loss: 0.0307, Accuracy: 16339/48000 (34%)

Dev set: Average loss: 0.0254, Accuracy: 6226/12000 (52%)

---
Train Epoch: 2 	Loss: 0.0220, Accuracy: 27688/48000 (58%)

Dev set: Average loss: 0.0190, Accuracy: 7714/12000 (64%)

---
Train Epoch: 3 	Loss: 0.0176, Accuracy: 31061/48000 (65%)

Dev set: Average loss: 0.0161, Accuracy: 8174/12000 (68%)

---
Train Epoch: 4 	Loss: 0.0153, Accuracy: 33100/48000 (69%)

Dev set: Average loss: 0.0144, Accuracy: 8574/12000 (71%)

---
Train Epoch: 5 	Loss: 0.0138, Accuracy: 34563/48000 (72%)

Dev set: Average loss: 0.0129, Accuracy: 9016/12000 (75%)

---
Train Epoch: 6 	Loss: 0.0128, Accuracy: 35446/48000 (74%)

Dev set: Average loss: 0.0123, Accuracy: 9023/12000 (75%)

---
Train Epoch: 7 	Loss: 0.0120, Accuracy: 36143/48000 (75%)

Dev set: Average loss: 0.0120, Accuracy: 9011/12000 (75%)

---
Train Epoch: 8 	Loss: 0.0114, Accuracy: 36781/48000 (77%)

Dev set: Average loss: 0.0115, Accuracy: 9065/12000 (76%)

---
Train Epoch: 9 	Loss: 0.

## Step 7. Explore further.
You can try different model architectures, different optimizers, learning rates and regularization strategies. Neural networks are incredibly flexibile, and so the space to do explore is enourmous.  Once you're done exploring, take your best model (i.e achieves best results on dev set) and run it on test!

In [None]:
eval_epoch(model,  test_loader, "Test")

## Step 8. Now try it on your own on CIFAR
