# Training a ConvNet PyTorch

In this notebook, you'll learn how to use the powerful PyTorch framework to specify a conv net architecture and train it on the CIFAR-10 dataset.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data import DataLoader
from torch.utils.data import sampler

import torchvision.datasets as dset
import torchvision.transforms as T

import numpy as np

import timeit

## What's this PyTorch business?

You've written a lot of code in this assignment to provide a whole host of neural network functionality. Dropout, Batch Norm, and 2D convolutions are some of the workhorses of deep learning in computer vision. You've also worked hard to make your code efficient and vectorized.

For the last part of this assignment, though, we're going to leave behind your beautiful codebase and instead migrate to one of two popular deep learning frameworks: in this instance, PyTorch (or TensorFlow, if you switch over to that notebook). 

Why?

* Our code will now run on GPUs! Much faster training. When using a framework like PyTorch or TensorFlow you can harness the power of the GPU for your own custom neural network architectures without having to write CUDA code directly (which is beyond the scope of this class).
* We want you to be ready to use one of these frameworks for your project so you can experiment more efficiently than if you were writing every feature you want to use by hand. 
* We want you to stand on the shoulders of giants! TensorFlow and PyTorch are both excellent frameworks that will make your lives a lot easier, and now that you understand their guts, you are free to use them :) 
* We want you to be exposed to the sort of deep learning code you might run into in academia or industry. 

## How will I learn PyTorch?

If you've used Torch before, but are new to PyTorch, this tutorial might be of use: http://pytorch.org/tutorials/beginner/former_torchies_tutorial.html

Otherwise, this notebook will walk you through much of what you need to do to train models in Torch. See the end of the notebook for some links to helpful tutorials if you want to learn more or need further clarification on topics that aren't fully explained here.

## Load Datasets

We load the CIFAR-10 dataset. This might take a couple minutes the first time you do it, but the files should stay cached after that.

In [2]:
class ChunkSampler(sampler.Sampler):
    """Samples elements sequentially from some offset. 
    Arguments:
        num_samples: # of desired datapoints
        start: offset where we should start selecting from
    """
    def __init__(self, num_samples, start = 0):
        self.num_samples = num_samples
        self.start = start

    def __iter__(self):
        return iter(range(self.start, self.start + self.num_samples))

    def __len__(self):
        return self.num_samples

NUM_TRAIN = 49000
NUM_VAL = 1000

cifar10_train = dset.CIFAR10('./cs231n/datasets', train=True, download=True,
                           transform=T.ToTensor())
loader_train = DataLoader(cifar10_train, batch_size=64, sampler=ChunkSampler(NUM_TRAIN, 0))

cifar10_val = dset.CIFAR10('./cs231n/datasets', train=True, download=True,
                           transform=T.ToTensor())
loader_val = DataLoader(cifar10_val, batch_size=64, sampler=ChunkSampler(NUM_VAL, NUM_TRAIN))

cifar10_test = dset.CIFAR10('./cs231n/datasets', train=False, download=True,
                          transform=T.ToTensor())
loader_test = DataLoader(cifar10_test, batch_size=64)


Files already downloaded and verified
Files already downloaded and verified
Files already downloaded and verified


For now, we're going to use a CPU-friendly datatype. Later, we'll switch to a datatype that will move all our computations to the GPU and measure the speedup.

In [3]:
dtype = torch.FloatTensor # the CPU datatype

# Constant to control how frequently we print train loss
print_every = 100

# This is a little utility that we'll use to reset the model
# if we want to re-initialize all our parameters
def reset(m):
    if hasattr(m, 'reset_parameters'):
        m.reset_parameters()

## Example Model

### Some assorted tidbits

Let's start by looking at a simple model. First, note that PyTorch operates on Tensors, which are n-dimensional arrays functionally analogous to numpy's ndarrays, with the additional feature that they can be used for computations on GPUs.

We'll provide you with a Flatten function, which we explain here. Remember that our image data (and more relevantly, our intermediate feature maps) are initially N x C x H x W, where:
* N is the number of datapoints
* C is the number of channels
* H is the height of the intermediate feature map in pixels
* W is the height of the intermediate feature map in pixels

This is the right way to represent the data when we are doing something like a 2D convolution, that needs spatial understanding of where the intermediate features are relative to each other. When we input  data into fully connected affine layers, however, we want each datapoint to be represented by a single vector -- it's no longer useful to segregate the different channels, rows, and columns of the data. So, we use a "Flatten" operation to collapse the C x H x W values per representation into a single long vector. The Flatten function below first reads in the N, C, H, and W values from a given batch of data, and then returns a "view" of that data. "View" is analogous to numpy's "reshape" method: it reshapes x's dimensions to be N x ??, where ?? is allowed to be anything (in this case, it will be C x H x W, but we don't need to specify that explicitly). 

In [4]:
class Flatten(nn.Module):
    def forward(self, x):
        N, C, H, W = x.size() # read in N, C, H, W
        return x.view(N, -1)  # "flatten" the C * H * W values into a single vector per image

### The example model itself

The first step to training your own model is defining its architecture.

Here's an example of a convolutional neural network defined in PyTorch -- try to understand what each line is doing, remembering that each layer is composed upon the previous layer. We haven't trained anything yet - that'll come next - for now, we want you to understand how everything gets set up.  nn.Sequential is a container which applies each layer
one after the other.

In that example, you see 2D convolutional layers (Conv2d), ReLU activations, and fully-connected layers (Linear). You also see the Cross-Entropy loss function, and the Adam optimizer being used. 

Make sure you understand why the parameters of the Linear layer are 5408 and 10.


In [5]:
# Here's where we define the architecture of the model... 
simple_model = nn.Sequential(
                nn.Conv2d(3, 32, kernel_size=7, stride=2),
                nn.ReLU(inplace=True),
                Flatten(), # see above for explanation
                nn.Linear(5408, 10), # affine layer ## 32 filters, conv reduction makes input 13*13
              )

# Set the type of all data in this model to be FloatTensor 
simple_model.type(dtype)

loss_fn = nn.CrossEntropyLoss().type(dtype)
optimizer = optim.Adam(simple_model.parameters(), lr=1e-2) # lr sets the learning rate of the optimizer

PyTorch supports many other layer types, loss functions, and optimizers - you will experiment with these next. Here's the official API documentation for these (if any of the parameters used above were unclear, this resource will also be helpful). One note: what we call in the class "spatial batch norm" is called "BatchNorm2D" in PyTorch.

* Layers: http://pytorch.org/docs/nn.html
* Activations: http://pytorch.org/docs/nn.html#non-linear-activations
* Loss functions: http://pytorch.org/docs/nn.html#loss-functions
* Optimizers: http://pytorch.org/docs/optim.html#algorithms

## Training a specific model

In this section, we're going to specify a model for you to construct. The goal here isn't to get good performance (that'll be next), but instead to get comfortable with understanding the PyTorch documentation and configuring your own model. 

Using the code provided above as guidance, and using the following PyTorch documentation, specify a model with the following architecture:

* 7x7 Convolutional Layer with 32 filters and stride of 1
* ReLU Activation Layer
* Spatial Batch Normalization Layer
* 2x2 Max Pooling layer with a stride of 2
* Affine layer with 1024 output units
* ReLU Activation Layer
* Affine layer from 1024 input units to 10 outputs

And finally, set up a **cross-entropy** loss function and the **RMSprop** learning rule.

In [6]:
fixed_model_base = nn.Sequential( # You fill this in!
    nn.Conv2d(3, 32, kernel_size=7, stride=1),
    nn.ReLU(inplace=True),
    nn.BatchNorm2d(32),
    nn.MaxPool2d(2, stride=2),
    Flatten(),
    nn.Linear(5408, 1024),
    nn.ReLU(inplace=True),
    nn.Linear(1024, 10),
            )

fixed_model = fixed_model_base.type(dtype)

To make sure you're doing the right thing, use the following tool to check the dimensionality of your output (it should be 64 x 10, since our batches have size 64 and the output of the final affine layer should be 10, corresponding to our 10 classes):

In [7]:
## Now we're going to feed a random batch into the model you defined and make sure the output is the right size
x = torch.randn(64, 3, 32, 32).type(dtype)
x_var = Variable(x.type(dtype)) # Construct a PyTorch Variable out of your input data
ans = fixed_model(x_var)        # Feed it through the model! 

# Check to make sure what comes out of your model
# is the right dimensionality... this should be True
# if you've done everything correctly
np.array_equal(np.array(ans.size()), np.array([64, 10]))       

True

### GPU!

Now, we're going to switch the dtype of the model and our data to the GPU-friendly tensors, and see what happens... everything is the same, except we are casting our model and input tensors as this new dtype instead of the old one.

If this returns false, or otherwise fails in a not-graceful way (i.e., with some error message), you may not have an NVIDIA GPU available on your machine. If you're running locally, we recommend you switch to Google Cloud and follow the instructions to set up a GPU there. If you're already on Google Cloud, something is wrong -- make sure you followed the instructions on how to request and use a GPU on your instance. If you did, post on Piazza or come to Office Hours so we can help you debug.

In [5]:
# Verify that CUDA is properly configured and you have a GPU available

torch.cuda.is_available()

True

In [9]:
import copy
gpu_dtype = torch.cuda.FloatTensor

fixed_model_gpu = copy.deepcopy(fixed_model_base).type(gpu_dtype)

x_gpu = torch.randn(64, 3, 32, 32).type(gpu_dtype)
x_var_gpu = Variable(x.type(gpu_dtype)) # Construct a PyTorch Variable out of your input data
ans = fixed_model_gpu(x_var_gpu)        # Feed it through the model! 

# Check to make sure what comes out of your model
# is the right dimensionality... this should be True
# if you've done everything correctly
np.array_equal(np.array(ans.size()), np.array([64, 10]))

True

Run the following cell to evaluate the performance of the forward pass running on the CPU:

In [10]:
%%timeit 
ans = fixed_model(x_var)

10 loops, best of 3: 26.1 ms per loop


... and now the GPU:

In [11]:
%%timeit 
torch.cuda.synchronize() # Make sure there are no pending GPU computations
ans = fixed_model_gpu(x_var_gpu)        # Feed it through the model! 
torch.cuda.synchronize() # Make sure there are no pending GPU computations

100 loops, best of 3: 1.89 ms per loop


You should observe that even a simple forward pass like this is significantly faster on the GPU. So for the rest of the assignment (and when you go train your models in assignment 3 and your project!), you should use the GPU datatype for your model and your tensors: as a reminder that is *torch.cuda.FloatTensor* (in our notebook here as *gpu_dtype*)

### Train the model.

Now that you've seen how to define a model and do a single forward pass of some data through it, let's  walk through how you'd actually train one whole epoch over your training data (using the simple_model we provided above).

Make sure you understand how each PyTorch function used below corresponds to what you implemented in your custom neural network implementation.

Note that because we are not resetting the weights anywhere below, if you run the cell multiple times, you are effectively training multiple epochs (so your performance should improve).

First, set up an RMSprop optimizer (using a 1e-3 learning rate) and a cross-entropy loss function:

In [14]:
loss_fn = nn.CrossEntropyLoss(size_average=False)
optimizer = optim.RMSprop(fixed_model_gpu.parameters(), lr=1e-3)

In [26]:
# This sets the model in "training" mode. This is relevant for some layers that may have different behavior
# in training mode vs testing mode, such as Dropout and BatchNorm. 
fixed_model_gpu.train()

# Load one batch at a time.
for t, (x, y) in enumerate(loader_train):
    x_var = Variable(x.type(gpu_dtype))
    y_var = Variable(y.type(gpu_dtype).long())

    # This is the forward pass: predict the scores for each class, for each x in the batch.
    scores = fixed_model_gpu(x_var)
    
    # Use the correct y values and the predicted y values to compute the loss.
    loss = loss_fn(scores, y_var)
    
    if (t + 1) % print_every == 0:
        print('t = %d, loss = %.4f' % (t + 1, loss.data[0]))

    # Zero out all of the gradients for the variables which the optimizer will update.
    optimizer.zero_grad()
    
    # This is the backwards pass: compute the gradient of the loss with respect to each 
    # parameter of the model.
    loss.backward()
    
    # Actually update the parameters of the model using the gradients computed by the backwards pass.
    optimizer.step()

t = 100, loss = 10.0971
t = 200, loss = 3.8753
t = 300, loss = 4.6107
t = 400, loss = 7.4986
t = 500, loss = 21.0644
t = 600, loss = 9.4087
t = 700, loss = 7.7326


Now you've seen how the training process works in PyTorch. To save you writing boilerplate code, we're providing the following helper functions to help you train for multiple epochs and check the accuracy of your model:

In [6]:
def train(model, loss_fn, optimizer, num_epochs = 1):
    for epoch in range(num_epochs):
        print('Starting epoch %d / %d' % (epoch + 1, num_epochs))
        model.train()
        for t, (x, y) in enumerate(loader_train):
            x_var = Variable(x.type(gpu_dtype))
            y_var = Variable(y.type(gpu_dtype).long())

            scores = model(x_var)
            
            loss = loss_fn(scores, y_var)
            if (t + 1) % print_every == 0:
                print('t = %d, loss = %.4f' % (t + 1, loss.data[0]))

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def check_accuracy(model, loader):
    if loader.dataset.train:
        print('Checking accuracy on validation set')
    else:
        print('Checking accuracy on test set')   
    num_correct = 0
    num_samples = 0
    model.eval() # Put the model in test mode (the opposite of model.train(), essentially)
    for x, y in loader:
        x_var = Variable(x.type(gpu_dtype), volatile=True)

        scores = model(x_var)
        _, preds = scores.data.cpu().max(1)
        num_correct += (preds == y).sum()
        num_samples += preds.size(0)
    acc = float(num_correct) / num_samples
    print('Got %d / %d correct (%.2f)' % (num_correct, num_samples, 100 * acc))
    return acc

### Check the accuracy of the model.

Let's see the train and check_accuracy code in action -- feel free to use these methods when evaluating the models you develop below.

You should get a training loss of around 1.2-1.4, and a validation accuracy of around 50-60%. As mentioned above, if you re-run the cells, you'll be training more epochs, so your performance will improve past these numbers.

But don't worry about getting these numbers better -- this was just practice before you tackle designing your own model.

In [27]:
torch.cuda.random.manual_seed(12345)
fixed_model_gpu.apply(reset)
train(fixed_model_gpu, loss_fn, optimizer, num_epochs=1)
check_accuracy(fixed_model_gpu, loader_val)

Starting epoch 1 / 1
t = 100, loss = 85.0193
t = 200, loss = 101.1028
t = 300, loss = 85.9814
t = 400, loss = 80.3795
t = 500, loss = 73.1408
t = 600, loss = 85.3632
t = 700, loss = 75.7075
Checking accuracy on validation set
Got 595 / 1000 correct (59.50)


### Don't forget the validation set!

And note that you can use the check_accuracy function to evaluate on either the test set or the validation set, by passing either **loader_test** or **loader_val** as the second argument to check_accuracy. You should not touch the test set until you have finished your architecture and hyperparameter tuning, and only run the test set once at the end to report a final value. 

## Train a _great_ model on CIFAR-10!

Now it's your job to experiment with architectures, hyperparameters, loss functions, and optimizers to train a model that achieves **>=70%** accuracy on the CIFAR-10 **validation** set. You can use the check_accuracy and train functions from above.

### Things you should try:
- **Filter size**: Above we used 7x7; this makes pretty pictures but smaller filters may be more efficient
- **Number of filters**: Above we used 32 filters. Do more or fewer do better?
- **Pooling vs Strided Convolution**: Do you use max pooling or just stride convolutions?
- **Batch normalization**: Try adding spatial batch normalization after convolution layers and vanilla batch normalization after affine layers. Do your networks train faster?
- **Network architecture**: The network above has two layers of trainable parameters. Can you do better with a deep network? Good architectures to try include:
    - [conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [conv-relu-conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
    - [batchnorm-relu-conv]xN -> [affine]xM -> [softmax or SVM]
- **Global Average Pooling**: Instead of flattening and then having multiple affine layers, perform convolutions until your image gets small (7x7 or so) and then perform an average pooling operation to get to a 1x1 image picture (1, 1 , Filter#), which is then reshaped into a (Filter#) vector. This is used in [Google's Inception Network](https://arxiv.org/abs/1512.00567) (See Table 1 for their architecture).
- **Regularization**: Add l2 weight regularization, or perhaps use Dropout.

### Tips for training
For each network architecture that you try, you should tune the learning rate and regularization strength. When doing this there are a couple important things to keep in mind:

- If the parameters are working well, you should see improvement within a few hundred iterations
- Remember the coarse-to-fine approach for hyperparameter tuning: start by testing a large range of hyperparameters for just a few training iterations to find the combinations of parameters that are working at all.
- Once you have found some sets of parameters that seem to work, search more finely around these parameters. You may need to train for more epochs.
- You should use the validation set for hyperparameter search, and save your test set for evaluating your architecture on the best parameters as selected by the validation set.

### Going above and beyond
If you are feeling adventurous there are many other features you can implement to try and improve your performance. You are **not required** to implement any of these; however they would be good things to try for extra credit.

- Alternative update steps: For the assignment we implemented SGD+momentum, RMSprop, and Adam; you could try alternatives like AdaGrad or AdaDelta.
- Alternative activation functions such as leaky ReLU, parametric ReLU, ELU, or MaxOut.
- Model ensembles
- Data augmentation
- New Architectures
  - [ResNets](https://arxiv.org/abs/1512.03385) where the input from the previous layer is added to the output.
  - [DenseNets](https://arxiv.org/abs/1608.06993) where inputs into previous layers are concatenated together.
  - [This blog has an in-depth overview](https://chatbotslife.com/resnets-highwaynets-and-densenets-oh-my-9bb15918ee32)

If you do decide to implement something extra, clearly describe it in the "Extra Credit Description" cell below.

### What we expect
At the very least, you should be able to train a ConvNet that gets at least 70% accuracy on the validation set. This is just a lower bound - if you are careful it should be possible to get accuracies much higher than that! Extra credit points will be awarded for particularly high-scoring models or unique approaches.

You should use the space below to experiment and train your network. 

Have fun and happy training!

In [8]:
torch.cuda.random.manual_seed(12345)
gpu_dtype = torch.cuda.FloatTensor

#### helper functions

def get_linear_size(flt_num, flt_size, stride, num_pool=1):
    return flt_num*(((32-flt_size)//stride + 1)**2)//((num_pool)**2)

def check_linear_size(flt_size, stride):
    if (32-flt_size)/stride != (32-flt_size)//stride:
        return False
    return True

def create_bests():
    return {"l_rate":-1,"num_first_stage":-1,"num_second_stage":-1, "stride":-1, "flt_num":-1, "flt_size":-1, "val":-1, "batch_after_affine": False}


def check_model(model, model_num, flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs=2):
    loss_fn = nn.CrossEntropyLoss(size_average=False)
    optimizer = optim.Adam(model.parameters(), lr=l_rate)
    train(model, loss_fn, optimizer, num_epochs=num_epochs)
    acc = check_accuracy(model, loader_val)
    maybe_save_variables(model_num, acc, flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine)

def maybe_save_variables(model_num, acc, flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine):
    best = bests[model_num]
    best_val = bests["val"]
    if acc > best["val"]:
        best["flt_size"] = flt_size
        best["flt_num"] = flt_num
        best["stride"] = stride
        best["l_rate"] = l_rate
        best["num_first_stage"] = num_first_stage
        best["num_second_stage"] = num_second_stage
        best["batch_after_affine"] = batch_after_affine
        best["val"] = acc
        print("Check:", model_num, best)
    if acc > best_val: # best of the models check
        bests["val"] = acc
        bests["model"] = model_num
        print("Check best model:", model_num, acc)
        
def train_one(model):
    for t, (x, y) in enumerate(loader_train):
        x_var = Variable(x.type(gpu_dtype))

        scores = model(x_var)
        print(scores.size())
        break

In [8]:
#### models

# [conv-relu-conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
def run_model0(flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs=2):
    linear_size = get_linear_size(flt_num, flt_size, stride, 2**num_first_stage)
    
    first_stage = lambda start:[
        nn.Conv2d(start, flt_num, kernel_size=flt_size, stride=stride),
        nn.ReLU(inplace=True),
        nn.Conv2d(flt_num, flt_num, kernel_size=flt_size, stride=stride),
        nn.ReLU(inplace=True),
        nn.MaxPool2d(2, stride=2),
    ]
    first_stages = []
    start = 3
    for num_first in range(num_first_stage):
        first_stages.extend(first_stage(start))
        start = flt_num

    second_stages = []
    size = linear_size
    for num_second in range(num_second_stage):
        second_stages.append(nn.Linear(size, 1024))
        size = 1024
        if batch_after_affine:
            second_stages.append(nn.BatchNorm1d(size))
                        
    model = nn.Sequential(
        *first_stages,
        Flatten(),
        *second_stages,
        nn.Linear(1024, 10)
    ).type(gpu_dtype)

    check_model(model, 0, flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs)



# [conv-relu-pool]xN -> [affine]xM -> [softmax or SVM]
def run_model1(flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs=2):
    linear_size = get_linear_size(flt_num, flt_size, stride, 2**num_first_stage)
    
    first_stage = lambda start:[
        nn.Conv2d(start, flt_num, kernel_size=flt_size, stride=stride),
        nn.ReLU(inplace=True),
        nn.MaxPool2d(2, stride=2),
    ]
        
    first_stages = []
    start = 3
    for num_first in range(num_first_stage):
        first_stages.extend(first_stage(start))
        start = flt_num

    second_stages = []
    size = linear_size
    for num_second in range(num_second_stage):
        second_stages.append(nn.Linear(size, 1024))
        size = 1024
        if batch_after_affine:
            second_stages.append(nn.BatchNorm1d(size))
                        
    model = nn.Sequential(
        *first_stages,
        Flatten(),
        *second_stages,
        nn.Linear(1024, 10)
    ).type(gpu_dtype)

    check_model(model, 1, flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs)



# [batchnorm-relu-conv]xN -> [affine]xM -> [softmax or SVM]
def run_model2(flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs=2):
    linear_size = get_linear_size(flt_num, flt_size, stride)

    first_stage = lambda start:[
        nn.BatchNorm2d(start),
        nn.ReLU(inplace=True),
        nn.Conv2d(start, flt_num, kernel_size=flt_size, stride=stride),
    ]
        
    first_stages = []
    start = 3
    for num_first in range(num_first_stage):
        first_stages.extend(first_stage(start))
        start = flt_num

    second_stages = []
    size = linear_size
    for num_second in range(num_second_stage):
        second_stages.append(nn.Linear(size, 1024))
        size = 1024
        if batch_after_affine:
            second_stages.append(nn.BatchNorm1d(size))
                        
    model = nn.Sequential(
        *first_stages,
        Flatten(),
        *second_stages,
        nn.Linear(1024, 10)
    ).type(gpu_dtype)

    check_model(model, 2, flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs)

In [13]:
# Train your model here, and make sure the output of this cell is the accuracy of your best model on the 
# train, val, and test sets. Here's some code to get you started. The output of this cell should be the training
# and validation accuracy on your best model (measured by validation accuracy).
filter_sizes = [1,3,5,7]
num_filters = [32,64,128,256]
strides = [1,2,3]
num_first_stages = [1,3,5] # can't do 7, too small output
num_second_stages = [1,3,5,7]
batch_after_affines = [True, False]
lr_step = 1
lr_range = np.arange(-4,-2,lr_step)
learning_rates = [10**np.random.uniform(n,n+lr_step) for n in lr_range]
best_model = None

#commented out so not accidentally run
'''
bests = {0: create_bests(), 1: create_bests(), 2: create_bests(), "model": None, "val": -1}

for flt_size in filter_sizes:
    for flt_num in num_filters:
        for stride in strides:
            for l_rate in learning_rates:
                for num_first_stage in num_first_stages:
                    for num_second_stage in num_second_stages:
                        for batch_after_affine in batch_after_affines:
                            if not check_linear_size(flt_size, stride):
                                print("filter doesn't fit, ", flt_num, flt_size, stride)
                                continue
                            try:
                                print("flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine")
                                print(flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine)
                                print("model0")
                                run_model0(flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs=1)
                                print("model1")
                                run_model1(flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs=1)
                                print("model2")
                                run_model2(flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine, num_epochs=1)
                            except RuntimeError as err:
                                print("Encountered error with sizes, moving on ", err)
                        

print(bests)
'''



'\nbests = {0: create_bests(), 1: create_bests(), 2: create_bests(), "model": None, "val": -1}\n\nfor flt_size in filter_sizes:\n    for flt_num in num_filters:\n        for stride in strides:\n            for l_rate in learning_rates:\n                for num_first_stage in num_first_stages:\n                    for num_second_stage in num_second_stages:\n                        for batch_after_affine in batch_after_affines:\n                            if not check_linear_size(flt_size, stride):\n                                print("filter doesn\'t fit, ", flt_num, flt_size, stride)\n                                continue\n                            try:\n                                print("flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine")\n                                print(flt_size, flt_num, stride, l_rate, num_first_stage, num_second_stage, batch_after_affine)\n                                print("model0")\n                    

In [38]:
# baseline
'''
best_val = -1
best_flt_size = -1
best_flt_num = -1
best_lr = -1
best_stride = -1
lr_step = 1
lr_range = np.arange(-4,-2,lr_step)
learning_rates = [10**np.random.uniform(n,n+lr_step) for n in lr_range]


for flt_size in filter_sizes:
    for flt_num in num_filters:
        for stride in strides:
            for l_rate in learning_rates:
                if (32-flt_size)/stride != (32-flt_size)//stride:
                    print("didn't fit", flt_size,stride)
                    continue # filter doesn't fit, keep going

                linear_size = flt_num*(((32-flt_size)//stride + 1)**2)//4 # from pooling
                model = nn.Sequential(
                    nn.Conv2d(3, flt_num, kernel_size=flt_size, stride=stride),
                    nn.ReLU(inplace=True),
                    nn.BatchNorm2d(flt_num),
                    nn.MaxPool2d(2, stride=2),
                    Flatten(),
                    nn.Linear(linear_size, 1024),
                    nn.ReLU(inplace=True),
                    nn.Linear(1024, 10),
                ).type(gpu_dtype)

                loss_fn = nn.CrossEntropyLoss(size_average=False)
                optimizer = optim.Adam(model.parameters(), lr=l_rate)
                train(model, loss_fn, optimizer, num_epochs=1)
                acc = check_accuracy(model, loader_val)

                if acc > best_val:
                    best_val = acc
                    best_flt_size = flt_size
                    best_flt_num = flt_num
                    best_lr = l_rate
                    best_stride = stride
                    print("check:",best_lr, np.log10(best_lr), best_flt_size, best_flt_num, best_stride, best_val)
'''

# training AlexNet-like model
lr_step = 1
lr_range = np.arange(-5,-2,lr_step)
learning_rates = [10**np.random.uniform(n,n+lr_step) for n in lr_range]
reg_step = 1
reg_range = np.arange(-9,5,reg_step)
regs = [10**np.random.uniform(n,n+reg_step) for n in reg_range]
best_stride = stride = 1
best_flt_size = flt_size = 7
best_flt_num = flt_num = 128
best_lr = -1
best_reg = -1
best_val = -1
for reg in regs:
    for l_rate in learning_rates:
        if (32-flt_size)/stride != (32-flt_size)//stride:
            print("didn't fit", flt_size,stride)
            continue # filter doesn't fit, keep going

        linear_size = flt_num*(((32-flt_size)//stride + 1)**2)//4 # from pooling
        #model = nn.Sequential(
        #    nn.Conv2d(3, flt_num, kernel_size=flt_size, stride=stride),
        #    nn.ReLU(inplace=True),
        #    nn.BatchNorm2d(flt_num),
        #    nn.MaxPool2d(2, stride=2),
        #   Flatten(),
        #    nn.Linear(linear_size, 1024),
        #    nn.BatchNorm1d(1024),
        #    nn.ReLU(inplace=True),
        #    nn.Linear(1024, 10),
        #).type(gpu_dtype)
        model = nn.Sequential(
        nn.Conv2d(3, 64, kernel_size=5, stride=stride),
        nn.BatchNorm2d(64),
        nn.ReLU(inplace=True),
        nn.Conv2d(64, 128, kernel_size=3, stride=stride),
        nn.BatchNorm2d(128),
        nn.ReLU(inplace=True),
        nn.MaxPool2d(2, stride=2),
        nn.Conv2d(128, 256, kernel_size=1, stride=stride),
        nn.BatchNorm2d(256),
        nn.ReLU(inplace=True),
        nn.MaxPool2d(2, stride=2),
        Flatten(),
        nn.Dropout(),
        nn.Linear(9216, 1024),
        nn.BatchNorm1d(1024),
        nn.ReLU(inplace=True),
        nn.Dropout(),
        nn.Linear(1024, 10)).type(gpu_dtype)

        loss_fn = nn.CrossEntropyLoss(size_average=False)
        optimizer = optim.Adam(model.parameters(), lr=l_rate, weight_decay=reg)
        train(model, loss_fn, optimizer, num_epochs=1)
        acc = check_accuracy(model, loader_val)

        if acc > best_val:
            best_val = acc
            best_lr = l_rate
            best_reg = reg
            print("check:",best_lr, np.log10(best_lr), np.log10(best_reg), best_val)
print("best l, reg, acc", best_lr, np.log10(best_lr), np.log10(best_reg), best_val)

#fixed_model_gpu.apply(reset)

Starting epoch 1 / 1
t = 100, loss = 106.1959
t = 200, loss = 101.7970
t = 300, loss = 102.2394
t = 400, loss = 85.2358
t = 500, loss = 80.0087
t = 600, loss = 84.8061
t = 700, loss = 93.2553
Checking accuracy on validation set
Got 588 / 1000 correct (58.80)
check: 5.505810946822641e-05 -4.25917870507 -8.93423687917 0.588
Starting epoch 1 / 1
t = 100, loss = 90.0830
t = 200, loss = 96.5742
t = 300, loss = 78.6894
t = 400, loss = 66.9081
t = 500, loss = 66.7567
t = 600, loss = 68.7619
t = 700, loss = 85.4061
Checking accuracy on validation set
Got 632 / 1000 correct (63.20)
check: 0.0001732125615835059 -3.76142061562 -8.93423687917 0.632
Starting epoch 1 / 1
t = 100, loss = 81.5981
t = 200, loss = 88.5796
t = 300, loss = 84.1179
t = 400, loss = 59.0202
t = 500, loss = 63.6259
t = 600, loss = 68.4827
t = 700, loss = 78.6402
Checking accuracy on validation set
Got 636 / 1000 correct (63.60)
check: 0.0016324348817746165 -2.78716413382 -8.93423687917 0.636
Starting epoch 1 / 1
t = 100, loss

In [39]:
# use best model to run 10 epochs
best_lrate = 10**-2.78716413382
best_reg = 10**-7.40552998346
best_stride = stride = 1

best_model = model = nn.Sequential(
        nn.Conv2d(3, 64, kernel_size=5, stride=stride),
        nn.BatchNorm2d(64),
        nn.ReLU(inplace=True),
        nn.Conv2d(64, 128, kernel_size=3, stride=stride),
        nn.BatchNorm2d(128),
        nn.ReLU(inplace=True),
        nn.MaxPool2d(2, stride=2),
        nn.Conv2d(128, 256, kernel_size=1, stride=stride),
        nn.BatchNorm2d(256),
        nn.ReLU(inplace=True),
        nn.MaxPool2d(2, stride=2),
        Flatten(),
        nn.Dropout(),
        nn.Linear(9216, 1024),
        nn.BatchNorm1d(1024),
        nn.ReLU(inplace=True),
        nn.Dropout(),
        nn.Linear(1024, 10)).type(gpu_dtype)
loss_fn = nn.CrossEntropyLoss(size_average=False)
optimizer = optim.Adam(model.parameters(), lr=best_lrate, weight_decay=best_reg)
train(model, loss_fn, optimizer, num_epochs=10)
acc = check_accuracy(model, loader_val)
print("Acc", acc)

Starting epoch 1 / 10
t = 100, loss = 90.0598
t = 200, loss = 88.2754
t = 300, loss = 83.4237
t = 400, loss = 61.3282
t = 500, loss = 64.6695
t = 600, loss = 68.3118
t = 700, loss = 84.6836
Starting epoch 2 / 10
t = 100, loss = 55.7722
t = 200, loss = 67.2878
t = 300, loss = 57.2617
t = 400, loss = 45.0917
t = 500, loss = 49.9258
t = 600, loss = 57.0327
t = 700, loss = 71.9670
Starting epoch 3 / 10
t = 100, loss = 43.1624
t = 200, loss = 63.5872
t = 300, loss = 52.7473
t = 400, loss = 43.8407
t = 500, loss = 47.0367
t = 600, loss = 49.1844
t = 700, loss = 57.7728
Starting epoch 4 / 10
t = 100, loss = 36.0154
t = 200, loss = 44.4081
t = 300, loss = 40.7104
t = 400, loss = 37.5453
t = 500, loss = 40.8666
t = 600, loss = 42.7433
t = 700, loss = 55.4944
Starting epoch 5 / 10
t = 100, loss = 35.0007
t = 200, loss = 37.3310
t = 300, loss = 38.0863
t = 400, loss = 36.4998
t = 500, loss = 36.2273
t = 600, loss = 33.3302
t = 700, loss = 40.2598
Starting epoch 6 / 10
t = 100, loss = 26.5500
t = 

### Describe what you did 

In the cell below you should write an explanation of what you did, any additional features that you implemented, and any visualizations or graphs that you make in the process of training and evaluating your network.

tl;dr I tried to make everything programmatic, but it took too long and didn't work well for the range of expression that was necessary, so I played around with a number of layers at the end and that got the accuracy to 80% for validation (and 79% for test!).

--

My approach was to do a series of loops of hyperparameters over each model. First I did loops using the basic model over the filter sizes, the number of filters, the learning rate, and the stride. I only ran for 1 epoch. The best results for this baseline model were 
best l_rate=10^-3.61685174265, flt_s=5, flt_n=128, stride=1, acc=0.65

I then tried the same loops but with the [conv-relu-pool]xN -> [affine]xM -> [softmax or SVM] model, with extra loops for N, M, and whether or not to add a batchnorm layer after the affine (see code above). Next, I used the same loops as before but tried the [conv-relu-conv-relu-pool]xN -> [affine]xM -> [softmax or SVM] and [batchnorm-relu-conv]xN -> [affine]xM -> [softmax or SVM] models and compared the difference. Again, I only ran for 1 epoch each iteration so that I could see which models were best to start. This process took about 6 hours. I commented out the code for this first run because it takes so long.

Note that there were cases where there were multiple N layers that, when pooled, reduced the output to 0. In the interest of time, I caught these exceptions and moved on from there.

It was difficult to make sure the parameters going in and out of each layer were exact (see the run_modelx code above). I'm surprised PyTorch doesn't have a dynamic setting so that these layers scale to the inputs that are received; it would make experimenting with different architectures easier (but maybe I'm missing something? I would love to hear if there is a way of doing this).

Model 0 did best with only 1 first stage, 1 second stage, stride 1, 256 filters, and a learning rate around 1e-2.3. With more stages, the performance declined, suggesting that the extra second stages might be contributing to overfitting.
Model 1 did best with only 1 first stage, 1 second stage, stride 1, 128 filters, and a learning rate around 1e-3. 
Model 2 did best with 3 first stages, 1 second stage, stride 1, 256 filters, and a learning rate around 1e-2.6.
All models benefitted from using batchnorm after the second stage affine layers.

Of these, model 1 did the best with a 1 epoch accuracy of 0.512. If these experiments have been run correctly, then it seems that a stride of 1 gives the most information back, and the more filters, the better. However, the baseline model still outperformed all of these models with a 1 epoch accuracy of 0.65. It could be that I am doing something wrong in the setup, or that the baseline model gets just enough information to do well on the validation set, and the extra information from these other models only hinders the accuracy. Because of this, I used the baseline model on 25 epochs but wasn't able to get above 68%.


At that point, I realized that my programmatic way of creating these models wasn't varied in expression enough. The same conv layers had the same filter sizes, and the linear layers also had the same sizes. Rather than changing the program, in the interest of time I modeled after AlexNet, where the filter sizes start out large and get smaller and the number of filters starts out small and gets larger. I started out with two and then three conv layers, with three doing better. From my experiments above, I learned that adding a batchnorm layer after an affine layer worked well consistently. I also learned that more filters seemed to increase accuracy, so the last conv layer has 256 filters. I added dropout in case there was any overfitting. With that, I was able to tune the learning rate and regularization strength (weight_decay) to get 80% accurate on the validation in 10 epochs.

## Test set -- run this only once

Now that we've gotten a result we're happy with, we test our final model on the test set (which you should store in best_model).  This would be the score we would achieve on a competition. Think about how this compares to your validation set accuracy.

In [40]:
check_accuracy(best_model, loader_test)

Checking accuracy on test set
Got 7975 / 10000 correct (79.75)


0.7975

## Going further with PyTorch

The next assignment will make heavy use of PyTorch. You might also find it useful for your projects. 

Here's a nice tutorial by Justin Johnson that shows off some of PyTorch's features, like dynamic graphs and custom NN modules: http://pytorch.org/tutorials/beginner/pytorch_with_examples.html

If you're interested in reinforcement learning for your final project, this is a good (more advanced) DQN tutorial in PyTorch: http://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html