<a href="https://colab.research.google.com/github/dylanwalker/MGSC496/blob/main/MGSC496_R09.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#@title Your Info

your_name = '' #@param {type:"string"}
your_email = '' #@param {type:"string"}
today_date = '' #@param {type:"date"}


# How to "read" this notebook

As you go through this notebook (or any notebook for this class), you will encounter new concepts and python code that implements them -- just like you would see in a textbook. Of course, in a textbook, it's easy to read code and an explanation of what it does and think that you understand it.
<br />
<br />

### Learn by doing
But this notebook is different from a textbook because it allows you to not just read the code, but play with it. **You can and should try out changing the code that you see**. In fact, in many places throughout this reading notebook, you will be asked to write your own code to experiment with a concept that was just covered. This is a form of "active reading" and the idea behind it is that we really learn by **doing**. 
<br />
<br />

### Change everything
But don't feel limited to only change code when I prompt you. This notebook is your learning environment and your playground. I encourage you to try changing and running all the code throughout the notebook and even to **add your own notes and new code blocks**. Adding comments to code to explain what you are testing, experimenting with or trying to do is really helpful to understand what you were thinking when you revisit it later. 
<br />
<br />
### Make this notebook your own
Make this notebook your own. Write your questions and thoughts. At the end of every reading notebook, I will ask the same set of questions to try to elicit your questions, reaction and feedback. When we review the reading notebook in class, I encourage you to   



Before we begin, go to Runtime->Change runtime type and make sure GPU is selected -- we'll want to be able to use a GPU for training our neural networks.

Note: sometimes using tensors in the GPU can be a pain, because the error reporting is not as nice. For this reason, when developing a NN, it is best to do so using the CPU. Once you have it all sorted out, you can add a little bit of code to change it to the GPU. However, if you don't want to have to reset the runtime and start from scratch, its best to set the runtime type to GPU so that you have the option to move things to the GPU at any point. If you don't then the "machine" you are running your code on will not have access to a GPU at all.

# Code Preface

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms


if not os.path.exists('./models'):
  os.mkdir('./models')


# Defining Neural Network Architectures

In the past examples, we have discussed multi-layered neural networks, but haven't actually shown any. We built a simply linear model using `nn.linear()`, but how do you build more complex architectures by using pytorch's existing layers?  

There are two conventional ways to do this:
1. Chain layers together using `torch.nn.Sequential()`
2. Create a class for your neural network that inherits from the `torch.nn.Module` base class and  implements the `forward()` method in it.

I will show examples of these two approaches:

In [None]:
model = torch.nn.Sequential(torch.nn.Linear(3,8),torch.nn.ReLU(),torch.nn.Linear(8,2),torch.nn.ReLU(),torch.nn.Softmax(dim=1))
model

In [None]:
#torch.rand(5,3)
model(torch.rand(5,3))

The sequential approach works when you have a simple network, but an alternative (and much more configurable and robust) approach is to define a class with a constructor and a `forward()` method: 

In [None]:
class BasicNet(torch.nn.Module):
  def __init__(self):
    # The constructor calls the base class constructor and then defines the layers that will be used (ordering doesn't matter here, as layers are just properties)
    super().__init__()
    self.fc1 = torch.nn.Linear(3,8) # fc here stands for "fully connected", as in all (3) neurons from the previous layer are connected to all (8) neurons of the next layer.
    self.relu1 = torch.nn.ReLU() # Activation functions are actually specified as separate layers; here we are using a Rectified Linear Unit 
    self.fc2 = torch.nn.Linear(8,2)
    self.relu2 = torch.nn.ReLU()
    self.softmax = torch.nn.Softmax(dim=1) # this last layer is a softmax, which transforms all inputs to outputs between (0,1) and the sum will be 1 (so we can treat them as probabilities)
  
  def forward(self,x):
    # The forward() method describes how an input tensor (the argument x) will be passed through the layers.
    # Here, the order matters.
    # Also note that we can do other things to the data at any point between the layers (such as functionally transform it in some way)
    #  -- we could add noise to the data somewhere in between some layers, normalize it, randomly drop or forget some of it... etc.
    #  Advance NN approaches will often use such tricks. 
    x = self.fc1(x)
    x = self.relu1(x)
    x = self.fc2(x)
    x = self.relu2(x)
    x = self.softmax(x)
    return x


In [None]:
model = BasicNet() # Make an object of type BasicNet 
y = model.forward(torch.rand(5,3)) 
print(y)

In the above you can see that the values in  each "row" of the output sum to 1. This is because we applied a SoftMax layer to the dimension 1 (which ensures that all output values will be between (0,1) and that they add up to 1). This is useful if we wanted to interpret dimension 1 as a "class label" for each datapoint and the values in each row of data as the probability that the datapoint belonged to one of the two possible classes.

Note: you can actually define a class and also use `torch.nn.Sequential()` to make logical blocks of layers within your class.  For example:
```python
class SomeNet(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.fc_block1 = torch.nn.Sequential(torch.nn.Linear(3,8),torch.nn.ReLU())
    self.fc_block2 = torch.nn.Sequential(torch.nn.Linear(8,2),torch.nn.ReLU())
    self.softmax = torch.nn.Softmax(dim=0)
...
```

This is convenient if you want to organize your layers.

I'd like to turn our attention to architecture in pytorch but before I do that, because many of the concepts apply to tensor data that is 2D or higher, we'll take a brief foray into `torchvision` and image processing, so I can explain how we represent images as tensors.   

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise: Make your own Neural Network Class</font>


Make your own neural network class by following the examples above. Remember your class should have a constructor that defines and stores the layers; and a forward method that passes the input through each layer in the right order. Feel free to use `torch.nn.Sequential` if you wish.


Your neural net class should be called `MyNet` and have the following architecture:
* A fully connected (linear) layer that takes input of 7 dimensions and outputs 32 dimensions
* A sigmoid layer (you can use `torch.nn.`
* Another fully connected (linear) layer with an input that matches the output of the last layer and an output of 64 dimensions
* Another sigmoid layer
* A third fully connected (linear) layer with an input that matches the output of the last layer and an output of 5 dimensions
* A softmax layer

When you have defined your class, make an instance of an object of that class and pass a random torch tensor of shape (3,7). Also try passing a random torch tensor of shape (10,7).

Unlike in the examples above, however, I want you to print out `x.shape` in between each layer that you pass it through in the `.forward` method. This will allow us to "see" how the tensor changes shape as it goes through each layer.


In [None]:
# Define your NN class

In [None]:
# Make an instance of an object of your NN Class

In [None]:
# Run a random tensor of shape (3,7) through your model

In [None]:
# Run a random tensor of shape (10,7) through your model

Q: Why did the model allow us to pass different tensors of shape `(n,7)` into it (for `n`=3, 10)? Could we have passed a tensor of shape `(n,7)` for arbitrary `n`? 

A: Pytorch is built to assume that the first dimension of tensors that you pass into models will always be the "batch" dimension, as the designers understood that we would like to take advantage of vectorized computation to run large batches of data through our model at once. So, yes, it would have accepted a tensor of shape `(n,7)` for arbitrary `n` as input.

Looking at how the shape of `x` changes as it passes through the layers, do you understand what is going on?


<hr/>

# Working with Image data

## MNIST - Handwritten Digit Data

Pytorch has a bunch of useful utilities for working with image data under the `torchvision` module. 

Let's look at some example image datasets and how to use some of these utilities in practice. This will put us in a better position to discuss architecture.

First we'll need to import a bunch of modules:

In [None]:
# imports
import torch
import numpy as np
import matplotlib.pyplot as plt
import torchvision
from torchvision import transforms


We'll use one of the datasets built into torchvision called MNIST, a dataset of 60,000 28x28 grayscale pixel images of handwritten digits, each of which belong to one of ten classes (0-9), that is described in detail [here](http://yann.lecun.com/exdb/mnist/). (btw, **NIST** stands for National Institute of Standards and Technology, who released the first dataset before it was **M**odified by others and thus called **MNIST**). 

We'll use `torchvision.datasets.MNIST()` but we don't want to work with the raw image data alone as it is delivered as a set of PIL (Python Image Library) image objects. We'll want to convert the single grayscale _channel of the image_ to a tensor and then normalize it so that the values all fall between (-1,1). 

To accomplish this, we'll use a tool from torchvision's transforms module, `transforms.Compose()` which lets us chain a bunch of transformations together. We'll chain `transforms.ToTensor()` and `transforms.Normalize()`, which takes the mean and std for the single grayscale channels. If you work on images, there are tons of useful image transformation in torchvision 

In [None]:
# Chain a bunch of transformations together us torchvision.transforms.Compose
transform_mnist = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize(mean=(0.5, ), std=(0.5, ))]) # First make the input data a tensor, then apply a normalization to the single grayscale channel of the image

With those transforms defined, we can actually grab the dataset and apply the chain of transformations all in one line of code. The method to download and load MNIST also allows you to set the argument `train=True` if you are going to use the images to train a NN (as opposed to testing it). 

In [None]:
# Grab training data from one of the built-in datasets (MNIST)
trainset_mnist = torchvision.datasets.MNIST(root='./mnist', train=True,
                                        download=True, transform=transform_mnist)

In [None]:
trainset_mnist

In [None]:
trainset_mnist.data.shape

In [None]:
trainset_mnist.classes

Just to see what one image looks like as a bunch of tensors:

In [None]:
exampleImage, exampleClassLabel = trainset_mnist[0]
print(exampleImage.shape) # 1  28x28 tensor -- Each of the 28x28 values represents the intensity of the pixel (white = high intensity ; black = low intensity) 
print(exampleImage)
print(exampleClassLabel) # the class label; we have to look at trainset_mnist.classes to see what this means

We'll want some way to show the image, so we'll make a quick function to do this. Because I know we'll work with some color images in a minute, I'm going to define a function that can work with those too.

The function uses `plt.imshow()` which knows how to display a single color channel image if it's given as a 2D numpy array (width, height) or a 3 color (R,G,B) image if its given as a 3D numpy array (wdth, height, color).  The function below just does three things:
 - swaps the dimensions around so they are what `plt.imshow()` expects
 - undoes our normalization
 - changes the tensor into a numpy array

In [None]:
def imshow(img):
  if img.shape[0]==3: # its probably (color,width,height) so make it (width,height,color) which is what plt.imshow() wants
    img = img.permute(1,2,0)
  elif img.shape[0]==1: # its probably a (1,width,height) so make it just (width,height) which is what plt.imshow() wants for a single channel
    img = img[0]
  img = img/2 + 0.5 # undo our normalization, just to show the image, because plt's imshow() expects numbers to be between (0,1)
  plt.imshow(img.cpu().numpy()) # plt's imshow() knows how to work with numpy arrays, not tensors, so we'll convert it first

Now lets grab a random example item from our dataset. We know there are 60,000 ( `trainset_mnist.data.shape[0]` ) items, so we'll use `np.random.randint()` to grab an index in this range, then we'll display just one of the color channels using our custom `imshow()` function. 

You should run this a few times, to get a feel for the data:

In [None]:
exImage,exLabel = trainset_mnist[ np.random.randint(0,trainset_mnist.data.shape[0]) ] # note that exItem is a tuple, so exItem[0] is the image and exItem[1] is the class label index
imshow(exImage)
print(exLabel) # remember trainset.classes is a list of the class labels, so this will translate the class label index (an int) into the class label

## CIFAR-10: Images of different objects (animals, vehicles, etc.)

We'll use one of the datasets built into torchvision called CIFAR10, a dataset of 50,000 32x32 pixel images -- each belonging to one of ten classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck) that is described in detail [here](https://www.cs.toronto.edu/~kriz/cifar.html). (btw, CIFAR stands for Canadian Institute for Advanced Research, though I always think of it as "Can It Fly And Run"). 

Just as we did before for the MNIST example, we'll use `torchvision.datasets.CIFAR10()` but we don't want to work with the raw image data alone as it is delivered as a set of PIL (Python Image Library) image objects. The only difference here is that these images are not a single grayscale channel but are in color. We'll want to convert ***each color channel of the image*** to a tensor and then normalize it so that the values all fall between (-1,1). 

Again, just as before, we'll `transforms.Compose()` to chain  `transforms.ToTensor()` and then `transforms.Normalize()`. The difference here is that since we have three color channels instead of just one, we supply a 3-tuple for the `mean` and `std` parameters of `Normalize()` to specify the mean and std for each of the three color channels. 

In [None]:
# Chain a bunch of transformations together us torchvision.transforms.Compose
transform_cifar = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))]) # First make the input data a tensor, then apply a normalization to each of the 3 color channels of the image

Now we'll grab the (training) dataset and apply those transformations in a single line: 

In [None]:
# Grab training data from one of the built-in datasets (CIFAR10)
trainset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=True,
                                        download=True, transform=transform_cifar)

In [None]:
trainset_cifar

In [None]:
trainset_cifar.data.shape

In [None]:
trainset_cifar.classes

Just to see what one image looks like as a bunch of tensors:

In [None]:
exampleImage, exampleClassLabel = trainset_cifar[0]
print(exampleImage.shape) # 3 different 32x32 tensors (one for each color channel) -- Each of the 32x32 values represents the intensity of the color for that pixel 
print(exampleImage)
print(exampleClassLabel) # the class label; we have to look at trainset.classes to see what this means

Now lets grab a random example item from our dataset. We know there are 50,000 ( `trainset.data.shape[0]` ) items, so we'll use `np.random.randint()` to grab an index int this range, then we'll display the image using our custom `imshow()` function. 

You should run this a few times, to get a feel for the data:

In [None]:
#imshow(exampleImage[0])

exImage, exLabel = trainset_cifar[ np.random.randint(0,trainset_cifar.data.shape[0]) ]
imshow(exImage)
print(trainset_cifar.classes[exLabel]) # remember trainset_cifar.classes is a list of the class labels, so this will translate the class label index (an int) into the class label


This same approach can be used to work with other image datasets, though the particulars (the number of pixels, color channels) may differ.

Now let's see how we can define and train a NN to classify these images.  We'll design a NN that could operate on either dataset, but start by training it to  the MNIST dataset as this is a much simpler classification task.

# Training a NN to Classify Images

Below is a big codeblock!  Let's break it down piece by piece to see what we're doing:

1. Import stuff
2. Define our `imshow()` function from before
3. Define `nnSave()` function to save our NN to a file
4. Define `nnLoad()` function to load a NN from a file
5. Define a neural net class for use with images
- we'll chain together the following layers:
 - input layer
 - Fully connected hidden layer 1
 - ReLU
 - Fully Connected hidden layer 2
 - ReLU
 - Fully Connected hidden layer 3
 - ReLU
 - logSoftMax
6. A fit function for training our neural network
7. A test function for evaluating our neural network

Let's have a look at each of these:

In [None]:
# 1. Import Stuff
import torch
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader


# 2. Our imshow() function from earlier
def imshow(img):
  if img.shape[0]==3: # its probably (color,width,height) so make it (width,height,color) which is what plt.imshow() wants
    img = img.permute(1,2,0)
  elif img.shape[0]==1: # its probably a (1,width,height) so make it just (width,height) which is what plt.imshow() wants for a single channel
    img = img[0]
  img = img/2 + 0.5 # undo our normalization, just to show the image, because plt's imshow() expects numbers to be between (0,1)
  plt.imshow(img.cpu().numpy()) # plt's imshow() knows how to work with numpy arrays, not tensors, so we'll convert it first

# 3. A function to save our NN to a file
def nnSave(model,opt,path):
  torch.save({'model_class': model.__class__, # this is a pointer to the definition of the model's class
              'model_args': model.init_args, # init_args is the only property we have to add to a NN class ourselves for this function to work.
              'model_state_dict': model.state_dict(),
              'opt_class': opt.__class__,
              'opt_args': opt.defaults,
              'opt_state_dict':opt.state_dict()},
            path)

# 4. A function to load our NN from a file  
def nnLoad(path):
  cp = torch.load(path)
  model = cp['model_class'](**cp['model_args']) # equivalent to model = ModelClass(arg1,arg2,...)
  model.load_state_dict(cp['model_state_dict'])
  opt = cp['opt_class'](model.parameters(),**cp['opt_args']) # equivalent to opt = OptClass(arg1,arg2,...)
  opt.load_state_dict(cp['opt_state_dict'])
  return model, opt


# 5. Definition of a Neural Network called ImgNet -- (input layer, FC hidden layer 1, ReLU, FC hidden layer 2, ReLU, FC hidden layer 3, ReLU, logSoftMax)
class ImgNet(torch.nn.Module):
  def __init__(self,sizeInput,sizeHiddenLayer1,sizeHiddenLayer2,sizeOutput):
    self.init_args = {k:v for k,v in locals().items() if k!='self' and k!='__class__'} # this funny line captures the name and values of the args so we can save them w/ nnSave()
    super().__init__()
    self.fc1 = torch.nn.Linear(sizeInput,sizeHiddenLayer1)
    self.relu1 = torch.nn.ReLU()
    self.fc2 = torch.nn.Linear(sizeHiddenLayer1,sizeHiddenLayer2)
    self.relu2 = torch.nn.ReLU()
    self.fc3 = torch.nn.Linear(sizeHiddenLayer2,sizeOutput)
    self.logsoftmax = torch.nn.LogSoftmax(dim=1) # We are using dim=1 here because the 0th dimension will be the batch dimension
  
  def forward(self,x):
    x = self.fc1(x)
    x = self.relu1(x)
    x = self.fc2(x)
    x = self.relu2(x)
    x = self.fc3(x)
    x = self.logsoftmax(x)
    return x

# 6. Fit function for training a NN
def fit(num_epochs, model, train_dl, loss_fn, opt):
  model.train() # make sure the model is in training mode (instead of eval mode)
  for epoch in range(num_epochs):
    running_loss=0
    for xb,yb in train_dl: 
      xb = xb.view(xb.shape[0],-1) # This will keep the first dimension as the batch dimension and flatten all the others
      xb = xb.to("cuda",non_blocking = True) # this puts the tensor in the GPU's memory. non_blocking=True ensures that RAM->GPU RAM copy doesn't block other operations
      yb = yb.to("cuda", non_blocking = True)
      opt.zero_grad() # We'll start by zero'ing the gradient. We could have done this at the end of this loop, but this ensures we have no errant gradients lying around for the first iteration of the loop
      pred = model(xb) # run the input through the model and get the predictions 
      loss = loss_fn(pred, yb) # calculate the loss -- we'd have to check that the loss_fn gets the prediction and true values in the form it expects -- so its wise to check the docs of whatever loss_fn we use
      loss.backward() # propagate the loss backward
      opt.step() # tell the optimizer to do its thing
      running_loss+=loss.item() # add up the running loss (remember the output of loss will be a scalar, so loss.item() will just be a numerical value)
    print(f"Epoch {epoch} loss = {running_loss/len(train_dl)}") # print out the loss (averaged over all the predictions in the batch)

# 7. Test function for evaluating a NN
def test(model, test_dl, loss_fn):
  model.eval() # put the model into evaluation mode -- may affect some types of layers (e.g., dropout)
  with torch.no_grad():
    running_loss = 0
    total = 0
    correct = 0
    numClasses = len(test_dl.dataset.classes)
    cm = np.zeros((numClasses,numClasses),dtype=np.int32) # an empty matrix to hold the confusion matrix, we'll sum the confusion matrices for each batch
    #print(cm.shape)
    for xb, yb in test_dl:
      xb = xb.view(xb.shape[0],-1)
      xb = xb.to("cuda")
      yb = yb.to("cuda")
      pred = model(xb)
      predLabels = torch.argmax(pred,dim=1)
      cm += confusion_matrix(yb.cpu().numpy(),predLabels.cpu().numpy(),labels=range(0,10)) # add this batch's confusion matrix to the total matrix -- we have to specify the list of class indexes, or sklearn will shorten our cm to only the classes seen
      loss = loss_fn(pred,yb)
      running_loss+=loss.item()
    ave_loss = running_loss/len(test_dl)
    acc = np.diag(cm)/cm.sum(axis=1) # the per class accuracy is the diagonals (tp) divided by all cases of that class
    return cm, acc, ave_loss


   

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise: Analyze the above code</font>

Look at the different pieces of the above codeblock and explain each of them as prompted in the questions below.



1. Unlike our previous examples of Neural Net class, the above `ImgNet` has some parameters or arguments that you must pass in when you create an instance of this class. Below, list each of these parameters and explain how they are used to define the architecture of this model:

ENTER YOUR ANSWER HERE

2. Look at the following line in the `.forward()` method: 
```
xb = xb.view(xb.shape[0],-1)
```
Now skim over the [documentation for the `torch.Tensor.view()` method](https://pytorch.org/docs/stable/generated/torch.Tensor.view.html). 

* Remember that `xb` is a batch of input tensors. What is the first argument that we passed into this method? What does it describe? 


ENTER YOUR ANSWER HERE


* The second argument that we pass into this method is `-1`. This essentially "flattens" the tensor. Create a random "3D" tensor called `my_t` below of size (5,8,3). Now run the code `my_t.view(myt.shape[0],-1)` and look at the output tensor

In [None]:
# Try it out 

Do you see how the `-1` argument works in `.view()`? Modify your code above by trying out different shapes for `my_t` (You can try this with 3D tensors of different shapes, 4D tensors, etc.).

3. Look at the fit function. Pay attention to where `opt.zero_grad()` and `opt.step()` are called. Are we using gradient descent or stochastic gradient descent here? Why?

ENTER YOUR ANSWER HERE

<hr/>

## Classifying MNIST Digits

Let's build a neural net to classify MNIST handwritten digit images.

The code below repeats what we used earlier to download the mnist dataset and define data loader objects.



In [None]:
# Load the MNIST dataset
transform_mnist = transforms.Compose( [transforms.ToTensor(), transforms.Normalize(mean=(0.5,), std=(0.5,)) ] )
trainset_mnist = torchvision.datasets.MNIST('./mnist', download=True, train=True, transform=transform_mnist)
testset_mnist = torchvision.datasets.MNIST('./mnist', download=True, train=False, transform=transform_mnist)

batch_size = 64
train_dl_mnist = DataLoader(trainset_mnist, batch_size=batch_size, shuffle=True)
test_dl_mnist = DataLoader(testset_mnist, batch_size=batch_size, shuffle=True)
imgnet_mnist = ImgNet(28*28,128,64,10).cuda()

# Train the NN
loss_fn_mnist = torch.nn.functional.nll_loss
opt_mnist = torch.optim.SGD(imgnet_mnist.parameters(), lr=0.003, momentum=0.9) # where did I get these "magic numbers?"  Trial and error and voodoo.
fit(15, imgnet_mnist, train_dl_mnist, loss_fn_mnist, opt_mnist)

In [None]:
nnSave(imgnet_mnist,opt_mnist,'./models/imgnet_mnist.pt')

In [None]:
# If we had already saved this to a file, we could uncomment the lines below to load it:
#imgnet_mist, opt_mnist = nnLoad('./models/imgnet_mnist.pt')
#imgnet_mnist = imgnet_mnist.cuda()

Before we evaluate the model's performance systematically, let's just get a feel for how it did by looking at some of the images and the model's predictions.

The code below will:
- grab an item from the data loader (remember it has shuffling, so every time we run it, a different item will be grabbed)
- get the NN's predicted label for that image (pay attention to the use of `torch.argmax()`
- show the image and show the label

We'll run the below a few times:

In [None]:
# Grab one data item and 
images, labels = next(iter(test_dl_mnist))
img = images[0].to("cuda")
label = labels[0].to("cuda").item()

with torch.no_grad():
  predLabel = torch.argmax(imgnet_mnist(img.view(1,-1))).item()

imshow(img)
print(f"Predicted label was: {predLabel} ; Actual label was: {label}")


Ok, let's turn to formally evaluating the model's performance.

In [None]:
# Formally evaluate the models performance on the entire test data using our test function.
cm_mnist,acc_mnist,ave_loss_mnist = test(imgnet_mnist, test_dl_mnist, loss_fn_mnist)
print(ave_loss_mnist)
print(cm_mnist)
print(acc_mnist)

**How did we do?**

Our NN classifier performs pretty well. From the confusion matrix, we can see the types of digits that it struggles with (e.g., confusing a "4" for a "9" or a "2" with a "7") -- these types of mistakes make sense as you can imagine making the same mistake yourself depending on the person's handwriting. Overall the per-class accuracy is relatively high.

## Classifying CIFAR10 Images

Now let's turn to the CIFAR10 dataset and see if a NN with the same architecture can do well in this more complicated context of distinguishing objects from one another. This is a much harder task.

In [None]:
# Load the CIFAR10 data
transform_cifar = transforms.Compose( [ transforms.ToTensor(),transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)) ] )
trainset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=True, download=True, transform=transform_cifar)
testset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=False, download=True, transform=transform_cifar)

batch_size = 64
train_dl_cifar = DataLoader(trainset_cifar, batch_size=batch_size, shuffle=True)
test_dl_cifar = DataLoader(testset_cifar, batch_size=batch_size, shuffle=True)
imgnet_cifar = ImgNet(3*32*32,128,64,10).cuda()

# Train the NN
loss_fn_cifar = torch.nn.functional.nll_loss
opt_cifar = torch.optim.SGD(imgnet_cifar.parameters(), lr=0.003, momentum=0.9) # where did I get these "magic numbers?"  Trial and error and voodoo.
fit(15, imgnet_cifar, train_dl_cifar, loss_fn_cifar, opt_cifar)


In [None]:
nnSave(imgnet_cifar,opt_cifar,'./models/imgnet_cifar.pt')

In [None]:
# If we had already saved this to a file, we could uncomment the lines below to load it:
#imgnet_cifar, opt_cifar = nnLoad('./models/imgnet_cifar.pt')
#imgnet_cifar = imgnet_cifar.cuda()

In [None]:
# Print an example
images, labels = next(iter(test_dl_cifar))
img = images[0].to("cuda")
label = labels[0].to("cuda")

with torch.no_grad():
  predLabel = torch.argmax(imgnet_cifar(img.view(1,-1))).item()

imshow(img)
print(f"Predicted label was: {testset_cifar.classes[predLabel]} ; Actual label was: {testset_cifar.classes[label]}")


In [None]:
# Formally evaluate the model's performance on the entire test data using our test function.
cm_cifar,acc_cifar,ave_loss_cifar = test(imgnet_cifar, test_dl_cifar, loss_fn_cifar)
print(ave_loss_cifar)
print(cm_cifar)
print(acc_cifar)

Uh oh. It appears that our simple imgnet model isn't so great at classifying the more complex images from the CIFAR10 dataset.  Given enough training examples (and enough training time), it may be that our model could learn more sophisticated features. We could always add more layers to give it more parameters to learn. After all, a sufficiently deep NN with enough params can approximate any function... But there is a better way that we can *help* the NN learn by specifying features that are more appropriate to learn patterns in images that would be useful? We're now in a position to talk about some more complex architectures. 

# Architecture

Research into building new types of neural networks has advanced rapidly to solve a rich variety of different machine learning problems in the realms of computer vision, natural language processing, and many other contexts. These advances have lead to all sorts of new types of layers that have been implemented in pytorch.

We'll look at the following concepts, the layers associated with them, and discuss the ideas behind them:
- Receptive Fields
- Pooling
- Convolution
- BatchNorm
- Dropout




## Receptive Fields

We have seen fully connected layers, where each neuron receives an input from all the neurons in the previous layer. However, what if we wanted to construct a layer that is much sparser? What if we wanted each neuron to receive inputs from only some neurons in the previous layer? Consider a 2D grid of neurons as the input layer. A good example of this is the input layer for image data. Now suppose we want each neuron in the next layer to receive inputs from only a small "window" (or "slice" or "chunk") from the input layer. We call this "window" the **<font color=blue>receptive field</font>**.

---
<figure>
<figtitle>
<font size=5 color=6699FF>
Receptive Fields
</font>
</figtitle>
<div align=left>
<img src="https://drive.google.com/uc?id=1DjUfLYPHC5wln9-q4pa2Y-q1hlYsEiXr" width=450>
</center>
<figcaption>
<font color=669999>The picture above shows an 11x11 2D input layer and a 3x3 receptive field.<br> It also shows a single neuron in the next layer that receives the receptive<br> field as inputs. Notice that this is very different from a fully connected<br> layers (where each neuron in the next layer receives an input from all the<br>neurons in the previous layer). It is much sparser.</font>  
</figcaption>
</figure>

---

With images, the receptive field is typically a set of adjacent pixels in part of the image. With text applications, this could be numerical values that represent the words surrounding a particular word. It makes sense that we would want to capture some "local information" within the data.

Because we often want to define an entire layer that operates on a set of receptive fields, we don't define each receptive field manually, but instead define:
- the **receptive field size**, which is the "window size" (e.g., 3x3 in the image above)
- the **stride**, which is how much in each direction the field should move to define the next receptive field.

---
<figure>
<figtitle>
<font size=5 color=6699FF>
Receptive Fields and Stride
</font>
</figtitle>
<div align=left>
<img src="https://drive.google.com/uc?id=1wBPzGF4YMRL1zOpVo0dupPRnKdnX1ka3" width=600>
</div>
<figcaption>
<font color=669999>The picture above illustrates an 11x11 input layer, a 3x3 receptive field, and a stride of 1 in<br>the horizontal direction (you can have different strides in each direction).</font>
</figcaption>
</figure>

---

Because we are moving the sliding window around and we may want to be able to pass over the edges of our input, it is typical to incorporate **padding** by adding some extra zero values around the input:

---
<figure>
<figtitle>
<font size=5 color=6699FF>
Receptive Fields and Padding
</font>
</figtitle>
<div align=left >
<img src="https://drive.google.com/uc?id=1kvwBLFFH_eJdPQKs6o3odhgdEnQnIY7u" width=450>
</div>
<figcaption>
<font color=669999>The picture above shows a 9x9 input layer with 1 layer of padding around<br> both the width and height dimension of the input. It also shows each of<br> the neurons in the next layer that receive the different receptive fields.</font>
</figcaption>
</figure>

---

***How many neurons would make up the output layer?*** 

Its important to know this, because when we chain together layers in a NN, the size of the output of one layer must match the size of the input of the next layer.  The formula for this (which you can also reason through yourself) is:

$O_d = \frac{I_d + 2*P_d - R_d}{S_d} + 1$

where:
- $I_d$ is the size of the input layer in the $d^{th}$ dimension
- $P_d$ is the padding size in the $d^{th}$ dimension
- $R_d$ is the receptive field size in the $d^{th}$ dimension
- $S_d$ is the stride in the $d^{th}$ dimension

So for the last picture I showed above, the calculation would be:

$ O_d = \frac{9 + 2*1 - 3}{1} + 1 = 9$
(which is the same for the vertical and horizontal dimensions of the input).

<br />

Some things to note:
* Naturally, the output size in any dimension needs to be an integer, so not all values of strides would work (since the above formula divides by stride). And of course, it must be the case that the receptive field $R_d<=I_d + 2*P_d$.

* Note that from this formula, you can see that operations on receptive fields (like the pooling and convolution operations that we'll talk about in the next section) can effectively shrink the size of data passing from input to output, depending on the choice of stride and padding. For this reason, one should take care with applying multiple operations like that in successive layers -- you *could shrink your data down to nothing!* While I've shown everything with 2D inputs, the notions extend to an aribtrary number of dimensions.




<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise: Calculate the Output Dimension of a pooling layer</font>

Write a python function that takes arguments `input_size_d`, `padding_size_d`, `receptive_field_size_d`, and `stride_size_d` and returns `output_size_d` using the formula above.

<br />

Suppose you have a tensor of shape `(9,9)` and you run it through a pooling layer with a padding of (1,1), a receptive field of (2,3) and a stride of (1,1). Use your function to calculate the output size in each dimension (i.e., call your function twice, once for each dimension).



In [None]:
# Write your function get_pooling_output_d(input_size_d, padding_size_d, receptive_field_size_d, stride_size_d)

In [None]:
# Get the output size of first dimension

In [None]:
# Get the output size of second dimension

<hr/>

## Pooling

Pooling layers enact some aggregation operation on a receptive field. The aggregation could be: average, max, min, etc. Let's look at some examples for max pooling and average pooling: 

<figure>
<figtitle>
<font size=5 color=6699FF>
<div align=left>
Pooling: Max and Average
</div>
</font>
</figtitle>
<div align=left>
<img src="https://drive.google.com/uc?id=1cMZ5wWbwatB3YbYHDShlAFXWX5qc-tYO" width=400>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
<img src="https://drive.google.com/uc?id=1MhVT3O5Z9pi8DROHKx0Ck_V8K9qTtUJT" width=400>
</div>
<figcaption>
<font color=669999>
An example of 2x2 pooling with a stride of 2 for max pooling (left) and average pooling (right).
</font>
</figcaption>
</figure>

Notice that pooling layers don't have any weights. During backpropagation, the gradient is routed backward to all the neurons in the receptive field that contributed to it. So, for example, in max pooling, only the neuron in the receptive field that held the max value would propagate the gradient backwards (similarly in min pooling). In average pooling, all the neurons in the receptive field would propagate the gradient. 

### How to add a pooling layer in Pytorch

It's relatively simple to add a pooling layer in Pytorch, by using the `torch.nn.MaxPool2D` (there's a 1d version as well). We simply have to specify:
- `kernel_size` - the size of the receptive field you are pooling. It can be an single integer, such as 2, for a 2x2 square pool, or a tuple such as (2,3) for a rectangular 2x3 pool.
- `stride` - the stride, which can be a single integer, such as 2, or a tuple such as (2,1) for striding different amounts in the x and y directions.
- `padding` - the amount of padding above and below the input before the pooling is applied (we'll talk more about this later). It can be a single integer for equal padding on the x and y dimensions, or a tuple for different padding in the x and y dimensions.
- There are other parameters that we'll typically not specify and just use the defaults, but you can read more about it in the pytorch documentation [here](https://pytorch.org/docs/stable/nn.html#maxpool2d)

And, of course, there are also `torch.nn.AvgPool2d()` and a few other types of pooling layers that take similar arguments.

Here's how we make a MaxPool2d layer with Pytorch:

In [None]:
p = torch.nn.MaxPool2d(kernel_size=2,stride=2)
print(p)

There aren't any parameters for the NN to learn for a pooling layer, because it is just pooling the receptive field and performing some aggregate operation. We will have to think about the shape of the output from such an operation (in many cases the pooling will act to shrink the input) -- so you will have to use the formula above to make sure the dimensions of the output of each layer are matched to the dimensions of the input of the next layer in your NN.

Here's a simple example

In [None]:
someInput = torch.rand(1,4,4) # e.g., a single 4x4 example input
print(someInput)
print(p(someInput)) # max pooling applied to the input
# According to the formula, we have (same for both x and y dimension) Od = (4 - 2)/2 + 1 = 2 

# You can play around with the parameters of the pooling layer and the dimensions of someInput to see how this works

## Convolution

Convolution is similar to pooling, but it introduces a weight for each neuron in the receptive field (RF). The weights form a grid with the same dimensions as the receptive field. We call this grid of weights a **<font color=blue>filter</font>** (also called a **kernel**). The output is simply the sum of each RF neuron times the corresponding filter weight. 
<br><br>

The actual calculation for a convolution (with stride=1) is given by:


$ C_{m,n} = \sum_i\sum_j Filter_{i,j} * Input_{m-i,n-j}$

<br>

But it's actually much more intuitive than that. A convolution is just sliding the filter box over the input and multiplying each input terms by the filter weight in the corresponding box and then adding up the result.

<br>

---

<figure>
<figtitle>
<font size=5 color=6699FF>A convolution operation</font>
</figtitle>
<div align=left>
<img src="https://drive.google.com/uc?id=11c_7Pcfl-xjVLTReCO1Rds07DcFhZL-6" width=600 />
</div>
<figcaption>
<font color=669999>
A convolution operation on a 4x4 input using a 2x2 filter with a stride of 1 in each direction.
</font>
</figcaption>
</figure>


---

<br>

Try to calculate the values of the output for the above example to ensure that you understand what is happening here. To get a general sense of what convolution does and why it can be such a powerful tool, we'll look at some animations:

<br>

---

<figure>
<figtitle>
<font size=5 color=6699FF>How Convolution Works (Animated)</font>
</figtitle>
<div align=left>
<img src="https://drive.google.com/uc?id=1aiJ3Dbb7BkmKeFuw7HcEhasZ1GzOdODP" width=600>
</div>
<figcaption>
<font color=669999>
A convolution operation on a 6x6 input using a 3x3 filter (also called a kernel). The output of a<br> convolution operation is sometimes called a feature map.
</font>
</figcaption>
</figure>

---

<br>

Have a look at the filter (kernel) chosen in the example above. It has positive values on the top row, zero values in the middle row and negative values on the bottom row. 

<font color=blue>**Q**</font>: What kind of *receptive field input* would yield a large value when multiplied by this filter?  

<font color=cc6600>**A**</font>: A receptive field that has a horizontal edge. 

Do you see why?

When you see a convolution operation, you should always pay attention to the filter that is used and get a feel for what it might do. Filters can be shaped to pick out "contrast" such as edges at different angles or bright spots. They can also be chosen to "blur" or "smooth out" an image.

Let's look at another example.


---

<figure>
<figtitle>
<font size=5 color=6699FF>Convolution with Stride > 1</font>
</figtitle>
<div align=left>
<img src="https://drive.google.com/uc?id=11lwVxjqIKiD7FYzwBbZyij7fOyObL3p4" width=600>
</div>
<figcaption>
<font color=669999>
A convolution operation on a 7x7 input using a 3x3 filter (also called a kernel), with a stride of 2<br> in each direction.
</font>
</figcaption>
</figure>

---
<br>

If some of this terminology (e.g., filter, kernel) seems familiar to you, it is no accident. In Image processing (e.g., Photoshop) and in audio processing (e.g., Ableton) applying filters through convolution long predates the use of convolution in neural networks. It is a central mathematical operation in ***digital signal processing***.

To show you how it can be useful, have a look at these convolutions applied to an image of the facade of a building (from Idstein, Germany) with different filter choices:

---
<figure>
<figtitle>
<font size=5 color=6699FF>Convolution for Edge Detection</font>
</figtitle>
<div align=left>
<img src="https://drive.google.com/uc?id=1icO8VUDxDKTmoap8_1hKsi_-1ydA4GYx" width=600>
</div>
<figcaption>
<font color=669999>Using convolution with different filters can detect edges in different orientations. This is why the<br> output of a convolution is sometimes referred to as a "Feature Map", because it can pick out<br> different features of the input.</font>
</figcaption>
</figure>

---
<br><br>
In neural networks, the weights that make up the filters of convolutional layers are **learned** the same way all NN parameters are learned -- through training (i.e., via backpropagation). The bias is typically shared across all the neurons in the convolutional layer.



### How to add a convolutional layer in Pytorch

It's relatively simple to add a convolutional layer into a NN in Pytorch using a `torch.nn.Conv2d()` layer. We simply have to specify:
- `in_channels` -- The number of 2d planes or input channels (this is analogous to e.g., the number of color channels of an image -- so with CIFAR10, it would be 3)
- `out_channels` -- the number of output channels produced by the convolution. This is the number of filters or kernels that the NN will learn. There will be a filter for each input channel.
- `stride` -- the stride parameter. It can be a single number, such as 3, or a tuple such as (1,2) if we want the stride to be different in the x and y dimensions of the input.
- `padding` -- the padding around the input. It can be a single number, such as 1, or a tuple such as (1,2) if we want different padding around x and y dimensions of the input.
- There are some other parameters that we'll typically just set as defaults. You can read about them in the pytorch documentation [here](https://pytorch.org/docs/stable/nn.html#conv2d)

For example:

In [None]:
import torch

c = torch.nn.Conv2d(in_channels=1,out_channels=3, stride = 1, padding=1, kernel_size=3)
print(c.weight.shape)
print(c.weight)

The tensors that are shown above are the filters (i.e., the parameters that our NN will learn when we train it), intialized to some random values. Notice that we specified a single input channel, and 3 output channels, so we have 3 different filters (or kernels), each of shape 3x3.

If we wanted, we could make the filter (or kernel) be non-square:

In [None]:
c = torch.nn.Conv2d(in_channels=1,out_channels=3, stride = 1, padding=1,kernel_size=(3,4))
print(c.weight.shape)
print(c.weight)

Notice that this has 3 rectangular filters of shape 3x4 that will be convolved over the single input channel (e.g., a grayscale image).

If we have 3 color channels for our image, as is the case with CIFAR10 images, we would specify `in_channels=3` and

In [None]:
c = torch.nn.Conv2d(in_channels=3,out_channels=3, stride = 1, padding=1,kernel_size=3)
print(c.weight.shape)
print(c.weight)

Notice that, in this case, we have 3 3x3 filters for each color channel that our NN will be learning.

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise: Use a PyTorch Convolution layer to explore different filters </font>

PyTorch convolution layers have filter weights that are learned, just as any weights in NNs are trained. However, we can set the weights of these layers manually to see how 2D convolutions with different filters affect image inputs. Let's do that and explore some different filters applied to (just one color channel of) CIFAR images.

<br />

Do the following:
1. Run the code below for a slightly modified version of the imshow function that will allow us to plot multiple images side by side.

2. Run the code to grab an example image and label from the CIFAR data.

3. Let's create the filters (kernels) that are shown in the animated image above labelled "Convolution for Edge Detection". Each of these is a 3x3 filter. We can define it by calling `torch.FloatTensor()` on a multidimensional array or list. Note that we will want to create a tensor of shape `(1,1,3,3)` where the 1st dimension is the batch dimension, the 2nd dimension is the "channel" of the image, and the next two dimensions are the x,y pixels of the image. So, you can define these filters by calling `torch.FloatTensor()` on a list of lists and then calling `.view(1,1,3,3)` to change it to the shape we want. Call these filter tensors `filter1`, `filter2`, etc. Start by setting the variable `filter = filter1` (later we will change this to look at all the filters we made).

4. Create a 2d convolution layer called `c` with 1 in channel, 1 out channel, padding and stride of 1, and a kernel size of 3x3.

5. We can set the weights of a conv2d layer manually, using `c.weight = ...` but we have to do this within a `with torch.no_grad():` block. Set the weight of your convolution layer to `filter`.

6. Grab the first color channel of the CIFAR image `exImage[0]`. We want to make this a tensor of shape `(1,1,32,32)`. Use `.view()` to accomplish this and call the resulting tensor `input`

7. Run `input` through the convolutional layer `c` to get the `output`. Look at the shape of `output`.

8. Now we will plot the input image, the filter, and the output image side by side using our `imshow` function. `imshow` will take a list of tensors to display, but wants each one to be of shape `(1,X,Y)`. So, we can plot a `(1,32,32)` shaped tensor (like our input and output)  or even a `(1,3,3)` shaped tensor like `filter`. Use `.view()` to change the input, output and filter tensors to these shapes, make a list of them and pass it to `imshow`

Finally, re-run the above but set `filter = filter2`, `filter = filter3`, etc.



In [None]:
# 1. Run the code below to define a modified imshow function that plots images side by side
#  You can now pass this function a list of images represented as tensors, and it will plot each side by side
def imshow(img,transform=True, titles = None):
  if not(isinstance(img,list)):
    img = [img]
  fig, ax = plt.subplots(1,len(img),figsize=(15,15))
  for i,img in enumerate(img):
    if img.shape[0]==3:
      img = img.permute(1,2,0)
    elif img.shape[0]==1: 
      img = img[0]
    if transform:
      img = img/2 + 0.5
    if isinstance(ax,np.ndarray):   
      ax[i].imshow(img.cpu().numpy())
      if titles is not None:
        ax[i].set_title(titles[i])
    else:
      ax.imshow(img.cpu().numpy())
      fig.set_figheight(5)
      fig.set_figwidth(5)

In [None]:
# 2. Grab an example image and its label from the CIFAR dataset -- you can rerun this until you get one that you like
exImage, exLabel = trainset_cifar[ np.random.randint(0,trainset_cifar.data.shape[0]) ]
imshow(exImage)
print(trainset_cifar.classes[exLabel])

In [None]:
# 3. Create the filters: filter1, filter2, etc. by looking at the 3x3  filters shown in the animated image "Convolution for Edge Detection" above.
#   You can each filter by making a list of lists, then calling torch.FloatTensor() on it, and then calling .view(1,1,32,32). Then define the variable filter
#   and set it equal to filter1 (later we'll change this to filter2, filter3, etc.)
# INSERT YOUR CODE HERE


# 4. Create a 2d convolution layer called c with 1 in channel, 1 out channel, padding and stride of 1, and a kernel size of 3x3.
# INSERT YOUR CODE HERE


# 5. Under a "with torch.no_grad()"" block, set the weights of teh conv2d layer manually to the filter, using c.weight = filter
# INSERT YOUR CODE HERE


# 6. Grab the first color channel of the example CIFAR image (exImage), exImage[0], and the apply .view(1,1,32,32) so that the conv2d layer can get the input shape it expects
# INSERT YOUR CODE HERE


# 7.Run "input" through the convolution layer and define the result "output". Look at the shape of output
# INSERT YOUR CODE HERE


# 8. Use our imshow function to plot the input image, filter, and output image side by side. We'll want to pass ishow a list of tensors. Each tensor in the list should have the
#    shape (1,x,x) instead of (1,1,x,x)  where x=32 for the input and output images, and x=3 for the filter. 
# INSERT YOUR CODE HERE


# Now, repeat the above with different filters. Do you see how the convolution is picking up on different "visual features" (e.g., horizontal edges, diagonal edges, etc. )
#  of the input image?


<hr/>

## Batch Normalization

Batch normalization is a way of keeping the outputs from a layer in your NN from getting out of control. It normalizes the outputs from a layer for each "batch" so that the mean activation value is close to 0 and its standard deviation is close to 1. This can lead to substantial speedup in training a NN. These layers are typically used after convolutions as well as fully connected layers.


### How to add a batch normalization layer in Pytorch

It's pretty straightforward -- we just use `torch.nn.BatchNorm2d` (for 2d inputs) or `torch.nnBatchNorm1d` (for 1d inputs).

We just have to specify `num_features` which is analogous to the number of channels in an image for the 2d case -- i.e., usually we are passing a batch of size N and the image has some Height (H) and Width (W), our tensors going into this will have shape (N,C,H,W). Of course, if you did some convolution in the prior step, then C would have to match the number of out_channels from that convolution)

In [None]:
bn = torch.nn.BatchNorm2d(1)
print(bn)

## Dropout

Dropout is a technique for training Neural Networks that tries to make sure that no neuron in the network relies too on other neurons and "learns" something.  It is a type of **regularization**. 
- Regularization is a category of techniques to reduce fitting error and reduce the potential for overfitting. 

The idea of dropout is that, during ***training time***, we will ignore some neurons with random probability (i.e., pretend as if they are not present in the network): 

<img src="https://drive.google.com/uc?id=1nThNWfnnC5omfMEJIm0JOSocYh6XcJaY" width=600>

By doing so, we encourage neurons to not become entirely dependent on other neurons (which may "dropout" with some probability in each training iteration):

<img src="https://drive.google.com/uc?id=1VBsbcHlFcZiwe9SV9eHIH8Icj-gyNfmg" width=300>

Note that **at test time** all neurons will be kept in the network. To ensure that this is the case, we have to indicate to our model when we are training and when we are testing. We do this with `model.train()` and `model.eval()`. You may have notice these lines of code in the function that I supplied above for `fit()` and `test()`. 


### How to add a dropout in pytorch


In practice, it is relatively straightforward to add dropout to a neural net in PyTorch using `torch.nn.Dropout()` or `torch.nn.Dropout2d()`. We just need to add a dropout layer after any layer where we might want some of the neurons to be dropped. When we do so, we have to specify the  probability, `p`, that each neuron in the previous layer will be dropped.


In [None]:
do = torch.nn.Dropout2d(p=0.5)
print(do)

# Feedback
What did you think about this notebook? What questions do you have? Were any parts confusing? Write your thoughts in the text box below.

<font size =2> note: You can double click this text box in colab to edit it.</font>

PUT YOUR THOUGHTS HERE

# Submit
Don't forget to submit your notebook before class! Make sure you have saved your work (**Colab Menu: File-> Save**) and then download a pure python copy (**Colab Menu: File-> Download -> Download .py**) and a python notebook copy (**Colab Menu: File-> Download -> Download .ipynb**). You will upload both of these to the assignment on the canvas page.
