<a href="https://colab.research.google.com/github/dylanwalker/MGSC496/blob/main/MGSC496_L11.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This is the in-class notebook for MGSC496 Lecture 11.

---



In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import torch
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms


if not os.path.exists('./models'):
  os.mkdir('./models')


Things to mention:
* Flattening images with `xb.view(xb.shape[0],-1)`
 * Esp in image convolution, we need to do this for flattening the data layers to do operations like softmax, or to get from the 2D image to a final class prediction
* Discuss `model.train()` and `model.eval()`

* Increasing loss / exploding gradients
 * Let's take a look at an example of this, below under **Misc:Increasing Loss / Exploding Gradients**
 * [A good explanation of why too-large learning rates can lead to gradient explosion](https://stats.stackexchange.com/questions/315664/gradient-descent-explodes-if-learning-rate-is-too-large)
* Cross Entropy Loss
 * Cross Entropy Loss is a good loss function to use for classification problems (esp w/ multiple classes).
 * It compares two distributions (and is often used to compare a one-hot encoded class label to a softmax predicted prob of a data point belonging to each class)
 * Note that the Cross Entropy Loss is asymmetric, so make sure that you are putting the labels as the second argument!
 * [Comparing Cross Entropy Loss to MSE](https://dhruvmetha.medium.com/why-cross-entropy-loss-6f221202c8b8#:~:text=Comparison%20with%20Mean%20Squared%20Error%20(L2%20loss))
* Two more advanced topics that we didn't cover are:
 * Text/sequence embeddings -- getting a numerical vector out of text ( a sequence of words) or even sequences of other data. You may have heard of Word2Vec which is a popular example of this.
  * Often this is done by masking (hiding) some of the elemnents in the sequnce (e.g., a word) and creating an NN that predicts the hidden elements. In doing so, we can make some middle layer of the NN have neurons represent the input sequence as a set of numerical values. These become the "vector" that represents the text. 
 * Recurrent Neural Networks: Feed the output of the NN back into it as an input.
  * Has lots of new types of problems to deal with like vanishing/exploding gradients as backward prop has to go through the net multiple times.
  * Can sacrific long term memory of what is happening. There are some approaches to dealing with this (like letting some inputs skip over the pass through the NN). You may have heard of LSTMs - Long Term Short Term memory NNs which are examples of RNNs
 * Transformers - An entirely different approach to handling input sequence data compared to RNNs. This is much more similar to the sequence embeddings approaches discussed, but has numbers capture how much attention is paid to different parts of the sequence when trying to predict a masked part of it.

## Reading Exercise Solution: Make Your Own Neural Network Class

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise:Make your own Neural Network Class</font>


Make your own neural network class by following the examples above. Remember your class should have a constructor that defines and stores the layers; and a forward method that passes the input through each layer in the right order. Feel free to use `torch.nn.Sequential` if you wish.


Your neural net class should be called `MyNet` and have the following architecture:
* A fully connected (linear) layer that takes input of 7 dimensions and outputs 32 dimensions
* A sigmoid layer (you can use `torch.nn.Sigmoid`)
* Another fully connected (linear) layer with an input that matches the output of the last layer and an output of 64 dimensions
* Another sigmoid layer
* A third fully connected (linear) layer with an input that matches the output of the last layer and an output of 5 dimensions
* A softmax layer

When you have defined your class, make an instance of an object of that class and pass a random torch tensor of shape (3,7). Also try passing a random torch tensor of shape (10,7).

Unlike in the examples above, however, I want you to print out `x.shape` in between each layer that you pass it through in the `.forward` method. This will allow us to "see" how the tensor changes shape as it goes through each layer.

In [None]:
# Define your NN class
class MyNet(torch.nn.Module):
  def __init__(self):
    super().__init__()
    self.fc1 = torch.nn.Linear(7,32) 
    self.sig1 = torch.nn.Sigmoid()  
    self.fc2 = torch.nn.Linear(32,64)
    self.sig2 = torch.nn.Sigmoid()
    self.fc3 = torch.nn.Linear(64,5)
    self.softmax = torch.nn.Softmax(dim=0) 
  
  def forward(self,x):
    print(x.shape)
    x = self.fc1(x)
    print(x.shape)
    x = self.sig1(x)
    print(x.shape)
    x = self.fc2(x)
    print(x.shape)
    x = self.sig2(x)
    print(x.shape)
    x = self.fc3(x)
    print(x.shape)
    x = self.softmax(x)
    print(x.shape)
    return x

In [None]:
# Make an instance of an object of your NN Class
model = MyNet()

In [None]:
# Run a random tensor of shape (3,7) through your model
model.forward(torch.rand(3,7))

In [None]:
# Run a random tensor of shape (10,7) through your model
model.forward(torch.rand(10,7))

Q: Why did the model allow us to pass different tensors of shape `(n,7)` into it (for `n`=3, 10)? Could we have passed a tensor of shape `(n,7)` for arbitrary `n`? 

A: Pytorch is built to assume that the first dimension of tensors that you pass into models will always be the "batch" dimension, as the designers understood that we would like to take advantage of vectorized computation to run large batches of data through our model at once. So, yes, it would have accepted a tensor of shape `(n,7)` for arbitrary `n` as input.

Looking at how the shape of `x` changes as it passes through the layers, do you understand what is going on?

<hr/>

# Reading Exercise Solution: Calculate the Output Dimension of a pooling layer

$O_d = \frac{I_d + 2*P_d - R_d}{S_d} + 1$

where:
- $I_d$ is the size of the input layer in the $d^{th}$ dimension
- $P_d$ is the padding size in the $d^{th}$ dimension
- $R_d$ is the receptive field size in the $d^{th}$ dimension
- $S_d$ is the stride in the $d^{th}$ dimension

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise: Calculate the Output Dimension of a pooling layer</font>

Write a python function that takes arguments `input_size_d`, `padding_size_d`, `receptive_field_size_d`, and `stride_size_d` and returns `output_size_d` using the formula above.

Suppose you have a tensor of shape `(9,9)` and you run it through a pooling layer with a padding of (1,1), a receptive field of (2,3) and a stride of (1,1). Use your function to calculate the output size in each dimension (i.e., call your function twice, once for each dimension).



In [None]:
# Try it out
def get_pooling_output_d(input_size_d, padding_size_d, receptive_field_size_d, stride_size_d):
  output_size_d = (input_size_d + 2*padding_size_d - receptive_field_size_d)/stride_size_d + 1
  return output_size_d
  

In [None]:
# Try it out
get_pooling_output_d(9,1,2,1)

In [None]:
get_pooling_output_d(9,1,3,1)

<hr/>

# Reading Exercise Solution: Make a Convolution Layer and apply it to a random tensor

<hr/>
<img src="https://drive.google.com/uc?id=1sk8CSP26YY7sfyzmHGFXncuNRujkvu9v" align="left">

<font size=3 color="darkred">Exercise: Use a PyTorch Convolution layer to explore different filters </font>

PyTorch convolution layers have filter weights that are learned, just as any weights in NNs are trained. However, we can set the weights of these layers manually to see how 2D convolutions with different filters affect image inputs. Let's do that and explore some different filters applied to (just one color channel of) CIFAR images.

<br />

Do the following:
1. Run the code below for a slightly modified version of the imshow function that will allow us to plot multiple images side by side.

2. Run the code to grab an example image and label from the CIFAR data.

3. Let's create the filters (kernels) that are shown in the animated image above labelled "Convolution for Edge Detection". Each of these is a 3x3 filter. We can define it by calling `torch.tensor()` on a multidimensional array or list. Note that we will want to create a tensor of shape `(1,1,3,3)` where the 1st dimension is the batch dimension, the 2nd dimension is the "channel" of the image, and the next two dimensions are the x,y pixels of the image. So, you can define these filters by calling `torch.tensor()` on a list of lists and then calling `.view(1,1,3,3)` to change it to the shape we want. Call these filter tensors `filter1`, `filter2`, etc. Start by setting the variable `filter=filter1` (later we will change this to look at all the filters we made).

4. Create a 2d convolution layer called `c` with 1 in channel, 1 out channel, padding and stride of 1, and a kernel size of 3x3.

5. We can set the weights of a conv2d layer manually, using `c.weight = ...` but we have to do this within a `with torch.no_grad():` block. Set the weight of your convolution layer to `filter1`.

6. Grab the first color channel of the CIFAR image `exImage[0]`. We want to make this a tensor of shape `(1,1,32,32)`. Use `.view()` to accomplish this and call the resulting tensor `input`

7. Run `input` through the convolutional layer `c` to get the `output`. Look at the shape of `output`.

8. Now we will plot the input image, the filter, and the output image side by side using our `imshow` function. `imshow` will take a list of tensors to display, but wants each one to be of shape `(1,X,Y)`. So, we can plot a `(1,32,32)` shaped tensor (like our input and output)  or even a `(1,3,3)` shaped tensor like `filter`. Use `.view()` to change the input, output and filter tensors to these shapes, make a list of them and pass it to `imshow`

Finally, re-run the above but set `filter=filter2`, `filter=filter3`, etc.



In [None]:
# 1. Run the code below to define a modified imshow function that plots images side by side
#  You can now pass this function a list of images represented as tensors, and it will plot each side by side
def imshow(img,transform=True, titles = None):
  if not(isinstance(img,list)):
    img = [img]
  fig, ax = plt.subplots(1,len(img),figsize=(15,15))
  for i,img in enumerate(img):
    if img.shape[0]==3:
      img = img.permute(1,2,0)
    elif img.shape[0]==1: 
      img = img[0]
    if transform:
      img = img/2 + 0.5
    if isinstance(ax,np.ndarray):   
      ax[i].imshow(img.cpu().numpy())
      if titles is not None:
        ax[i].set_title(titles[i])
    else:
      ax.imshow(img.cpu().numpy())
      fig.set_figheight(5)
      fig.set_figwidth(5)


In [None]:
# Load the CIFAR dataset
transform_cifar = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5))]) 

trainset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=True,
                                        download=True, transform=transform_cifar)


In [None]:
# 2. Grab an example image and its label from the CIFAR dataset
exImage, exLabel = trainset_cifar[ np.random.randint(0,trainset_cifar.data.shape[0]) ]
imshow(exImage)
print(trainset_cifar.classes[exLabel])

In [None]:
# 3. through 8. 
filter1 = torch.tensor([[[[1.,2.,1.],[0.,0.,0.],[-1.,-2.,-1.]]]])
filter2 = torch.tensor([[[[0.,-1.,0.],[-1.,5.,-1.],[0.,-1.,0.]]]])
filter3 = torch.tensor([[[[0.,1.,2.],[-1.,0.,1.],[-2.,-1.,0.]]]])
filter4 = torch.tensor([[[[1.,0.,-1.],[0.,0.,0.],[-1.,0.,1.]]]])
filter = filter3

print(trainset_cifar.classes[exLabel])
c = torch.nn.Conv2d(in_channels=1, out_channels=1, stride=1, padding=1, kernel_size=(3,3))
with torch.no_grad():
  c.weight = torch.nn.Parameter(filter)
input = exImage[0]
output = c(input.view(1,1,32,32))
imshow([input,filter.view(1,3,3),output.view(1,32,32).detach()], titles=['input','filter','output'])

<hr/>

# In-Class Exercise: Train a CNN for labeling CIFAR-10 images

While simple neural networks perform well for the MNIST data, they work badly for the CIFAR-10 images due to the complexity of these images that have three color channels. By using advanced architecture such as convolution layers, we will build a better model than the current one. 

Before writing a code block for an advanced model, let's think about its structure. First, we need to capture edges or other features in images. Convolution layers are what we need for this purpose. The number of output channels would be larger than the number of input channels to let the model learn many features as the following line.

```python
torch.nn.Conv2d(in_channels=3, out_channels=sizeOutChannels, kernel_size=3, padding=1)
```

Next, by adding a batch normalization layer, we can increase speed and performance of training. Note that `num_features` in the `BatchNorm2d` should match the `out_channels` in the previous convolution layer.

```python
torch.nn.BatchNorm2d(num_features = sizeOutChannels)
```

Any activation layer can be added after the batch normalization layer. In this lecture, we will use the ReLU function. 

```python
torch.nn.ReLU()
```

Lastly, add a pooling layer to aggregate values. 

```python
torch.nn.MaxPool2d(kernel_size=2, stride=1)
```

Our model is built on the combinations of convolution layers, batch normalization layers, activation layers, and pooling layers. 

Below, create a new model class called CnnCIFAR (it needs to inherit from `torch.nn.Module`).

You can model it after the NN that we built above. However, to make your `forward` method simpler and to better organize your layers, you should  combine layers into logical blocks using `torch.nn.Sequential()`.

In your constructor:
- Don't forget to call the super's constructor first
- add arguments `sizeOutChannels`(the out_channels of the Conv2D layer), `sizeHiddenLayer` (the out features of the fully connected linear layer) to your constructor.
- define a convolution layer block that consists of sequential layers of:
 - a Conv2d
 - a BatchNorm2d
 - a ReLU
 - a MaxPool2d (`kernel_size=2`, `size=2`)
- define a fully connected layer block that consists of sequential layers of:
 - a linear layer with the appropriate input feature size to match the output of the convolution layer (it will be some number *`sizeOutChannels` -- you'll have to figure out what that number is based on the `kernel_size` and `stride` of the MaxPool2d layer) and the output size given by our argument `sizeHiddenLayer`
 - a ReLU
 - a Dropout (`p=0.2`)
 - another linear layer with output size of 10 (for each of the 10 classes a CIFAR image can belong to).

Define your `forward` method to pass the input x through the logical layer blocks that are defined in your constructor.
- IMPORTANT: you'll want to flatten the output of the convolution layer block before passing it into the fully connected layer block. You can do this with `x.view(x.size(0),-1)`


In [None]:
# Write you CnnCifar class here

Run the below code (to ensure that CIFAR10 is loaded and transformed properly, in case a session disconnect happened earlier)

In [None]:
# Load the CIFAR10 data
transform_cifar = transforms.Compose( [ transforms.ToTensor(),transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)) ] )
trainset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=True, download=True, transform=transform_cifar)
testset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=False, download=True, transform=transform_cifar)

batch_size = 64
train_dl_cifar = DataLoader(trainset_cifar, batch_size=batch_size, shuffle=True)
test_dl_cifar = DataLoader(testset_cifar, batch_size=batch_size, shuffle=True)

Run the below code to define the model with the given arguments for `sizeOutChannels`, `sizeHiddenLayer` and to define a loss function (here we'll use the cross entropy loss) and optimizer (here we'll use Stochastic Gradient Descent)

In [None]:
cnnCIFAR = CnnCIFAR(sizeOutChannels = 16, sizeHiddenLayer = 50)
cnnCIFAR = cnnCIFAR.cuda() # define the model for cuda

cnn_CIFAR_loss_fn = torch.nn.CrossEntropyLoss() # use cross entropy loss
cnn_CIFAR_opt = torch.optim.SGD(cnnCIFAR.parameters(), lr=0.003, momentum=0.9) # where did I get these "magic numbers?"  Trial and error and voodoo.

In [None]:
cnnCIFAR.train()
# We will train the model for 15 epochs as same as the previous fully connected network.
for epoch in range(15):
    running_loss = 0.0
    for inputs, labels in train_dl_cifar:
        # data to train
        inputs = inputs.cuda()
        labels = labels.cuda()

        # intitiate gradients
        cnn_CIFAR_opt.zero_grad()

        # calculate loss and update parameters
        outputs = cnnCIFAR(inputs)
        loss = cnn_CIFAR_loss_fn(outputs, labels)
        loss.backward()
        cnn_CIFAR_opt.step()

        # Sum losses
        running_loss += loss.item()

    print(f"Epoch {epoch+1} loss = {running_loss/len(train_dl_cifar)}") # print out the loss (averaged over all the predictions in the batch)

Let's evaluate the trained model.


In [None]:
cnnCIFAR.eval() # put the model into evaluation mode -- may affect some types of layers (e.g., dropout)
with torch.no_grad():
  running_loss = 0
  total = 0
  correct = 0
  numClasses = len(test_dl_cifar.dataset.classes)
  cm = np.zeros((numClasses,numClasses),dtype=np.int32) # an empty matrix to hold the confusion matrix, we'll sum the confusion matrices for each batch
  for xb, yb in test_dl_cifar:
    xb = xb.cuda()
    yb = yb.cuda()
    pred = cnnCIFAR(xb)
    predLabels = torch.argmax(pred,dim=1)
    cm += confusion_matrix(yb.cpu().numpy(),predLabels.cpu().numpy(),range(0,10)) # add this batch's confusion matrix to the total matrix -- we have to specify the list of class indexes, or sklearn will shorten our cm to only the classes seen

In [None]:
acc = np.diag(cm)/cm.sum(axis=1)
print(cm, '\n', acc)

The convolution neural network performs better than the previous fully connected network. Could we still do better? We only added one convolutional layer block. But these images belong to many different classes.  You should now go back and try to experiment with the model. What if you run it for more epochs? What if you try changing the arguments for our model (e.g., try adjusting the parameters `sizeOutChannels`, `sizeHiddenLayer`). You could also try adding another convolutional layer block.  Remember the final output needs to match the number of classes we're trying to predict.  See if you can do better. Don't be afraid to google around for examples of CNN's applied to images. What do they do? What kind of performance can they achive on CIFAR (this is a well known and standard dataset).  

#### Solution: Don't look until you've tried it

In [None]:
import os
import pandas as pd
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import torch
from torch.utils.data import DataLoader
import torchvision
from torchvision import transforms


if not os.path.exists('./models'):
  os.mkdir('./models')

# Load the CIFAR10 data
transform_cifar = transforms.Compose( [ transforms.ToTensor(),transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)) ] )
trainset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=True, download=True, transform=transform_cifar)
testset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=False, download=True, transform=transform_cifar)

batch_size = 64
train_dl_cifar = DataLoader(trainset_cifar, batch_size=batch_size, shuffle=True)
test_dl_cifar = DataLoader(testset_cifar, batch_size=batch_size, shuffle=True)

# Define you CnnCifar Class
class CnnCIFAR(torch.nn.Module):
  def __init__(self, sizeOutChannels, sizeHiddenLayer):
    super(CnnCIFAR, self).__init__()
    self.conv_layer = torch.nn.Sequential(
        torch.nn.Conv2d(in_channels=3, out_channels=sizeOutChannels, kernel_size=3, padding=1),
        torch.nn.BatchNorm2d(num_features = sizeOutChannels),
        torch.nn.ReLU(),
        torch.nn.MaxPool2d(kernel_size=2, stride=2)
    )
    self.fc_layer = torch.nn.Sequential(
        torch.nn.Linear(sizeOutChannels*16*16, sizeHiddenLayer),
        torch.nn.ReLU(),
        torch.nn.Dropout(p=0.2),
        torch.nn.Linear(sizeHiddenLayer, 10) # return values to predict a class among 10 labels.
    )  

  def forward(self, x):
    # conv_layer
    x = self.conv_layer(x)
    # flatten
    x = x.view(x.size(0), -1)
    # fc_layer
    x = self.fc_layer(x)
    return x 


# Define the model and loss and optimizer
cnnCIFAR = CnnCIFAR(sizeOutChannels = 16, sizeHiddenLayer = 50)
cnnCIFAR = cnnCIFAR.cuda() # define the model for cuda

cnn_CIFAR_loss_fn = torch.nn.CrossEntropyLoss() # use cross entropy loss
cnn_CIFAR_opt = torch.optim.SGD(cnnCIFAR.parameters(), lr=0.003, momentum=0.9) # where did I get these "magic numbers?"  Trial and error and voodoo.

# Train the model
cnnCIFAR.train()
# We will train the model for 15 epochs as same as the previous fully connected network.
for epoch in range(15):
    running_loss = 0.0
    for inputs, labels in train_dl_cifar:
        # data to train
        inputs = inputs.cuda()
        labels = labels.cuda()

        # intitiate gradients
        cnn_CIFAR_opt.zero_grad()

        # calculate loss and update parameters
        outputs = cnnCIFAR(inputs)
        loss = cnn_CIFAR_loss_fn(outputs, labels)
        loss.backward()
        cnn_CIFAR_opt.step()

        # Sum losses
        running_loss += loss.item()

    print(f"Epoch {epoch+1} loss = {running_loss/len(train_dl_cifar)}") # print out the loss (averaged over all the predictions in the batch)


# Evaluate the trained model
cnnCIFAR.eval() # put the model into evaluation mode -- may affect some types of layers (e.g., dropout)
with torch.no_grad():
  running_loss = 0
  total = 0
  correct = 0
  numClasses = len(test_dl_cifar.dataset.classes)
  cm = np.zeros((numClasses,numClasses),dtype=np.int32) # an empty matrix to hold the confusion matrix, we'll sum the confusion matrices for each batch
  for xb, yb in test_dl_cifar:
    xb = xb.cuda()
    yb = yb.cuda()
    pred = cnnCIFAR(xb)
    predLabels = torch.argmax(pred,dim=1)
    cm += confusion_matrix(yb.cpu().numpy(),predLabels.cpu().numpy(),labels=range(0,10)) # add this batch's confusion matrix to the total matrix -- we have to specify the list of class indexes, or sklearn will shorten our cm to only the classes seen

acc = np.diag(cm)/cm.sum(axis=1)
print(cm, '\n', acc)


In [None]:
from sklearn.metrics import confusion_matrix

# Misc: Load MNIST and Train a simple NN model on it 

In [None]:
# 1. Import Stuff
import torch
import numpy as np
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
import torchvision
from torchvision import transforms
from torch.utils.data import DataLoader


# 2. Our imshow() function from earlier
def imshow(img):
  if img.shape[0]==3: # its probably (color,width,height) so make it (width,height,color) which is what plt.imshow() wants
    img = img.permute(1,2,0)
  elif img.shape[0]==1: # its probably a (1,width,height) so make it just (width,height) which is what plt.imshow() wants for a single channel
    img = img[0]
  img = img/2 + 0.5 # undo our normalization, just to show the image, because plt's imshow() expects numbers to be between (0,1)
  plt.imshow(img.cpu().numpy()) # plt's imshow() knows how to work with numpy arrays, not tensors, so we'll convert it first

# 3. A function to save our NN to a file
def nnSave(model,opt,path):
  torch.save({'model_class': model.__class__, # this is a pointer to the definition of the model's class
              'model_args': model.init_args, # init_args is the only property we have to add to a NN class ourselves for this function to work.
              'model_state_dict': model.state_dict(),
              'opt_class': opt.__class__,
              'opt_args': opt.defaults,
              'opt_state_dict':opt.state_dict()},
            path)

# 4. A function to load our NN from a file  
def nnLoad(path):
  cp = torch.load(path)
  model = cp['model_class'](**cp['model_args']) # equivalent to model = ModelClass(arg1,arg2,...)
  model.load_state_dict(cp['model_state_dict'])
  opt = cp['opt_class'](model.parameters(),**cp['opt_args']) # equivalent to opt = OptClass(arg1,arg2,...)
  opt.load_state_dict(cp['opt_state_dict'])
  return model, opt


# 5. Definition of a Neural Network called ImgNet -- (input layer, FC hidden layer 1, ReLU, FC hidden layer 2, ReLU, FC hidden layer 3, ReLU, logSoftMax)
class ImgNet(torch.nn.Module):
  def __init__(self,sizeInput,sizeHiddenLayer1,sizeHiddenLayer2,sizeOutput):
    self.init_args = {k:v for k,v in locals().items() if k!='self' and k!='__class__'} # this funny line captures the name and values of the args so we can save them w/ nnSave()
    super().__init__()
    self.fc1 = torch.nn.Linear(sizeInput,sizeHiddenLayer1)
    self.relu1 = torch.nn.ReLU()
    self.fc2 = torch.nn.Linear(sizeHiddenLayer1,sizeHiddenLayer2)
    self.relu2 = torch.nn.ReLU()
    self.fc3 = torch.nn.Linear(sizeHiddenLayer2,sizeOutput)
    self.logsoftmax = torch.nn.LogSoftmax(dim=1) # We are using dim=1 here because the 0th dimension will be the batch dimension
  
  def forward(self,x):
    x = self.fc1(x)
    x = self.relu1(x)
    x = self.fc2(x)
    x = self.relu2(x)
    x = self.fc3(x)
    x = self.logsoftmax(x)
    return x

# 6. Fit function for training a NN
def fit(num_epochs, model, train_dl, loss_fn, opt):
  model.train() # make sure the model is in training mode (instead of eval mode)
  for epoch in range(num_epochs):
    running_loss=0
    for xb,yb in train_dl: 
      xb = xb.view(xb.shape[0],-1) # This will keep the first dimension as the batch dimension and flatten all the others
      xb = xb.to("cuda",non_blocking = True) # this puts the tensor in the GPU's memory. non_blocking=True ensures that RAM->GPU RAM copy doesn't block other operations
      yb = yb.to("cuda", non_blocking = True)
      opt.zero_grad() # We'll start by zero'ing the gradient. We could have done this at the end of this loop, but this ensures we have no errant gradients lying around for the first iteration of the loop
      pred = model(xb) # run the input through the model and get the predictions 
      loss = loss_fn(pred, yb) # calculate the loss -- we'd have to check that the loss_fn gets the prediction and true values in the form it expects -- so its wise to check the docs of whatever loss_fn we use
      loss.backward() # propagate the loss backward
      opt.step() # tell the optimizer to do its thing
      running_loss+=loss.item() # add up the running loss (remember the output of loss will be a scalar, so loss.item() will just be a numerical value)
    print(f"Epoch {epoch} loss = {running_loss/len(train_dl)}") # print out the loss (averaged over all the predictions in the batch)

# 7. Test function for evaluating a NN
def test(model, test_dl, loss_fn):
  model.eval() # put the model into evaluation mode -- may affect some types of layers (e.g., dropout)
  with torch.no_grad():
    running_loss = 0
    total = 0
    correct = 0
    numClasses = len(test_dl.dataset.classes)
    cm = np.zeros((numClasses,numClasses),dtype=np.int32) # an empty matrix to hold the confusion matrix, we'll sum the confusion matrices for each batch
    #print(cm.shape)
    for xb, yb in test_dl:
      xb = xb.view(xb.shape[0],-1)
      xb = xb.to("cuda")
      yb = yb.to("cuda")
      pred = model(xb)
      predLabels = torch.argmax(pred,dim=1)
      cm += confusion_matrix(yb.cpu().numpy(),predLabels.cpu().numpy(),labels=range(0,10)) # add this batch's confusion matrix to the total matrix -- we have to specify the list of class indexes, or sklearn will shorten our cm to only the classes seen
      loss = loss_fn(pred,yb)
      running_loss+=loss.item()
    ave_loss = running_loss/len(test_dl)
    acc = np.diag(cm)/cm.sum(axis=1) # the per class accuracy is the diagonals (tp) divided by all cases of that class
    return cm, acc, ave_loss


   

In [None]:
# Load the MNIST dataset
transform_mnist = transforms.Compose( [transforms.ToTensor(), transforms.Normalize(mean=(0.5,), std=(0.5,)) ] )
trainset_mnist = torchvision.datasets.MNIST('./mnist', download=True, train=True, transform=transform_mnist)
testset_mnist = torchvision.datasets.MNIST('./mnist', download=True, train=False, transform=transform_mnist)

batch_size = 64
train_dl_mnist = DataLoader(trainset_mnist, batch_size=batch_size, shuffle=True)
test_dl_mnist = DataLoader(testset_mnist, batch_size=batch_size, shuffle=True)
imgnet_mnist = ImgNet(28*28,128,64,10).cuda()


In [None]:
# Train the NN on the MNIST data
loss_fn_mnist = torch.nn.functional.nll_loss
opt_mnist = torch.optim.SGD(imgnet_mnist.parameters(), lr=0.003, momentum=0.9) # where did I get these "magic numbers?"  Trial and error and voodoo.
fit(15, imgnet_mnist, train_dl_mnist, loss_fn_mnist, opt_mnist)

In [None]:
# Show prediction for a random MNIST image
images, labels = next(iter(test_dl_mnist))
img = images[0].to("cuda")
label = labels[0].to("cuda").item()

with torch.no_grad():
  predLabel = torch.argmax(imgnet_mnist(img.view(1,-1))).item()

imshow(img)
print(f"Predicted label was: {predLabel} ; Actual label was: {label}")


In [None]:
# Evaluate performance of our NN on MNIST holdout data
cm_mnist, acc_mnist, ave_loss_mnist = test(imgnet_mnist, test_dl_mnist, loss_fn_mnist)
print(ave_loss_mnist)
print(cm_mnist)
print(acc_mnist)

# Misc: Load CIFAR and Train a simple NN model on it

In [None]:
# Load the CIFAR10 data
transform_cifar = transforms.Compose( [ transforms.ToTensor(),transforms.Normalize(mean=(0.5, 0.5, 0.5), std=(0.5, 0.5, 0.5)) ] )
trainset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=True, download=True, transform=transform_cifar)
testset_cifar = torchvision.datasets.CIFAR10(root='./cifar10', train=False, download=True, transform=transform_cifar)

batch_size = 64
train_dl_cifar = DataLoader(trainset_cifar, batch_size=batch_size, shuffle=True)
test_dl_cifar = DataLoader(testset_cifar, batch_size=batch_size, shuffle=True)
imgnet_cifar = ImgNet(3*32*32,128,64,10).cuda()

# Train the NN
loss_fn_cifar = torch.nn.functional.nll_loss
opt_cifar = torch.optim.SGD(imgnet_cifar.parameters(), lr=0.003, momentum=0.9) # where did I get these "magic numbers?"  Trial and error and voodoo.
fit(15, imgnet_cifar, train_dl_cifar, loss_fn_cifar, opt_cifar)


In [None]:
# Print an example
images, labels = next(iter(test_dl_cifar))
img = images[0].to("cuda")
label = labels[0].to("cuda")

with torch.no_grad():
  predLabel = torch.argmax(imgnet_cifar(img.view(1,-1))).item()

imshow(img)
print(f"Predicted label was: {testset_cifar.classes[predLabel]} ; Actual label was: {testset_cifar.classes[label]}")


# Misc: Increasing Loss / Exploding Gradients

Let's look at a very simple example of a single neuron (linear layer) with no activation function -- i.e., this is just linear regression (using the wine quality data). 

Here's an example of where the loss function "turns around" during training.  This is an example of an "exploding gradient". I captured here the range of epochs where the loss "turns around" by carefully selecting the learning rate.

In [None]:
wine_df.info()

In [None]:
import torch
import matplotlib.pyplot as plt
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader, random_split


wine_df = pd.read_csv('https://raw.githubusercontent.com/dylanwalker/MGSC496/main/datasets/winequality-red.csv')

scaler = StandardScaler()
scaler.fit(wine_df.loc[:,'fixed acidity':'alcohol'])
wine_df.loc[:,'fixed acidity':'alcohol'] = scaler.transform(wine_df.loc[:,'fixed acidity':'alcohol'])
wine_df.quality = wine_df.quality/10


inputs = torch.tensor(wine_df.loc[:,'fixed acidity':'alcohol'].values, dtype=torch.float32)
targets = torch.tensor(wine_df.loc[:,'quality'].values, dtype=torch.float32) 

full_ds = TensorDataset(inputs, targets)
train_ds, test_ds = random_split(full_ds, [0.8, 0.2])

batch_size = len(train_ds)
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=True)


def fit_gd(num_epochs, model, train_dl, test_dl, loss_fn, opt):
    training_losses = []
    validation_losses = []
    model.train() 
    for epoch in range(num_epochs):
      loss_for_epoch = 0
      vloss_for_epoch = 0
      for xb,yb in train_dl: 
        pred = model(xb) 
        loss = loss_fn(pred, yb.view(-1,1)) 
        loss_for_epoch += loss.item() 
        loss.backward()
      opt.step() 
      opt.zero_grad()
      training_losses.append(loss_for_epoch)
      # Calculate validation loss for each epoch
      model.eval()
      for xvb, yvb in test_dl:
        with torch.no_grad():
          predv = model(xvb)
          vloss = loss_fn(predv, yvb.view(-1,1))
          vloss_for_epoch += vloss.item()
      validation_losses.append(vloss_for_epoch)
      model.train()
    return training_losses, validation_losses 

num_epochs_exploding = 100 
lr_exploding = 3.229e-1
lr_reasonable = 1e-5
lr = lr_exploding
num_epochs = num_epochs_exploding

model = torch.nn.Linear(11,1) 
opt = torch.optim.SGD(model.parameters(), lr=lr)
loss_fn = torch.nn.functional.mse_loss



training_losses, validation_losses = fit_gd(num_epochs, model, train_dl, test_dl, loss_fn, opt)

In [None]:
import plotly.express as px
losses_df = pd.DataFrame({'epoch':range(len(training_losses)),'training_loss':training_losses,'validation_loss':validation_losses})
fig = px.line(losses_df,x='epoch',y=['training_loss','validation_loss'])
fig.update_layout(yaxis_title='loss')

You can see that the loss function is decreasing as we would expect, at first. But at some point it turns around. If we were to keep training, the gradient would eventually explode and go to "infinity" (past the point where we can keep track of the values, so it would become "inf" or "NaN").

The primary culprit in this example is that our learning rate is too high. In other words, the steps we are taking to adjust the weights of our NN are just too big. This is one reason why you always need to look at the training loss.
