# L-9-3: Inverse Classroom: It’s not working! Help!

These exercises will give you some debugging experience on problems typically found when doing machine learning in practice.

**Outline**

0. General Set-up
1. Debugging A Bad Training Set-up
2. Image Segmentation with DICE Loss
3. Fixing the Data-Processing Pipeline
4. Test Performance is Too Good!

## 0. General Set-up

Here we provide general code set-up: package requirements, train-loaders, etc.

In [0]:
## Some general imports we may need:
import numpy as np
import torch
import torch.nn as nn
import torch.utils.data
import matplotlib.pyplot as plt
import time
import struct

Make sure GPU is enabled: In Colab, at the top, 

click `Runtime` -> `Change runtime type` -> `Hardware Accelerator` -> `GPU`

In [0]:
gpu_boole = torch.cuda.is_available()

##3. Fixing the Data-Processing Pipeline

* Like in L-9-1, you are given a training loop for MNIST.
* There is a runtime error! How do you fix it?
* As it turns out, there are other errors too, having to do with data-processing.
* **Deliverables:** 
    * Fix all runtime error(s).
    * Continue debugging until you achieve 90% or greater test accuracy in 10 epochs of training or less.
    * Describe the fixes you made.

**Defining the model and optimizer:**
We define the model and optimizer here.

In [0]:
## Defining the model:
class Net(nn.Module):
  def __init__(self, input_size, width, num_classes):
    super(Net, self).__init__()

    ##feedfoward layers:
    self.ff1 = nn.Linear(input_size, width) #input

    self.ff2 = nn.Linear(width, width) #hidden layers
    self.ff3 = nn.Linear(width, width)

    self.ff_out = nn.Linear(width, num_classes) #logit layer     

    ##activations:
    self.relu = nn.ReLU()
                
  def forward(self, input_data):
    out = self.relu(self.ff1(input_data)) 
    out = self.relu(self.ff2(out)) 
    out = self.relu(self.ff3(out))
    out = self.ff_out(out)
    return out #returns class probabilities for each image

net = Net(input_size = 784, width = 500, num_classes = 10)
if gpu_boole:
  net = net.cuda()

optimizer = torch.optim.SGD(net.parameters(), lr = 0.01)
loss_metric = nn.CrossEntropyLoss()


**Data-loading:** In L-9-1, we made use of the `torchvision` package for data-loading and preprocessing. We abstracted away some preprocessing steps. Here, we are giving a more granular implementation that you may have to adjust in various ways. This is more akin to what you will see in practice with other datasets.


In [0]:
#Downloading and unzipping MNIST data files:
!curl -O http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
!curl -O http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
!curl -O http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
!curl -O http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
!gunzip t*-ubyte.gz -f

In [0]:
##Loading files into numpy arrays:
def read_idx(filename, boole=0):
    with open(filename, 'rb') as f:
        zero, data_type, dims = struct.unpack('>HBB', f.read(4))
        shape = tuple(struct.unpack('>I', f.read(4))[0] for d in range(dims))
        if boole:
          return np.fromstring(f.read(), dtype=np.uint8).reshape(shape).astype(np.float32)*10     
        else:
          return np.fromstring(f.read(), dtype=np.uint8).reshape(shape)

xtrain = read_idx('train-images-idx3-ubyte', 1)
xtest = read_idx('t10k-images-idx3-ubyte', 1)
ytrain = read_idx('train-labels-idx1-ubyte')
ytest = read_idx('t10k-labels-idx1-ubyte')

np.random.shuffle(xtrain); np.random.shuffle(ytrain);

xtrain = torch.Tensor(xtrain)
ytrain = torch.Tensor(ytrain)
ytrain = ytrain.reshape([-1,1])
xtest = torch.Tensor(xtest)
ytest = torch.Tensor(ytest)
ytest = ytest.reshape([-1,1])

## data_loaders:
train = torch.utils.data.TensorDataset(xtrain, ytrain)
test = torch.utils.data.TensorDataset(xtest, ytest)

train_loader = torch.utils.data.DataLoader(train, batch_size=128)
test_loader = torch.utils.data.DataLoader(test, batch_size=128, shuffle=False)

**Defining training and test loss and accuracy functions:** These functions will be useful in our training loop to view are training and test loss/accuracy at each epoch.

In [0]:
def train_eval(verbose = 1):
    correct = 0
    total = 0
    loss_sum = 0
    for images, labels in train_loader:
        if gpu_boole:
            images, labels = images.cuda(), labels.cuda()
        images = images.view(-1, 28*28)
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted.float() == labels.float()).sum()

        loss_sum += loss_metric(outputs,labels)
        
    if verbose:
        print('Train accuracy: %f %%' % (100 * correct / total))
        print('Train loss: %f' % (loss_sum.cpu().data.numpy().item() / total))

    return 100.0 * correct / total, loss_sum.cpu().data.numpy().item() / total
    
def test_eval(verbose = 1):
    correct = 0
    total = 0
    loss_sum = 0
    for images, labels in test_loader:
        if gpu_boole:
            images, labels = images.cuda(), labels.cuda()
        images = images.view(-1, 28*28)
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted.float() == labels.float()).sum()

        loss_sum += loss_metric(outputs,labels)

    if verbose:
        print('Test accuracy: %f %%' % (100 * correct / total))
        print('Test loss: %f' % (loss_sum.cpu().data.numpy().item() / total))

    return 100.0 * correct / total, loss_sum.cpu().data.numpy().item() / total

**Traning loop:** here, we give the training loop. A number of epochs is set. Loss is recorded and plotted at the end.

**IMPORTANT NOTE:** For re-running this code cell, if you encounter nan loss, you will need to reinstantiate your model and optimizer by re-running the "Defining the model and optimizer:" code cell above.

In [0]:
#re-initializing network weights:
def weights_init(m):
    if isinstance(m, nn.Conv2d) or isinstance(m, nn.Linear):
        torch.nn.init.xavier_uniform(m.weight.data)

weights_init(net)

#number of epochs to train for:
epochs = 10

#defining batch train loss recording arrays for later visualization/plotting:
loss_batch_store = []

print("Starting Training")
#training loop:
for epoch in range(epochs):
  time1 = time.time() #timekeeping

  for i, (x,y) in enumerate(train_loader):

    if gpu_boole:
      x = x.cuda()
      y = y.cuda()

    #loss calculation and gradient update:

    if i > 0 or epoch > 0:
      optimizer.zero_grad()
    outputs = net.forward(x)
    loss = loss_metric(outputs,y)
    loss.backward()

    if i > 0 or epoch > 0:
      loss_batch_store.append(loss.cpu().data.numpy().item())
                  
    ##performing update:
    optimizer.step()

  print("Epoch",epoch+1,':')
  train_perc, train_loss = train_eval()
  test_perc, test_loss = test_eval()

  time2 = time.time() #timekeeping
  print('Elapsed time for epoch:',time2 - time1,'s')
  print('ETA of completion:',(time2 - time1)*(epochs - epoch - 1)/60,'minutes')
  print()

## Plotting batch-wise train loss curve:
plt.plot(loss_batch_store, '-o', label = 'train_loss', color = 'blue')
plt.xlabel('Minibatch Number')
plt.ylabel('Sample-wise Loss At Last minibatch')
plt.legend()
plt.show()


**Debugging outline and hints:**
*   For an easier time, follow these hints:
1. First, you should get a reshaping error. This can be fixed in the training loop block.
2. Second, you should get a casting error. Try modifying *y* such that it is the correct data type. It is easiest to do this in the Data-loading block.
3. Third, you should get an error in your loss function. Go to the Data-loading block, are the shapes of your *y* tensors correct?
4. Fourth, in the Data-loading block, check the min and max values of your *x* tensors. Do they look correct?
5. Fifth, in the Data-loading block, check that your data-label mapping is correct.



**Describe your debugging process.**

1. What errors/issues did you identify and resolve?

[Your text here]

2. What was your process for debugging? Describe the order of steps you took, debugging deadends you ran into, etc. There isn't necessarily a correct answer here, we just want to see an overview of what you tried and corrected for.

[Your text here]