# PyTorch & MNIST Intro

Let's go through a simple example of Pytorch and MNIST as a way to level set NNs and the use of notebooks. This is our first introductory class, with the initial goal of level setting and kicking-off discussions on training neural networks.

Please complete all challenges below for 10 (+ 2 extra) points in total. 

In [None]:
import torch
import torchvision
import matplotlib.pyplot as pl
random_seed = 1
torch.backends.cudnn.enabled = False
torch.manual_seed(random_seed)

## Dataset

MNIST is probably the most traditionally used dataset for neural networks, as it is a relatively challenging problem in computer vision: recognizing single-digit numbers from a hand-written digital format. Classically, this dataset takes the form of samples of $28 \times 28$ matrices.

In [None]:
batch_size_train = 64
batch_size_test = 1000

In [None]:
train_loader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST('./files/', train=True, download=True,
                                transform=torchvision.transforms.Compose([
                                    torchvision.transforms.ToTensor(),
                                    torchvision.transforms.Normalize((0.1307,), (0.3081,))
                                ])),
    batch_size=batch_size_train,
    shuffle=True)

test_loader = torch.utils.data.DataLoader(
    torchvision.datasets.MNIST('./files/', train=False, download=True,
                                transform=torchvision.transforms.Compose([
                                    torchvision.transforms.ToTensor(),
                                    torchvision.transforms.Normalize(
                                        (0.1307,), (0.3081,))
                                ])),
    batch_size=batch_size_test,
    shuffle=True)

In [None]:
examples = enumerate(test_loader)
batch_idx, (example_data, example_targets) = next(examples)

In [None]:
fig = pl.figure()
for i in range(6):
    pl.subplot(2,3,i+1)
    pl.tight_layout()
    pl.imshow(example_data[i][0], cmap='gray', interpolation='none')
    pl.title("Ground Truth: {}".format(example_targets[i]))
    pl.xticks([])
    pl.yticks([])
pl.show()

## Building the model

We need to specify the model through a Python class. Below we show how to create a Feedforward Neural Network model using Pytorch.

In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [None]:
class FeedforwardNeuralNetModel(nn.Module):
    def __init__(self):
        input_dim = 28*28
        num_classes = 10
        super(FeedforwardNeuralNetModel, self).__init__()
        self.fc1 = nn.Linear(input_dim, num_classes) 

    def forward(self, x):
        out = self.fc1(x)
        return F.log_softmax(out)

You'll need to instantiate this class as well as an optimizer, which will apply an algorithm to find the internal parameters of that model, such as matrix weights and biases. As an example, we will use the Stochastic Gradient Descent algorithm.

In [None]:
learning_rate = 0.001
momentum = 0.1
log_interval = 10

In [None]:
network = FeedforwardNeuralNetModel()
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)

## Training

Next, we will define the training procedure.

In [None]:
n_epochs = 5

train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

In [None]:
! mkdir -p results

In [None]:
def train(epoch):
    network.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad() # clears gradients
        output = network(data.reshape(-1, 28*28))
        
        loss = F.nll_loss(output, target)
        loss.backward()
        
        optimizer.step()
        
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), loss.item())
            )
            train_losses.append(loss.item())
            train_counter.append((batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))
            torch.save(network.state_dict(), f'./results/model_iteration-{epoch}.pth')
            torch.save(optimizer.state_dict(), f'./results/optimizer_iteration-{epoch}.pth')

In [None]:
train(1)

<br />
Alongside trainig, we will also monitor the performance of the model on a set of samples not seen during the training.

In [None]:
def test():
    network.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            output = network(data.reshape(-1, 28*28))
            test_loss += F.nll_loss(output, target, size_average=False).item()
            pred = output.data.max(1, keepdim=True)[1]
            correct += pred.eq(target.data.view_as(pred)).sum()
    test_loss /= len(test_loader.dataset)
    test_losses.append(test_loss)
    print('\nTest set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n' \
          .format(test_loss, correct, len(test_loader.dataset), 100. * correct / len(test_loader.dataset))
    )

In [None]:
test()

## Training and evaluating for multiple epochs

Let's train now for all desired epochs.

In [None]:
for epoch in range(2, n_epochs + 1): # starts from the second iteration
  train(epoch)
  test()

# Model performance

Finally, we can inspect the results.

In [None]:
fig = pl.figure()
pl.plot(train_counter, train_losses, color=(0.2, 0.2, 1.0))
pl.scatter(test_counter[:-1], test_losses, color=(1.0, 0.2, 0.2))

pl.legend(['Train Loss', 'Test Loss'], loc='upper right', frameon=False)
pl.xlabel('Training Samples')
pl.ylabel('Log Likelihood Loss')

pl.show()

It's always important to inspect anecdotes to convince yourself the model is behind as expected.

In [None]:
with torch.no_grad():
  output = network(example_data.reshape(1000, 28*28))

**Challenge: (2pt)** Can you explain why we are using `torch.no_grad()`?

**Answer:** Using `torch.no_grad()` prevents the calculation of gradients, which improves the performance of machine learning models by decreasing the computation work and memory needed.`torch.nograd()` is especially useful for feed forward neural networks, where model weights do not need to be updated, speeding up the overall training process.

In [None]:
fig = pl.figure()
for i in range(6):
    pl.subplot(2,3,i+1)
    pl.tight_layout()
    pl.imshow(example_data[i][0], cmap='gray', interpolation='none')
    pl.title("Prediction: {}".format(output.data.max(1, keepdim=True)[1][i].item()))
    pl.xticks([])
    pl.yticks([])
fig

**Challenge: (1pt)** Re-do the plot above, but showcasing only miss-classifications (i.e. cases in which the model did wrong).

**Answer:** Challenge completed in the code below:

In [None]:
# Find incorrect predictions
inc_example_data = []
inc_outputs = []
idx = 6
while idx < len(example_data) and len(inc_example_data) < 6:
    predicted = output.data.max(1, keepdim=True)[1][idx].item()
    if (predicted != example_targets[idx]):
        inc_example_data.append(example_data[idx])
        inc_outputs.append(predicted)
    idx += 1

# Plot incorrect example data
for i in range(6):
    pl.subplot(2,3,i+1)
    pl.tight_layout()
    pl.imshow(inc_example_data[i][0], cmap='gray', interpolation='none')
    pl.title("Prediction: {}".format(inc_outputs[i]))
    pl.xticks([])
    pl.yticks([])
fig

## Loading trained models

Eventually, you will want to load the model you trained in the past for either running inference or continue the training procedure. The functions we developed above save artifacts contain all of the metadata and data about the model, assuming you have the right model class. Let's inspect those files: 

In [None]:
! ls results

To load a model:

In [None]:
trained_model = FeedforwardNeuralNetModel()
model_state_dict = torch.load("results/model_iteration-1.pth")
trained_model.load_state_dict(model_state_dict)

Before proceeding, let's inspect `model_state_dict`

In [None]:
model_state_dict.keys()

In [None]:
model_state_dict['fc1.weight'].shape

In [None]:
f,axs = pl.subplots(3,3, figsize=(8,8))

c = 0
for ax in axs:
    for sax in ax:
        sax.imshow(model_state_dict['fc1.weight'][c].reshape((28,28)), 
                   cmap = pl.get_cmap('Blues'))
        c += 1
        sax.axis('off')

pl.show()

Let's do the same for the optimizer

In [None]:
optimizer = optim.SGD(trained_model.parameters(), lr=learning_rate, momentum=momentum)
optimizer_state_dict = torch.load("results/optimizer_iteration-3.pth")
optimizer.load_state_dict(optimizer_state_dict)

In [None]:
optimizer_state_dict.keys()

**Challenge (1pt):** Can you explain the data in this dictionary?

**Answer**: State `0` represents the images where portions are identified in blue and white. State `1` represents the bias terms associated with each image present in the dictionary. The `param_groups` describe the parameters used during model training, such as learning rate (`lr`), momentum, and more.

## Final challenges

* **(1 pt)** What happens if you use only 10% of the available training data? Plot the difference in performance of the network.
* **(0.5pt)** What happens if you remove 80% of all samples with label 5. Do you see a difference in performance? Is this difference homogeneous?
* **(0.5pt)** What happens if you change parameters like the learnign rate and momentum? Plot the difference.
* **(2pt)** Can you add more layers to this neural network? Start with one additional layer (often called "hidden layer"). What changes can you observe in doing so?
* **(2pt)** Can you add regularization to this model? Look for L1, L2, and drop-out regularizations. What changes do you observe?
* **[stretch] (2pt)** Can you change this model and turn it into a convolutional neural network?

* **(1 pt)** What happens if you use only 10% of the available training data? Plot the difference in performance of the network.

In [None]:
def train_mod(epoch, trainer, train_losses, train_counter, network, optimizer):
    network.train()
    for batch_idx, (data, target) in enumerate(trainer):
        optimizer.zero_grad()
        output = network(data.reshape(-1, 28*28))
        
        loss = F.nll_loss(output, target)
        loss.backward()
        
        optimizer.step()
        
        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(trainer.dataset),
                100. * batch_idx / len(trainer), loss.item())
            )
            train_losses.append(loss.item())
            train_counter.append((batch_idx*64) + ((epoch-1)*len(trainer.dataset)))

def test_mod(test_losses, network):
    network.eval()
    test_loss = 0
    correct = 0
    with torch.no_grad():
        for data, target in test_loader:
            output = network(data.reshape(-1, 28*28))
            test_loss += F.nll_loss(output, target, size_average=False).item()
            pred = output.data.max(1, keepdim=True)[1]
            correct += pred.eq(target.data.view_as(pred)).sum()
    test_loss /= len(test_loader.dataset)
    test_losses.append(test_loss)
    print('\nTest set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n' \
          .format(test_loss, correct, len(test_loader.dataset), 100. * correct / len(test_loader.dataset))
    )

In [None]:
train_dataset = torchvision.datasets.MNIST('./files/', train=True, download=True,
                                           transform=torchvision.transforms.Compose([
                                               torchvision.transforms.ToTensor(),
                                               torchvision.transforms.Normalize((0.1307,), (0.3081,))
                                           ]))
train_10 = torch.utils.data.Subset(train_dataset, list(range(int(0.1 * len(train_dataset)))))
train_loader_10 = torch.utils.data.DataLoader(train_10, batch_size=batch_size_train, shuffle=True)

network_10 = FeedforwardNeuralNetModel()
optimizer_10 = optim.SGD(network_10.parameters(), lr=learning_rate, momentum=momentum)
n_epochs = 5

train_losses_10 = []
train_counter_10 = []
test_losses_10 = []
test_counter_10 = [i*len(train_loader_10.dataset) for i in range(n_epochs + 1)]
    
# Training
for epoch in range(1, n_epochs + 1):
    train_mod(epoch, train_loader_10, train_losses_10, train_counter_10, network_10, optimizer_10)
    test_mod(test_losses_10, network_10)

In [None]:
fig = pl.figure()
pl.scatter(test_counter[:-1], test_losses, color=(1.0, 0.2, 0.2))
pl.scatter(test_counter_10[:-1], test_losses_10, color='purple')

pl.legend(['Test Loss Original', 'Test Loss 10% Training'], loc='upper right', frameon=False)
pl.xlabel('Training Samples')
pl.ylabel('Log Likelihood Test Loss')

pl.show()

**Answer:** Using 10% of the available training data results in similar training performance and a higher testing set loss. As seen in the graph above, test set losses were higher for each epoch when trianed on 10 percent training data in comparison to the full dataset.

* **(0.5pt)** What happens if you remove 80% of all samples with label 5. Do you see a difference in performance? Is this difference homogeneous?


In [None]:
five_idx = [i for i in range(len(train_loader.dataset)) if train_loader.dataset[i][1] == 5]
all_idx = [i for i in range(len(train_loader.dataset))]
to_remove = five_idx[:int(len(five_idx) * 0.8)]

train_20 = torch.utils.data.Subset(train_dataset, list(set(all_idx) - set(to_remove)))
train_20_loader = torch.utils.data.DataLoader(train_20, batch_size=batch_size_train, shuffle=True)

n_epochs = 5
train_losses_20 = []
train_counter_20 = []
test_losses_20 = []
test_counter_20 = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

network_20 = FeedforwardNeuralNetModel()
optimizer_20 = optim.SGD(network_20.parameters(), lr=learning_rate, momentum=momentum)

for i in range(1,n_epochs+1):
    train_mod(i, train_20_loader, train_losses, train_counter, network_20, optimizer_20)
    test_mod(test_losses_20, network_20)

In [None]:
fig = pl.figure()
pl.scatter(test_counter[:-1], test_losses, color=(1.0, 0.2, 0.2))
pl.scatter(test_counter_20[:-1], test_losses_20, color='purple')

pl.legend(['Test Loss Original', 'Test Loss 20% 5 labels'], loc='upper right', frameon=False)
pl.xlabel('Training Samples')
pl.ylabel('Log Likelihood Test Loss')

pl.show()

Looking at the plot above, we can see that the test loss increase for each iteration when 80 percent of all training data labeled "5" is taken out of the training set. Furthermore, the trend of each training sets scatterplot seems to follow a general pattern, which gives us reason to believe that there could be a homogeneous difference.

* **(0.5pt)** What happens if you change parameters like the learnign rate and momentum? Plot the difference.


In [None]:
lrs = [0.1, 0.01, 0.001]
momentums = [1, 0.5, 0.001]
colors = ['red', 'blue', 'green']
fig, ax = pl.subplots()
results = []
for lr, momentum in zip(lrs, momentums):
    network = FeedforwardNeuralNetModel()
    optimizer = optim.SGD(network.parameters(), lr=lr, momentum=momentum)
    n_epochs = 5
    
    train_losses_temp = []
    train_counter_temp = []
    test_losses_temp = []
    test_counter_temp = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]
    
    for j in range(1,n_epochs+1):
        train_mod(j, train_loader, train_losses_temp, train_counter_temp, network, optimizer)
        test_mod(test_losses_temp, network)
        
    results.append(test_losses_temp)
        
    

In [None]:
fig = pl.figure()
pl.scatter(test_counter[:-1], results[0], color=colors[0])
pl.scatter(test_counter[:-1], results[1], color=colors[1])
pl.scatter(test_counter[:-1], results[2], color=colors[2])

pl.legend([f"lr={lrs[0]},momentum={mementum[0]}", f"lr={lrs[1]},momentum={mementum[1]}", f"lr={lrs[2]},momentum={mementum[2]}"], loc='upper right', frameon=False)
pl.xlabel('Training Samples')
pl.ylabel('Log Likelihood Test Loss')

pl.show()

Increasing the momentum results in an increase of test loss, and a similar effect can be observed with the learning rate. As learning rate determines how high a step is taken during gradient descent, large learning rate values can result in failure to converge to the optimal value. Too high of a momentum value can cause failure in converegence as well as it determine how fast convergence occurs, and large values can result in inconsistent results.

* **(2pt)** Can you add more layers to this neural network? Start with one additional layer (often called "hidden layer"). What changes can you observe in doing so?


In [None]:
class FeedforwardNeuralNetModel2(nn.Module):
    def __init__(self):
        super(FeedforwardNeuralNetModel2, self).__init__()
        input_dim = 28 * 28
        hidden_dim1 = 512  
        hidden_dim2 = 256
        out_dim = 10 

        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.out = nn.Linear(hidden_dim2, out_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        out = self.out(x)
        return F.log_softmax(out)


network_hid = FeedforwardNeuralNetModel2()
optimizer_hid = optim.SGD(network_hid.parameters(), lr=learning_rate, momentum=momentum)
n_epochs = 10

train_losses_hid = []
train_counter_hid = []
test_losses_hid = []
test_counter_hid = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

network = FeedforwardNeuralNetModel2()
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum) 

for i in range(1,n_epochs+1):
    train_mod(i, train_loader, train_losses_hid, train_counter_hid, network_hid, optimizer_hid)
    test_mod(test_losses_hid, network_hid)
    train_mod(i, train_loader, train_losses, train_counter, network, optimizer)
    test_mod(test_losses, network)

In [None]:
fig = pl.figure()
pl.scatter(test_counter[:-1], test_losses, color="red")
pl.scatter(test_counter_hid[:-1], test_losses_hid, color="purple")

pl.legend(['Test Loss Original', 'Test Loss Additional Layers'], loc='upper right', frameon=False)
pl.xlabel('Training Samples')
pl.ylabel('Log Likelihood Loss')

pl.show()

**Answer:** Yes, I was able to add more layers to the neural network, consisting of one input layer, three hidden layers, and one output layer. Adding these additional layers initally resulted in a rise in test loss for the first training sample, and later continued to either beat or match the test loss of the original training model. 

* **(2pt)** Can you add regularization to this model? Look for L1, L2, and drop-out regularizations. What changes do you observe?


In [None]:
def l1_regularization(model, l1_lambda):
    l1_norm = sum(param.abs().sum() for param in model.parameters())
    return l1_lambda * l1_norm

network_l1 = FeedforwardNeuralNetModel2()
optimizer_l1 = optim.SGD(network_l1.parameters(), lr=learning_rate, momentum=momentum)

n_epochs = 10
log_interval = 10
l1_lambda = 0.001 

train_losses_l1 = []
train_counter_l1 = []
test_losses_l1 = []
test_counter_l1 = [i * len(train_loader.dataset) for i in range(n_epochs + 1)]

def train_l1(epoch):
    network_l1.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        optimizer.zero_grad()
        output = network_l1(data.reshape(-1, 28*28))

        loss = F.nll_loss(output, target)

        l1_loss = l1_regularization(network_l1, l1_lambda)
        total_loss = loss + l1_loss

        total_loss.backward()  
        optimizer_l1.step()  

        if batch_idx % log_interval == 0:
            print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
                epoch, batch_idx * len(data), len(train_loader.dataset),
                100. * batch_idx / len(train_loader), total_loss.item())
            )
            train_losses_l1.append(total_loss.item())
            train_counter_l1.append((batch_idx * len(data)) + ((epoch - 1) * len(train_loader.dataset)))


network = FeedforwardNeuralNetModel2()
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)

train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

n_epochs = 10
for i in range(1, n_epochs + 1):
    train_l1(i)
    test_mod(test_losses_l1, network_l1)
    train_mod(i, train_loader, train_losses, train_counter, network, optimizer)
    test_mod(test_losses, network)

In [None]:
fig = pl.figure()
pl.scatter(test_counter[:-1], test_losses, color="red")
pl.scatter(test_counter_l1[:-1], test_losses_l1, color="purple")

pl.legend(['Test Loss Original', 'Test Loss L1 Reg'], loc='upper right', frameon=False)
pl.xlabel('Training Samples')
pl.ylabel('Log Likelihood Loss')

pl.show()

After implementing L1 regularization in the neural network designed in the previous question, we can observe that the test loss is higher for each sample in comparison to the original model. Further optimization is required (such as searching for the best learning rate and momentum) in order to get better performance.

In [None]:
network_l2 = FeedforwardNeuralNetModel2()
l2_lambda = 0.001

optimizer_l2 = optim.SGD(network_l2.parameters(), lr=learning_rate, momentum=momentum, weight_decay=l2_lambda)

n_epochs = 10

train_losses_l2 = []
train_counter_l2 = []
test_losses_l2 = []
test_counter_l2 = [i * len(train_loader.dataset) for i in range(n_epochs + 1)]

network = FeedforwardNeuralNetModel2()
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)

train_losses = []
train_counter = []
test_losses = []
test_counter = [i * len(train_loader.dataset) for i in range(n_epochs + 1)]

for i in range(1, n_epochs + 1):
    train_mod(i, train_loader, train_losses_l2, train_counter_l2, network_l2, optimizer_l2)
    test_mod(test_losses_l2, network_l2)

    train_mod(i, train_loader, train_losses, train_counter, network, optimizer)
    test_mod(test_losses, network)

In [None]:
fig = pl.figure()
pl.scatter(test_counter[:-1], test_losses, color="red")
pl.scatter(test_counter_l2[:-1], test_losses_l2, color="purple")

pl.legend(['Test Loss Original', 'Test Loss L2 Reg'], loc='upper right', frameon=False)
pl.xlabel('Training Samples')
pl.ylabel('Log Likelihood Loss')

pl.show()

**Mention how L2 performed**

In [None]:
class FeedforwardNeuralNetModel3(nn.Module):
    def __init__(self, dropout_prob=0.5):
        super(FeedforwardNeuralNetModel3, self).__init__()
        input_dim = 28 * 28
        hidden_dim1 = 512  
        hidden_dim2 = 256
        out_dim = 10 

        self.fc1 = nn.Linear(input_dim, hidden_dim1)
        self.fc2 = nn.Linear(hidden_dim1, hidden_dim2)
        self.out = nn.Linear(hidden_dim2, out_dim)

        self.dropout = nn.Dropout(p=dropout_prob)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)

        x = F.relu(self.fc2(x))
        x = self.dropout(x)

        out = self.out(x)
        return F.log_softmax(out, dim=1)

network_drop = FeedforwardNeuralNetModel3(dropout_prob=0.5)
optimizer_drop = optim.SGD(network_drop.parameters(), lr=learning_rate, momentum=momentum)
n_epochs = 10

train_losses_drop = []
train_counter_drop = []
test_losses_drop = []
test_counter_drop = [i * len(train_loader.dataset) for i in range(n_epochs + 1)]

train_losses = []
train_counter = []
test_losses = []
test_counter = [i * len(train_loader.dataset) for i in range(n_epochs + 1)]

network = FeedforwardNeuralNetModel3(dropout_prob=0.0)
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)

for i in range(1, n_epochs + 1):
    train_mod(i, train_loader, train_losses_drop, train_counter_drop, network_drop, optimizer_drop)
    test_mod(test_losses_drop, network_drop)

    train_mod(i, train_loader, train_losses, train_counter, network, optimizer)
    test_mod(test_losses, network)

In [None]:
fig = pl.figure()
pl.scatter(test_counter[:-1], test_losses, color="red")
pl.scatter(test_counter_drop[:-1], test_losses_drop, color="purple")

pl.legend(['Test Loss Original', 'Test Loss Dropout Reg'], loc='upper right', frameon=False)
pl.xlabel('Training Samples')
pl.ylabel('Log Likelihood Loss')

pl.show()

**Mention how Dropout reg performed**

* **[stretch] (2pt)** Can you change this model and turn it into a convolutional neural network?

In [None]:
class CNN(nn.Module):
    def __init__(self):
        super(CNN, self).__init__()
        
        num_classes = 100  
        num_classes2 = 10
        
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1, stride=2)  
        self.conv2 = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1, stride=2) 
        self.conv3 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3, padding=1)          
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        self.fc1 = nn.Linear(128 * 3 * 3, num_classes)
        self.out = nn.Linear(num_classes, num_classes2)

    def forward(self, x):
        x = x.view(-1, 1, 28, 28)
        
        x = self.conv1(x) 
        x = self.conv2(x) 
        x = self.conv3(x)
        x = self.pool(x)
        x = F.relu(self.fc1(x))
        out = self.out(x)
        
        return F.log_softmax(out, dim=1)

In [None]:
network = CNN()
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)

train_losses_cnn = []
train_counter_cnn = []
test_losses_cnn = []

n_epochs = 10
test_counter_cnn = [i * len(train_loader.dataset) for i in range(n_epochs + 1)]


for i in range(1, n_epochs + 1):
    train_mod(i, train_loader, train_losses_cnn, train_counter_cnn)
    test_mod(test_losses_cnn)

In [None]:
fig = pl.figure()
pl.scatter(test_counter[:-1], test_losses, color="red")
pl.scatter(test_counter_cnn[:-1], test_losses_cnn, color="purple")

pl.legend(['Test Loss Original', 'Test Loss CNN'], loc='upper right', frameon=False)
pl.xlabel('Training Samples')
pl.ylabel('Log Likelihood Loss')

pl.show()

We can see in the graph above that the loss of the CNN model is still worse than our original model, indicating that additional optimization is required to yield best performance.