## PyTorch: Implementing a CNN for MNIST

We achieved around 92% accuracy using two fully connected layers. Let's see if we can improve performance by switching to a convolutional neural network architecture. First we apply our usual imports and then setup data loaders.

In [1]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from torch.autograd import Variable
import torchvision.transforms as transforms
import torchvision.datasets as dsets
from timeit import default_timer as timer

# normalize the pixel values to [-1,1] with mean 0. This is really important!
transform = transforms.Compose(
    [transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

training_data = dsets.MNIST(root="./data", train = True, transform=transform, download = True)
testing_data = dsets.MNIST(root="./data", train = False, transform=transform, download = True)

batch_size = 64
train_loader = torch.utils.data.DataLoader(dataset=training_data, batch_size=batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(dataset=testing_data, batch_size=len(testing_data), shuffle=False)

Now setup the CNN:

In [11]:
# 1. Define and instantiate the network as a subclass of nn.Module. 
# Unlike the last example, we will simply the number of parameters and leave the num_ouputs fixed at 10.
class cnn(nn.Module):

    def __init__(self, num_input_channels):
        super(cnn, self).__init__()

        # Layer 1
        # 1 input channel (e.g. feature map); the image
        # 6 output channels (e.g. layer has 6 feature maps);
        # uses 3x3 kernel over the input channel to build output channels.
        self.conv1 = nn.Sequential(
            nn.Conv2d(in_channels=num_input_channels, out_channels=10, kernel_size=5, stride=1, padding=0),
            nn.ReLU()
        )
        
        self.conv2 = nn.Sequential(
            nn.Conv2d(in_channels=10, out_channels=20, kernel_size=5, stride=1, padding=0),
            nn.ReLU()
        )
        self.maxpool = nn.MaxPool2d((2,2))
        
        # Add a dropout layer to add a bit of regularization.
        self.drop = nn.Dropout2d(p=0.25, inplace=False)
        
        #Why 20*10*10? See below.
        self.fc3 = nn.Sequential(
            nn.Linear(in_features=20*10*10, out_features=100),
            nn.ReLU()
        )
        
        self.fc4 = nn.Sequential(
            nn.Linear(100,10),
            nn.ReLU()
        )

    # x: input to the network
    def forward(self, x):

        # Pass x through conv1 using relu activations.
        x = self.conv1(x)
        x = self.conv2(x)  
        x = self.maxpool(x) 
        x = self.drop(x)
        ## We need to flatten all of the feature map activations into a column vector to pass into the fc layers.
        ## Why is it 20*10*10? See below.
        x = x.view(-1, 20*10*10) 
        x = self.fc3(x)
        x = self.fc4(x)
        return x
    
    

Note that you should keep track of the spatial size of the output of each convolutional layer to track how big your kernel, stride, and padding has to be. The output volume of a conv layer is a function of the input volume size (W), the receptive field size (F), the stride with which they are applied (S), and the amount of zero padding used (P) on the border. You should convince yourself that the correct formula for calculating how many neurons “fit” is given by: 

$$\frac{W−F+2P}{S} + 1$$

So if the input to the CNN layer is 7x7, the kernel size is 3x3, stride of 1, and no 0-padding, the output volume with have dimensions 5x5 (as $(7-3+0)/1 + 1 = 5$). Note that the depth of the input (e.g. number of color channels or feature maps at previous layer) is irrelevant: as many feature maps can be combined to produce as many feature maps as you want at the subsequent layer.

Lets look at the network above: for conv1; input is 28x28, kernel is 5x5, side of 1, no 0-padding. So the output feature maps will have shape $(28-5+0)/1 + 1 = 24$, so 24x24. 

In conv2, input is 24x24, kernel is 5x5, stride of 1, no 0-adding. Each of the 16 output feature maps will have dimension $(24-5+0)/1 + 1 = 20$, so 20x20. 

The output of conv2 goes through a maxpool with 2x2 kernel with stride of 2. The output of the maxpool will thus have shape $(20-2+0)/2 + 1 = 10$, so 10x10. 

Note that we need to reshape the activations in the 20 10x10 feature maps in the maxpool layer into a single vector as input to the fully
connected layer. According to these calculations, we should have $10*10$ features per feature maps, and there are 20 feature maps after maxpool, so total size of the fc3 input is $10*10*20$. 


In [3]:
the_net = cnn(num_input_channels=1) ##MNIST digits are grayscale. 
print(the_net)

num_parameters = 0
for x in the_net.parameters():
    print x.shape
    
for x in the_net.parameters():
    num_params = 1
    for dim_size in x.shape:
        num_params *= dim_size
    num_parameters += num_params
    
print('number of network parameters: %i' % (num_parameters))

cnn(
  (conv1): Sequential(
    (0): Conv2d (1, 10, kernel_size=(5, 5), stride=(1, 1))
    (1): ReLU()
  )
  (conv2): Sequential(
    (0): Conv2d (10, 20, kernel_size=(5, 5), stride=(1, 1))
    (1): ReLU()
  )
  (maxpool): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), dilation=(1, 1))
  (drop): Dropout2d(p=0.25)
  (fc3): Sequential(
    (0): Linear(in_features=2000, out_features=100)
    (1): ReLU()
  )
  (fc4): Sequential(
    (0): Linear(in_features=100, out_features=10)
    (1): ReLU()
  )
)
torch.Size([10, 1, 5, 5])
torch.Size([10])
torch.Size([20, 10, 5, 5])
torch.Size([20])
torch.Size([100, 2000])
torch.Size([100])
torch.Size([10, 100])
torch.Size([10])
number of network parameters: 206390


So we have 206,390 weights to learn (recall we only had about 8000 in the two fully connected layer network)!

**Let's use the GPU for this much bigger network.**

In [4]:
the_net = cnn(num_input_channels=1) ##MNIST digits are grayscale. 
the_net.cuda()
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(the_net.parameters(), lr=0.01)

Time to train! **Note that all Variables go onto the GPU**. Also observe that we need to move some variables back to the **CPU** in order to do comparisons with data stored in RAM.

In [5]:
num_epochs = 25
for epoch in range(num_epochs):
    start = timer()
    for batch_num , (minibatch_of_images, minibatch_of_labels) in enumerate(train_loader):
    
        the_batch = Variable(minibatch_of_images).cuda()
        labels = Variable(minibatch_of_labels).cuda()
        
        optimizer.zero_grad()
        
        output = the_net(the_batch)
        
        loss = loss_function(output, labels)
        
        loss.backward()
        
        optimizer.step()
        
        # we want to check the accuracy with test dataset every 300 iterations.
        if batch_num % 300 == 0:
            print("At epoch %i, minibatch %i. Loss: %.4f." % (epoch, batch_num, loss.data[0]))
            
    end = timer()
    print("Epoch %i finished! It took: %.4f seconds" % (epoch, end - start))
    correct = 0
    total = 0
    for data in test_loader:
        images, labels = data
        outputs = the_net(Variable(images).cuda())
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
    correct += (predicted.cpu() == labels).sum()
    print('Accuracy of the network on the %d test images: %.2f %%' % (total, 100.0 * correct / total))

At epoch 0, minibatch 0. Loss: 2.3048.
At epoch 0, minibatch 300. Loss: 1.6442.
At epoch 0, minibatch 600. Loss: 0.3903.
At epoch 0, minibatch 900. Loss: 0.2438.
Epoch 0 finished! It took: 5.7396 seconds
Accuracy of the network on the 10000 test images: 92.03 %
At epoch 1, minibatch 0. Loss: 0.4225.
At epoch 1, minibatch 300. Loss: 0.1288.
At epoch 1, minibatch 600. Loss: 0.1220.
At epoch 1, minibatch 900. Loss: 0.1005.
Epoch 1 finished! It took: 5.1696 seconds
Accuracy of the network on the 10000 test images: 94.77 %
At epoch 2, minibatch 0. Loss: 0.1395.
At epoch 2, minibatch 300. Loss: 0.1230.
At epoch 2, minibatch 600. Loss: 0.0808.
At epoch 2, minibatch 900. Loss: 0.1665.
Epoch 2 finished! It took: 5.1457 seconds
Accuracy of the network on the 10000 test images: 96.37 %
At epoch 3, minibatch 0. Loss: 0.2040.
At epoch 3, minibatch 300. Loss: 0.2229.
At epoch 3, minibatch 600. Loss: 0.1274.
At epoch 3, minibatch 900. Loss: 0.0636.
Epoch 3 finished! It took: 5.1681 seconds
Accuracy o

< 2% test error is awesome! 

BTW -- how long would an epoch take on a CPU?

In [10]:
the_net = cnn(num_input_channels=1) ##MNIST digits are grayscale. 
optimizer = torch.optim.SGD(the_net.parameters(), lr=0.01)
start = timer()
for batch_num , (minibatch_of_images, minibatch_of_labels) in enumerate(train_loader):

    the_batch = Variable(minibatch_of_images)
    labels = Variable(minibatch_of_labels)

    optimizer.zero_grad()

    output = the_net(the_batch)

    loss = loss_function(output, labels)

    loss.backward()

    optimizer.step()

    # we want to check the accuracy with test dataset every 300 iterations.
    if batch_num % 300 == 0:
        print("At minibatch %i. Loss: %.4f." % (batch_num, loss.data[0]))

end = timer()
print("One epoch on the CPU took: %.4f seconds" % (end - start))

At minibatch 0. Loss: 2.3033.
At minibatch 300. Loss: 1.9801.
At minibatch 600. Loss: 0.5785.
At minibatch 900. Loss: 0.4037.
One epoch on the CPU took: 26.3057 seconds


**So about 5x the time**... and this is a relatively small CNN! 