This is an example on how to build a convolutional neural network (CNN) to recognize handwritten digits in the MNIST data set.

In [2]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms

## Convolutional Neural Network

Convolutional neural networks take biological inspiration from the animal visual cortex. The visual cortex consists of a complex arrangement of cells. These cells are sensitive to specific regions of the visual field, called a receptive field. The different receptive fields partially overlap such that they cover the entire visual field. CNNs use this idea of specialized components having specific tasks similar to how neurons in the visual cortex look for specific characteristics.

In a CNN, an input in passed through a series of <b>convolution layers</b>, <b>nonlinear layers</b>, <b>pooling layers</b>, and <b>fully-connected layers</b>, and we get an output. In image classification, the input is an image that can be represented as an array of pixel values, and t eoutput can be single class or a probability of classes that best describe the image. 

### Convolution layer 

The first layer in a CNN is the convolutional layer. This layer extracts features, such as edges, curves, etc., from the input image. It consists of a set of filters. We slide each filter across the width and height of the input image, computing element vise multiplications of the values in the filter and the original pixel values in the input. The multiplications are summed up, and the process is repeated by moving the filter to the right on the input volume. The output produced by sliding the filter over all the locations is called an activation or feature map. The depth of the output array is the same as the depth of the input array.

Every image can be considered as a matrix of pixel values. Consider a 5x5 image whose pixel values are only 0 and 1 (for a grayscale image, the pixel values range from 0 to 255, the green matrix below is a special case). Also, consider another 3x3 matrix which is the filter. The convolution of the 5x5 image and the 3x3 filter can computed as shown in the animation below:   
<img src="https://i.stack.imgur.com/I7DBr.gif" style="width: 400px;">

Three parameters control the size of the output:
<ul>
<li><b>Depth</b>: Depth corresponds to the number of filters used for the convolution operation.

<li><b>Stride</b>: Stride is the number of pixels by which we slide our filter matrix over the input matrix. By default, the <b> stride </b> is 1, which results in the filter sliding by one pixel at a time. When the stride is 2, then the filter jumps two pixels at a time resulting in smaller output volumes.

<li><b>Zero-padding</b>: Zero-padding is used to pad the input volume with zeros around the border. This allows us to control the spatial size of the output volume.
</ul>

### Non-Linear Layer (ReLU)

ReLU (Rectified Linear Units) is an element-wise activation function, and replaces all negative pixel values in the feature map by zero. It implements the function $y = max(0, x)$, so the input and ouput sizes of this layer are the same.

<img src="https://www.embedded-vision.com/sites/default/files/technical-articles/CadenceCNN/Figure8.jpg" style="width: 600px;">

### Pooling Layer

This layer reduces the spatial size of the representation. It controls overfitting by reducing the amount of parameters and computation in the network. The most common form of pooling uses the Max operation. The example shown below uses max pooling with a 2x2 window. We slide our window with a stride of 2 and take the maximum value in each region.

<img src="https://qph.ec.quoracdn.net/main-qimg-8afedfb2f82f279781bfefa269bc6a90" style="width: 600px;">

### Fully Connected Layer

This layer is fully connected with the output of the previous layer. This layer performs classification on the features extracted by the convolutional layer and downsampled by the pooling layer by using a weighted sum of the features followed by a bias offset.

<img src="https://cdn-images-1.medium.com/max/1600/1*Kdnux0Kw1yQ4D8dq__mYCA.png" style="width: 300px;">

In [10]:
class Model(nn.Module):
  def __init__(self):
    super(Model, self).__init__()
    self.conv1 = nn.Conv2d(1, 32, 5, padding=2)
    self.conv2 = nn.Conv2d(32, 64, 5, padding=2)
    self.fc1 = nn.Linear(64*7*7, 1024)
    self.fc2 = nn.Linear(1024, 10)

  def forward(self, x):
    x = F.relu(F.max_pool2d(self.conv1(x), 2))
    x = F.relu(F.max_pool2d(self.conv2(x), 2))
    x = x.view(-1, 64*7*7)
    x = F.relu(self.fc1(x))
    x = F.dropout(x, training=self.training)
    x = self.fc2(x)
    return F.log_softmax(x)

model = Model()
model

Model (
  (conv1): Conv2d(1, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (conv2): Conv2d(32, 64, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
  (fc1): Linear (3136 -> 1024)
  (fc2): Linear (1024 -> 10)
)

We load the training and test data from the MNIST dataset.

In [11]:
batch_size = 50
train_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=True, download=True, transform=transforms.ToTensor()),
    batch_size=batch_size, shuffle=True)

test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('data', train=False, transform=transforms.ToTensor()),
    batch_size=1000)

<b>Optimizer:</b> This updates the parameters based on the computed gradients. Here, we have used Stochastic Gradient Descent (SGD).

In [12]:
optimizer = optim.SGD(model.parameters(), lr=0.0003)

<b>Loss Function:</b> For training and evaluation, we have to define a loss function that measures how closely the model's predictions match the target classes.

In [13]:
criterion = nn.CrossEntropyLoss()

In [14]:
def train(epoch):
  model.train()
  i = 1
  for data, target in train_loader:
    data, target = Variable(data), Variable(target)
    optimizer.zero_grad()
    output = model(data)
    # make_dot(output)
    loss = criterion(output, target)
    prediction = output.data.max(1)[1]
    accuracy = prediction.eq(target.data).sum()/batch_size*100
    loss.backward()
    optimizer.step()
    if i % 1000 == 0:
      print('\nTrain Step: {}\tLoss: {:.3f}'.format(epoch, loss.data[0]))
    i += 1

In [15]:
def test():
  model.eval()
  correct = 0
  for data, target in test_loader:
    data, target = Variable(data), Variable(target)
    output = model(data)
    prediction = output.data.max(1)[1]
    correct += prediction.eq(target.data).sum()

  print('Test set: Accuracy: {:.2f}%'.format(100. * correct / len(test_loader.dataset)))

In [16]:
for epoch in range(15):
  train(epoch)
  test()

Train Step: 0	Loss: 2.289

Test set: Accuracy: 24.83%
Train Step: 1	Loss: 2.268

Test set: Accuracy: 37.09%
Train Step: 2	Loss: 2.254

Test set: Accuracy: 49.42%
Train Step: 3	Loss: 2.188

Test set: Accuracy: 58.45%
Train Step: 4	Loss: 2.116

Test set: Accuracy: 65.52%
Train Step: 5	Loss: 1.998

Test set: Accuracy: 70.11%
Train Step: 6	Loss: 1.648

Test set: Accuracy: 72.65%
Train Step: 7	Loss: 1.211

Test set: Accuracy: 77.72%
Train Step: 8	Loss: 0.850

Test set: Accuracy: 81.59%
Train Step: 9	Loss: 0.868

Test set: Accuracy: 83.75%
Train Step: 10	Loss: 0.512

Test set: Accuracy: 85.32%
Train Step: 11	Loss: 0.672

Test set: Accuracy: 86.81%
Train Step: 12	Loss: 0.386

Test set: Accuracy: 88.04%
Train Step: 13	Loss: 0.790

Test set: Accuracy: 88.84%
Train Step: 14	Loss: 0.337

Test set: Accuracy: 89.41%
