# Introduction 

In this post, we implement Lenet, widely acknowledged to be one of the first convolutional neural networks that was used in a practical setting, and also amongst the first to be trained via the backpropogation algorithm rather than being hand-designed. The original Lenet model was developed by Yann LeCun, Leon Bottou, Yoshua Bengio and Patrick Haffner, to classify numerical digits from 0 to 9. It was accurate enough that many banks adopted it to scan the digits on the millons of checks deposited at ATMs every day. In fact, there are some ATMs which still today use the original code from the creators of Lenet!

In [None]:
import os
import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
import torchvision.datasets as datasets

from tqdm import tqdm
from torchvision import transforms
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split

# The architecture

We will implement Lenet-5, the version of the model that was used widely by banks to process checks at ATMs. Lenet-5 has seven layers. Three of these are convolutional layers, two are pooling layers, two are fully connected layers. This is a very small network by modern standards. Many neural network architectures published in the last decade have tens of layers, and some even have over one hundred layers. 

However, old and small does not mean weak. We'll see that for the problem it was designed to solve, gray-scale hand-written digit recognition, Lenet-5 does extremely well. It actually made it to production at a large scale and performed very accurately and reliably, which is more than can be said for many far larger and fancier models being developed at companies today.


We stick as closely as possible to the original reference implementation that is described in the paper. Doing this, it turns out, is not straightforward. This is because the paper was written a long time back, before the most recent few years of deep learning research. To complicate things, many implementations of Lenet available online deviate in ways big and small from the original paper, usually by incorporating convenient simplifications, and also on findings that were not known when Lenet was first released. We point these out where we can, because at least for me, it was interesting to see all the small ways in which the impact of so much research effort ends up being reflected in even 'simple' code.

# Lenet pooling layer

The Lenet pooling layer as described in the original paper is not commonly used today. Average pooling and max. pooling have come to be more commonly used, but they had not been discovered or experimented with at the time the authors were writing. Note that most online implementations do not implement the original precise pooling layer described in the paper.

The paper actually pools information after the convolutional layers in a specific way. First, 2 by 2 kernel with a stride of 2 is passed over the input. These parameters are chosen to ensure that the kernel neighborhoods don't overlap. Each of the 4 entries in the kernel neighborhood are then summed, and passed through a linear layer with learned coefficients. In our implementation, to accomplish the summation, we use the LPPool2d method from PyTorch, with a parameter of 1 and stride of 2. The documentation tells us that this is exactly what we want. 

In [None]:
class LenetPool(nn.Module):
    """Subsampling layer from LeCun et al. (1987)"""
    def __init__(self, kernel_size, sum_stride, in_features, out_features):
        """Initialization"""
        super(LenetPool, self).__init__()
        self.sum_pool = nn.LPPool2d(1, kernel_size, stride=sum_stride)
        self.linear = nn.Linear(in_features, out_features)
        self.layers = nn.Sequential(self.sum_pool, self.linear)
    
    def forward(self, x):
        """Forward pass"""
        return self.layers(x)

# Lenet activation function

Just like the Lenet pooling layer, the activation function described in the original paper is not commonly used today. Rectified Linear Units (ReLU) and its variants and relatives have become the default activation, and there has been a move away from sigmoidal functions like the sigmoid and tanh. This is largely to do with learning dynamics - these functions become very flat near zero and at large magnitudes, which results in low gradients and hence slower learning. However, the authors didn't know this when they were designing Lenet. Furthermore, the activation function used in Lenet is not a vanilla tanh.  There are some tweaks that the authors have made which they thought would improve performance.

The first modification is that that the input to the activation is first passed through a linear layer, or in the paper's language, multipled by a learned weight $S$ to adaptively set the slope of the activation function near its origin. Secondly, the activation function output is then scaled by $A$, a value that is hard-coded in the paper to $1.7159$. 

Online implementations tend to either use ReLU or just straight up tanh without these modifications, and performance doesn't seem to be affected much. But we stick to this implementation as that is what's in the paper. It's cool to think that actually going through the original work makes all these differences show up.

In [None]:
class LenetSigmoid(nn.Module):
    """Activation function for Lenet-5"""
    def __init__(self, in_features, out_features):
        super(LenetSigmoid, self).__init__()
        self.S = nn.Linear(in_features, out_features, bias=False)
        self.act = nn.Tanh()
        self.A = 1.7159
    
    def forward(self, x):
        """Forward pass"""
        x = self.S(x)
        x = self.act(x)
        x = self.A*x
        return x

# The Lenet architecture

The Lenet architecture apart from this is straightforward, and the network given below is a faithful reproduction, apart from two interesting changes.

The first change is layer C3. In the original paper, the authors are careful to note that not all feature maps from S2's output are sent to every filter in C3. Instead, there is a scheme in which each input feature map is sent to only three of the 16 filters in C3. There are two reasons for this: 1) to encourage individual features to adapt and 2) to keep the number of connections manageable due to the limited amount of computational resources available at the time. This first change reminded me of dropout regularization, which it predates by quite some time! This would have taken more work than I'm interested in putting in at the moment, but might make for a nice future extension. 

The second change is to the output layer. The authors actually use an RBF network as an output layer. They get the network to output 84 values which is then compared to a 7 by 12 bitmap version of the target image. The class with the lowest cross-entropy loss is selected. We don't do this because the softmax final layer for classifiers is very standard these days, and also because it doesn't seem to effect performance much, but this is another nice future exercise, especially to keep the spirit of being authentic to what is actually in the paper.

In [None]:
class Lenet(nn.Module):
    """Lenet-5 architecture"""
    def __init__(self, num_classes):
        super(Lenet, self).__init__()
        
        # Convolutional block
        self.c1 = nn.Conv2d(1, 6, 5)
        self.a1 = LenetSigmoid(28, 28)
        
        # Pooling block
        self.s2 = LenetPool(2, 2, 14, 14)
        self.a2 = LenetSigmoid(14, 14)
        
        # Convolutional block
        self.c3 = nn.Conv2d(6, 16, 5)
        self.a3 = LenetSigmoid(10, 10)
        
        # Pooling block
        self.s4 = LenetPool(2, 2, 5, 5)
        self.a4 = LenetSigmoid(5, 5)
        
        # Flatten to prepare for fully connected blocks
        self.flatten = nn.Flatten()
        
        # Fully connected block
        self.f5 = nn.Linear(400, 120)
        self.a5 = LenetSigmoid(120, 120)
        
        # Fully connected block
        self.f6 = nn.Linear(120, 84)
        self.a6 = LenetSigmoid(84, 84)
        
        # Output block
        self.f7 = nn.Linear(84, num_classes)
        
        # Wrap all layers in a Sequential block
        self.layers = nn.Sequential(self.c1, self.a1, 
                                    self.s2, self.a2,
                                    self.c3, self.a3, 
                                    self.s4, self.a4,
                                    self.flatten,
                                    self.f5, self.a5, 
                                    self.f6, self.a6, 
                                    self.f7,)
    
    def forward(self, x):
        """Lenet-5 forward pass"""
        return self.layers(x)

# Testing for shapes

One side note while doing this: testing to make sure that we are getting the correct shapes turned out to be important. Given below is a very simple utility function that prints out the output shapes from applying each layer in sequence. One silly mistake that I was making while initially doing this was testing out without a batch dimension, and so the training loop wasn't working. 

In [None]:
def run_tests(shape, num_classes):
    """Test the forward pass for shapes"""
    
    # Create test instance
    x = torch.randn(shape)
    
    # Initialize the model
    model = Lenet(num_classes)
    
    # Iterate through the layer
    for layer in model.layers:
        
        # Apply each layer
        x = layer(x)
        
        # Print the summary
        print(layer.__class__.__name__, ':', x.shape)
    
    # Return statement
    return x

# Training / evaluation results

Given below is a standard training and evaluation loop. Note that during the preprocessing we had to resize the images to be 32 by 32 because for some reason they weren't that size in the PyTorch dataset that we downloaded. But apart from that we just convert all the images to tensors, and apply a normalization. Note that since I am on an M1 Mac, I can use the MPS device to make sure the GPU is utilized. The results show that the loss curve decreases pretty smoothly throughout training, and we get a very high test accuracy of more than 97 %, very similar to the results in the original paper!

In [None]:
if __name__ == '__main__':
    
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
    
    # Hyperparameters
    num_classes = 10
    batch_size = 128
    learning_rate = 0.001
    epochs = 20
    
    # Transform for MNIST (Normalize to mean 0.5, std 0.5)
    transform = transforms.Compose([transforms.Resize((32, 32)), 
                                    transforms.ToTensor(), 
                                    transforms.Normalize((0.5,), (0.5,))])

    # Load MNIST dataset
    train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
    test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

    train_loader = torch.utils.data.DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
    test_loader = torch.utils.data.DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)

    # Initialize model, loss function, and optimizer
    model = Lenet(num_classes).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    # Training loop
    for epoch in range(epochs):
        model.train()
        total_loss = 0

        for batch_idx, (data, target) in tqdm(enumerate(train_loader)):
            
            # Sent to GPU
            data, target = data.to(device), target.to(device)
            
            # Zero the gradients
            optimizer.zero_grad()

            # Forward pass
            output = model(data)

            # Compute loss
            loss = criterion(output, target)
            total_loss += loss.item()

            # Backward pass
            loss.backward()

            # Update weights
            optimizer.step()

        print(f"Epoch {epoch+1}/{epochs}, Loss: {total_loss/len(train_loader):.4f}")
        
    # Evaluate the model
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)
            
            
            _, predicted = torch.max(output, 1)
            total += target.size(0)
            correct += (predicted == target).sum().item()

    accuracy = 100 * correct / total
    print(f"Test Accuracy: {accuracy:.2f}%")

469it [00:17, 26.69it/s]


Epoch 1/20, Loss: 0.3028


469it [00:17, 26.87it/s]


Epoch 2/20, Loss: 0.1270


469it [00:15, 30.61it/s]


Epoch 3/20, Loss: 0.0970


469it [00:15, 30.85it/s]


Epoch 4/20, Loss: 0.0822


469it [00:15, 30.26it/s]


Epoch 5/20, Loss: 0.0746


469it [00:16, 29.07it/s]


Epoch 6/20, Loss: 0.0672


469it [00:15, 30.37it/s]


Epoch 7/20, Loss: 0.0647


469it [00:15, 30.53it/s]


Epoch 8/20, Loss: 0.0597


469it [00:15, 30.88it/s]


Epoch 9/20, Loss: 0.0532


469it [00:15, 30.61it/s]


Epoch 10/20, Loss: 0.0555


469it [00:15, 30.67it/s]


Epoch 11/20, Loss: 0.0521


469it [00:15, 30.74it/s]


Epoch 12/20, Loss: 0.0520


469it [00:15, 30.47it/s]


Epoch 13/20, Loss: 0.0511


469it [00:15, 30.45it/s]


Epoch 14/20, Loss: 0.0472


469it [00:15, 30.34it/s]


Epoch 15/20, Loss: 0.0520


469it [00:16, 28.73it/s]


Epoch 16/20, Loss: 0.0474


469it [00:15, 30.16it/s]


Epoch 17/20, Loss: 0.0471


469it [00:15, 29.87it/s]


Epoch 18/20, Loss: 0.0473


469it [00:15, 30.37it/s]


Epoch 19/20, Loss: 0.0467


469it [00:15, 30.50it/s]


Epoch 20/20, Loss: 0.0438
Test Accuracy: 97.86%


# Conclusion

There you have it, Lenet-5 in all (most) of its splendor! The most fun I had while writing this post is realizing the small implementation differences between the original paper and different versions online, including some textbooks! It turns out, which in many ways is expected, that these changes don't impact performance much, but they were still surprising to see. Lenet was one of the first examples of how neural network approaches could be competitive with the then dominant machine learning paradigm of support vector machines, and served as an inspiration to the early deep learning pioneers that their ideas were promising and could be turned into useful innovations.