<a href="https://colab.research.google.com/github/foxtrotmike/CS909/blob/master/resnet_mnist.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Simplified ResNet for MNIST Digit Recognition
By [Fayyaz Minhas](https://sites.google.com/view/fayyaz/home)

In this tutorial, we'll dive into the construction and understanding of a simplified ResNet (Residual Network) model tailored for recognizing digits in the MNIST dataset using PyTorch. The essence of ResNet lies in its innovative use of residual blocks that allow for training deeper neural networks by effectively addressing the vanishing gradient problem. Let's explore how to implement this powerful architecture step-by-step.

## Setting the Stage
First, we establish our working environment by importing necessary libraries and configuring our device. This step ensures that our model can leverage GPU acceleration if available:

In [1]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')


## Preparing the Data
The MNIST dataset, consisting of 28x28 pixel grayscale images of handwritten digits, serves as our training and testing ground. We utilize PyTorch's torchvision package to load and transform the dataset into tensor format for our model. Data loaders are then employed to batch and shuffle our dataset, preparing it for the training and testing process.

In [2]:
# Hyperparameters
num_epochs = 5
num_classes = 10
batch_size = 100
learning_rate = 0.001

# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data/',
                                           train=True,
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='./data/',
                                          train=False,
                                          transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

## Building the Residual Block
The heart of ResNet is the Residual Block. This component is designed to learn residual functions with reference to the layer inputs, allowing the network to have additional paths for gradient flow during backpropagation. his is achieved by introducing a shortcut (or skip connection) that bypasses one or more layers.

### Components of a Residual Block
Convolutional Layers: The main components of a residual block are its convolutional layers. Typically, a residual block contains two convolutional layers, each followed by a normalization layer. These layers are responsible for learning the weights from the input data. In the context of the MNIST example:

* The first convolutional layer (conv1) applies a set of filters to the input. This is followed by batch normalization (bn1) to stabilize and speed up training.
* The second convolutional layer (conv2) further processes the output of the first layer, followed by another batch normalization layer (bn2).
* Batch Normalization (BN): Each convolutional layer is followed by a batch normalization layer. BN normalizes the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. This ensures that the model trains faster and reduces the sensitivity to network initialization.
* Activation Function: After each batch normalization layer, a non-linear activation function (ReLU, in this case) is applied. The ReLU (Rectified Linear Unit) function introduces non-linearity into the model, allowing it to learn more complex patterns in the data.
* Skip Connection: The most critical feature of a residual block is the skip connection that adds the input of the residual block to its output. This mechanism (below) allows the gradient to be directly backpropagated to earlier layers. If the dimensions of the input and output of the residual block do not match, a downsampling layer (downsample) adjusts the dimensions of the input before it is added to the block's output. This is often accomplished with a convolutional layer with a kernel size of 1 (also known as a 1x1 convolution) in the skip connection.

In [3]:
'''
if self.downsample:
    residual = self.downsample(x)
out += residual
'''

'\nif self.downsample:\n    residual = self.downsample(x)\nout += residual\n'

### How It Works
The key idea behind a residual block is to learn the addition of some layers to the identity mapping of the input, rather than learning the entire transformation. This concept is encapsulated in the block's forward path:

The input x is passed through two convolutional layers, each followed by batch normalization and ReLU activation.
Simultaneously, the input x is also directly carried over through the skip connection. If necessary, it's transformed to match the dimensions.
The output of the convolutional path and the skip connection path are added together.
A final ReLU activation is applied to the combined output.
This architecture allows the network to learn more efficiently. If additional layers do not improve the model, they can learn to approximate a zero function, effectively allowing the model to rely on the skip connection. Thus, even very deep networks can be trained without degradation, leading to significant improvements in various deep learning tasks.

Here is the complete code

In [4]:
# Residual block
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(ResidualBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsample = downsample

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        if self.downsample:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)
        return out

## Assembling the ResNet Model
The step of assembling the ResNet model involves constructing a neural network architecture that incorporates multiple residual blocks, each designed to facilitate the training of deeper networks by allowing gradients to flow more freely during the backpropagation process. Here's a detailed breakdown of how the ResNet model is assembled, particularly focusing on the simplified version tailored for the MNIST dataset as described in the provided code.



In [5]:

# ResNet
class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=10):
        super(ResNet, self).__init__()
        self.in_channels = 16
        self.conv = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn = nn.BatchNorm2d(16)
        self.relu = nn.ReLU(inplace=True)
        self.layer1 = self.make_layer(block, 16, layers[0], stride=1)
        self.layer2 = self.make_layer(block, 32, layers[1], stride=2)
        self.avg_pool = nn.AdaptiveAvgPool2d((1, 1))  # This line is changed to adaptively pool to 1x1 size
        self.fc = nn.Linear(32, num_classes)

    def make_layer(self, block, out_channels, blocks, stride):
        downsample = None
        if (stride != 1) or (self.in_channels != out_channels):
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels))
        layers = []
        layers.append(block(self.in_channels, out_channels, stride, downsample))
        self.in_channels = out_channels
        for i in range(1, blocks):
            layers.append(block(out_channels, out_channels))
        return nn.Sequential(*layers)

    def forward(self, x):
        out = self.conv(x)
        out = self.bn(out)
        out = self.relu(out)
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.avg_pool(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out


### Initializing the ResNet Model
The ResNet class extends nn.Module, PyTorch's base class for all neural network modules. The constructor of the ResNet class initializes the components that will be used in the network.

* block: This parameter specifies the type of residual block to use within the network. In our case, it's the ResidualBlock class defined earlier.
* layers: A list that specifies the number of residual blocks to be included in each layer of the network. For example, [2, 2] means there are two sections (or layers) in the network, each containing two residual blocks.
* num_classes: The number of output classes. For MNIST, this is 10, corresponding to the digits 0 through 9.

### Constructing the Initial Convolutional Layer
Before the residual blocks, an initial convolutional layer processes the input images. This layer prepares the input for the residual blocks by applying a convolution that preserves the spatial dimensions (due to padding=1) and reduces the channel size from 1 (grayscale image) to 16. This is followed by batch normalization and ReLU activation.

### Building Layers with Residual Blocks
The core of the ResNet model is its layers of residual blocks. The make_layer method constructs these layers:

* Each call to make_layer creates a sequence of residual blocks (blocks) with specified output channels (out_channels) and a stride that controls downsampling.
* The first block in each sequence may include downsampling (stride != 1) to reduce the spatial dimensions of the feature maps, matching the ResNet architecture's design to decrease resolution while increasing depth.
* If downsampling is required, a downsample convolutional layer adjusts the input's dimensions to enable element-wise addition with the block's output.

### Adaptive Pooling and Output Layer
After the residual blocks, an adaptive average pooling layer reduces each feature map to a single value, making the network's output size independent of the input size. This pooled output is then flattened and passed through a fully connected layer to produce the final class predictions. The output of this layer matches the number of classes in the dataset, providing the logits for each class.

### Forward Pass
The forward method defines the path of the input data through the network.
It sequentially applies the initial convolutional layer, the residual layers, the adaptive pooling, and the fully connected layer.
The skip connections within each residual block ensure that the input signal can bypass the convolutional operations, facilitating the training of deeper networks by improving gradient flow.

## Training and Testing
With our model defined, we proceed to train it on the MNIST dataset. During training, we monitor the loss at each epoch, adjusting the model parameters accordingly. The training loop is encapsulated as follows:

In [6]:
model = ResNet(ResidualBlock, [2, 2], num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))

# Save the model checkpoint
torch.save(model.state_dict(), 'resnet_model.ckpt')


Epoch [1/5], Step [100/600], Loss: 0.7457
Epoch [1/5], Step [200/600], Loss: 0.2319
Epoch [1/5], Step [300/600], Loss: 0.2321
Epoch [1/5], Step [400/600], Loss: 0.1181
Epoch [1/5], Step [500/600], Loss: 0.1656
Epoch [1/5], Step [600/600], Loss: 0.1268
Epoch [2/5], Step [100/600], Loss: 0.0602
Epoch [2/5], Step [200/600], Loss: 0.0578
Epoch [2/5], Step [300/600], Loss: 0.0854
Epoch [2/5], Step [400/600], Loss: 0.0738
Epoch [2/5], Step [500/600], Loss: 0.0862
Epoch [2/5], Step [600/600], Loss: 0.1333
Epoch [3/5], Step [100/600], Loss: 0.0799
Epoch [3/5], Step [200/600], Loss: 0.0346
Epoch [3/5], Step [300/600], Loss: 0.0621
Epoch [3/5], Step [400/600], Loss: 0.0748
Epoch [3/5], Step [500/600], Loss: 0.1218
Epoch [3/5], Step [600/600], Loss: 0.0456
Epoch [4/5], Step [100/600], Loss: 0.0359
Epoch [4/5], Step [200/600], Loss: 0.0518
Epoch [4/5], Step [300/600], Loss: 0.0235
Epoch [4/5], Step [400/600], Loss: 0.0382
Epoch [4/5], Step [500/600], Loss: 0.0354
Epoch [4/5], Step [600/600], Loss:

## Conclusion
By implementing a simplified ResNet architecture, we've created a robust model capable of high-accuracy digit recognition on the MNIST dataset. The incorporation of residual blocks demonstrates a significant advancement in deep learning, allowing us to train deeper networks more effectively. This tutorial not only highlights the practical application of ResNet but also underscores the importance of architectural innovations in enhancing model performance.

Here is the complete code in a single block.

In [7]:
import torch
import torch.nn as nn
import torchvision
import torchvision.transforms as transforms

# Device configuration
device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')

# Hyperparameters
num_epochs = 5
num_classes = 10
batch_size = 100
learning_rate = 0.001

# MNIST dataset
train_dataset = torchvision.datasets.MNIST(root='./data/',
                                           train=True,
                                           transform=transforms.ToTensor(),
                                           download=True)

test_dataset = torchvision.datasets.MNIST(root='./data/',
                                          train=False,
                                          transform=transforms.ToTensor())

# Data loader
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
                                           batch_size=batch_size,
                                           shuffle=True)

test_loader = torch.utils.data.DataLoader(dataset=test_dataset,
                                          batch_size=batch_size,
                                          shuffle=False)

# Residual block
class ResidualBlock(nn.Module):
    def __init__(self, in_channels, out_channels, stride=1, downsample=None):
        super(ResidualBlock, self).__init__()
        self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1, bias=False)
        self.bn1 = nn.BatchNorm2d(out_channels)
        self.relu = nn.ReLU(inplace=True)
        self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn2 = nn.BatchNorm2d(out_channels)
        self.downsample = downsample

    def forward(self, x):
        residual = x
        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)
        out = self.conv2(out)
        out = self.bn2(out)
        if self.downsample:
            residual = self.downsample(x)
        out += residual
        out = self.relu(out)
        return out

# ResNet
class ResNet(nn.Module):
    def __init__(self, block, layers, num_classes=10):
        super(ResNet, self).__init__()
        self.in_channels = 16
        self.conv = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1, bias=False)
        self.bn = nn.BatchNorm2d(16)
        self.relu = nn.ReLU(inplace=True)
        self.layer1 = self.make_layer(block, 16, layers[0], stride=1)
        self.layer2 = self.make_layer(block, 32, layers[1], stride=2)
        self.avg_pool = nn.AdaptiveAvgPool2d((1, 1))  # This line is changed to adaptively pool to 1x1 size
        self.fc = nn.Linear(32, num_classes)

    def make_layer(self, block, out_channels, blocks, stride):
        downsample = None
        if (stride != 1) or (self.in_channels != out_channels):
            downsample = nn.Sequential(
                nn.Conv2d(self.in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
                nn.BatchNorm2d(out_channels))
        layers = []
        layers.append(block(self.in_channels, out_channels, stride, downsample))
        self.in_channels = out_channels
        for i in range(1, blocks):
            layers.append(block(out_channels, out_channels))
        return nn.Sequential(*layers)

    def forward(self, x):
        out = self.conv(x)
        out = self.bn(out)
        out = self.relu(out)
        out = self.layer1(out)
        out = self.layer2(out)
        out = self.avg_pool(out)
        out = out.view(out.size(0), -1)
        out = self.fc(out)
        return out

model = ResNet(ResidualBlock, [2, 2], num_classes).to(device)

# Loss and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

# Train the model
total_step = len(train_loader)
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (i+1) % 100 == 0:
            print('Epoch [{}/{}], Step [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, i+1, total_step, loss.item()))

# Test the model
model.eval()
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

    print('Test Accuracy of the model on the 10000 test images: {} %'.format(100 * correct / total))

# Save the model checkpoint
torch.save(model.state_dict(), 'resnet_model.ckpt')

Epoch [1/5], Step [100/600], Loss: 0.6256
Epoch [1/5], Step [200/600], Loss: 0.2521
Epoch [1/5], Step [300/600], Loss: 0.1570
Epoch [1/5], Step [400/600], Loss: 0.2393
Epoch [1/5], Step [500/600], Loss: 0.1080
Epoch [1/5], Step [600/600], Loss: 0.1157
Epoch [2/5], Step [100/600], Loss: 0.0701
Epoch [2/5], Step [200/600], Loss: 0.1114
Epoch [2/5], Step [300/600], Loss: 0.0491
Epoch [2/5], Step [400/600], Loss: 0.0185
Epoch [2/5], Step [500/600], Loss: 0.0881
Epoch [2/5], Step [600/600], Loss: 0.0315
Epoch [3/5], Step [100/600], Loss: 0.0377
Epoch [3/5], Step [200/600], Loss: 0.0598
Epoch [3/5], Step [300/600], Loss: 0.0303
Epoch [3/5], Step [400/600], Loss: 0.0234
Epoch [3/5], Step [500/600], Loss: 0.0731
Epoch [3/5], Step [600/600], Loss: 0.0512
Epoch [4/5], Step [100/600], Loss: 0.0527
Epoch [4/5], Step [200/600], Loss: 0.0272
Epoch [4/5], Step [300/600], Loss: 0.0251
Epoch [4/5], Step [400/600], Loss: 0.0806
Epoch [4/5], Step [500/600], Loss: 0.0345
Epoch [4/5], Step [600/600], Loss: