# Classification of the MNIST-dataset

Classification of the images in the hand-drawn MNIST-dataset has been a benchmark for models for years. The dataset contains about 60 000 training images and 10 000 test images. This might seem like a lot but considering the fact that the average neural network used for classification on this dataset is comprised of about 200 000 parameters, it is rather interesting to get the accuracy we do. In this notebook, I will be exploring the accuracy of predictions of a basic neural network and one with convolutions.

This notebook is the first Convolutional Neural Network I have written and as such, the daunting task of tuning the layers proved quite a struggle. The following, however, was the process I went through:

#### Convolutions:

Almost all CNNs start with a convolution layer. We aim to extract special patterns (or features) from the images to better emulate the way humans classify images. These convolutions come with quite a lot of tuning, however. We start with channels. The MNIST-dataset is grayscale and thus each image has one channel. The out-channel is the amount of filters to be applied through the layer and can be conceptualized as the amount of different patterns we would like to learn in this layer. Increasing this hyperparameter increases the complexity of the model and can lead to overfitting. It is common to choose 16 or 32 but without some experimenting, the best choice is unclear. Next, we have to choose the size of the filter, how much it moves and padding of the image. Using reference (2), we can determine the hyperparameters such that all pixels on the image are used to the same magnitude. The size will most likely be 3x3 with our image size (28x28), then with stride of 1, we need padding of 1 on each side.

#### Batch-Normalization and Dropout
A quick search on google scholar on the importance of BN and Dropout and where to put them highlights the ongoing debate on this matter. Dropout is the process of choosing neurons at random and ignoring their output. This process aims to reduce overfitting by encouraging the network to rely on multiple neurons for specific patterns. Batch-Normalization aims to increase the convergence rate on a minimal loss by decreasing the internal covariance shift. The original paper written by Loffe and Szegedy suggests putting Batch normalization before activation functions. This, however doesn't make sense as it leads defies the point of BN, as supported by reference 4. Also, reference 5 (and many tests) suggest that using dropout regularization with BN leads to worse results compared to using only BN and as such, BN will be used without DR and the BN-layers will be put after the activation layers.



In [1]:
import MNIST_classifier as mn
from torch import nn
import torch
from torch.utils.data import TensorDataset, DataLoader
training_dataloader, testing_dataloader = mn.init_MNIST(True)


2024-05-08 13:23:05.012159: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


0.13066047430038452
0.30810782313346863


In [2]:
model = mn.CNN().cuda()
#model.train_from_dataset(1, training_dataloader, testing_dataloader)

In [3]:
from sklearn.datasets import fetch_openml
from tensorflow.keras.preprocessing.image import ImageDataGenerator # type: ignore
mnist = fetch_openml('mnist_784', version=1)

# Get the data and target
X, y = mnist["data"], mnist["target"]
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
X_train = torch.tensor(X_train.astype(float).values)
X_test = torch.tensor(X_test.astype(float).values)
y_train = torch.tensor(y_train.astype(int).values)
y_test = torch.tensor(y_test.astype(int).values)

test_dataset = TensorDataset(X_test, y_test)
test_dataloader = DataLoader(test_dataset, batch_size=512, shuffle=True)

datagen = ImageDataGenerator(
        rotation_range=10,  
        zoom_range = 0.10,  
        width_shift_range=0.1, 
        height_shift_range=0.1
        )
#model.train_from_generator(40000, datagen, X_train, y_train, test_dataloader)

In [4]:
class CNN(nn.Module):
    
    def __init__(self):
        super(CNN, self).__init__()
        self.device = "cuda" if torch.cuda.is_available else "cpu"
        self.features = nn.Sequential(
            nn.Conv2d(in_channels=1, out_channels=16, kernel_size=(3,3), stride=1, padding=1),
            nn.ReLU(),
            nn.BatchNorm2d(16),
            nn.MaxPool2d(2),
            nn.Conv2d(in_channels=16, out_channels=32, kernel_size=(3, 3),stride = 1, padding = 1),
            nn.ReLU(),
            nn.BatchNorm2d(32),
            nn.MaxPool2d(2)
            
        )
        self.classify = nn.Sequential(
            nn.Linear(in_features=7*7*32, out_features=256),
            nn.ReLU(),
            nn.BatchNorm1d(256),
            nn.Linear(in_features=256, out_features=10)         
        )
        

    def forward(self, x):
        x = self.features(x)
        x = x.view(-1, 32*7*7)
        x = self.classify(x)
        return x

In [5]:
model = CNN().cpu()
model.load_state_dict(torch.load("Number Drawing Game/models/final_model.pth"))
model.eval()

CNN(
  (features): Sequential(
    (0): Conv2d(1, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU()
    (2): BatchNorm2d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (4): Conv2d(16, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (5): ReLU()
    (6): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (classify): Sequential(
    (0): Linear(in_features=1568, out_features=256, bias=True)
    (1): ReLU()
    (2): BatchNorm1d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (3): Linear(in_features=256, out_features=10, bias=True)
  )
)

In [6]:
import pandas as pd
import torchvision.transforms as transforms
test_df = pd.read_csv(filepath_or_buffer="data/test.csv")
kaggle_test = torch.Tensor(test_df.values)

# input = transform(kaggle_test).unsqueeze(0)
transform = transforms.Compose([
    transforms.Normalize((0.13066047430038452,), (0.30810782313346863,))
])
input = transform(kaggle_test.unsqueeze(1)).view(28000, 1, 28, 28)

output = model(input)
numbers = torch.argmax(output, dim= -1)
numbers


with open("output.txt", 'w') as f:
    # Write header
    f.write("ImageId,Label\n")
    
    # Write data
    for i in range(28000):
        f.write(f"{i+1},{numbers[i]}\n")

f.close()


## References

<ol>
<li>Zhang, A., Lipton, Z., li, M. and Smola, A. (2023). Dive Into Deep Learning. [online] Cambridge University Press. Available at: https://d2l.ai.</li>
<li>asiltureli.github.io. (n.d.). ConvNet Size Calculator. [online] Available at: https://asiltureli.github.io/Convolution-Layer-Calculator/ [Accessed 4 May 2024].</li>
<li>Loffe, S. and Szegedy, C. (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. [online] Available at: https://arxiv.org/abs/1502.03167 [Accessed 4 May 2024].</li>
<li>Rosebrock, A. (2021). Convolutional Neural Networks (CNNs) and Layer Types. [online] PyImageSearch. Available at: https://pyimagesearch.com/2021/05/14/convolutional-neural-networks-cnns-and-layer-types/.</li>
<li>Li, X., Chen, S., Hu, X. and Yang, J. (2018). Understanding the Disharmony between Dropout and Batch Normalization by Variance Shift. [online] Available at: https://arxiv.org/pdf/1801.05134 [Accessed 4 May 2024].</li>
</ol>