# MNIST Digit Recognizer

**Authors: Clement, Calvin, Tilova**

---

Welcome to the very first project of the **Tequila Chicas**! We will be classifying images of hand written numbers to their corresponding digits. This project follows the guidelines and uses the data set provide from the Kaggle Competition [here](https://www.kaggle.com/competitions/digit-recognizer/overview). 

## Introduction  

**MARKDOWN**

<a id = 'toc'></a>
    
## Table of Contents
---
1. [Convolutional Neural Network](#CNN)

**Importing Libraries**

In [1]:
import numpy as np
import pandas as pd

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Train_Test_Split
from sklearn.model_selection import train_test_split

# Scaling
from sklearn.preprocessing import StandardScaler

# PyTorch
import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader

# Progress bar from tqdm
from tqdm.notebook import trange, tnrange, tqdm_notebook
from tqdm.notebook import tnrange

# ignores the filter warnings
import warnings
warnings.filterwarnings('ignore')

<a id = 'CNN'></a>
### 1. Convolutional Neural Network
---
Loading the test and train set CSVs files.

In [2]:
df_train = pd.read_csv('../data/train.csv')
df_test = pd.read_csv('../data/test.csv')
df_train.shape, df_test.shape

((42000, 785), (28000, 784))

We need to set our independent (X) and dependent (y) variables as `numpy arrays` from the dataset.

In [3]:
X = df_train.iloc[:, 1:].to_numpy()
y = df_train.iloc[:, 0].to_numpy()

# sanity check
print(X.shape, y.shape)

(42000, 784) (42000,)


We will perform a **train_test_split()** to split our dataset into train and validation sets.
- Validation size of 25% of the data.
- Stratify=y to make sure distribution of the classes remain the same in both training and validation set.

In [4]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, stratify=y)
X_train.shape, y_train.shape

((31500, 784), (31500,))

#### 1.1 Image Preprocessing

---

**Steps**
1. Scale the data
2. Reshape the 1-D array into 2-D
2. Convert 2-D array into Torch tensors

In [5]:
# instantiate standard scaler
ss = StandardScaler()

# fit and transform training
X_train = ss.fit_transform(X_train)

# ONLY transform X_val
X_val = ss.transform(X_val)

In [6]:
# reshape training & validation
X_train = np.array(X_train).reshape(-1, 28, 28)
X_val = np.array(X_val).reshape(-1, 28, 28)

# sanity check
print(X_train.shape, X_val.shape)

(31500, 28, 28) (10500, 28, 28)


In [7]:
### To torch tensors ###
# Independent Variables
X_train = torch.tensor(X_train, dtype=torch.float32)
X_val = torch.tensor(X_val, dtype=torch.float32)

# Dependent Variable
y_train = torch.tensor(y_train, dtype=torch.long)
y_val = torch.tensor(y_val, dtype=torch.long)

# Sanity Check
print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

torch.Size([31500, 28, 28]) torch.Size([31500]) torch.Size([10500, 28, 28]) torch.Size([10500])


For a simple Convolutional Neural Network, we can have an architecture that looks like this: 

In [8]:
class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        
        # Define the convolutional layers
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(in_channels=16, out_channels=32, kernel_size=3, stride=1, padding=1)
        
        # Define the max pooling layer
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)
        
        # Define the fully connected (linear) layers
        self.fc1 = nn.Linear(32 * 14 * 14, 128)  # Adjust the input size based on the image dimensions
        self.fc2 = nn.Linear(128, 10)  # Output size is 10 for 10 classes (e.g., digits 0-9)
        
        # Define activation function
        self.relu = nn.ReLU()
        
    def forward(self, x):
        # Apply first convolutional layer followed by ReLU activation and max pooling
        x = self.pool(self.relu(self.conv1(x)))
        
        # Apply second convolutional layer followed by ReLU activation and max pooling
        x = self.pool(self.relu(self.conv2(x)))
        
        # Flatten the output for the fully connected layers
        x = x.view(-1, 32 * 14 * 14)  # Adjust the size based on the output size of the convolutional layers
        
        # Apply fully connected layers followed by ReLU activation
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Create an instance of the SimpleCNN model
model = SimpleCNN()


### 1.2 Multi-Convolution Layering Explained

Here we explain the layering for the CNN architecture below.

Convolutional Neural Networks utilizes the idea of convolution to extract key features in objects such as images. For images, a kernel or window is slid through the image in order to extract key features. More specifically, if we have a 28x28 pixel image with 1 colour channel (greyscale), a convolution window can have a dimension of, let's say 3x3, and is slid across the one image channel where in each position the dot product of the window and the image pixels is computed to extract key features. Depending on whether or not padding is added, we will end up with an output feature map that may or may not have the same dimension as the input image, 28x28. 
> **Note 1**: The convolution we are using is 2D, i.e. a square, meaning it is applied per channel. Each (colour) channel has a convolution window. For greyscale images, we only have 1 convolution window, but for RGB images each colour channel has a separate window. Mind you, only having 1 convolution window for 1 channel is **not** the same as defining only one convolution layer in our architecture!

> **Note 1.1**: Can think of `out_channels` in `nn.Conv2d` as the number of filters we want to look at for our images. In our 1x28x28 images, setting `out_channels=32`is essentially saying to look at our 28x28 image with 32 filters (Ex. edge detection, sharpen, blurring, etc.)

> **Note 2**: 1x28x28 is the number of **features** we have for the image 

Now for colour images, let's say RGB, instead of 1 channel we have 3 now meaning our features becomes 3x28x28. For convolution, since we are working in 2D, we have 3 windows. One window for red, one window for green, and one window for blue. Each of these windows convolve over the 28x28 pixel image to extract the features. 

To calculate the number of input features we have for our fully-connected layer, we have to look at how many channels we ended up with after the convolution and how our image has changed size. By multiplying these two, we can get our input features for the fc-layer. For our case, we have 1x28x28 images, that is 1 channel (colour) image with 28x28 pixel. Looking at our convolution layers, we used maxpooling in each where the stride is set to 2. Since stride is how many pixels we are going to skip when we do pooling, with our 28x28 image after one maxpooling we will end up with a 14x14 image. Doing this two more times for 2 more layers we have 7x7 and then 3.5x3.5, we can't have half pixels though so using floor division we are left with 3x3. 

$$ \begin{align}
\text{fc input features} &= \text{layer 1} \left(\text{image size, } \frac{28}{2}\times\frac{28}{2} = 14\times14 \right)\times \text{channels}\\
&= \text{layer 2} \left( \frac{14}{2}\times\frac{14}{2} = 7\times7 \right)\times \text{channels}\\
&= \text{layer 3} \left( \frac{7}{2}\times\frac{7}{2} = 3\times3 \text{ (floor division)}\right)\times \text{channels}\\
&= 3\times3\times128 \text{ (last layer out channels is 128)}
\end{align}$$


In [9]:
class MultilayerCNN(nn.Module):
    """Basic multi-layer CNN architecture."""

    def __init__(self):
        """Define the main components of the network including the loss and optimizer."""
        super(MultilayerCNN, self).__init__()

        self.conv_layer = nn.Sequential(

            # Convolutional block 1
            nn.Conv2d(in_channels=1, out_channels=32, kernel_size=3, padding=1), # Setting padding = 1 to not lose information in the edges of the image
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # Convolutional block 2
            nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3, padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(kernel_size=2, stride=2),

            # # Convolutional block 3
            # nn.Conv2d(in_channels=64, out_channels=72, kernel_size=3, padding=1),
            # nn.ReLU(inplace=True),
            # nn.MaxPool2d(kernel_size=2, stride=2),
        ) # the size of the flattened output tensor will be 64*128 // (2*2)


        self.fc_layer = nn.Sequential(
        nn.Linear(in_features=64*7*7, out_features=192),  # Corrected input size
        nn.ReLU(inplace=True),
        nn.Linear(in_features=192, out_features=10)  # Assuming 10 output classes
        )

        self.softmax_layer = nn.Softmax(dim=1)

        self.cross_entropy_loss = nn.CrossEntropyLoss()
        self.optimizer = torch.optim.SGD(self.parameters(), lr=.001, momentum=0.9)

    def forward(self, x):
        """Perform forward pass."""
        # pass through the convolutional layers
        x = self.conv_layer(x)

        # flatten the output of the convolution
        x = x.view(x.size(0), -1)

        # pass through the fully connected layers
        x = self.fc_layer(x)

        if not self.training:
            x = self.softmax_layer(x)

        return x

    def predict(self, x):
        '''
        Get hard class predictions from the
        feature data
        '''
        predictions = self.forward(x)

        # Find highest class logit, notice we don't need to convert to
        # probabilities to do hard predictions, we can simply choose the
        # highest values
        hard_class_predictions = torch.argmax(predictions, dim=1)

        return hard_class_predictions

In [10]:
# initialize the model
CNN_model = MultilayerCNN()
#CNN_model = SimpleCNN()
CNN_model

MultilayerCNN(
  (conv_layer): Sequential(
    (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc_layer): Sequential(
    (0): Linear(in_features=3136, out_features=192, bias=True)
    (1): ReLU(inplace=True)
    (2): Linear(in_features=192, out_features=10, bias=True)
  )
  (softmax_layer): Softmax(dim=1)
  (cross_entropy_loss): CrossEntropyLoss()
)

 Let's try this on a sample data.

In [11]:
# take the first training example
sample = X_train[0]
sample_label = y_train[0]

In [12]:
print(sample.shape, sample_label.shape)

torch.Size([28, 28]) torch.Size([])


PyTorch also takes in 4D tensors, usually in the form of $\text{(batch_size, channels, height, width)}$. Since our images are only in pixels (28x28), we need to add the other dimensions by using `unsqueeze()`.

In [13]:
sample.unsqueeze(0).unsqueeze(0).shape

torch.Size([1, 1, 28, 28])

In [14]:
# pass it through the model
outputs = CNN_model(sample.unsqueeze(0).unsqueeze(0))

In [15]:
# find the loss
CNN_model.cross_entropy_loss(outputs, torch.tensor([sample_label]))

tensor(2.2904, grad_fn=<NllLossBackward>)

Applying this to our full training set:
- First let's look at the training and validation shapes
- Apply unsqueeze() method to add a dimension after the number of rows

In [16]:
X_train.shape

torch.Size([31500, 28, 28])

In [17]:
X_train.unsqueeze(1).shape

torch.Size([31500, 1, 28, 28])

In [18]:
X_train = X_train.unsqueeze(1)
X_val = X_val.unsqueeze(1)

Now that we have the appropriate dimensions for the model, we need to combine the independant (X) and dependant (y) variables into one tensor data so it can be fed into the `DataLoader` function.

In [19]:
#For both sets
train_data = TensorDataset(X_train, y_train)
val_data = TensorDataset(X_val, y_val)

In [20]:
# this iterator returns four training examples at a time so we will update the model after every 4 images
train_dataloader = DataLoader(train_data, batch_size=4, shuffle=True)

# this iterator returns 1024 test examples at a time (for fast testing)
val_dataloader = DataLoader(val_data, batch_size=1024, shuffle=False)

To evaluate our model's performance, we can create a function that will return the model's accuracy:

- Create 2 variables: correct and total
- Go through the dataloader and find the predictions
- Calculate how many labels the model predicted correctly by dividing correct and total

In [21]:
def get_accuracy(model, dataloader):
    '''
    Helper function to get classification accuracy for a model over the items in dataloader.
    taken from: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html
    '''
    correct = 0
    total = 0


    # Go through all of the data
    for batch in dataloader:
        images, labels = batch

        # Get the prediction of the net on the images
        predicted = model.predict(images)


        total += labels.size(0)

        # Count those we got correct
        correct += (predicted == labels).sum().item()

    return(100 * correct / total)


print(f"Pre training accuracy: {get_accuracy(CNN_model, val_dataloader)}%")

Pre training accuracy: 17.047619047619047%


In [22]:
# re-initiliaze the model
CNN_model = MultilayerCNN()


# MAIN EPOCH LOOP: the epochs are the number of times we loop through the entire training set.
for epoch in tnrange(10, desc="Total epochs: "):
    
    loss_sum = 0
    acc_sum = 0
    
    # BATCH LOOP: loop over the data batches using the data loader 
    # if you don't have tqdm installed, just use this simpler for-loop instead
    # for batch in train_dataloader: 
    for i, data in enumerate(train_dataloader, 0):

        # get the inputs
        inputs, labels = data
        
        # FORWARD PASS and loss calculation
        outputs = CNN_model(inputs)
        loss = CNN_model.cross_entropy_loss(outputs, labels)
        
        # BACKWARD PASS but zero the gradients first to delete the old ones
        # as pytorch accumulates gradients by default
        CNN_model.optimizer.zero_grad()
        loss.backward()

        # UPDATE: the model weights are updated
        CNN_model.optimizer.step()
        
        # MONITORING: save loss and accuracy on the batch to track the training
        loss_sum += loss.item()

        # Get the prediction of the net on the images
        predicted = CNN_model.predict(inputs)
        acc_sum += (labels == predicted).sum().item() / labels.shape[0]
        
    
    # print summary of training metrics
    loss_avg = loss_sum / len(train_dataloader)
    acc_avg = acc_sum / len(train_dataloader)
    test_acc = get_accuracy(CNN_model, val_dataloader)
    
    print(f"Avg loss: {np.round(loss_avg, 4)} | "\
          f"Avg training accuracy: {np.round(acc_avg*100, 2)} | "\
          f"Avg test accuracy: {np.round(test_acc, 2)}")

print('Finished Training') 

Total epochs:   0%|          | 0/10 [00:00<?, ?it/s]

Avg loss: 0.2331 | Avg training accuracy: 94.05 | Avg test accuracy: 97.48
Avg loss: 0.0617 | Avg training accuracy: 98.59 | Avg test accuracy: 97.86
Avg loss: 0.038 | Avg training accuracy: 99.11 | Avg test accuracy: 98.48
Avg loss: 0.0272 | Avg training accuracy: 99.5 | Avg test accuracy: 98.37
Avg loss: 0.0198 | Avg training accuracy: 99.68 | Avg test accuracy: 98.61
Avg loss: 0.0137 | Avg training accuracy: 99.77 | Avg test accuracy: 98.61
Avg loss: 0.0115 | Avg training accuracy: 99.81 | Avg test accuracy: 98.57
Avg loss: 0.0082 | Avg training accuracy: 99.89 | Avg test accuracy: 98.81
Avg loss: 0.005 | Avg training accuracy: 99.94 | Avg test accuracy: 98.62
Avg loss: 0.0034 | Avg training accuracy: 99.97 | Avg test accuracy: 98.75
Finished Training
