# GDAA 2010 – Data Mining Modelling

## Project #2 - Applied Deep Learning in Python using PyTorch

### Alex Moss

## Introduction

For this second project, industry standard Deep Learning workflows will be used to generate class label predictions on a set of images. The purpose of this assignment is to gain experience building Deep Learning models in a Python-based environment for making predictions on unlabelled image data. The PyTorch framework will be leveraged to build your models.
The work for this project will be executed in a Jupyter Notebook and be accompanied by detailed comments that explain the dataset characteristics, modelling procedures, and results. The provided “dl_leafsnap.ipynb” on Brightspace will be used as a template for this code.

First things first, import all of the necessary packages to run the code for this project.

In [None]:
import torch
import torchvision
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import transforms
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import os
import sklearn
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score



Assign the cpu to an object called 'device'.

In [None]:
device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
device

## Data Loading & Transformation

The dataset for this project is a commonly used machine learning dataset know as the MNIST database. It contains 70,000 images of handwritten digits from 0 to 9. This code chunk downloads this dataset, transforms it to an tensor using the 'transforms.ToTensor()' command, then splits it into a training and testing set. The classes for the dataset are then set from 0 to 9.

In [None]:
# Define the directory where you want to save the dataset
download_path = r'E:\\NSCC\\Semester_2\\GDAA2010_DataMiningModelling\\Project_2\\data\\MNIST'

# Define the transformation
transform = transforms.ToTensor()

# Create the training dataset object
trainset = torchvision.datasets.MNIST(root=download_path, train=True, download=True, transform=transform)

# Create the test dataset object
testset = torchvision.datasets.MNIST(root=download_path, train=False, download=True, transform=transform)

classes = ('0', '1', '2', '3', '4', '5', '6', '7', '8', '9')


This code chunk verifies the shape of the tensors. 

In [None]:
train_iter = iter(trainset)

image, label = next(train_iter)

image.shape, label

The above output indicates that the tensors created by PyTorch for this project contain a single channel, because they are grayscale images. We can also see that the images are 28 by 28 pixels. The final number in the output, 5, indicates that this particular tensor has a label of 5. Lets check on those labels with the below code chunk.

In [None]:
print(trainset.class_to_idx)

The numbered labels that were created earlier simply represent the handwritten number represented by each sample. 

The tensors, in the case for this project, are representing 28 by 28 pixel images. These tensors can be visualized graphically using the matplotlib package. To ensure the data was loaded into the project correctly, one of the tensors will be plotted.

In [None]:
index = 2024 # This selects the image from the training data by index number

image, label = trainset[index] # Get the image and its label

plt.imshow(image.permute(1, 2, 0))
plt.title(f"Label: {classes[label]}")
plt.show()

## Data Preparation

The primary component for the data prep in this project includes splitting up the data into the appropriate sets. In our case, there will be a training set and testing set. Within the training set, there will be a further split of the data, the training set and the validation set. First, lets see how the data is split so far.

In [None]:
len(trainset), len(testset)

The default split is an 85/15 split, meaning 60,000 samples will be for training, and 10,000 for testing. Now, the training set needs to be split into training and validation sets. A 75/25 split will be used for training and validation. A batch size also needs to be set. The first size used will be 32. Further experimenting to achieve better results may occur, resulting in a tweaking of the batch size.

In [None]:
trainset, valset = torch.utils.data.random_split(trainset, [45000, 15000])
len(trainset), len(valset), len(testset)

# Set the batch size
batch_size = 64

It's also helpful to know how many batches our data will be divided into during training, the below code chunk checks that quickly for us.

In [None]:
print(f'Number of batches in the training set: {int(45000 / batch_size)}')
print(f'Number of batches in the validation set: {int(15000 / batch_size)}')

This code chunk simply verifies that the training dataset is being viewed as a subset of our original dataset.

In [None]:
type(trainset)

## Model Preparation

Although the CPU is being used for model training instead of a GPU, data loaders will still be setup to remain consistent with the sample code.

In [None]:
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
                                          shuffle=True, num_workers=2)

valloader = torch.utils.data.DataLoader(valset, batch_size=batch_size,
                                          shuffle=False, num_workers=2)

testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
                                         shuffle=False, num_workers=2)

Now that the data has been prepped, the model is ready to be defined. This project will build a Convolutional Neural Network, which works well for image classification tasks.

The model will be stored as a class. Within our class, various functions will be defined to accomplish the necessary tasks for this model.

The first main function outlines the layers to be included in the model, and their properties. This will include, for this model, a range of convolutional layers, max pooling layers, linear layers, and dropout layers (among other details). The output channels of one layer need to match the subsequent input channels of the next layer. How these values are initially determined is a bit arbitrary, and often requires considerable iteration to get right. 

The second main function we describe here can be called the forward function, which provides the instructions for how the model should flow, as in from one layer to the next. It also defines the activation functions to be applied at various point throughout the model, and these are meant to capture the nonlinearities of the data during training. Basically, these are going to help the model find the curve that fits that data, especially when the data is very complex. 


In [None]:
class NeuralNet(nn.Module):
    def __init__(self):
        super().__init__()

        # Convolutional layers
        self.conv1 = nn.Conv2d(in_channels=1, out_channels=64, kernel_size=5, padding=2) #Ensure the first input channel is = 1, because our Tensors are single channel.
        self.pool1 = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=5, padding=2)
        self.pool2 = nn.MaxPool2d(2, 2)
        self.conv3 = nn.Conv2d(in_channels=128, out_channels=256, kernel_size=4, padding=1)
        self.pool3 = nn.MaxPool2d(2, 2)

        # Calculate the size of the flattened features after all convolutional and pooling layers
        self._flattened_features = self._get_conv_output_size()

        # Fully connected layers
        self.fc1 = nn.Linear(in_features=self._flattened_features, out_features=1024)
        self.drop1 = nn.Dropout(p=0.3)
        self.fc2 = nn.Linear(in_features=1024, out_features=512)
        self.drop2 = nn.Dropout(p=0.3)
        self.out = nn.Linear(in_features=512, out_features=10) ### Ensure output features matches the number of 

    def _get_conv_output_size(self):
        # Create a dummy input to pass through the convolutional layers only
        dummy_input = torch.zeros(1, 1, 28, 28)  # Use the provided input dimensions
        dummy_input = self.conv1(dummy_input)
        dummy_input = self.pool1(dummy_input)
        dummy_input = self.conv2(dummy_input)
        dummy_input = self.pool2(dummy_input)
        dummy_input = self.conv3(dummy_input)
        dummy_input = self.pool3(dummy_input)
        return int(torch.flatten(dummy_input, 1).size(1))

    def forward(self, x):
        # Convolutional and pooling layers
        x = F.relu(self.conv1(x))
        x = self.pool1(x)
        x = F.relu(self.conv2(x))
        x = self.pool2(x)
        x = F.relu(self.conv3(x))
        x = self.pool3(x)

        # Flatten the output for the fully connected layers
        x = torch.flatten(x, 1)

        # Fully connected layers with ReLU activations and dropout
        x = F.relu(self.fc1(x))
        x = self.drop1(x)
        x = F.relu(self.fc2(x))
        x = self.drop2(x)

        # Output layer
        x = self.out(x)

        return x


Now that the model architecture has been defined, the model needs to be pushed to the device.

In [None]:
net = NeuralNet()
net.to(device)

The code chunk below verifies the shape of the initial and final outputs.

In [None]:
for i, data in enumerate(trainloader):
    inputs, labels = data[0].to(device), data[1].to(device)
    print(f'input shape: {inputs.shape}')
    print(f'after network shape: {net(inputs).shape}')
    break

This next code chunk checks how many parameters the model will be estimating.

In [None]:
num_params = 0
for x in net.parameters():
  num_params += len(torch.flatten(x))

print(f'Number of parameters in the model: {num_params:,}')

The next two lines of code are very important. Firstly, the loss function for the model needs to be defined. The loss function is the metric that model will be trying to minimize. In this case, the CrossEntropyLoss will be used. 

Next, we define the learning rate of the model. The learning rate is the aggressiveness in which the model attempts to correct itself. The learning rate will start at 0.0001, but may be tweaked to achieve better results.

In [None]:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(net.parameters(), lr=0.0001)

## Training the Data

Training a deep learning model can be tricky, in part because it is easy to come up against computational bottlenecks when dealing with extremely complex models being run on very large datasets. To optimize training, we train batches of data at a time, and we iterate through all the training data in epochs. This gives models a good opportunity to report back on the training's realtime progress. 

However, before the model can be trained, the training loop needs to be defined. The below code chunk sets up the training epoch.

In [None]:
def train_one_epoch():
  net.train(True)

  running_loss = 0.0
  running_accuracy = 0.0

  for batch_index, data in enumerate(trainloader):
    inputs, labels = data[0].to(device), data[1].to(device)

    optimizer.zero_grad()

    outputs = net(inputs) # shape: [batch_size, 5]
    correct = torch.sum(labels == torch.argmax(outputs, dim=1)).item()
    running_accuracy += correct / batch_size

    loss = criterion(outputs, labels)
    running_loss += loss.item()
    loss.backward()
    optimizer.step()

    if batch_index % 200 == 199:  # print every 10 batches ## STeve was here to change this, probably did not change anythig error wise
      avg_loss_across_batches = running_loss / 200
      avg_acc_across_batches = (running_accuracy / 200) * 100
      print('Batch {0}, Loss: {1:.3f}, Accuracy: {2:.1f}%'.format(batch_index+1,
                                                          avg_loss_across_batches,
                                                          avg_acc_across_batches))
      running_loss = 0.0
      running_accuracy = 0.0

    
  print()

Now that the training epoch has been setup, the validation epoch needs to be setup.

In [None]:
def validate_one_epoch():
    net.train(False)
    running_loss = 0.0
    running_accuracy = 0.0
    
    for i, data in enumerate(valloader):
        inputs, labels = data[0].to(device), data[1].to(device)
        
        with torch.no_grad():
            outputs = net(inputs) # shape: [batch_size, 5]
            correct = torch.sum(labels == torch.argmax(outputs, dim=1)).item()
            running_accuracy += correct / batch_size
            loss = criterion(outputs, labels) # One number, the average batch loss
            running_loss += loss.item()
        
    avg_loss_across_batches = running_loss / len(valloader)
    avg_acc_across_batches = (running_accuracy / len(valloader)) * 100
    
    print('Val Loss: {0:.3f}, Val Accuracy: {1:.1f}%'.format(avg_loss_across_batches,
                                                            avg_acc_across_batches))
    print('***************************************************')
    print()

Now that the training and validation loops have been properly defined, the model can officially begin training. The number of epochs that are trained and validated will change to optimize results.

In [None]:
num_epochs = 5

for epoch_index in range(num_epochs):
    print(f'Epoch: {epoch_index + 1}\n')
    
    train_one_epoch()
    validate_one_epoch()
    
print('Finished Training')

## Evaluating the Model

With our model now trained and validated, we can evaluate the model using standard methods for classification models. Our evalution will include classification accuracy, true/false positive and negative rates, and other values. These are also nicely visualized in the confusion matrix, which can be nicely formatted using `matplotlib` and `seaborn`.


The metrics themselves will come from a range of `sklearn` metrics, as seen below. Testing our model involves running our test dataset into our model to generate predictions, and then cross-referencing the predictions against the actual class labels to check for errors. 

In [None]:
# Get predicted labels from the model
predicted_labels = []
true_labels = []

# Iterate through test set and collect predictions
for images, labels in testloader:
    images = images.to(device)
    labels = labels.to(device)
    outputs = net(images)
    _, predicted = torch.max(outputs, 1)
    predicted_labels.extend(predicted.cpu().numpy())
    true_labels.extend(labels.cpu().numpy())

# Calculate accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
print("Accuracy:", accuracy)

# Generate classification report
class_report = classification_report(true_labels, predicted_labels, target_names=classes)


# Generate confusion matrix
conf_matrix = confusion_matrix(true_labels, predicted_labels)

Print both the classification report and the confusion matrix.

In [None]:
print("Classification Report:\n", class_report)

In [None]:
print("Confusion Matrix:\n", conf_matrix)

This code chunk visualizes the confusion matrix with improved formatting.

In [None]:
# Extract class labels from the dataset
class_labels = testset.classes

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.set(font_scale=1.2)  # Set font scale
sns.heatmap(conf_matrix, annot=True, cmap="Blues", fmt="d", xticklabels=class_labels, yticklabels=class_labels)
plt.xlabel("Predicted labels")
plt.ylabel("True labels")
plt.title("Confusion Matrix")
plt.show()

## Saving & Testing the Trained Model

Once training results are satisfactory, this below code chunk should be run to save the model so that it does not need to be trained everytime predictions are to be made on new data.

In [None]:
# Define the directory path to save the model
save_dir = r"E:\\NSCC\\Semester_2\\GDAA2010_DataMiningModelling\DL\\models"

# Define the file name for the saved model
model_name = "MNIST_epoch5batch64.pth"
save_path = os.path.join(save_dir, model_name)

# Save the model
torch.save(net.state_dict(), save_path)

print(f"Model saved at: {save_path}")


With the newly saved model, it can be loaded back into the project separately and then tested on the same testdata as earlier to ensure the model is working as intended.

In [None]:
# Define the model architecture
net_test = NeuralNet()

save_path = os.path.join(save_dir, model_name)
# Load the saved model state dictionary
net_test.load_state_dict(torch.load(save_path))

# Evaluate the model on the test data
net_test.eval()
correct = 0
total = 0
with torch.no_grad():
    for images, labels in testloader:
        images, labels = images.to(device), labels.to(device)
        outputs = net_test(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

# Calculate accuracy
accuracy = 100 * correct / total
print(f"Accuracy of the network on the test images: {accuracy:.2f}%")

## Conclusion/Thoughts

 Comparing this deep learning project to the first machine learning project, there was much less to play around with this time around. After modifying the code to accommodate the smaller, grayscale, 10 class MNIST dataset, there wasn't much else to change from the sample code using the leafsnap dataset. With that said, it was interesting seeing the accuracy get slightly better after manipulating some of the tweakable parameters. The first run already showed an impressive accuracy of 97.93%, which was improved after increasing batch size and epoch count, taking a small jump to 98.86%. Finally, after increasing the learning rate to 0.001 and the batch size to 10, a final accuracy score of 98.96% was achieved. 

 Overall, this was a neat project to watch in action. Watching the model print the results of it actively improving its accuracy score is extremely cool and there will be a point made to potentially get this to run using more custom, geospatial data. Also, it will be cool to run through the time series version of this code.