## Assignment 6, 10/10 fun

#### Transitioning from Keras/TensorFlow to PyTorch:

Upon request, this notebook was transformed to Pytorch instead of using the code provided. Bare requirement results were yielded with the original code by decreasing the learning rate to 0.1 and increasing the epochs to 5. This gave 92% test accuracy in the FNN and by adding just another line "layer" in the CNN yielded 97% test accuracy.

*   Replacing Keras layers with equivalent PyTorch `nn.Module` implementations.
*   Using PyTorch's `DataLoader` for data loading and batching.
*   Implementing the training loop using PyTorch's `optim` optimizers and manual gradient management.
*   Adjusting the code to align with PyTorch's tensor-based operations and dynamic computation graph.

In [1]:
import tensorflow as tf
from tensorflow import keras


def load_mnist():
    (x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

    # Normalize the input data - MNIST data is pixel arrays, so divide by max pixel value 255
    x_train = x_train/255.0
    x_test = x_test/255.0

    # Output is categorical - map from digit target to vector (e.g. 2 -> [0,0,1,0,0,0,0,0,0,0])
    y_train = keras.utils.to_categorical(y_train, num_classes=10)
    y_test = keras.utils.to_categorical(y_test, num_classes=10)

    return x_train, y_train, x_test, y_test


def build_model(cnn=True):

    model = keras.Sequential()

    # Input is 28x28 image, single channel (grayscale)
    model.add(keras.Input(shape=(28, 28, 1)))

    if not cnn:

        ###  Fully connected neural network ###

        # Input is multidimensional, flattened to single dimension
        model.add(keras.layers.Flatten())
        # Add a hidden layer - units is number of neurons/layer width
        model.add(keras.layers.Dense(units=16, activation="relu"))
        # TODO add more dense layers and/or vary number of units for increased complexity of FNN

    else:

        ###  Convolutional neural network  ###

        # Add convolutional layer - filters is depth of layer output and kernel_size the convolution window
        model.add(keras.layers.Conv2D(filters=8, kernel_size=(2, 2), activation="relu", padding="same"))
        model.add(keras.layers.Conv2D(filters=8, kernel_size=(2, 2), activation="relu", padding="same"))

        # Add pooling layer to downscale (MaxPooling downscales by returning the maximum value in each input window)
        model.add(keras.layers.MaxPooling2D(pool_size=(2, 2)))
        # TODO add more layers and/or experiment with different number of filters, different kernel_size or pool_size

        # Flatten internal dimensions before output - additional dense layers could also be included after this line
        model.add(keras.layers.Flatten())

    # Final model layer - the same for all model architectures
    # Activation is softmax
    model.add(keras.layers.Dense(units=10, activation="softmax"))

    return model



if __name__ == "__main__":

    # TODO try different values for epochs and learning_rate to improve model performance
    epochs = 5
    learning_rate = 0.1

    x_train, y_train, x_test, y_test = load_mnist()
    model = build_model(cnn=True)  # set cnn=True for convolutional network, false for MLP

    # Compile model - Stochastic gradient descent is chosen for the optimizer and categorical cross entropy for the
    # loss calculation
    optimizer = keras.optimizers.SGD(learning_rate=learning_rate)
    model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=['accuracy'])

    # Show model architecture details and compare parameter counts
    model.summary()

    # Train the model on training data
    history = model.fit(x_train, y_train, epochs=epochs, batch_size=128, verbose=1, validation_split=0.1)

    # Evaluate the model on test data
    test_loss, test_acc = model.evaluate(x_test, y_test, verbose=1)
    print(f"Test accuracy: {test_acc:.4f}")


Epoch 1/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.6972 - loss: 0.9327 - val_accuracy: 0.9187 - val_loss: 0.2779
Epoch 2/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.9111 - loss: 0.3058 - val_accuracy: 0.9495 - val_loss: 0.1818
Epoch 3/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.9415 - loss: 0.1971 - val_accuracy: 0.9623 - val_loss: 0.1318
Epoch 4/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.9601 - loss: 0.1388 - val_accuracy: 0.9733 - val_loss: 0.0964
Epoch 5/5
[1m422/422[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 6ms/step - accuracy: 0.9684 - loss: 0.1073 - val_accuracy: 0.9748 - val_loss: 0.0934
[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - accuracy: 0.9659 - loss: 0.1088
Test accuracy: 0.9705


In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np

In [8]:
def load_mnist():
    """
    Load and prepare the MNIST dataset.
    Returns train and test data as tensors.
    """
    # Define transformations: Convert to tensor
    transform = transforms.ToTensor()
    
    # Download and load training data
    train_dataset = datasets.MNIST('./data', train=True, download=True, transform=transform)
    test_dataset = datasets.MNIST('./data', train=False, download=True, transform=transform)
    
    # Create data loaders
    train_loader = DataLoader(train_dataset, batch_size=128, shuffle=True)
    test_loader = DataLoader(test_dataset, batch_size=128, shuffle=False)
    
    return train_loader, test_loader


In [9]:
def build_model(cnn=True):
    if not cnn:
        # Fully connected neural network
        model = nn.Sequential(
            nn.Flatten(),  # Input is multidimensional, flattened to single dimension
            nn.Linear(28*28, 128),  # Hidden layer with 16 neurons
            nn.BatchNorm1d(128),       # Add batch normalization
            nn.ReLU(),
            nn.Dropout(0.2),           # Add dropout
            nn.Linear(128, 64),        # Additional hidden layer
             nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.Linear(64, 10),  # Output layer with 10 neurons (one per digit)
        )
    else:
        # Convolutional neural network
        model = nn.Sequential(
            # Conv layer with 8 filters and 2x2 kernel
            nn.Conv2d(1, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),       # Add batch normalization
            nn.ReLU(),
            nn.MaxPool2d(2),
            
            #One more convolutional layer
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),       # Add batch normalization
            nn.ReLU(),
            nn.MaxPool2d(2),
            
            
            #Pieced Together
            nn.Flatten(),  # Flatten dimensions before output
            nn.Linear(64 * 7 * 7, 128),  # Output layer
            nn.BatchNorm1d(128),       # Add batch normalization
            nn.ReLU(),
            nn.Dropout(0.2),           # Add dropout
            nn.Linear(128, 10),  # Output layer
        )
    
    return model

#### Model changes made

**Fully Connected Neural Network (FNN):**

*   Increased the number of neurons in the first hidden layer from 16 to 128 to enhance the model's capacity to learn more complex patterns.
*   Implemented Batch Normalization after each hidden layer to stabilize training and potentially allow for higher learning rates.
*   Added Dropout layers to prevent overfitting by randomly setting a fraction of input units to 0 during training.
*   Introduced an additional hidden layer to increase the model's depth and ability to capture more abstract features.
*   Ensured the output layer matches the number of classes in the dataset (10 for MNIST).

**Convolutional Neural Network (CNN):**

*   Increased the number of filters in the convolutional layers from 8 to 32 to enable the network to learn a richer set of features.
*   Implemented Batch Normalization after each convolutional layer to improve training stability and speed.
*   Added an extra Convolutional layer to allow the network to learn more complex spatial hierarchies.
*   Combined the convolutional layers with fully connected layers, using a 128-neuron hidden layer before the final 10-neuron output layer for classification.

In [10]:
def train_and_evaluate(model, train_loader, test_loader, epochs=10, learning_rate=0.1):
    """Training and evaluation function for both models"""
    # Set up loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=learning_rate, weight_decay=1e-4)
    
    # Train the model
    for epoch in range(epochs):
        running_loss = 0.0
        model.train()  # Set model to training mode
        for inputs, labels in train_loader:
            # Zero the parameter gradients
            optimizer.zero_grad()
            
            # Forward pass, backward pass, optimize
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            
            running_loss += loss.item()
            print(f'Epoch {epoch+1}, Loss: {running_loss/len(train_loader):.3f}')
    
    # Evaluate the model
    correct = 0
    total = 0
    model.eval()  # Set model to evaluation mode
    with torch.no_grad():
        for inputs, labels in test_loader:
            outputs = model(inputs)
            _, predicted = torch.max(outputs, 1)
            total += labels.size(0)
            correct += (predicted == labels).sum().item()
    
    test_acc = 100 * correct / total
    print(f"Test accuracy: {test_acc:.2f}%")
    
    return test_acc

####  Optimizations:

*   Increased the number of training epochs to 10 to allow the model more time to learn from the data.
*   Decreased the learning rate to 0.1 to prevent overshooting and promote more stable convergence.
*   Added weight decay (L2 regularization) to the SGD optimizer to prevent overfitting by penalizing large weights.
*   Ensured the optimizer step is called after the backward pass to update model parameters.

In [None]:
# Load MNIST dataset
train_loader, test_loader = load_mnist()

# Hyperparameters
epochs = 10
learning_rate = 0.1

# Build FNN model
fnn_model = build_model(cnn=False)

# Print model architecture
print("FNN Model Architecture:")
print(fnn_model)
print("\n" + "="*50 + "\n")

# Train and evaluate FNN model
print("Training Fully Connected Neural Network...")
fnn_accuracy = train_and_evaluate(
    model=fnn_model,
    train_loader=train_loader,
    test_loader=test_loader,
    epochs=epochs,
    learning_rate=learning_rate
)

FNN Model Architecture:
Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=128, bias=True)
  (2): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (3): ReLU()
  (4): Dropout(p=0.2, inplace=False)
  (5): Linear(in_features=128, out_features=64, bias=True)
  (6): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (7): ReLU()
  (8): Linear(in_features=64, out_features=10, bias=True)
)


Training Fully Connected Neural Network...
Epoch 1, Loss: 0.005
Epoch 1, Loss: 0.010
Epoch 1, Loss: 0.014
Epoch 1, Loss: 0.017
Epoch 1, Loss: 0.021
Epoch 1, Loss: 0.024
Epoch 1, Loss: 0.027
Epoch 1, Loss: 0.030
Epoch 1, Loss: 0.033
Epoch 1, Loss: 0.036
Epoch 1, Loss: 0.038
Epoch 1, Loss: 0.041
Epoch 1, Loss: 0.043
Epoch 1, Loss: 0.045
Epoch 1, Loss: 0.048
Epoch 1, Loss: 0.050
Epoch 1, Loss: 0.052
Epoch 1, Loss: 0.054
Epoch 1, Loss: 0.056
Epoch 1, Loss: 0.058
Epoch 1, Loss: 0.060
Epoch 1, Loss: 0.06

In [12]:
# Hyperparameters
epochs = 10
learning_rate = 0.1

# Build CNN model
cnn_model = build_model(cnn=True)

# Print model architecture
print("CNN Model Architecture:")
print(cnn_model)
print("\n" + "="*50 + "\n")

# Train and evaluate CNN model
print("Training Convolutional Neural Network...")
cnn_accuracy = train_and_evaluate(
    model=cnn_model,
    train_loader=train_loader,
    test_loader=test_loader,
    epochs=epochs,
    learning_rate=learning_rate
)

CNN Model Architecture:
Sequential(
  (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU()
  (3): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (4): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (5): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (6): ReLU()
  (7): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  (8): Flatten(start_dim=1, end_dim=-1)
  (9): Linear(in_features=3136, out_features=128, bias=True)
  (10): BatchNorm1d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (11): ReLU()
  (12): Dropout(p=0.2, inplace=False)
  (13): Linear(in_features=128, out_features=10, bias=True)
)


Training Convolutional Neural Network...
Epoch 1, Loss: 0.005
Epoch 1, Loss: 0.008
Epoch 1, Loss: 0.011
Epoch 1, Loss: 0.013
Epoch 1, Loss: 0.015


#### How Softmax and Categorical Cross-Entropy Work for MNIST Classification

1. **Softmax Function:**
   - Converts the model's output into 10 probabilities (one for each digit 0-9)
   - Makes sure all probabilities add up to 1.0

2. **Categorical Cross-Entropy Loss:**
   - Measures how far the predicted probabilities are from the correct answer
   - Creates larger penalties when the model is confident but wrong

3. **Learning Process:**
   - When the model makes a mistake (e.g., thinks a "3" is a "5"), the loss produces gradients
   - These gradients update the network weights to make the correct digit more likely next time
   - Over many training examples, the model learns which visual features identify each digit