# Assignment - 1

This assignment has two sections:

    a) Study of hyperparameter impact - after you complete the program, you should add one page discussion and conclusion regarding hyperparameter impact to your notebook.
    b) Study of data bias impact - after you complete the program, you should add one page discussion and conclusion regarding data bias impact to your notebook.
### Model: 
![alt text](model.png "Architecture")

In [None]:
import time
import torch
import itertools
import torchvision

import numpy as np
import matplotlib.pyplot as plt

from torch import nn, optim
from torchvision import datasets, transforms
from torch.utils.data import TensorDataset, DataLoader

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

flatten = itertools.chain.from_iterable

# Some helper functions

def plot_loss(loss_as_list):
    """
    Plot the loss curve from a list of loss terms.
    """
    plt.plot(loss_as_list, 'k')
    _ = plt.title("Loss Curve")
    _ = plt.xlabel("Epochs")
    _ = plt.ylabel("Loss")
    
def get_classification_results(model, loader):
    """
    Print the accuracy of a trained model.
    Loss: Cross Entropy
    """
    correct, total = 0, 0
    predictions = []
    true_labels = []

    for xs, ts in test_loader:
        xs = xs.view(-1, 784) # flatten the image
        zs = model(xs) # do forward pass
        pred = zs.max(1, keepdim=True)[1] # get the index of the max logit
        correct += pred.eq(ts.view_as(pred)).sum().item() # count equal values
        total += int(ts.shape[0]) # get total values

        predictions.append(pred)
        true_labels.append(ts)

    accuracy = correct / total
    conf_matrix = confusion_matrix(list(flatten(true_labels)), list(flatten(predictions)))
    cl_report = classification_report(list(flatten(true_labels)), list(flatten(predictions)), digits=4)

    print(cl_report)
    print(conf_matrix)

In the past three lectures we saw how to load data, define multi-layer perceptron (MLP) model, and train them. In this assignment, we will learn nuances of the dataset and some inherent limitations of deep neural networks. Some helper code that is required is provided and students are encouraged to read documentations and come up with code to carry out some operations.

## a) Impact of Hyperparameters in Learning Performance

Hyperparameters are parameters that are set before the learning process and influence the learning performance of your algorithm. They are tunable parameters and influence how the model handles different aspects of forward and backward passes during optimization. For example, as we saw, changing the learning rate ($\eta$) could improve or degrade the learning performance. This happens due to unstable training for large values of $\eta$ or small changes in weight updates due to very small values of $\eta$.

Let us look at some hyper parameters and understand how they impact the learning performance. First we will define the hyperparameters and train the Deep Neural Network (DNN) model we discussed in Lecture 2.

### Load original MNIST data

In [None]:
torch.manual_seed(13)

N_train = 64
N_test = 256

# We will use torch.utils.data.DataLoader to wrap our dataset.
# This provides easier batching, GPU support, etc.
# Calling torchvision.datasets.MNIST() will download and format the MNIST
# dataset with the transforms we specify. Here, in the transforms we first convert
# the image to PyTorch tensor, and then normalize the image based on a given mean
# and standard deviation. Normalizing the image does: image = (image - mean) / std.
# We shuffle the data as well by defining shuffle=True.

train_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('../Datasets/', train=True, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=N_train, shuffle=True)

test_loader = torch.utils.data.DataLoader(
  torchvision.datasets.MNIST('../Datasets/', train=False, download=True,
                             transform=torchvision.transforms.Compose([
                               torchvision.transforms.ToTensor(),
                               torchvision.transforms.Normalize(
                                 (0.1307,), (0.3081,))
                             ])),
  batch_size=N_test, shuffle=True)

### Define the model and hyperparameters

In [None]:
input_size = 784
hidden_sizes = [128, 64]
output_size = 10

# Hyper Parameters
lr = 0.003 # learning rate
NUM_EPOCHS = 10

In [None]:
def MLP():
    """
    A function implementation of the model definition.
    """
    model = nn.Sequential(nn.Linear(input_size, hidden_sizes[0]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[0], hidden_sizes[1]),
                      nn.ReLU(),
                      nn.Linear(hidden_sizes[1], output_size)
                     )
    return model

In [None]:
def train(model, NUM_EPOCHS, train_loader):
    """
    A function to train the neural network model.
    """
    loss_fn = nn.CrossEntropyLoss() # also called criterion sometimes.

    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    start = time.time()
    loss_as_list = []

    for EPOCH in range(NUM_EPOCHS):
        running_loss = 0
        for images, labels in train_loader:
            # Flatten MNIST images into a 784 long vector
            images = images.view(images.shape[0], -1)

            # Training pass
            optimizer.zero_grad()

            output = model(images)
            loss = loss_fn(output, labels)
            loss_as_list.append(loss)

            #This is where the model learns by backpropagating
            loss.backward()

            #And optimizes its weights here
            optimizer.step()

            running_loss += loss.item()
        else:
            print("Epoch {} - Training loss: {}".format(EPOCH, running_loss/len(train_loader)))

    print("\nTraining Time (in minutes) =",(time.time()-start)/60)
    return(loss_as_list)

In [None]:
# Define the model
model = MLP()

# Start training the model on the train_loader.
loss_values = train(model, NUM_EPOCHS, train_loader)

In [None]:
# Plot the loss curve
plot_loss(loss_values)

In [None]:
# This function we defined in the beginning generates a classification report based
# on https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn.metrics.classification_report
# We also generate a confusion matrix to study the misclassifications.
get_classification_results(model, test_loader)

In [None]:
# Save the model
torch.save(model, 'mnist_original.pt')

In [None]:
# Study the representations

# Extract the hidden layer latent dense representations to study the distribution of the representations.
# We can select a specific layer for our study.
# Let us extract the penultimate layer as the embedding layer. Hence we use `-2` index since `:` means upto but not including.

embd_model = nn.Sequential(*list(model.children()))[:-2]

In [None]:
# Note that our new model don't have the output layer and associated ReLU activation layer.
embd_model

In [None]:
# We can define a z vector and append all activations to that.
zs = []
for xs, ts in test_loader:
    xs = xs.view(-1, 784) # flatten the image
    zs.append(embd_model(xs).detach().numpy()) # do forward pass to extract embeddings and append to zs.

zs = np.vstack(zs) # Stack all the embeddings. This will give you 10000*64 array, since embedding size (out_features of (2) Linear) is 64.

zs_mean = np.average(zs, axis=0) # Find mean of all embeddings with respect to depth axis (axis=0). This will give you 1*64 vector.

zs_std = np.std(zs, axis=0) # Find the standard deviation all embeddings with respect to the depth axis.

In [None]:
# Find the covariance of embeddings.


### Change Learning Rate

Great! Our model is generating 97.71% accuracy. What happens when we change the learning rate to $\eta = 0.05$? Please add your code below.

In [None]:
# Define new hyper parameters here


In [None]:
# Define the model
model_lr = ...

In [None]:
# Start training the model on the train_loader.
model_lr_loss_values = ...

In [None]:
# Plot the loss curve


In [None]:
# Classification results


In [None]:
# Save the model
torch.save(model_lr, 'mnist_change_lr.pt')

In [None]:
# Study the representations

# Let us extract the penultimate layer as the embedding layer.

embd_model_lr = ...

In [None]:
# Note that our new model don't have the output layer and associated ReLU activation layer.
embd_model_lr

In [None]:
# We can define a z vector and append all activations to that.
zs_embd_model_lr = []
for xs, ts in test_loader:
    xs = xs.view(-1, 784) # flatten the image
    zs_embd_model_lr.append(embd_model_lr(xs).detach().numpy()) # do forward pass to extract embeddings and append to zs.

zs_embd_model_lr = ... # Stack all the embeddings. 

zs_embd_model_lr_mean = ... # Find mean of all embeddings with respect to depth axis (axis=0).

zs_embd_model_lr_std = ... # Find the standard deviation all embeddings with respect to the depth axis.

In [None]:
# Find the covariance of embeddings.


What are your findings? Write below. (Make sure you change the type of the cell to MarkDown).

### Change Number of Layers

What happens if we increase the number of layers in the neural network? Can we expect higher accuracy? Create three hidden layers instead of two. The third hidden layer size can be 32.

In [None]:
input_size = 784
hidden_sizes_bigger_MLP = [] #
output_size = 10

def bigger_MLP():
    """
    A function implementation of the model definition.
    """
    model = nn.Sequential(
                         # Fill your network here
                         )
    return model

In [None]:
# Define the model
bigger_model = ...

In [None]:
# Start training the model on the train_loader.
bigger_model_loss_values = ...

In [None]:
# Plot the loss curve


In [None]:
# Classification results


In [None]:
# Save the model
torch.save(bigger_model, 'mnist_bigger_model.pt')

In [None]:
# Study the representations

# Let us extract the penultimate layer as the embedding layer.

embd_bigger_model = ...

In [None]:
# Note that our new model don't have the output layer and associated ReLU activation layer.
embd_bigger_model

In [None]:
# We can define a z vector and append all activations to that.
zs_embd_bigger_model = []
for xs, ts in test_loader:
    xs = xs.view(-1, 784) # flatten the image
    zs_embd_bigger_model.append(embd_bigger_model(xs).detach().numpy()) # do forward pass to extract embeddings and append to zs.

zs_embd_bigger_model = ... # Stack all the embeddings. 

zs_embd_bigger_model_mean = ... # Find mean of all embeddings with respect to depth axis (axis=0).

zs_embd_bigger_model_std = ... # Find the standard deviation all embeddings with respect to the depth axis.

In [None]:
# Find the covariance of embeddings.


### Discussion and Conclusion Section - Major Finding (1 Page)

What poked your interest? Did you find any changes in the learning performance and representations when you changed the hyper parameters or increased the size of the model? Write your findings below.

## b) Data Bias due to Class Imbalance

MNIST is a relatively clean dataset. There are 60,000 total training samples and each class has roughly 6000 samples each. Roughly. But there is not a lot of changes. Infact, here's proof.

In [1]:
# Transform to normalized Tensors 
transform = transforms.Compose([transforms.ToTensor(),
                                transforms.Normalize((0.1307,), (0.3081,))])

train_dataset = datasets.MNIST('../Datasets/', train=True, transform=transform, download=True)
test_dataset = datasets.MNIST('../Datasets/', train=False, transform=transform, download=True)


train_loader = DataLoader(train_dataset, batch_size=len(train_dataset))
test_loader = DataLoader(test_dataset, batch_size=len(test_dataset))

train_dataset_array = next(iter(train_loader))[0].numpy()
train_dataset_array_labels = next(iter(train_loader))[1].numpy()

test_dataset_array = next(iter(test_loader))[0].numpy()
test_dataset_array_labels = next(iter(test_loader))[1].numpy()

NameError: name 'transforms' is not defined

In [None]:
unique_vals = np.unique(train_dataset_array_labels, return_counts=True)

plt.bar(unique_vals[0], unique_vals[1], color='gray')
plt.title("MNIST Training Data Distribution")
plt.xlabel("Class")
plt.ylabel("Num Samples")
_ = plt.xticks(unique_vals[0])

We can see that there is not much difference in terms of number of samples per class. Now, what happens if there is considerably less sample in one class? Below is some code to reduce the number of samples from class `1`.  Train, test, and study the embeddings of a model on the biased dataset and compare with the previous study.

In [None]:
count = 0
max_count = 6500
pos2del = []
label2del = 1
for i in range(len(train_dataset_array_labels)):
    if train_dataset_array_labels[i] == label2del and count < max_count:
        count += 1
        pos2del.append(i)

In [None]:
train_biased_X = np.delete(train_dataset_array, pos2del, axis=0)
train_biased_Y = np.delete(train_dataset_array_labels, pos2del, axis=0)

In [None]:
N_train = 64
N_test = 256

t_mnist_assn_1a_train_X = torch.Tensor(train_biased_X)
t_mnist_assn_1a_train_Y = torch.Tensor(train_biased_Y).type(torch.LongTensor)
    
train_data = TensorDataset(t_mnist_assn_1a_train_X, t_mnist_assn_1a_train_Y)
train_loader_bias = DataLoader(train_data, batch_size=N_train, shuffle=True)

# Note that we didn't do any operations on the test loader.

In [None]:
unique_vals = np.unique(train_biased_Y, return_counts=True)

plt.bar(unique_vals[0], unique_vals[1], color='gray')
plt.title("MNIST Training Data Distribution - Class 1 Imbalance")
plt.xlabel("Class")
plt.ylabel("Num Samples")
_ = plt.xticks(unique_vals[0])

In [None]:
input_size = 784
hidden_sizes = [128, 64]
output_size = 10

# Hyper Parameters
lr = 0.003 # learning rate
NUM_EPOCHS = 10

In [None]:
# Define the model
model_biased_1 = MLP()

# Start training the model on the train_loader_bias.
loss_values_model_biased_1 = 

In [None]:
# Plot the loss curve


In [None]:
# Classification results


In [None]:
# Save the model
torch.save(model_biased_1, 'mnist_biased_1.pt')

In [None]:
# Study the representations

# Let us extract the penultimate layer as the embedding layer.

embd_model_biased_1 = ...

In [None]:
# Note that our new model don't have the output layer and associated ReLU activation layer.
embd_model_biased_1

In [None]:
# We can define a z vector and append all activations to that.
zs_embd_model_biased_1 = []
for xs, ts in test_loader:
    xs = xs.view(-1, 784) # flatten the image
    zs_embd_model_biased_1.append(embd_model_biased_1(xs).detach().numpy()) # do forward pass to extract embeddings and append to zs.

zs_embd_model_biased_1 = ... # Stack all the embeddings. 

zs_embd_model_biased_1_mean = ... # Find mean of all embeddings with respect to depth axis (axis=0).

zs_embd_model_biased_1_std = ... # Find the standard deviation all embeddings with respect to the depth axis.

In [None]:
# Find the covariance of embeddings.


### Discussion and Conclusion Section - Major Finding (1 Page)