# HW1 - Exploring MLPs with PyTorch

# Problem 1: Simple MLP for Binary Classification
In this problem, you will train a simple MLP to classify two handwritten digits: 0 vs 1. We provide some starter codes to do this task with steps. However, you do not need to follow the exact steps as long as you can complete the task in sections marked as <span style="color:red">[YOUR TASK]</span>.

## Dataset Setup
We will use the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). The `torchvision` package has supported this dataset. We can load the dataset in this way (the dataset will take up 63M of your disk space):

# HW1 - Exploring MLPs with PyTorch

In [63]:
import torch
from torchvision import transforms, datasets
import numpy as np
import pandas as pd
import sklearn
import torch.nn as nn


In [64]:
import platform, time
print(platform.mac_ver() )
torch.has_mps

('14.2.1', ('', '', ''), 'arm64')


  torch.has_mps


True

In [65]:
device = torch.device('cpu')

In [66]:
# if not torch.backends.mps.is_available():
#     if not torch.backends.mps.is_built():
#         print("MPS not available because the current PyTorch install was not "
#               "built with MPS enabled.")
#     else:
#         print("MPS not available because the current MacOS version is not 12.3+ "
#               "and/or you do not have an MPS-enabled device on this machine.")
    
# else:
#     device = torch.device("mps")
#     print('mps enabled')

In [67]:
# define the data pre-processing
# convert the input to the range [-1, 1].
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize(0.5, 0.5)]
    )

# Load the MNIST dataset 
# this command requires Internet to download the dataset
mnist = datasets.MNIST(root='/Users/vashisth/Documents/GitHub/Intro_DL/IDL_hw1/data', 
                       train=True, 
                       download=True, 
                       transform=transform)
mnist_test = datasets.MNIST(root='/Users/vashisth/Documents/GitHub/Intro_DL/IDL_hw1/data',   # './data'
                            train=False, 
                            download=True, 
                            transform=transform)

In [68]:
from torch.utils.data import DataLoader, random_split

print("Frequencies: ", torch.bincount(mnist.train_labels))
print(len(torch.bincount(mnist.train_labels)))
# Filter for digits 0 and 1
# train_data = [data for data in mnist if data[1] < 2]
# train_index = mnist.train_labels<2
# mnist.data = mnist.data[train_index]
# mnist.targets = mnist.targets[train_index]
# # Your code goes here
# mnist_test.data = mnist_test.data[test_index]
# mnist_test.targets = mnist_test.targets[test_index]

Frequencies:  tensor([5923, 6742, 5958, 6131, 5842, 5421, 5918, 6265, 5851, 5949])
10




In [69]:
# Split training data into training and validation sets
# Your code goes here
# train_set = ...
# val_set = ...
train_len = int(len(mnist) *.8)
val_len = len(mnist) - train_len
train_set, val_set = random_split(mnist, [train_len, val_len])

# Define DataLoaders to access data in batches
train_loader = DataLoader(train_set, batch_size=1028, shuffle=True)
# Your code goes here
val_loader = DataLoader(val_set, batch_size = 1028, shuffle=False)
test_loader = DataLoader(mnist_test, batch_size = 1028, shuffle=False)

In [70]:
train_len

48000

# Problem 2: MNIST 10-class classification

Now we want to train an MLP to handle multi-class classification for all 10 digits in the MNIST dataset. We will use the full MNIST dataset without filtering for specific digits. You may modify the MLP so that it can be used for multi-class classification.

<span style="color:red">[YOUR TASK]</span>
- Implement the training loop and evaluation section. Report the hyper-parameters you choose.
- Experiment with different numbers of neurons in the hidden layer and note any changes in performance.
- Write a brief analysis of the model's performance, including any challenges faced and how they were addressed.

In our implementations, we trained our network for 10 epochs in about 20 seconds on a laptop.
When you define a new model, remember to update the optimizer!



In [71]:
class MulticlassMLP(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super(MulticlassMLP, self).__init__()
        # Your code goes here
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.activation = nn.Sigmoid()
        self.fc2 = nn.Linear(hidden_dim, out_dim)
        
    def forward(self, x):
        # Your code goes here
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x)
        
        return x

# Your code goes here
hidden_dim = int(np.sqrt(28*28*10))
model = MulticlassMLP(in_dim=28 * 28,
                  hidden_dim=hidden_dim,
                  out_dim=10).to(device)
print(model)

MulticlassMLP(
  (fc1): Linear(in_features=784, out_features=88, bias=True)
  (activation): Sigmoid()
  (fc2): Linear(in_features=88, out_features=10, bias=True)
)


In [72]:
def ten_digit(batch_size, hidden_dim, optimizer,  device = 'cpu'): # or mps lr=1e-3,
    device = torch.device(device)
    # Define DataLoaders to access data in batches
    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
    # Your code goes here
    val_loader = DataLoader(val_set, batch_size = batch_size, shuffle=False)
    test_loader = DataLoader(mnist_test, batch_size = batch_size, shuffle=False)
    
    model = MulticlassMLP(in_dim=28 * 28,
                  hidden_dim=hidden_dim,
                  out_dim=10).to(device)
    # print(model)
    criterion = nn.CrossEntropyLoss()
    
    if optimizer == 'adam':
        lr = 1e-3
        optimizer = torch.optim.Adam(model.parameters(), lr=lr)
    else:
        lr=1e-2
        optimizer = torch.optim.SGD(model.parameters(), lr=lr)
    
    num_epochs = 10
    # training
    start_time = time.time()
    for epoch in range(num_epochs):
        correct, count = 0, 0 
        for data, target in train_loader:
            # free the gradient from the previous batch
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            # reshape the image into a vector
            data = data.view(data.size(0), -1)
            # model forward
            output = model(data)
            # compute the loss
            loss = criterion(output, target)
            # model backward
            loss.backward()
            # update the model paramters
            optimizer.step()
            
            # adding this for train accuracy 
            pred = output.argmax(dim=1)
            correct += (pred == target).sum().item()
            count += data.size(0)
        
        train_acc = 100. * correct / count
        # print(f'Training accuracy: {train_acc:.2f}%')

    training_time = time.time()- start_time
    # print(training_time)
    
    # validation
    val_loss = count = 0
    correct = total = 0
    for data, target in val_loader:
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)
        output = model(data)
        val_loss += criterion(output, target).item()
        count += 1
        pred = output.argmax(dim=1)
        correct += (pred == target).sum().item()
        total += data.size(0)
        
    val_loss = val_loss / count
    val_acc = 100. * correct / total
    # print(f'Validation loss: {val_loss:.2f}, accuracy: {val_acc:.2f}%')
    
    # test
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            data = data.view(data.size(0), -1)
            output = model(data)
            pred = output.argmax(dim=1)
            correct += (pred == target).sum().item()
            total += data.size(0)
            
    test_acc = 100. * correct / total
    # print(f'Test Accuracy: {test_acc:.2f}%')
    
    return training_time, train_acc, val_acc, test_acc

In [76]:
import pandas as pd

results = []
devices = ['cpu', 'mps']
batch_sizes = [64, 128, 1024]
optimizers = ['adam', 'sgd']
# learning_rates= [1e-4, 1e-3, 1e-2, 1e-1]
hidden_dims = [16, 32, 64]
for batch_size in batch_sizes:
    for optimizer in optimizers:
        for device in devices:
            for hidden_dim in hidden_dims:
                training_time, train_acc, val_acc, test_acc = ten_digit(batch_size=batch_size, 
                                                                        optimizer=optimizer,
                                                                        hidden_dim=hidden_dim,
                                                                        # lr = lr, 
                                                                        device=device )
                lr = 1e-3 if optimizer=='adam' else 1e-2
                print([device, batch_size, optimizer, lr, hidden_dim,  training_time, train_acc, val_acc, test_acc])
                results.append([device, batch_size, optimizer, lr, hidden_dim,  training_time, train_acc, val_acc, test_acc])



headers = ['Device', 'Batch size', 'Optimizer', 'LR', 'Hidden Dim', 
           'Training Time', 'Train Acc', 'Val Acc', 'Test Acc']
df = pd.DataFrame(results, columns=headers)

['cpu', 64, 'adam', 0.001, 16, 29.785605907440186, 93.33333333333333, 92.89166666666667, 93.11]
['cpu', 64, 'adam', 0.001, 32, 30.65951418876648, 95.59583333333333, 94.975, 95.12]
['cpu', 64, 'adam', 0.001, 64, 30.598462104797363, 97.10208333333334, 95.88333333333334, 96.18]
['mps', 64, 'adam', 0.001, 16, 70.20412874221802, 93.47916666666667, 93.10833333333333, 93.37]
['mps', 64, 'adam', 0.001, 32, 64.60245871543884, 95.71458333333334, 94.96666666666667, 95.1]
['mps', 64, 'adam', 0.001, 64, 68.29115414619446, 97.02083333333333, 95.975, 96.22]
['cpu', 64, 'sgd', 0.01, 16, 28.56167697906494, 88.98125, 89.03333333333333, 89.35]
['cpu', 64, 'sgd', 0.01, 32, 28.658032178878784, 90.04375, 90.06666666666666, 90.38]
['cpu', 64, 'sgd', 0.01, 64, 29.286179065704346, 90.01875, 90.11666666666666, 90.72]
['mps', 64, 'sgd', 0.01, 16, 54.447420835494995, 89.04166666666667, 89.4, 89.41]
['mps', 64, 'sgd', 0.01, 32, 54.18169593811035, 89.82916666666667, 90.05, 90.57]
['mps', 64, 'sgd', 0.01, 64, 53.712

In [80]:
df

Unnamed: 0,Device,Batch size,Optimizer,LR,Hidden Dim,Training Time,Train Acc,Val Acc,Test Acc
0,cpu,64,adam,0.001,16,29.785606,93.333333,92.891667,93.11
1,cpu,64,adam,0.001,32,30.659514,95.595833,94.975,95.12
2,cpu,64,adam,0.001,64,30.598462,97.102083,95.883333,96.18
3,mps,64,adam,0.001,16,70.204129,93.479167,93.108333,93.37
4,mps,64,adam,0.001,32,64.602459,95.714583,94.966667,95.1
5,mps,64,adam,0.001,64,68.291154,97.020833,95.975,96.22
6,cpu,64,sgd,0.01,16,28.561677,88.98125,89.033333,89.35
7,cpu,64,sgd,0.01,32,28.658032,90.04375,90.066667,90.38
8,cpu,64,sgd,0.01,64,29.286179,90.01875,90.116667,90.72
9,mps,64,sgd,0.01,16,54.447421,89.041667,89.4,89.41


In [79]:
df.to_csv('sigmoid_hyperopt.csv')

In [81]:
class MulticlassMLP(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super(MulticlassMLP, self).__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.activation = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, out_dim)
        
    def forward(self, x):
        # Your code goes here
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x)
        
        return x

# Your code goes here
hidden_dim = int(np.sqrt(28*28*10))
model = MulticlassMLP(in_dim=28 * 28,
                  hidden_dim=hidden_dim,
                  out_dim=10).to(device)
print(model)

MulticlassMLP(
  (fc1): Linear(in_features=784, out_features=88, bias=True)
  (activation): ReLU()
  (fc2): Linear(in_features=88, out_features=10, bias=True)
)


In [83]:
import pandas as pd

results = []
devices = ['cpu']
batch_sizes = [64, 128, 1024]
optimizers = ['adam', 'sgd']
# learning_rates= [1e-4, 1e-3, 1e-2, 1e-1]
hidden_dims = [16, 32, 64, 128]
for batch_size in batch_sizes:
    for optimizer in optimizers:
        for device in devices:
            for hidden_dim in hidden_dims:
                training_time, train_acc, val_acc, test_acc = ten_digit(batch_size=batch_size, 
                                                                        optimizer=optimizer,
                                                                        hidden_dim=hidden_dim,
                                                                        # lr = lr, 
                                                                        device=device )
                lr = 1e-3 if optimizer=='adam' else 1e-2
                print([device, batch_size, optimizer, lr, hidden_dim,  training_time, train_acc, val_acc, test_acc])
                results.append([device, batch_size, optimizer, lr, hidden_dim,  training_time, train_acc, val_acc, test_acc])



headers = ['Device', 'Batch size', 'Optimizer', 'LR', 'Hidden Dim', 
           'Training Time', 'Train Acc', 'Val Acc', 'Test Acc']
df = pd.DataFrame(results, columns=headers)
df.to_csv('relu_hyperopt_q2.csv')
df

['cpu', 64, 'adam', 0.001, 16, 29.184563159942627, 88.03333333333333, 87.925, 88.14]
['cpu', 64, 'adam', 0.001, 32, 30.543816089630127, 95.29583333333333, 94.175, 94.25]
['cpu', 64, 'adam', 0.001, 64, 31.19202208518982, 96.96458333333334, 96.225, 96.5]
['cpu', 64, 'adam', 0.001, 128, 32.99071264266968, 97.86458333333333, 96.95, 96.95]
['cpu', 64, 'sgd', 0.01, 16, 27.672548055648804, 92.31875, 92.2, 92.65]
['cpu', 64, 'sgd', 0.01, 32, 28.152845859527588, 92.80208333333333, 92.65, 92.93]
['cpu', 64, 'sgd', 0.01, 64, 28.80337882041931, 93.4625, 93.3, 93.56]
['cpu', 64, 'sgd', 0.01, 128, 29.64063310623169, 93.91458333333334, 93.63333333333334, 93.97]
['cpu', 128, 'adam', 0.001, 16, 26.838134050369263, 93.07291666666667, 92.55, 92.78]
['cpu', 128, 'adam', 0.001, 32, 27.21410608291626, 95.55833333333334, 95.125, 95.59]
['cpu', 128, 'adam', 0.001, 64, 28.684025287628174, 96.66875, 96.06666666666666, 96.28]
['cpu', 128, 'adam', 0.001, 128, 29.71784520149231, 97.76458333333333, 96.5583333333333

Unnamed: 0,Device,Batch size,Optimizer,LR,Hidden Dim,Training Time,Train Acc,Val Acc,Test Acc
0,cpu,64,adam,0.001,16,29.184563,88.033333,87.925,88.14
1,cpu,64,adam,0.001,32,30.543816,95.295833,94.175,94.25
2,cpu,64,adam,0.001,64,31.192022,96.964583,96.225,96.5
3,cpu,64,adam,0.001,128,32.990713,97.864583,96.95,96.95
4,cpu,64,sgd,0.01,16,27.672548,92.31875,92.2,92.65
5,cpu,64,sgd,0.01,32,28.152846,92.802083,92.65,92.93
6,cpu,64,sgd,0.01,64,28.803379,93.4625,93.3,93.56
7,cpu,64,sgd,0.01,128,29.640633,93.914583,93.633333,93.97
8,cpu,128,adam,0.001,16,26.838134,93.072917,92.55,92.78
9,cpu,128,adam,0.001,32,27.214106,95.558333,95.125,95.59


# Problem 3: Handling Class Imbalance in MNIST Dataset
In this problem, we will explore how to handle class imbalance problems, which are very common in real-world applications. A modified MNIST dataset is created as follows: we choose all instances of digit “0”, and choose only 1\% instances of digit “1” for both training and test sets:

In [None]:
# Filter for digits 0 and 1
train_0 = [data for data in mnist if data[1] == 0]
train_1 = [data for data in mnist if data[1] == 1]
train_1 = train_1[:len(train_1) // 100]
train_data = train_0 + train_1

For such a class imbalance problem, accuracy may not be a good metric. Always predicting "0" regardless of the input can be 99\% accurate. Instead, we use the $F_1$ score as the evaluation metric:
$$F_1 = 2\cdot\frac{\text{precision}\cdot \text{recall}}{\text{precision} + \text{recall}}$$
where precision and recall are defined as:
$$\text{precision}=\frac{\text{number of instances correctly predicted as "1"}}{\text{number of instances predicted as "1"}}$$
$$\text{recall}=\frac{\text{number of instances correctly predicted as "1"}}{\text{number of instances labeled as "1"}}$$

To handle such a problem, some changes to the training may be necessary. Some suggestions include: 
1) Adjusting the class weights in the loss function, i.e., use a larger weight for the minority class when computing the loss.
2) Implementing resampling techniques (either undersampling the majority class or oversampling the minority class).

<span style="color:red">[YOUR TASK]</span>
- Create the imbalance datasets with all "0" digits and only 1\% "1" digits.
- Implement the training loop and evaluation section (implementing the $F_1$ metric). 
- Ignore the class imbalance problem and train the MLP. Report your hyper-parameter details and the $F_1$ score performance on the test set (as the baseline).
- Explore modifications to improve the performance of the class imbalance problem. Report your modifications and the $F_1$ scores performance on the test set.

In [37]:
# Your code goes here

<span style="color:red">[EXTRA BONUS]</span>

If the hyper-parameters are chosen properly, the baseline can perform satisfactorily on the class imbalance problem with 1% digit "1". We want to challenge the baseline and handle more class-imbalanced datasets.

In [None]:
import random
N = 1000
# generate a class-imbalanced dataset controlled by "N"
train_0 = [data for data in mnist if data[1] == 0]
train_1 = [data for data in mnist if data[1] == 1]
random.shuffle(train_1)
train_1 = train_1[:len(train_1) // N]
train_data = train_0 + train_1

Can you propose new ways for the class imbalance problem and achieve stable and satisfactory performance for large $N = 500, \; 1000, \; \cdots$?

In [2]:
# Your code goes here

# Problem 4: Reconstruct the MNIST images by Regression
In this problem, we want to train the MLP (with only one hidden layer) to complete a regression task: reconstruct the input image. The goal of this task is dimension reduction, and we set the hidden layer dimension to a smaller number, say 50. Once we can train the MLP to reconstruct the input images perfectly, we find an lower dimension representation of the MNIST images.

Since this is a reconstruction task, the labels of the images are not needed, and the target is the same as the inputs. Mean Squared Error (MSE) is recommended as the loss function:

In [None]:
criterion = nn.MSELoss()

Another tip is to add a `torch.nn.Tanh()` activation layer to the end of the model. Recall that our data pre-processing converts the data into the range $[-1, 1]$:

In [None]:
# define the data pre-processing
# convert the input to the range [-1, 1].
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize(0.5, 0.5)]
    )

Having a `torch.nn.Tanh()` activation layer at the end of the model can convert the output of the model into the range $[-1, 1]$, making the training easier.

<span style="color:red">[YOUR TASK]</span>
- Define an MLP with only one hidden layer and set the hidden layer dimension as 50. Train the MLP to reconstruct input images from all 10 digits.
- Report the Mean Squared Error on the training, validation and test set. Report your hyper-parameter details.
- Pick 5 images for each digit from the test set. Visualize the original images and the reconstructed images using the MLP.

In [38]:
# Your code goes here