# HW1 - Exploring MLPs with PyTorch

# Problem 1: Simple MLP for Binary Classification
In this problem, you will train a simple MLP to classify two handwritten digits: 0 vs 1. We provide some starter codes to do this task with steps. However, you do not need to follow the exact steps as long as you can complete the task in sections marked as <span style="color:red">[YOUR TASK]</span>.

## Dataset Setup
We will use the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). The `torchvision` package has supported this dataset. We can load the dataset in this way (the dataset will take up 63M of your disk space):

# HW1 - Exploring MLPs with PyTorch

# Problem 3: Handling Class Imbalance in MNIST Dataset
In this problem, we will explore how to handle class imbalance problems, which are very common in real-world applications. A modified MNIST dataset is created as follows: we choose all instances of digit “0”, and choose only 1\% instances of digit “1” for both training and test sets:

For such a class imbalance problem, accuracy may not be a good metric. Always predicting "0" regardless of the input can be 99\% accurate. Instead, we use the $F_1$ score as the evaluation metric:
$$F_1 = 2\cdot\frac{\text{precision}\cdot \text{recall}}{\text{precision} + \text{recall}}$$
where precision and recall are defined as:
$$\text{precision}=\frac{\text{number of instances correctly predicted as "1"}}{\text{number of instances predicted as "1"}}$$
$$\text{recall}=\frac{\text{number of instances correctly predicted as "1"}}{\text{number of instances labeled as "1"}}$$

To handle such a problem, some changes to the training may be necessary. Some suggestions include: 
1) Adjusting the class weights in the loss function, i.e., use a larger weight for the minority class when computing the loss.
2) Implementing resampling techniques (either undersampling the majority class or oversampling the minority class).

<span style="color:red">[YOUR TASK]</span>
- Create the imbalance datasets with all "0" digits and only 1\% "1" digits.
- Implement the training loop and evaluation section (implementing the $F_1$ metric). 
- Ignore the class imbalance problem and train the MLP. Report your hyper-parameter details and the $F_1$ score performance on the test set (as the baseline).
- Explore modifications to improve the performance of the class imbalance problem. Report your modifications and the $F_1$ scores performance on the test set.

In [112]:
import torch
from torchvision import transforms, datasets
import numpy as np
import pandas as pd
import sklearn
import torch.nn as nn
import time

In [113]:
device = torch.device('cpu')

In [114]:
# define the data pre-processing
# convert the input to the range [-1, 1].
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize(0.5, 0.5)]
    )

# Load the MNIST dataset 
# this command requires Internet to download the dataset
mnist = datasets.MNIST(root='/Users/vashisth/Documents/GitHub/Intro_DL/IDL_hw1/data', 
                       train=True, 
                       download=True, 
                       transform=transform)
mnist_test = datasets.MNIST(root='/Users/vashisth/Documents/GitHub/Intro_DL/IDL_hw1/data',   # './data'
                            train=False, 
                            download=True, 
                            transform=transform)

In [115]:
from torch.utils.data import DataLoader, random_split

print("Frequencies: ", torch.bincount(mnist.targets))
print(len(torch.bincount(mnist.targets)))

Frequencies:  tensor([5923, 6742, 5958, 6131, 5842, 5421, 5918, 6265, 5851, 5949])
10


In [116]:
# Filter for digits 0 and 1
import random
N= 100
train_0 = [data for data in mnist if data[1] == 0]
train_1 = [data for data in mnist if data[1] == 1]
# random.shuffle(train_1)
print(len(train_0), len(train_1), len(train_1) + len( train_0) )
train_1 = train_1[:len(train_1) // N]
print(len(train_0), len(train_1), len(train_1) + len( train_0) )
train_set = train_0 + train_1


test_0 = [data for data in mnist_test if data[1] == 0]
test_1 = [data for data in mnist_test if data[1] == 1]
print(len(test_0), len(test_1), len(test_1) + len( test_0) )

test_1 = test_1[:len(test_1) // N]
print(len(test_0), len(test_1), len(test_1) + len( test_0) )

test_set = test_0 + test_1

5923 6742 12665
5923 67 5990
980 1135 2115
980 11 991


In [117]:
# Split training data into training and validation sets
train_len = int(len(train_set) *.8)
val_len = len(train_set) - train_len
train_set, val_set = random_split(train_set, [train_len, val_len])

# Define DataLoaders to access data in batches
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
val_loader = DataLoader(val_set, batch_size = 64, shuffle=False)
test_loader = DataLoader(test_set, batch_size = 64, shuffle=False)

In [118]:
# Define your MLP
class SimpleMLP(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super(SimpleMLP, self).__init__()
        # Your code goes here
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.activation = nn.ReLU()
        self.fc2 = nn.Linear(hidden_dim, out_dim)
        
    def forward(self, x):
        # Your code goes here
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x)
        return x

# Your code goes here
hidden_dim = 4
model = SimpleMLP(in_dim=28 * 28,
                  hidden_dim=hidden_dim,
                  out_dim=2).to(device)
print(model)

SimpleMLP(
  (fc1): Linear(in_features=784, out_features=4, bias=True)
  (activation): ReLU()
  (fc2): Linear(in_features=4, out_features=2, bias=True)
)


In [119]:
# Your code goes here
def precision_score(labels, predictions):
    predictions, labels = np.array(labels), np.array(predictions)
    predictions_1 = np.sum(predictions==1)
    correct_1 = np.sum( (predictions==1) & (labels==1))
    precision = correct_1/ predictions_1 if predictions_1 > 0 else 1e-6
    return precision

def recall_score(labels, predictions):
    predictions, labels = np.array(labels), np.array(predictions)
    correct_1 = np.sum( (predictions==1) & (labels==1))
    labels_1 = np.sum(labels==1)
    recall = correct_1/ labels_1 if labels_1 > 0 else 1e-6
    return recall

def f1_score(labels, predictions):
    precision = precision_score(labels, predictions)
    recall = recall_score(labels, predictions)
    f1 = (2 * (recall * precision)) / (precision + recall)
    return f1

In [120]:
from sklearn.metrics import confusion_matrix

In [124]:
def two_digit(weight, batch_size=64):
    model = SimpleMLP(in_dim=28 * 28,
                  hidden_dim=hidden_dim,
                  out_dim=2).to(device)
    
    criterion = nn.CrossEntropyLoss(weight = weight)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)
    num_epochs = 10
    
    # training
    start_time = time.time()
    for epoch in range(num_epochs):
        correct, count = 0, 0 
        for data, target in train_loader:
            # free the gradient from the previous batch
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            # reshape the image into a vector
            data = data.view(data.size(0), -1)
            # model forward
            output = model(data)
            # compute the loss
            loss = criterion(output, target)
            # model backward
            loss.backward()
            # update the model paramters
            optimizer.step()
            
            # adding this for train accuracy 
            pred = output.argmax(dim=1)
            correct += (pred == target).sum().item()
            count += data.size(0)
        
        train_acc = 100. * correct / count
        # print(f'Training accuracy: {train_acc:.2f}%')

    training_time = time.time()- start_time
    # print(training_time)
    
    # validation
    val_loss = count = 0
    correct = total = 0
    val_preds = []; val_labels=[]
    for data, target in val_loader:
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)
        output = model(data)
        val_loss += criterion(output, target).item()
        count += 1
        pred = output.argmax(dim=1)
        correct += (pred == target).sum().item()
        total += data.size(0)
        val_preds.append(pred)
        val_labels.append(target)
        # print(type(target))

    val_preds = torch.cat(val_preds).numpy()
    val_labels = torch.cat(val_labels).numpy()
    assert len(val_preds) == len(val_set)
    
    val_loss = val_loss / count
    val_acc = 100. * correct / total
    # print(f'Validation loss: {val_loss:.2f}, accuracy: {val_acc:.2f}%')
    f1_validation = f1_score(labels = val_labels, predictions = val_preds)
    # print(f'F1 score validation: {f1_validation:.2f}')
    
    # test
    model.eval()
    correct = total = 0
    test_preds = []; test_labels=[]

    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            data = data.view(data.size(0), -1)
            output = model(data)
            pred = output.argmax(dim=1)
            correct += (pred == target).sum().item()
            total += data.size(0)
            test_preds.append(pred)
            test_labels.append(target)
        
    test_preds = torch.cat(test_preds).numpy()
    test_labels = torch.cat(test_labels).numpy()
    assert len(test_preds) == len(test_set)   
    test_acc = 100. * correct / total
    # print(f'Test Accuracy: {test_acc:.2f}%')
    # print(f'Validation loss: {val_loss:.2f}, accuracy: {val_acc:.2f}%')
    f1_test = f1_score(labels = test_labels, predictions =test_preds)
    # print(f'F1 score test: {f1_test:.2f}')
    conf_matrix = confusion_matrix(test_labels, test_preds)

    # Print the confusion matrix
    print("Confusion Matrix:")
    print(conf_matrix)
    
    return training_time, train_acc, val_acc, test_acc, f1_validation, f1_test

In [125]:
# weight_0 = len(train_set) / len(train_0)
# weight_1 = len(train_set) / len(train_1)
# weight_0, weight_1 = weight_0/ (weight_0 + weight_1), weight_1/ (weight_0 + weight_1)
total = len(train_0) + len(train_1)
weight_0 = total/len(train_0) 
weight_1 = total/ len(train_1)
compensation = torch.tensor([weight_0, weight_1], dtype=torch.float32)
compensation

tensor([ 1.0113, 89.4030])

In [126]:
batch_size = 64
results = []
weights = [[1,1], [1, (max(compensation)/ min(compensation))/10], compensation]

# for batch_size in batch_sizes:
for weight in weights:
    reweight_factor = max(weight)/ min(weight)
    reweight_factor = float(reweight_factor)
    weight = torch.tensor([weight_0, weight_1], dtype=torch.float32)
    weight = weight.to(device)
    training_time, train_acc, val_acc, test_acc, f1_validation, f1_test = two_digit(batch_size=batch_size, weight = weight)
    results.append([batch_size,reweight_factor, training_time, train_acc, val_acc, test_acc, f1_validation, f1_test])

headers = ['Batch size', 'Weighing Factor', 'Training Time ', 'Train Acc' ,' Val Acc', 'Test Acc', 'F1-Val', 'F1-Test']
df =  pd.DataFrame(results, columns = headers)
df.to_csv('q3_hyperopt.csv')
df

Confusion Matrix:
[[980   0]
 [ 11   0]]
Confusion Matrix:
[[980   0]
 [  0  11]]
Confusion Matrix:
[[980   0]
 [  0  11]]


Unnamed: 0,Batch size,Weighing Factor,Training Time,Train Acc,Val Acc,Test Acc,F1-Val,F1-Test
0,64,1.0,0.442843,98.789649,99.248748,98.89001,0.0,0.0
1,64,8.840298,0.464865,99.958264,99.916528,100.0,0.947368,1.0
2,64,88.402977,0.460066,100.0,100.0,100.0,1.0,1.0


<span style="color:red">[EXTRA BONUS]</span>

If the hyper-parameters are chosen properly, the baseline can perform satisfactorily on the class imbalance problem with 1% digit "1". We want to challenge the baseline and handle more class-imbalanced datasets.

In [13]:
import random
N = 1000
# generate a class-imbalanced dataset controlled by "N"
train_0 = [data for data in mnist if data[1] == 0]
train_1 = [data for data in mnist if data[1] == 1]
random.shuffle(train_1)
train_1 = train_1[:len(train_1) // N]
train_data = train_0 + train_1

Can you propose new ways for the class imbalance problem and achieve stable and satisfactory performance for large $N = 500, \; 1000, \; \cdots$?

In [14]:
# Your code goes here

# Problem 4: Reconstruct the MNIST images by Regression
In this problem, we want to train the MLP (with only one hidden layer) to complete a regression task: reconstruct the input image. The goal of this task is dimension reduction, and we set the hidden layer dimension to a smaller number, say 50. Once we can train the MLP to reconstruct the input images perfectly, we find an lower dimension representation of the MNIST images.

Since this is a reconstruction task, the labels of the images are not needed, and the target is the same as the inputs. Mean Squared Error (MSE) is recommended as the loss function:

In [15]:
criterion = nn.MSELoss()

Another tip is to add a `torch.nn.Tanh()` activation layer to the end of the model. Recall that our data pre-processing converts the data into the range $[-1, 1]$:

In [16]:
# define the data pre-processing
# convert the input to the range [-1, 1].
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize(0.5, 0.5)]
    )

Having a `torch.nn.Tanh()` activation layer at the end of the model can convert the output of the model into the range $[-1, 1]$, making the training easier.

<span style="color:red">[YOUR TASK]</span>
- Define an MLP with only one hidden layer and set the hidden layer dimension as 50. Train the MLP to reconstruct input images from all 10 digits.
- Report the Mean Squared Error on the training, validation and test set. Report your hyper-parameter details.
- Pick 5 images for each digit from the test set. Visualize the original images and the reconstructed images using the MLP.

In [17]:
# Your code goes here