# HW1 - Exploring MLPs with PyTorch

# Problem 1: Simple MLP for Binary Classification
In this problem, you will train a simple MLP to classify two handwritten digits: 0 vs 1. We provide some starter codes to do this task with steps. However, you do not need to follow the exact steps as long as you can complete the task in sections marked as <span style="color:red">[YOUR TASK]</span>.

## Dataset Setup
We will use the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). The `torchvision` package has supported this dataset. We can load the dataset in this way (the dataset will take up 63M of your disk space):

# HW1 - Exploring MLPs with PyTorch

# Problem 1: Simple MLP for Binary Classification
In this problem, you will train a simple MLP to classify two handwritten digits: 0 vs 1. We provide some starter codes to do this task with steps. However, you do not need to follow the exact steps as long as you can complete the task in sections marked as <span style="color:red">[YOUR TASK]</span>.

## Dataset Setup
We will use the [MNIST dataset](http://yann.lecun.com/exdb/mnist/). The `torchvision` package has supported this dataset. We can load the dataset in this way (the dataset will take up 63M of your disk space):

In [20]:
import torch
from torchvision import transforms, datasets
import numpy as np
import pandas as pd
import sklearn
import torch.nn as nn

In [21]:
import platform, time
print(platform.mac_ver() )
torch.has_mps

('14.2.1', ('', '', ''), 'arm64')


  torch.has_mps


True

In [22]:
device = torch.device('cpu')

In [23]:
# if not torch.backends.mps.is_available():
#     if not torch.backends.mps.is_built():
#         print("MPS not available because the current PyTorch install was not "
#               "built with MPS enabled.")
#     else:
#         print("MPS not available because the current MacOS version is not 12.3+ "
#               "and/or you do not have an MPS-enabled device on this machine.")
    
# else:
#     device = torch.device("mps")
#     print('mps enabled')

In [24]:
# define the data pre-processing
# convert the input to the range [-1, 1].
transform = transforms.Compose(
    [transforms.ToTensor(), transforms.Normalize(0.5, 0.5)]
    )

# Load the MNIST dataset 
# this command requires Internet to download the dataset
mnist = datasets.MNIST(root='/Users/vashisth/Documents/GitHub/Intro_DL/IDL_hw1/data', 
                       train=True, 
                       download=True, 
                       transform=transform)
mnist_test = datasets.MNIST(root='/Users/vashisth/Documents/GitHub/Intro_DL/IDL_hw1/data',   # './data'
                            train=False, 
                            download=True, 
                            transform=transform)

In Problem 1, we only focus on a binary classification between digits 0 and 1. Thus we filter the dataset to contain only samples of digits 0 and 1. Besides, we want to randomly split the original training data into two disjoint datasets: a new training set containing 80\% original training samples and a validation dataset containing 20\% original training samples. We provide the incomplete code as a hint:

In [25]:
from torch.utils.data import DataLoader, random_split

# Filter for digits 0 and 1
# train_data = [data for data in mnist if data[1] < 2]
train_index = mnist.targets<2
mnist.data = mnist.data[train_index]
mnist.targets = mnist.targets[train_index]
# Your code goes here
test_index = mnist_test.targets<2
mnist_test.data = mnist_test.data[test_index]
mnist_test.targets = mnist_test.targets[test_index]

In [26]:
# Split training data into training and validation sets
# Your code goes here
# train_set = ...
# val_set = ...
train_len = int(len(mnist) *.8)
val_len = len(mnist) - train_len
train_set, val_set = random_split(mnist, [train_len, val_len])

# Define DataLoaders to access data in batches
train_loader = DataLoader(train_set, batch_size=64, shuffle=True)
# Your code goes here
val_loader = DataLoader(val_set, batch_size = 64, shuffle=False)
test_loader = DataLoader(mnist_test, batch_size = 64, shuffle=False)

In [27]:
train_len

10132

## Define an MLP
We want to define a simple MLP with only one hidden layer. You can use ``torch.nn.Linear`` to define a single MLP layer and pick an activation layer you like. Since our inputs are images with $28\times28$ pixels, the input dimension is $28\times28=784$. The problem is a binary classification, thus, the output dimension is 2. 

In [28]:

# Define your MLP
class SimpleMLP(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super(SimpleMLP, self).__init__()
        # Your code goes here
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.activation = nn.Sigmoid()
        self.fc2 = nn.Linear(hidden_dim, out_dim)
        
    def forward(self, x):
        # Your code goes here
        x = self.fc1(x)
        x = self.activation(x)
        x = self.fc2(x)
        
        return x

# Your code goes here
hidden_dim = 5
model = SimpleMLP(in_dim=28 * 28,
                  hidden_dim=hidden_dim,
                  out_dim=2).to(device)
print(model)

SimpleMLP(
  (fc1): Linear(in_features=784, out_features=5, bias=True)
  (activation): Sigmoid()
  (fc2): Linear(in_features=5, out_features=2, bias=True)
)


## Train the MLP
To train the model, we need to define a loss function (criterion) and an optimizer. The loss function tells us how far away the model’s prediction is from the label. Once we have the loss, PyTorch can compute the gradient of the model automatically. The optimizer uses the gradient to update the model. For classification problems, we often use the Cross Entropy Loss. For the optimizer, we can use stochastic gradient descent optimizer or Adam optimizer:

In [29]:
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

There are several hyper-parameters in the optimizer (please see the [PyTorch document](https://pytorch.org/docs/stable/optim.html) for details). You can play with the hyper-parameters and see how they influence the training.

Now we have almost everything to train the model. We provide a sample code to complete the training loops:

In [30]:
num_epochs = 10
start_time = time.time()
for epoch in range(num_epochs):
    correct, count = 0, 0 
    for data, target in train_loader:
        # free the gradient from the previous batch
        data, target = data.to(device), target.to(device)
        optimizer.zero_grad()
        # reshape the image into a vector
        data = data.view(data.size(0), -1)
        # model forward
        output = model(data)
        # compute the loss
        loss = criterion(output, target)
        # model backward
        loss.backward()
        # update the model paramters
        optimizer.step()
        
        # adding this for train accuracy 
        pred = output.argmax(dim=1)
        correct += (pred == target).sum().item()
        count += data.size(0)
    
    train_acc = 100. * correct / count
    print(f'Training accuracy: {train_acc:.2f}%')

training_time = time.time()- start_time
print(training_time)

Training accuracy: 98.33%
Training accuracy: 99.74%
Training accuracy: 99.85%
Training accuracy: 99.89%
Training accuracy: 99.90%
Training accuracy: 99.92%
Training accuracy: 99.93%
Training accuracy: 99.93%
Training accuracy: 99.94%
Training accuracy: 99.97%
6.095792293548584


After the training, we can use the validation dataset to know the performance of our model on new samples:

In [31]:
val_loss = count = 0
correct = total = 0
for data, target in val_loader:
    data, target = data.to(device), target.to(device)
    data = data.view(data.size(0), -1)
    output = model(data)
    val_loss += criterion(output, target).item()
    count += 1
    pred = output.argmax(dim=1)
    correct += (pred == target).sum().item()
    total += data.size(0)
    
val_loss = val_loss / count
val_acc = 100. * correct / total
print(f'Validation loss: {val_loss:.2f}, accuracy: {val_acc:.2f}%')

Validation loss: 0.02, accuracy: 99.80%


In [32]:
model.eval()
correct = total = 0
with torch.no_grad():
    for data, target in test_loader:
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)
        output = model(data)
        pred = output.argmax(dim=1)
        correct += (pred == target).sum().item()
        total += data.size(0)
        
test_acc = 100. * correct / total
print(f'Test Accuracy: {test_acc:.2f}%')

Test Accuracy: 99.95%


You can also perform validation after each epoch. But remember not to train (backward and update) on the validation dataset. Use the validation set to optimize performance. After you are done with this, report performance on the test set(You are encouraged not to use the test set for validation, i.e., use the test set only once after you are happy with the validation performance).

<span style="color:red">[YOUR TASK]</span>
- Filter all samples representing digits "0" or "1" from the MNIST datasets. 
- Randomly split the training data into a training set (80\% training samples) of a validation set (20% training samples).
- Define an MLP with 1 hidden layer and train the MLP to classify the digits "0" vs "1".  Report your MLP design and training details (which optimizer, number of epochs, learning rate, etc.)
- Keep other hyper-parameters the same, and train the model with different batch sizes: 2, 16, 128, 1024. Report the time cost, training, validation, and test set accuracy of your model


In our implementations, we trained our network for 10 epochs in about 10 seconds on a laptop, getting a test accuracy of 99\% %.

One tip about the hidden layer size is to begin with a small number, say $16\sim 64$. Some people find $$\text{hidden size} = \sqrt{\text{input size}\times \text{output size}}$$ is a good choice in practice. If your model's training accuracy is too low, you can double the hidden layer size. However, if you find the training accuracy is high. Still, the validation accuracy is much lower, you may consider a smaller hidden layer size because your model has the risk of overfitting.


In [33]:
def two_digit(batch_size):
    # Define DataLoaders to access data in batches
    train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True)
    # Your code goes here
    val_loader = DataLoader(val_set, batch_size = batch_size, shuffle=False)
    test_loader = DataLoader(mnist_test, batch_size = batch_size, shuffle=False)
    
    model = SimpleMLP(in_dim=28 * 28,
                  hidden_dim=hidden_dim,
                  out_dim=2).to(device)
    # print(model)
    criterion = nn.CrossEntropyLoss()
    # You can play with different optimizers
    # optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    
    num_epochs = 10
    
    # training
    start_time = time.time()
    for epoch in range(num_epochs):
        correct, count = 0, 0 
        for data, target in train_loader:
            # free the gradient from the previous batch
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            # reshape the image into a vector
            data = data.view(data.size(0), -1)
            # model forward
            output = model(data)
            # compute the loss
            loss = criterion(output, target)
            # model backward
            loss.backward()
            # update the model paramters
            optimizer.step()
            
            # adding this for train accuracy 
            pred = output.argmax(dim=1)
            correct += (pred == target).sum().item()
            count += data.size(0)
        print(f'Epoch {epoch+1}, Loss: {loss.item():.4f}')

        train_acc = 100. * correct / count
        # print(f'Training accuracy: {train_acc:.2f}%')

    training_time = time.time()- start_time
    # print(training_time)
    
    # validation
    val_loss = count = 0
    correct = total = 0
    for data, target in val_loader:
        data, target = data.to(device), target.to(device)
        data = data.view(data.size(0), -1)
        output = model(data)
        val_loss += criterion(output, target).item()
        count += 1
        pred = output.argmax(dim=1)
        correct += (pred == target).sum().item()
        total += data.size(0)
        
    val_loss = val_loss / count
    val_acc = 100. * correct / total
    # print(f'Validation loss: {val_loss:.2f}, accuracy: {val_acc:.2f}%')
    
    # test
    model.eval()
    correct = total = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            data = data.view(data.size(0), -1)
            output = model(data)
            pred = output.argmax(dim=1)
            correct += (pred == target).sum().item()
            total += data.size(0)
            
    test_acc = 100. * correct / total
    # print(f'Test Accuracy: {test_acc:.2f}%')
    print('Hyperopt run done')
    return training_time, train_acc, val_acc, test_acc

In [34]:
batch_sizes = [2, 16, 128, 1024]
results = []

for batch_size in batch_sizes:
    training_time, train_acc, val_acc, test_acc = two_digit(batch_size=batch_size)
    results.append([batch_size,training_time, train_acc, val_acc, test_acc])

headers = ['Batch size', 'Training Time ', 'Train Acc' ,' Val Acc', 'Test Acc']
df =  pd.DataFrame(results, columns = headers)


Epoch 1, Loss: 0.0038
Epoch 2, Loss: 0.0013
Epoch 3, Loss: 0.0003
Epoch 4, Loss: 0.0012
Epoch 5, Loss: 0.0003
Epoch 6, Loss: 0.0001
Epoch 7, Loss: 0.0001
Epoch 8, Loss: 0.0001
Epoch 9, Loss: 0.0001
Epoch 10, Loss: 0.0001
Hyperopt run done
Epoch 1, Loss: 0.0708
Epoch 2, Loss: 0.0350
Epoch 3, Loss: 0.0138
Epoch 4, Loss: 0.0130
Epoch 5, Loss: 0.0053
Epoch 6, Loss: 0.0045
Epoch 7, Loss: 0.0065
Epoch 8, Loss: 0.0022
Epoch 9, Loss: 0.0016
Epoch 10, Loss: 0.0012
Hyperopt run done
Epoch 1, Loss: 0.2647
Epoch 2, Loss: 0.1728
Epoch 3, Loss: 0.1197
Epoch 4, Loss: 0.0973
Epoch 5, Loss: 0.0759
Epoch 6, Loss: 0.0663
Epoch 7, Loss: 0.0509
Epoch 8, Loss: 0.0491
Epoch 9, Loss: 0.0399
Epoch 10, Loss: 0.0346
Hyperopt run done
Epoch 1, Loss: 0.6370
Epoch 2, Loss: 0.5631
Epoch 3, Loss: 0.5066
Epoch 4, Loss: 0.4771
Epoch 5, Loss: 0.4590
Epoch 6, Loss: 0.4270
Epoch 7, Loss: 0.4068
Epoch 8, Loss: 0.3801
Epoch 9, Loss: 0.3821
Epoch 10, Loss: 0.3466
Hyperopt run done


In [35]:
print(device)
df.to_csv('question_1.csv')
df

cpu


Unnamed: 0,Batch size,Training Time,Train Acc,Val Acc,Test Acc
0,2,28.409046,99.940782,99.842084,99.905437
1,16,10.273708,99.950651,99.802606,99.905437
2,128,5.733365,99.940782,99.763127,99.905437
3,1024,5.083993,98.874852,99.368338,99.479905


In [36]:
df = pd.read_csv('question_1.csv')
latex_table = df.to_latex(index=False)
print(latex_table)

\begin{tabular}{rrrrrr}
\toprule
Unnamed: 0 & Batch size & Training Time  & Train Acc &  Val Acc & Test Acc \\
\midrule
0 & 2 & 28.409046 & 99.940782 & 99.842084 & 99.905437 \\
1 & 16 & 10.273708 & 99.950651 & 99.802606 & 99.905437 \\
2 & 128 & 5.733365 & 99.940782 & 99.763127 & 99.905437 \\
3 & 1024 & 5.083993 & 98.874852 & 99.368338 & 99.479905 \\
\bottomrule
\end{tabular}

