## Exercise 0: Train your model on GPU (0 points)

For some tasks in this assignment, it can take a long time if you run it on CPU. For example, based on our test on Exercise 3 Task 4, it will take roughly 2 hours to train the full model for 1 epoch on CPU. Hence, we highly recommend you try to train your model on GPU.

To do so, first you need to enable GPU on Colab (this will restart the runtime). Click `Runtime`-> `Change runtime type` and select the `Hardware accelerator` there.  You can then run the following code to see if the GPU is correctly initialized and available.



In [None]:
import torch
print(f'Can I can use GPU now? -- {torch.cuda.is_available()}')

Can I can use GPU now? -- True


### You must manually move your model and data to the GPU (and sometimes back to the cpu)
After setting the GPU up on colab, then you should put your **model** and **data** to GPU. We give a simple example below. You can use `to` function for this task. See [torch.Tensor.to](https://pytorch.org/docs/stable/generated/torch.Tensor.to.html) to move a tensor to the GPU (probably your mini-batch of data in each iteration) or [torch.nn.Module.to](https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.to) to move your NN model to GPU (assuming you create subclass [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html)). Note that `to()` of tensor returns a NEW tensor while `to` of a NN model will apply this in-place. To be safe, the best semantics are `obj = obj.to(device)`. For printing, you will need to move a tensor back to the CPU via the `cpu()` function.

Once the model and input data are on the GPU, everything else can be done the same.  This is the beauty of PyTorch GPU acceleration.  None of the other code needs to be altered.

To summarize, you need to 1) enable GPU acceleration in Colab, 2) put the model on the GPU, and 3) put the input data (i.e., the batch of samples) onto the GPU using `to()` after it is loaded by the data loaders (usually you only put one batch of data on the GPU at a time).

In [None]:
import torch.nn as nn
rand_tensor = torch.rand(5,2)
simple_model = nn.Sequential(nn.Linear(2,10), nn.ReLU(), nn.Linear(10,1))
print(f'input is on {rand_tensor.device}')
print(f'model parameters are on {[param.device for param in simple_model.parameters()]}')
print(f'output is on {simple_model(rand_tensor).device}')

device = torch.device('cuda')
# ----------- <Your code> ---------------
# Move rand_tensor and model onto the GPU device
rand_tensor = rand_tensor.to(device)
simple_model = simple_model.to(device)
# --------- <End your code> -------------
print(f'input is on {rand_tensor.device}')
print(f'model parameters are on {[param.device for param in simple_model.parameters()]}')
print(f'output is on {simple_model(rand_tensor).device}')

input is on cpu
model parameters are on [device(type='cpu'), device(type='cpu'), device(type='cpu'), device(type='cpu')]
output is on cpu
input is on cuda:0
model parameters are on [device(type='cuda', index=0), device(type='cuda', index=0), device(type='cuda', index=0), device(type='cuda', index=0)]
output is on cuda:0


## Exercise 1: Why use a CNN rather than only fully connected layers? (30 points)

In this exercise, you will build two models for the **MNIST** dataset: one uses only fully connected layers and another uses a standard CNN layout (convolution layers everywhere except the last layer is fully connected layer). The two models should be built with roughly the same accuracy performance, your task is to compare the number of network parameters (a huge number of parameters can affect training/testing time, memory requirements, overfitting, etc.).

### Task 1: Following the structure used in the instructions, you should create

*   One network named **OurFC** which should consist with only fully connected layers

  *   You should decide how many layers and how many hidden dimensions you want in your network 
  *   Your final accuracy on the test dataset should lie roughly around 90% ($\pm$2%)
  *   There is no need to make the neural network unnecessarily complex, your total training time should no longer than 3 mins

*   Another network named **OurCNN** which applys a standard CNN structure
  *   Again, you should decide how many layers and how many channels you want for each layer.
  *   Your final accuracy on the test dataset should lie roughly around 90% ($\pm$2%)
  *   A standard CNN structure can be composed as **[Conv2d, MaxPooling, ReLU] x num_conv_layers + FC x num_fc_layers**

* Train and test your network on MNIST data as in the instructions
* You are **required** to print out the loss in the training and loss+accuracy in the test as in the instructions.

In [None]:
# Import MNIST data, pre-process it and split it to test and train set
import torchvision

train_batch_size, test_batch_size = 64, 1000

transform = torchvision.transforms.Compose([torchvision.transforms.ToTensor(),
                                           torchvision.transforms.Normalize((0.1307,),(0.3081,))]) # MNIST data has 0.1307 mean and 0.3081 standard deviation

train_dataset = torchvision.datasets.MNIST('/data', train=True, download=True, transform=transform)
test_dataset = torchvision.datasets.MNIST('/data', train=False, download=True, transform=transform)

print(f'Training Data: {train_dataset}')
print(f'Testing Data: {test_dataset}')

train_loader = torch.utils.data.DataLoader(train_dataset, train_batch_size, shuffle=True)
test_loader = torch.utils.data.DataLoader(test_dataset, test_batch_size, shuffle=False)

Training Data: Dataset MNIST
    Number of datapoints: 60000
    Root location: /data
    Split: Train
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.1307,), std=(0.3081,))
           )
Testing Data: Dataset MNIST
    Number of datapoints: 10000
    Root location: /data
    Split: Test
    StandardTransform
Transform: Compose(
               ToTensor()
               Normalize(mean=(0.1307,), std=(0.3081,))
           )


In [None]:
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Fully Connected Network Class
class OurFC(nn.Module):

  def __init__(self):
    super(OurFC, self).__init__()
    self.fc = nn.Linear(784, 10)

  def forward(self, x):
    x = x.view(-1, x.size()[2]*x.size()[3])
    x = self.fc(x)
    return F.log_softmax(x, -1)

# CNN Network Class
class OurCNN(nn.Module):

  def __init__(self):
    super(OurCNN, self).__init__()
    self.conv = nn.Conv2d(1, 1, kernel_size=5) # Using only 1 filter to match the accuracy ~92%, otherwise it shoots to 95%
    self.fc = nn.Linear(144, 10)
  
  def forward(self, x):
      x = self.conv(x)
      x = F.relu(F.max_pool2d(x, 2))
      x = x.view(-1, x.size()[1]*x.size()[2]*x.size()[3])
      x = self.fc(x)
      return F.log_softmax(x, -1)

FCClassifier = OurFC()
FCOptimizer = optim.SGD(FCClassifier.parameters(), lr=0.01, momentum=0.7)

CNNClassifier = OurCNN()
CNNOptimizar = optim.SGD(CNNClassifier.parameters(), lr=0.01, momentum=0.7)


In [None]:
import time

# Function to train the given classifier using the given optimizer
def train(classifier, optimizer, epoch, train_loss, train_counter):
  start_time = time.time()
  for batch_idx, (images, targets) in enumerate(train_loader):
    optimizer.zero_grad()
    outputs = classifier(images)
    loss = F.nll_loss(outputs, targets)
    loss.backward()
    optimizer.step()

    if batch_idx%10 == 0:
      train_loss.append(loss.item())
      train_counter.append(train_batch_size*batch_idx + (epoch-1)*len(train_loader.dataset))
    if batch_idx%100 == 0:
      print(f'Epoch: {epoch} [{batch_idx*len(images)}/{len(train_loader.dataset)}] Training Loss: {loss.item()}')
  print(f'Total training time for epoch: {epoch} is {time.time()-start_time} seconds')

# Function to test the given classifier using the given optimizer
def test(classifier, optimizer, epoch, test_loss, test_counter):
  classifier.eval()

  loss = 0
  correct = 0

  with torch.no_grad():
    for images, targets in test_loader:
      outputs = classifier(images)
      loss += F.nll_loss(outputs, targets, reduction='sum').item()
      prediction = outputs.data.max(1, keepdim=True)[1]
      correct += prediction.eq(targets.data.view_as(prediction)).sum()

  test_loss.append(loss/len(test_loader.dataset))
  test_counter.append(len(train_loader.dataset)*epoch)
  print(f'Test Result on epoch: {epoch}: Avg loss is {loss/len(test_loader.dataset)}, Accuracy: {100*(correct/len(test_loader.dataset))}')


In [None]:
FC_train_loss, FC_train_counter = [], []
FC_test_loss, FC_test_counter = [], []
CNN_train_loss, CNN_train_counter = [], []
CNN_test_loss, CNN_test_counter = [], []
max_epoch = 3

# Function to train and test the model
def train_and_test(classifier, optimizer, train_loss, train_counter, test_loss, test_counter):
  for epoch in range(1, max_epoch+1):
    train(classifier, optimizer, epoch, train_loss, train_counter)
    test(classifier, optimizer, epoch, test_loss, test_counter)

print(f'>>>>>>>>>>>>>>Printing Results for FC Netowrk (OurFC)<<<<<<<<<<<<<<')
train_and_test(FCClassifier, FCOptimizer, FC_train_loss, FC_train_counter, FC_test_loss, FC_test_counter)
print(f'>>>>>>>>>>>>>>Printing Results for CNN Netowrk (OurCNN)<<<<<<<<<<<<<<')
train_and_test(CNNClassifier, CNNOptimizar, CNN_train_loss, CNN_train_counter, CNN_test_loss, CNN_test_counter)

>>>>>>>>>>>>>>Printing Results for FC Netowrk (OurFC)<<<<<<<<<<<<<<
Epoch: 1 [0/60000] Training Loss: 0.14794547855854034
Epoch: 1 [6400/60000] Training Loss: 0.19575344026088715
Epoch: 1 [12800/60000] Training Loss: 0.5624825358390808
Epoch: 1 [19200/60000] Training Loss: 0.30876073241233826
Epoch: 1 [25600/60000] Training Loss: 0.2982504069805145
Epoch: 1 [32000/60000] Training Loss: 0.4672214090824127
Epoch: 1 [38400/60000] Training Loss: 0.3649389147758484
Epoch: 1 [44800/60000] Training Loss: 0.3104346990585327
Epoch: 1 [51200/60000] Training Loss: 0.18228305876255035
Epoch: 1 [57600/60000] Training Loss: 0.450327068567276
Total training time for epoch: 1 is 12.64466118812561 seconds
Test Result on epoch: 1: Avg loss is 0.27425271682739255, Accuracy: 92.2699966430664
Epoch: 2 [0/60000] Training Loss: 0.45664820075035095
Epoch: 2 [6400/60000] Training Loss: 0.35054245591163635
Epoch: 2 [12800/60000] Training Loss: 0.2514461874961853
Epoch: 2 [19200/60000] Training Loss: 0.384651303

### Task 2: Compare the number of parameters that are used in both your neural networks by printing out the total number of parameters for both of your networks.

**Note:** You need to clearly show which number corresponds to which network.

In [None]:
# Function to count number of learnable parameters in a model
def count_num_of_parameters(model):
  return sum(parameter.numel() for parameter in model.parameters() if parameter.requires_grad)

print(f'OurFC Parameters: {count_num_of_parameters(FCClassifier)}')
print(f'OurCNN Parameters: {count_num_of_parameters(CNNClassifier)}')

OurFC Parameters: 7850
OurCNN Parameters: 1476
