**Group-08**<br/>
<font style="color:red"> **Belhassen Ghoul <br/> Robin Ehrensperger <br/> Dominic Diedenhofen**</font>

In [10]:
import torch
from torch.utils.data import DataLoader, random_split
from torchvision import datasets
from torchvision.transforms import ToTensor
from torchsummary import summary
import numpy as np
import matplotlib.pyplot as plt

### Loading Data

Load train and test partition of the MNIST dataset.

Prepare the training by splitting the training partition into a training and validation.

In [11]:
training_data = datasets.MNIST(root="data", train=True, download=True, transform=ToTensor())
test_data = datasets.MNIST(root="data", train=False, download=True, transform=ToTensor())

In [12]:
# Partition into train and validate

### YOUR CODE START ###
training_set, validation_set = random_split(training_data,[50000,10000])

### YOUR CODE END ###

### MLP

Implement an MLP model that can be configured with a an arbitrary number of layers and units per layer.

To that end, implement a suitable sub-class of `torch.nn.Module` with a constructor that accepts the following arguments:
* `units`: list of integers that specify the number of units in the different layers. The first element corresponds to the number of units in the input layer (layer '0'), the last element is the number of output units, i.e. the number of classes the classifier is designed for (10 for an MNIST classifier). Hence, MLP will have $n$ hidden layers if `units` has $n+1$ elements. 
* `activation_class`: Class name of the activation function layer to be used (such as `torch.nn.ReLU`). Instances can be created by `activation_class()` and added to the succession of layers defined by the model. 

Alternatively, you can implement a utility method that creates a `torch.nn.Sequential` model accordingly. 


In [115]:
### YOUR CODE START ###

class MLP(torch.nn.Module):
    
    def __init__(self, units, activation_class = None):
        super(MLP, self).__init__()
        self.units = units[1]
        self.activation_class = activation_class
        self.flatten = torch.nn.Flatten()
        self.linear1 = torch.nn.Linear(units[0],self.units)
        self.ReLU = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(self.units,units[2])
        self.ReLU2 = torch.nn.ReLU()
        self.linear3 = torch.nn.Linear(units[2],units[3])

    def forward(self, x):
        z = self.linear1(self.flatten(x))
        z = self.ReLU(z)
        z = self.linear2(z)
        z = self.ReLU2(z)
        return self.linear3(z)
        

### YOUR CODE END ###

In [116]:
model = MLP([28*28,300, 100, 10])

from torchsummary import summary
summary(model, (1,28,28))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
           Flatten-1                  [-1, 784]               0
            Linear-2                  [-1, 300]         235,500
              ReLU-3                  [-1, 300]               0
            Linear-4                  [-1, 100]          30,100
              ReLU-5                  [-1, 100]               0
            Linear-6                   [-1, 10]           1,010
Total params: 266,610
Trainable params: 266,610
Non-trainable params: 0
----------------------------------------------------------------
Input size (MB): 0.00
Forward/backward pass size (MB): 0.01
Params size (MB): 1.02
Estimated Total Size (MB): 1.03
----------------------------------------------------------------


### Training Loop

For training, implement a method with the arguments:
* `model`: Model to be trained
* `lr`: Learning rate
* `nepochs`: Number of epochs
* `batchsize`: Batch size
* `training_data`: Training set (subclassed of `Dataset`)
* `validation_data`: Validation set (subclassed of `Dataset`)

Remember the training and validation cost and accuracy, respectively for monitoring the progress of the training. <br>
Note that for the training cost and accuracy you can use the per batch quantities averaged over an epoch. 

Furthermore, you can use the SGD optimizer of pytorch (`torch.optim.SGD`) - but without momentum.

In [118]:
def train_eval(model, lr, nepochs, nbatch, training_set, validation_set):
    # finally return the sequence of per epoch values
    cost_hist = []
    cost_hist_valid = []
    acc_hist = []
    acc_hist_valid = []

    cost_ce = torch.nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=lr)

    ### YOUR CODE START ###
    
    # epoch: current epoch
    # cost, cost_valid, acc, acc_valid: cost and acurracy (for training, validation set) per epoch     
    
    training_loader = DataLoader(training_set, batch_size=nbatch, shuffle=True)
    validation_loader = DataLoader(validation_set, batch_size=10000, shuffle=True)

    size = len(training_loader.dataset)
    nbatches = len(training_loader)

    cost, acc = 0.0, 0.0
    for epoch in range(nepochs):
        for batch, (X, Y) in enumerate(training_loader):

            pred = model(X)    
            loss= cost_ce(pred,Y)    

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()    
        
            acc += (pred.argmax(dim=1)==Y).type(torch.float).sum().item()
            cost += cost_ce(pred,Y)

        cost /= nbatches
        acc /= size
    
        cost_valid, acc_valid = 0.0, 0.0
        with torch.no_grad():
            for X,Y in validation_loader:
                pred = model(X)
                acc_valid += (pred.argmax(dim=1)==Y).type(torch.float).sum().item()/len(validation_loader.dataset)
                cost_valid += cost_ce(pred,Y)

        print("Epoch %i: %f, %f, %f, %f"%(epoch, cost, acc, cost_valid, acc_valid))

        ### YOUR CODE END ###
        
        cost_hist.append(cost)
        cost_hist_valid.append(cost_valid)
        acc_hist.append(acc)
        acc_hist_valid.append(acc_valid)
    return cost_hist, cost_hist_valid, acc_hist, acc_hist_valid

In [119]:
units = (784,10,10)
nepochs = 20
lr = 0.5

train_eval(model,lr=lr,nepochs=nepochs,nbatch=64,training_set=training_set,validation_set=validation_set)

Epoch 0: 0.329284, 0.896100, 0.130532, 0.961200
Epoch 1: 0.109761, 0.966258, 0.562918, 0.866600
Epoch 2: 0.075648, 0.976199, 0.107886, 0.967600
Epoch 3: 0.052647, 0.983080, 0.293440, 0.933300
Epoch 4: 0.040176, 0.986620, 0.186789, 0.952000
Epoch 5: 0.029490, 0.990380, 0.101680, 0.971800
Epoch 6: 0.024980, 0.992000, 0.077242, 0.978900
Epoch 7: 0.020006, 0.993860, 0.087881, 0.976000
Epoch 8: 0.010044, 0.996980, 0.093514, 0.978400
Epoch 9: 0.013439, 0.995380, 0.084583, 0.979700
Epoch 10: 0.012179, 0.996100, 0.099076, 0.978600
Epoch 11: 0.007437, 0.997440, 0.104426, 0.978800
Epoch 12: 0.007060, 0.997680, 0.102571, 0.979700
Epoch 13: 0.008387, 0.997400, 0.112578, 0.978400
Epoch 14: 0.006891, 0.997820, 0.110990, 0.978800
Epoch 15: 0.005433, 0.998200, 1.801062, 0.878800
Epoch 16: 0.010228, 0.997640, 0.117600, 0.976700
Epoch 17: 0.004936, 0.998520, 0.100398, 0.981700
Epoch 18: 0.001167, 0.999780, 0.097673, 0.983400
Epoch 19: 0.000236, 1.000000, 0.098411, 0.982800


([tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>),
  tensor(0.0002, grad_fn=<DivBackward0>)],
 [tensor(0.1305),
  tensor(0.5629),
  tensor(0.1079),
  tensor(0.2934),
  tensor(0.1868),
  tensor(0.1017),
  tensor(0.0772),
  tensor(0.0879),
  tensor(0.0935

### Exploration

Now use this functionality to explore different layer configurations: 
* Number of layers
* Number of units per layer
* Suitable learning rate
* Suitable number of epochs.

Use a batchsize of 64.

Make sure that you choose a sufficinetly large number of epochs so that the learning has more or less stabilizes (converged). 

### Summary

Summarize your findings with the different settings in a table

| Units | nepochs | lr | Acc (Train) | Acc (Valid) |
| --- | :-: | :-: | :-: | :-: |
| (784,10,10) | 20 | 0.5 | 94.1% | 93.4% |

<font style="color:red">the model is much better (98,3%!!!)than the old one with 93% Acc but we have to pay with computation time!</font>
