# Task 2. 
With MNIST data set, to randomly use 10000 images as the training set, 5000 images as the validation set and use the test set as provided, and explore how does the network architecture [number of hidden layers, number of dimensions per hidden layer, activation functions], Batch normalization, optimization algorithms, learning rate and batch size would affect the training accuracy and validation accuracy. Finally, for the best model that achieved in validation accuracy, you further train it with the whole training set, what would be your training accuracy, validation accuracy and test accuracy.  Please summarize what you have learned from this experiment (HW8b.ipynb). (30 points) 

### Download MNIST dataset

In [1]:
import torch
import torch.nn as nn

from torch.utils.data import random_split
from torch.utils.data import DataLoader

from torchvision import datasets
from torchvision.transforms import ToTensor

  Referenced from: <0F72FEF0-4DF1-3E8A-90BA-513122A1950F> /Users/alex/opt/anaconda3/lib/python3.9/site-packages/torchvision/image.so
  warn(


In [2]:
torch.manual_seed(1)

<torch._C.Generator at 0x7fb4bfd2c870>

In [3]:
# download the MNIST dataset
dataset = datasets.MNIST(
    root='data/', 
    train=True, 
    transform=ToTensor(), 
    download=True
)

In [4]:
# split dataset into initial training, training, and validation datasets
init_train, train_ds, valid_ds  = random_split(dataset, [10000, 45000, 5000])

# get MNIST test dataset
test_ds = datasets.MNIST(root='data/', train=False, transform=ToTensor())

In [5]:
# create dataloaders
batch_size = 64

init_train_dl = DataLoader(init_train, batch_size, shuffle=True)
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=True)

### Explore models

In [6]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print("Using {} device".format(device))

Using cpu device


**Base model: ```model_0```**

In [7]:
# contruct a base model
hidden_units = [32, 16]
image_size = init_train[0][0].shape
input_size = image_size[0] * image_size[1] * image_size[2]
all_layers = [nn.Flatten()]

for hidden_unit in hidden_units:
    layer = nn.Linear(input_size, hidden_unit)
    all_layers.append(layer)
    all_layers.append(nn.ReLU())
    input_size = hidden_unit
    
all_layers.append(nn.Linear(hidden_units[-1], 10))
model_0 = nn.Sequential(*all_layers)
model_0

Sequential(
  (0): Flatten(start_dim=1, end_dim=-1)
  (1): Linear(in_features=784, out_features=32, bias=True)
  (2): ReLU()
  (3): Linear(in_features=32, out_features=16, bias=True)
  (4): ReLU()
  (5): Linear(in_features=16, out_features=10, bias=True)
)

In [8]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model_0.parameters(), lr=learning_rate)

In [9]:
def train_and_validate(model, train_dl, valid_dl, optimizer, loss_fn, num_epochs=20):
    for epoch in range(num_epochs):
        # Training loop
        acc_hist_train = 0
        loss_hist_train = 0
        for x_batch, y_batch in train_dl:
            pred = model(x_batch)
            loss = loss_fn(pred, y_batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
            acc_hist_train += is_correct.sum()
            loss_hist_train += loss.item() * x_batch.size(0)  # accumulate loss
        acc_hist_train /= len(train_dl.dataset)
        loss_hist_train /= len(train_dl.dataset)

        # Validation loop
        acc_hist_valid = 0
        loss_hist_valid = 0
        with torch.no_grad():
            for x_batch, y_batch in valid_dl:
                pred = model(x_batch)
                loss = loss_fn(pred, y_batch)
                is_correct = (torch.argmax(pred, dim=1) == y_batch).float()
                acc_hist_valid += is_correct.sum()
                loss_hist_valid += loss.item() * x_batch.size(0)
            acc_hist_valid /= len(valid_dl.dataset)
            loss_hist_valid /= len(valid_dl.dataset)

        print(f'Epoch [{epoch+1:0>2}/{num_epochs}], Train Loss: {loss_hist_train:.4f}, Train Acc: {acc_hist_train:.4f}, Valid Loss: {loss_hist_valid:.4f}, Valid Acc: {acc_hist_valid:.4f}')

In [10]:
train_and_validate(model_0, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 1.3329, Train Acc: 0.5996, Valid Loss: 0.6420, Valid Acc: 0.8184
Epoch [02/20], Train Loss: 0.5098, Train Acc: 0.8545, Valid Loss: 0.4246, Valid Acc: 0.8806
Epoch [03/20], Train Loss: 0.3747, Train Acc: 0.8952, Valid Loss: 0.3600, Valid Acc: 0.8954
Epoch [04/20], Train Loss: 0.3205, Train Acc: 0.9092, Valid Loss: 0.3280, Valid Acc: 0.9086
Epoch [05/20], Train Loss: 0.2868, Train Acc: 0.9171, Valid Loss: 0.3005, Valid Acc: 0.9162
Epoch [06/20], Train Loss: 0.2618, Train Acc: 0.9233, Valid Loss: 0.2968, Valid Acc: 0.9158
Epoch [07/20], Train Loss: 0.2455, Train Acc: 0.9310, Valid Loss: 0.2836, Valid Acc: 0.9178
Epoch [08/20], Train Loss: 0.2280, Train Acc: 0.9357, Valid Loss: 0.2759, Valid Acc: 0.9224
Epoch [09/20], Train Loss: 0.2168, Train Acc: 0.9379, Valid Loss: 0.2726, Valid Acc: 0.9236
Epoch [10/20], Train Loss: 0.2018, Train Acc: 0.9409, Valid Loss: 0.2605, Valid Acc: 0.9240
Epoch [11/20], Train Loss: 0.1926, Train Acc: 0.9456, Valid Loss: 0.2578, Valid 

**First variation, ```model_1```: test a higher dimensional hidden layer**

In [14]:
# construct model with a higher dimensional hidden layer
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.ReLU(), 
            nn.Linear(32, 32), 
            nn.ReLU(), 
            nn.Linear(32, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
    
model_1 = Model()
model_1

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): ReLU()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): ReLU()
    (4): Linear(in_features=32, out_features=10, bias=True)
  )
)

In [15]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model_1.parameters(), lr=learning_rate)

train_and_validate(model_1, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 1.2091, Train Acc: 0.6400, Valid Loss: 0.5031, Valid Acc: 0.8646
Epoch [02/20], Train Loss: 0.4005, Train Acc: 0.8860, Valid Loss: 0.3543, Valid Acc: 0.8944
Epoch [03/20], Train Loss: 0.3195, Train Acc: 0.9079, Valid Loss: 0.3141, Valid Acc: 0.9074
Epoch [04/20], Train Loss: 0.2793, Train Acc: 0.9209, Valid Loss: 0.2962, Valid Acc: 0.9146
Epoch [05/20], Train Loss: 0.2579, Train Acc: 0.9255, Valid Loss: 0.2874, Valid Acc: 0.9128
Epoch [06/20], Train Loss: 0.2371, Train Acc: 0.9313, Valid Loss: 0.2770, Valid Acc: 0.9166
Epoch [07/20], Train Loss: 0.2155, Train Acc: 0.9377, Valid Loss: 0.2624, Valid Acc: 0.9210
Epoch [08/20], Train Loss: 0.2021, Train Acc: 0.9404, Valid Loss: 0.2547, Valid Acc: 0.9240
Epoch [09/20], Train Loss: 0.1875, Train Acc: 0.9450, Valid Loss: 0.2652, Valid Acc: 0.9210
Epoch [10/20], Train Loss: 0.1790, Train Acc: 0.9465, Valid Loss: 0.2546, Valid Acc: 0.9236
Epoch [11/20], Train Loss: 0.1653, Train Acc: 0.9539, Valid Loss: 0.2468, Valid 

Increasing the number of dimensions of the hidden layer allowed the model to better fit the initial training set, but there is actually a slight decrease in the performance on the validation set.

**Second variation ```model_2```: add another hidden layer to the base model**

In [18]:
# construct model with two hidden layers
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.ReLU(), 
            nn.Linear(32, 32), 
            nn.ReLU(), 
            nn.Linear(32, 16), 
            nn.ReLU(), 
            nn.Linear(16, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
    
model_2 = Model()
model_2

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): ReLU()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): ReLU()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): ReLU()
    (6): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [19]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model_2.parameters(), lr=learning_rate)

train_and_validate(model_2, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 1.4398, Train Acc: 0.5359, Valid Loss: 0.6894, Valid Acc: 0.8024
Epoch [02/20], Train Loss: 0.5639, Train Acc: 0.8360, Valid Loss: 0.4875, Valid Acc: 0.8560
Epoch [03/20], Train Loss: 0.4382, Train Acc: 0.8746, Valid Loss: 0.4065, Valid Acc: 0.8838
Epoch [04/20], Train Loss: 0.3691, Train Acc: 0.8930, Valid Loss: 0.3567, Valid Acc: 0.8940
Epoch [05/20], Train Loss: 0.3142, Train Acc: 0.9098, Valid Loss: 0.3275, Valid Acc: 0.9028
Epoch [06/20], Train Loss: 0.2793, Train Acc: 0.9183, Valid Loss: 0.3021, Valid Acc: 0.9074
Epoch [07/20], Train Loss: 0.2534, Train Acc: 0.9279, Valid Loss: 0.2783, Valid Acc: 0.9202
Epoch [08/20], Train Loss: 0.2350, Train Acc: 0.9300, Valid Loss: 0.2822, Valid Acc: 0.9166
Epoch [09/20], Train Loss: 0.2078, Train Acc: 0.9400, Valid Loss: 0.2720, Valid Acc: 0.9192
Epoch [10/20], Train Loss: 0.1946, Train Acc: 0.9412, Valid Loss: 0.2554, Valid Acc: 0.9218
Epoch [11/20], Train Loss: 0.1791, Train Acc: 0.9477, Valid Loss: 0.2543, Valid 

Adding a hidden layer seems to improve model performance by a noticable amount. Performance on the initial training dataloader is better than the performance of the base model. The validation loss is similar between the two models, but the validation accuracy is better for this current model. For the following variations, I will modify this model, ```model_2```.

**Third variation ```model_3```: test different activation functions on ```model_2```**

1. Add a ```softmax()``` activation function in the output layer (since it is a multi-class classification problem).

In [32]:
# construct a model that applies softmax to the output layer
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.ReLU(), 
            nn.Linear(32, 32), 
            nn.ReLU(), 
            nn.Linear(32, 16), 
            nn.ReLU(), 
            nn.Linear(16, 10),
            nn.Softmax(dim=1)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
    
model_3 = Model()
model_3

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): ReLU()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): ReLU()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): ReLU()
    (6): Linear(in_features=16, out_features=10, bias=True)
    (7): Softmax(dim=1)
  )
)

In [33]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model_3.parameters(), lr=learning_rate)

train_and_validate(model_3, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 2.1295, Train Acc: 0.3400, Valid Loss: 1.9287, Valid Acc: 0.5714
Epoch [02/20], Train Loss: 1.8162, Train Acc: 0.6743, Valid Loss: 1.7323, Valid Acc: 0.7606
Epoch [03/20], Train Loss: 1.6902, Train Acc: 0.7909, Valid Loss: 1.6838, Valid Acc: 0.7908
Epoch [04/20], Train Loss: 1.6622, Train Acc: 0.8098, Valid Loss: 1.6627, Valid Acc: 0.8136
Epoch [05/20], Train Loss: 1.6487, Train Acc: 0.8213, Valid Loss: 1.6551, Valid Acc: 0.8156
Epoch [06/20], Train Loss: 1.6408, Train Acc: 0.8282, Valid Loss: 1.6478, Valid Acc: 0.8214
Epoch [07/20], Train Loss: 1.6321, Train Acc: 0.8356, Valid Loss: 1.6453, Valid Acc: 0.8224
Epoch [08/20], Train Loss: 1.6278, Train Acc: 0.8387, Valid Loss: 1.6428, Valid Acc: 0.8236
Epoch [09/20], Train Loss: 1.6241, Train Acc: 0.8428, Valid Loss: 1.6426, Valid Acc: 0.8242
Epoch [10/20], Train Loss: 1.6188, Train Acc: 0.8474, Valid Loss: 1.6412, Valid Acc: 0.8254
Epoch [11/20], Train Loss: 1.6166, Train Acc: 0.8485, Valid Loss: 1.6363, Valid 

Adding a softmax activation makes the model worse.

2. Try a ```Tanh()``` activation function between each layer.

In [31]:
# construct model with Tanh() activation function
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.Tanh(), 
            nn.Linear(32, 32), 
            nn.Tanh(), 
            nn.Linear(32, 16), 
            nn.Tanh(), 
            nn.Linear(16, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
    
model_3 = Model()
model_3

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): Tanh()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): Tanh()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): Tanh()
    (6): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [26]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model_3.parameters(), lr=learning_rate)

train_and_validate(model_3, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 1.4537, Train Acc: 0.6403, Valid Loss: 0.8828, Valid Acc: 0.8228
Epoch [02/20], Train Loss: 0.6591, Train Acc: 0.8569, Valid Loss: 0.5250, Valid Acc: 0.8806
Epoch [03/20], Train Loss: 0.4208, Train Acc: 0.9027, Valid Loss: 0.3892, Valid Acc: 0.9018
Epoch [04/20], Train Loss: 0.3131, Train Acc: 0.9248, Valid Loss: 0.3280, Valid Acc: 0.9108
Epoch [05/20], Train Loss: 0.2537, Train Acc: 0.9373, Valid Loss: 0.2949, Valid Acc: 0.9174
Epoch [06/20], Train Loss: 0.2117, Train Acc: 0.9458, Valid Loss: 0.2664, Valid Acc: 0.9248
Epoch [07/20], Train Loss: 0.1804, Train Acc: 0.9530, Valid Loss: 0.2456, Valid Acc: 0.9300
Epoch [08/20], Train Loss: 0.1518, Train Acc: 0.9631, Valid Loss: 0.2413, Valid Acc: 0.9312
Epoch [09/20], Train Loss: 0.1323, Train Acc: 0.9675, Valid Loss: 0.2369, Valid Acc: 0.9336
Epoch [10/20], Train Loss: 0.1150, Train Acc: 0.9735, Valid Loss: 0.2274, Valid Acc: 0.9382
Epoch [11/20], Train Loss: 0.1015, Train Acc: 0.9765, Valid Loss: 0.2366, Valid 

The model seems to actually perform better with ```Tanh()``` than with ```ReLU()``` as the activation function.

3. Try a ```Sigmoid()``` activation function between each layer.

In [27]:
# construct model with Sigmoid activation function
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.Sigmoid(), 
            nn.Linear(32, 32), 
            nn.Sigmoid(), 
            nn.Linear(32, 16), 
            nn.Sigmoid(), 
            nn.Linear(16, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits
    
model_3 = Model()
model_3

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): Sigmoid()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): Sigmoid()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): Sigmoid()
    (6): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [28]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model_3.parameters(), lr=learning_rate)

train_and_validate(model_3, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 2.2532, Train Acc: 0.1602, Valid Loss: 2.0975, Valid Acc: 0.2210
Epoch [02/20], Train Loss: 1.9239, Train Acc: 0.2481, Valid Loss: 1.7869, Valid Acc: 0.2842
Epoch [03/20], Train Loss: 1.6914, Train Acc: 0.4225, Valid Loss: 1.5998, Valid Acc: 0.4738
Epoch [04/20], Train Loss: 1.5133, Train Acc: 0.5085, Valid Loss: 1.4388, Valid Acc: 0.6022
Epoch [05/20], Train Loss: 1.3324, Train Acc: 0.6232, Valid Loss: 1.2486, Valid Acc: 0.6350
Epoch [06/20], Train Loss: 1.1455, Train Acc: 0.6641, Valid Loss: 1.0868, Valid Acc: 0.6626
Epoch [07/20], Train Loss: 0.9994, Train Acc: 0.7009, Valid Loss: 0.9657, Valid Acc: 0.7132
Epoch [08/20], Train Loss: 0.8853, Train Acc: 0.7418, Valid Loss: 0.8644, Valid Acc: 0.7494
Epoch [09/20], Train Loss: 0.7915, Train Acc: 0.7872, Valid Loss: 0.7828, Valid Acc: 0.7858
Epoch [10/20], Train Loss: 0.7103, Train Acc: 0.8158, Valid Loss: 0.7174, Valid Acc: 0.8208
Epoch [11/20], Train Loss: 0.6402, Train Acc: 0.8511, Valid Loss: 0.6579, Valid 

The best performance is still when ```Tanh()``` is used as the activation function. From now on I will use ```Tanh()``` as the activation function.

**Fourth variation ```model_4```: add batch normalization**

In [34]:
# construct model with batch normalization
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32),
            nn.BatchNorm1d(32),
            nn.Tanh(),
            nn.Linear(32, 32),
            nn.BatchNorm1d(32),
            nn.Tanh(),
            nn.Linear(32, 16),
            nn.BatchNorm1d(16),
            nn.Tanh(),
            nn.Linear(16, 10),
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

    
model_4 = Model()
model_4

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (2): Tanh()
    (3): Linear(in_features=32, out_features=32, bias=True)
    (4): BatchNorm1d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (5): Tanh()
    (6): Linear(in_features=32, out_features=16, bias=True)
    (7): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (8): Tanh()
    (9): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [35]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model_4.parameters(), lr=learning_rate)

train_and_validate(model_4, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 1.1783, Train Acc: 0.8032, Valid Loss: 0.7759, Valid Acc: 0.8764
Epoch [02/20], Train Loss: 0.5888, Train Acc: 0.8967, Valid Loss: 0.4751, Valid Acc: 0.9046
Epoch [03/20], Train Loss: 0.3859, Train Acc: 0.9187, Valid Loss: 0.3832, Valid Acc: 0.9064
Epoch [04/20], Train Loss: 0.2915, Train Acc: 0.9345, Valid Loss: 0.3160, Valid Acc: 0.9166
Epoch [05/20], Train Loss: 0.2378, Train Acc: 0.9413, Valid Loss: 0.3198, Valid Acc: 0.9092
Epoch [06/20], Train Loss: 0.1972, Train Acc: 0.9500, Valid Loss: 0.2706, Valid Acc: 0.9240
Epoch [07/20], Train Loss: 0.1728, Train Acc: 0.9559, Valid Loss: 0.2656, Valid Acc: 0.9222
Epoch [08/20], Train Loss: 0.1459, Train Acc: 0.9630, Valid Loss: 0.2588, Valid Acc: 0.9246
Epoch [09/20], Train Loss: 0.1227, Train Acc: 0.9704, Valid Loss: 0.2422, Valid Acc: 0.9278
Epoch [10/20], Train Loss: 0.1113, Train Acc: 0.9729, Valid Loss: 0.2587, Valid Acc: 0.9228
Epoch [11/20], Train Loss: 0.1009, Train Acc: 0.9731, Valid Loss: 0.2459, Valid 

Batch normalization does not seem to improve the model. I will not implement it going forward.

**Fifth variation, ```model_5```: try different optimization functions**

Try SGD optimization function.

In [50]:
# construct model with Tanh() activation function
# this is the model with the best validation accuracy so far
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.Tanh(), 
            nn.Linear(32, 32), 
            nn.Tanh(), 
            nn.Linear(32, 16), 
            nn.Tanh(), 
            nn.Linear(16, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model_5 = Model()
model_5

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): Tanh()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): Tanh()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): Tanh()
    (6): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [39]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.SGD(model_5.parameters(), lr=learning_rate)

train_and_validate(model_5, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 2.3183, Train Acc: 0.1255, Valid Loss: 2.3075, Valid Acc: 0.1450
Epoch [02/20], Train Loss: 2.3097, Train Acc: 0.1453, Valid Loss: 2.2996, Valid Acc: 0.1636
Epoch [03/20], Train Loss: 2.3014, Train Acc: 0.1634, Valid Loss: 2.2920, Valid Acc: 0.1870
Epoch [04/20], Train Loss: 2.2934, Train Acc: 0.1868, Valid Loss: 2.2845, Valid Acc: 0.2188
Epoch [05/20], Train Loss: 2.2856, Train Acc: 0.2329, Valid Loss: 2.2772, Valid Acc: 0.2714
Epoch [06/20], Train Loss: 2.2778, Train Acc: 0.2836, Valid Loss: 2.2698, Valid Acc: 0.3126
Epoch [07/20], Train Loss: 2.2700, Train Acc: 0.3150, Valid Loss: 2.2622, Valid Acc: 0.3360
Epoch [08/20], Train Loss: 2.2620, Train Acc: 0.3293, Valid Loss: 2.2545, Valid Acc: 0.3464
Epoch [09/20], Train Loss: 2.2538, Train Acc: 0.3401, Valid Loss: 2.2464, Valid Acc: 0.3496
Epoch [10/20], Train Loss: 2.2453, Train Acc: 0.3424, Valid Loss: 2.2380, Valid Acc: 0.3542
Epoch [11/20], Train Loss: 2.2364, Train Acc: 0.3446, Valid Loss: 2.2290, Valid 

SGD significantly decreases the accuracy of the model. I will keep Adam as the optimization function.

**Sixth variation: try different learning rates on the performance of ```model_5```**

1. learning_rate = 0.01

In [52]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.Tanh(), 
            nn.Linear(32, 32), 
            nn.Tanh(), 
            nn.Linear(32, 16), 
            nn.Tanh(), 
            nn.Linear(16, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model_5 = Model()
model_5

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): Tanh()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): Tanh()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): Tanh()
    (6): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [53]:
# evaluate the model with learning_rate = 0.01
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.01
optimizer = torch.optim.Adam(model_5.parameters(), lr=learning_rate)

train_and_validate(model_5, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 0.6785, Train Acc: 0.8063, Valid Loss: 0.4129, Valid Acc: 0.8748
Epoch [02/20], Train Loss: 0.3234, Train Acc: 0.9082, Valid Loss: 0.3549, Valid Acc: 0.8970
Epoch [03/20], Train Loss: 0.2648, Train Acc: 0.9217, Valid Loss: 0.3262, Valid Acc: 0.9072
Epoch [04/20], Train Loss: 0.2289, Train Acc: 0.9326, Valid Loss: 0.2967, Valid Acc: 0.9168
Epoch [05/20], Train Loss: 0.1863, Train Acc: 0.9456, Valid Loss: 0.2818, Valid Acc: 0.9214
Epoch [06/20], Train Loss: 0.1663, Train Acc: 0.9523, Valid Loss: 0.2710, Valid Acc: 0.9274
Epoch [07/20], Train Loss: 0.1677, Train Acc: 0.9505, Valid Loss: 0.2817, Valid Acc: 0.9254
Epoch [08/20], Train Loss: 0.1544, Train Acc: 0.9535, Valid Loss: 0.2901, Valid Acc: 0.9228
Epoch [09/20], Train Loss: 0.1378, Train Acc: 0.9564, Valid Loss: 0.2870, Valid Acc: 0.9270
Epoch [10/20], Train Loss: 0.1395, Train Acc: 0.9567, Valid Loss: 0.2686, Valid Acc: 0.9306
Epoch [11/20], Train Loss: 0.1263, Train Acc: 0.9631, Valid Loss: 0.2969, Valid 

The performance of the model is not as good with a larger learning rate.

2. learning_rate = 0.0001

In [54]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.Tanh(), 
            nn.Linear(32, 32), 
            nn.Tanh(), 
            nn.Linear(32, 16), 
            nn.Tanh(), 
            nn.Linear(16, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model_5 = Model()
model_5

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): Tanh()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): Tanh()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): Tanh()
    (6): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [55]:
# evaluate the model with learning_rate = 0.0001
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.0001
optimizer = torch.optim.Adam(model_5.parameters(), lr=learning_rate)

train_and_validate(model_5, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 2.1945, Train Acc: 0.2992, Valid Loss: 2.0443, Valid Acc: 0.5058
Epoch [02/20], Train Loss: 1.8883, Train Acc: 0.5725, Valid Loss: 1.7515, Valid Acc: 0.5958
Epoch [03/20], Train Loss: 1.6409, Train Acc: 0.6408, Valid Loss: 1.5446, Valid Acc: 0.6668
Epoch [04/20], Train Loss: 1.4562, Train Acc: 0.6928, Valid Loss: 1.3790, Valid Acc: 0.6996
Epoch [05/20], Train Loss: 1.3041, Train Acc: 0.7260, Valid Loss: 1.2406, Valid Acc: 0.7290
Epoch [06/20], Train Loss: 1.1766, Train Acc: 0.7490, Valid Loss: 1.1261, Valid Acc: 0.7528
Epoch [07/20], Train Loss: 1.0703, Train Acc: 0.7655, Valid Loss: 1.0295, Valid Acc: 0.7688
Epoch [08/20], Train Loss: 0.9813, Train Acc: 0.7788, Valid Loss: 0.9496, Valid Acc: 0.7840
Epoch [09/20], Train Loss: 0.9067, Train Acc: 0.7912, Valid Loss: 0.8824, Valid Acc: 0.8002
Epoch [10/20], Train Loss: 0.8426, Train Acc: 0.8105, Valid Loss: 0.8250, Valid Acc: 0.8164
Epoch [11/20], Train Loss: 0.7873, Train Acc: 0.8263, Valid Loss: 0.7744, Valid 

The model takes too long to converge with a smaller learning rate. I will keep the learning rate at 0.001.

**Sixth variation: try different batch sizes on the performance of ```model_5```**

1. batch_size = 2

In [56]:
# create dataloaders
batch_size = 2

init_train_dl = DataLoader(init_train, batch_size, shuffle=True)
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=True)

In [57]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.Tanh(), 
            nn.Linear(32, 32), 
            nn.Tanh(), 
            nn.Linear(32, 16), 
            nn.Tanh(), 
            nn.Linear(16, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model_5 = Model()
model_5

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): Tanh()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): Tanh()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): Tanh()
    (6): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [58]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model_5.parameters(), lr=learning_rate)

train_and_validate(model_5, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 0.5772, Train Acc: 0.8459, Valid Loss: 0.3363, Valid Acc: 0.9014
Epoch [02/20], Train Loss: 0.2627, Train Acc: 0.9254, Valid Loss: 0.2803, Valid Acc: 0.9202
Epoch [03/20], Train Loss: 0.2017, Train Acc: 0.9427, Valid Loss: 0.2980, Valid Acc: 0.9150
Epoch [04/20], Train Loss: 0.1651, Train Acc: 0.9532, Valid Loss: 0.2500, Valid Acc: 0.9298
Epoch [05/20], Train Loss: 0.1361, Train Acc: 0.9601, Valid Loss: 0.2504, Valid Acc: 0.9366
Epoch [06/20], Train Loss: 0.1148, Train Acc: 0.9657, Valid Loss: 0.2456, Valid Acc: 0.9344
Epoch [07/20], Train Loss: 0.0980, Train Acc: 0.9708, Valid Loss: 0.2607, Valid Acc: 0.9336
Epoch [08/20], Train Loss: 0.0902, Train Acc: 0.9741, Valid Loss: 0.2544, Valid Acc: 0.9354
Epoch [09/20], Train Loss: 0.0861, Train Acc: 0.9731, Valid Loss: 0.2574, Valid Acc: 0.9354
Epoch [10/20], Train Loss: 0.0720, Train Acc: 0.9784, Valid Loss: 0.2395, Valid Acc: 0.9410
Epoch [11/20], Train Loss: 0.0639, Train Acc: 0.9821, Valid Loss: 0.2611, Valid 

The performance of the model is similar to ```batch_size=64```, but training takes significantly longer.

2. batch_size = 128

In [59]:
# create dataloaders
batch_size = 128

init_train_dl = DataLoader(init_train, batch_size, shuffle=True)
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=True)

In [60]:
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.Tanh(), 
            nn.Linear(32, 32), 
            nn.Tanh(), 
            nn.Linear(32, 16), 
            nn.Tanh(), 
            nn.Linear(16, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model_5 = Model()
model_5

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): Tanh()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): Tanh()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): Tanh()
    (6): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [61]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(model_5.parameters(), lr=learning_rate)

train_and_validate(model_5, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 1.7552, Train Acc: 0.5482, Valid Loss: 1.2614, Valid Acc: 0.7464
Epoch [02/20], Train Loss: 1.0009, Train Acc: 0.7993, Valid Loss: 0.7960, Valid Acc: 0.8392
Epoch [03/20], Train Loss: 0.6670, Train Acc: 0.8635, Valid Loss: 0.5791, Valid Acc: 0.8748
Epoch [04/20], Train Loss: 0.4941, Train Acc: 0.8953, Valid Loss: 0.4628, Valid Acc: 0.8974
Epoch [05/20], Train Loss: 0.3957, Train Acc: 0.9129, Valid Loss: 0.3994, Valid Acc: 0.9040
Epoch [06/20], Train Loss: 0.3300, Train Acc: 0.9243, Valid Loss: 0.3581, Valid Acc: 0.9108
Epoch [07/20], Train Loss: 0.2867, Train Acc: 0.9331, Valid Loss: 0.3299, Valid Acc: 0.9164
Epoch [08/20], Train Loss: 0.2484, Train Acc: 0.9403, Valid Loss: 0.3101, Valid Acc: 0.9170
Epoch [09/20], Train Loss: 0.2199, Train Acc: 0.9476, Valid Loss: 0.2902, Valid Acc: 0.9224
Epoch [10/20], Train Loss: 0.1964, Train Acc: 0.9525, Valid Loss: 0.2780, Valid Acc: 0.9226
Epoch [11/20], Train Loss: 0.1728, Train Acc: 0.9588, Valid Loss: 0.2724, Valid 

The performance is similar. The validation accuracy is slightly worse than when ```batch_size=64```. I will keep ```batch_size=64```.

### The best model:

In [62]:
# create dataloaders
batch_size = 64

init_train_dl = DataLoader(init_train, batch_size, shuffle=True)
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size, shuffle=True)
test_dl = DataLoader(test_ds, batch_size, shuffle=True)

In [63]:
# construct the best performing model
class Model(nn.Module):
    def __init__(self):
        super(Model, self).__init__()
        self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(28*28, 32), 
            nn.Tanh(), 
            nn.Linear(32, 32), 
            nn.Tanh(), 
            nn.Linear(32, 16), 
            nn.Tanh(), 
            nn.Linear(16, 10)
        )

    def forward(self, x):
        x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

best_model = Model()
best_model

Model(
  (flatten): Flatten(start_dim=1, end_dim=-1)
  (linear_relu_stack): Sequential(
    (0): Linear(in_features=784, out_features=32, bias=True)
    (1): Tanh()
    (2): Linear(in_features=32, out_features=32, bias=True)
    (3): Tanh()
    (4): Linear(in_features=32, out_features=16, bias=True)
    (5): Tanh()
    (6): Linear(in_features=16, out_features=10, bias=True)
  )
)

In [64]:
# evaluate the model
loss_fn = nn.CrossEntropyLoss()
learning_rate = 0.001
optimizer = torch.optim.Adam(best_model.parameters(), lr=learning_rate)

train_and_validate(best_model, init_train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 1.4223, Train Acc: 0.6529, Valid Loss: 0.7984, Valid Acc: 0.8368
Epoch [02/20], Train Loss: 0.6017, Train Acc: 0.8671, Valid Loss: 0.4747, Valid Acc: 0.8888
Epoch [03/20], Train Loss: 0.3986, Train Acc: 0.9055, Valid Loss: 0.3701, Valid Acc: 0.9048
Epoch [04/20], Train Loss: 0.3071, Train Acc: 0.9244, Valid Loss: 0.3147, Valid Acc: 0.9160
Epoch [05/20], Train Loss: 0.2501, Train Acc: 0.9383, Valid Loss: 0.2854, Valid Acc: 0.9202
Epoch [06/20], Train Loss: 0.2102, Train Acc: 0.9470, Valid Loss: 0.2575, Valid Acc: 0.9286
Epoch [07/20], Train Loss: 0.1815, Train Acc: 0.9532, Valid Loss: 0.2488, Valid Acc: 0.9292
Epoch [08/20], Train Loss: 0.1577, Train Acc: 0.9616, Valid Loss: 0.2348, Valid Acc: 0.9318
Epoch [09/20], Train Loss: 0.1349, Train Acc: 0.9672, Valid Loss: 0.2388, Valid Acc: 0.9302
Epoch [10/20], Train Loss: 0.1213, Train Acc: 0.9707, Valid Loss: 0.2355, Valid Acc: 0.9308
Epoch [11/20], Train Loss: 0.1030, Train Acc: 0.9762, Valid Loss: 0.2354, Valid 

### Train the best model on the training dataset and evaluate:

In [65]:
# train and evaluate using the training set and validation set
train_and_validate(best_model, train_dl, valid_dl, optimizer, loss_fn)

Epoch [01/20], Train Loss: 0.1923, Train Acc: 0.9448, Valid Loss: 0.1694, Valid Acc: 0.9538
Epoch [02/20], Train Loss: 0.1320, Train Acc: 0.9617, Valid Loss: 0.1595, Valid Acc: 0.9556
Epoch [03/20], Train Loss: 0.1117, Train Acc: 0.9671, Valid Loss: 0.1513, Valid Acc: 0.9548
Epoch [04/20], Train Loss: 0.0963, Train Acc: 0.9721, Valid Loss: 0.1451, Valid Acc: 0.9546
Epoch [05/20], Train Loss: 0.0852, Train Acc: 0.9749, Valid Loss: 0.1435, Valid Acc: 0.9580
Epoch [06/20], Train Loss: 0.0762, Train Acc: 0.9777, Valid Loss: 0.1447, Valid Acc: 0.9566
Epoch [07/20], Train Loss: 0.0700, Train Acc: 0.9798, Valid Loss: 0.1364, Valid Acc: 0.9606
Epoch [08/20], Train Loss: 0.0623, Train Acc: 0.9825, Valid Loss: 0.1487, Valid Acc: 0.9570
Epoch [09/20], Train Loss: 0.0602, Train Acc: 0.9820, Valid Loss: 0.1444, Valid Acc: 0.9576
Epoch [10/20], Train Loss: 0.0526, Train Acc: 0.9847, Valid Loss: 0.1369, Valid Acc: 0.9638
Epoch [11/20], Train Loss: 0.0485, Train Acc: 0.9860, Valid Loss: 0.1436, Valid 

The training accuracy reaches 99.24%.
The validation accuracy reaches 96.32%.

In [76]:
# evaluate the model on the test set
pred = best_model(test_ds.data / 255.)
is_correct = (torch.argmax(pred, dim=1) == test_ds.targets).float()
print(f'Test accuracy: {is_correct.mean():.4f}') 

Test accuracy: 0.9632


In [74]:
# evaluate the model on the test set
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in test_dl:
        outputs = best_model(inputs)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy on the test set: {:.2f}%'.format(100 * correct / total))

Accuracy on the test set: 96.32%


### Summary

There are many different ways to modify construction and training of a neural network. Sometimes the best solutions are not what is expected. I thought that a softmax() function would improve performance since this is a multi-class regression problem, however this was not the case. Additionally, I did not think that Tanh() activation fuction would give better performance than ReLU(), and again I was wrong. It's important to try many different options when developing neural networks since the best working model may not have the construction that you expect.