## 3 Gradient Descent

## By Hans Martin Aannestad

1. Load and preprocess the CIFAR-10 dataset. Split it into 3 datasets: training, validation and
test. Take a subset of these datasets by keeping only 2 labels: bird and plane.

In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
from torch.utils.data import random_split
from datetime import datetime

torch.manual_seed(265)

batch_size =  256
n_epoch =  30
loss_fn = nn.CrossEntropyLoss()
seed =  265


device = (torch.device('cuda') if torch.cuda.is_available()
          else torch.device('cpu'))
print(f"Training on device {device}.")

Training on device cpu.


In [2]:
def load_cifar(train_val_split=0.9, data_path='../data/', preprocessor=None):
    
    # Define preprocessor if not already given
    if preprocessor is None:
        preprocessor = transforms.Compose([
            transforms.ToTensor(),
            transforms.Normalize((0.4915, 0.4823, 0.4468),
                                (0.2470, 0.2435, 0.2616))
        ])
    
    # load datasets
    data_train_val = datasets.CIFAR10(
        data_path,       
        train=True,      
        download=True,
        transform=preprocessor)

    data_test = datasets.CIFAR10(
        data_path, 
        train=False,
        download=True,
        transform=preprocessor)

    # train/validation split
    n_train = int(len(data_train_val)*train_val_split)
    n_val =  len(data_train_val) - n_train

    data_train, data_val = random_split(
        data_train_val, 
        [n_train, n_val],
        generator=torch.Generator().manual_seed(123)
    )

    print("Size of the train dataset:        ", len(data_train))
    print("Size of the validation dataset:   ", len(data_val))
    print("Size of the test dataset:         ", len(data_test))
    
    return (data_train, data_val, data_test)

cifar10_train, cifar10_val, cifar10_test = load_cifar()

# Now define a lighter version of CIFAR10: cifar
label_map = {0: 0, 2: 1}
class_names = ['airplane', 'bird']

# For each dataset, keep only airplanes and birds
cifar2_train = [(img, label_map[label]) for img, label in cifar10_train if label in [0, 2]]
cifar2_val = [(img, label_map[label]) for img, label in cifar10_val if label in [0, 2]]
cifar2_test = [(img, label_map[label]) for img, label in cifar10_test if label in [0, 2]]

print('Size of the training dataset: ', len(cifar2_train))
print('Size of the validation dataset: ', len(cifar2_val))
print('Size of the test dataset: ', len(cifar2_test))

Files already downloaded and verified
Files already downloaded and verified
Size of the train dataset:         45000
Size of the validation dataset:    5000
Size of the test dataset:          10000
Size of the training dataset:  9017
Size of the validation dataset:  983
Size of the test dataset:  2000


2. Write a MyMLP class that implements a MLP in PyTorch (so only fully connected layers) such that:

(a) The input dimension is 3072 (= 32*32*3) and the output dimension is 2 (for the 2
classes).

(b) The hidden layers have respectively 512, 128 and 32 hidden units.

(c) All activation functions are ReLU. The last layer has no activation function since the
cross-entropy loss already includes a softmax activation function.


In [3]:
class MyMLP(nn.Module):
    def __init__(self):
        super().__init__()  # to inherit the '__init__' method from the 'nn.Module' class
        # Add whatever you want here (e.g layers and activation functions)
        # The order and names don't matter here but it is easier to understand
        # if you go for Layer1, fun1, layer2, fun2, etc
        # Some conventions:
  
        # - fc for fully connected

        self.flat = nn.Flatten()
        # 32*32*3: determined by our dataset: 32x32 RGB images
        self.fc0 = nn.Linear(32*32*3, 512)
        self.act0 = nn.ReLU()
        self.fc1 = nn.Linear(512,128)
        self.act1 = nn.ReLU()
        self.fc2 = nn.Linear(128, 32)
        self.act2 = nn.ReLU()
        # 2: determined by our number of classes (birds and planes)
        self.fc3 = nn.Linear(32, 2) # no activation since cross-entropy loss
        
    # Remember, we saw earlier that `forward` defines the 
    # computation performed at every call (the forward pass) and that it
    # should be overridden by all subclasses.
    def forward(self, x):
        # Now the order matters! 
        out = self.flat(x)
        out = self.act0(self.fc0(out))
        out = self.act1(self.fc1(out))
        out = self.act2(self.fc2(out))
        out = self.fc3(out)
        return out

3. Write a train(n epochs, optimizer, model, loss fn, train loader) function that trains
model for n epochs epochs given an optimizer optimizer, a loss function loss fn and a dataloader train loader.

In [4]:
def train(n_epochs, optimizer, model, loss_fn, train_loader):
    
    n_batch = len(train_loader)
    losses_train = []
    model.train()
    optimizer.zero_grad(set_to_none=True)
    
    for epoch in range(1, n_epochs + 1):
        
        loss_train = 0.0
        for imgs, labels in train_loader:

            imgs = imgs.to(device=device) 
            labels = labels.to(device=device)

            outputs = model(imgs)
            
            loss = loss_fn(outputs, labels)
            loss.backward()
            
            optimizer.step()
            optimizer.zero_grad()

            loss_train += loss.item()
            
        losses_train.append(loss_train / n_batch)

        if epoch == 1 or epoch % 5 == 0:
            print('{}  |  Epoch {}  |  Training loss {:.3f}'.format(
                datetime.now().time(), epoch, loss_train / n_batch))


4. Write a similar function train manual update that has no optimizer parameter, but a learning rate lr parameter instead and that manually updates each trainable parameter of model
using equation (3). Do not forget to zero out all gradients after each iteration.

In [5]:
def train_manual(n_epochs, model, loss_fn, train_loader, lr=1e-2):
    model.train()
    
    for epoch in range(1, n_epochs + 1):
        loss_train = 0.0
        for imgs, labels in train_loader:
            imgs = imgs.to(device=device) 
            labels = labels.to(device=device)

            outputs = model(imgs)
            loss = loss_fn(outputs, labels)
            loss.backward()

            with torch.no_grad():
                for i, p in enumerate(model.parameters()):
                    if p.grad is not None:
                        grad = p.grad.detach()
                        p.data = p.data - lr*grad
                        p.grad.zero_()     #  zero out all gradients                   

            loss_train += loss.item()

        if epoch == 1 or epoch % 5 == 0:
            print('{}  |  Epoch {}  |  Training loss {:.3f}'.format(
                datetime.now().time(), epoch,
                loss_train / len(train_loader)))

5. Train 2 instances of MyMLP, one using train and the other using train manual update (use
the same parameter values for both models). Compare their respective training losses. To get
exactly the same results with both functions, see section 3.3.

In [6]:
def compute_accuracy(model, loader):
    model.eval()
    correct = 0
    total = 0

    with torch.no_grad():
        for imgs, labels in loader:
            imgs = imgs.to(device=device)
            labels = labels.to(device=device)

            outputs = model(imgs)
            _, predicted = torch.max(outputs, dim=1)
            total += labels.shape[0]
            correct += int((predicted == labels).sum())

    acc =  correct / total
    print("Accuracy: {:.2f}".format(acc))
    return acc

In [7]:
# Now we can instantiate a model with the architecture defined in the cell above
model = MyMLP()

# Our model can be inspected exactly as we inspected our model in the previous tutorial (which was then defined using nn.Sequential) 
numel_list = [p.numel() for p in model.parameters()]
print("Total number of parameters: ", sum(numel_list))
print("Number of parameter per layer: ", numel_list)

img, _ = cifar2_train[0]
# Again we can feed a input and get the output exactly the same way as before
output_tensor = model(img.unsqueeze(0))
print("Output: \n", output_tensor)

Total number of parameters:  1643234
Number of parameter per layer:  [1572864, 512, 65536, 128, 4096, 32, 64, 2]
Output: 
 tensor([[-0.0430,  0.0806]], grad_fn=<AddmmBackward0>)


In [8]:
train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=batch_size, shuffle=True)
model1 = MyMLP().to(device=device) 
optimizer = optim.SGD(model1.parameters(), lr=1e-2)

train(
    n_epochs = n_epoch,
    optimizer = optimizer,
    model = model1,
    loss_fn = loss_fn,
    train_loader = train_loader,
)

20:20:23.377197  |  Epoch 1  |  Training loss 0.681
20:20:36.816051  |  Epoch 5  |  Training loss 0.553
20:20:51.805633  |  Epoch 10  |  Training loss 0.472
20:21:07.123797  |  Epoch 15  |  Training loss 0.425
20:21:22.501640  |  Epoch 20  |  Training loss 0.388
20:21:37.913883  |  Epoch 25  |  Training loss 0.355
20:21:53.114117  |  Epoch 30  |  Training loss 0.326


In [9]:
train_loader = torch.utils.data.DataLoader(cifar2_train, batch_size=batch_size, shuffle=True)
model2 = MyMLP().to(device=device) 
lr = 1e-2

train_manual(
    n_epochs = n_epoch,
    model = model2,
    loss_fn = loss_fn,
    train_loader = train_loader,
    lr = lr,
)

20:21:56.343620  |  Epoch 1  |  Training loss 0.675
20:22:09.038886  |  Epoch 5  |  Training loss 0.532
20:22:24.727741  |  Epoch 10  |  Training loss 0.456
20:22:40.234515  |  Epoch 15  |  Training loss 0.413
20:22:55.796016  |  Epoch 20  |  Training loss 0.375
20:23:11.315099  |  Epoch 25  |  Training loss 0.345
20:23:26.881493  |  Epoch 30  |  Training loss 0.318


In [10]:
def train_manual(n_epochs, model, loss_fn, train_loader, lr=1e-2):
    model.train()
    
    for epoch in range(1, n_epochs + 1):
        loss_train = 0.0
        for imgs, labels in train_loader:
            imgs = imgs.to(device=device) 
            labels = labels.to(device=device)

            outputs = model(imgs)
            loss = loss_fn(outputs, labels)
            loss.backward()

            with torch.no_grad():
                for i, p in enumerate(model.parameters()):
                    if p.grad is not None:
                        grad = p.grad.detach()
                        p.data = p.data - lr*grad
                        p.grad.zero_()     #  zero out all gradients                   

            loss_train += loss.item()

        if epoch == 1 or epoch % 5 == 0:
            print('{}  |  Epoch {}  |  Training loss {:.3f}'.format(
                datetime.now().time(), epoch,
                loss_train / len(train_loader)))

6. Modify train manual update by adding a L2 regularization term in your manual parameter
update. Add an additional weight decay parameter to train manual update. Compare
again train and train manual update results with 0 < weight decay < 1.

In [11]:
def train_manual_w_decay(n_epochs, model, loss_fn, train_loader, lr=1e-2, w_decay=0.):
    model.train()
 
    for epoch in range(1, n_epochs + 1):
        loss_train = 0.0
        for imgs, labels in train_loader:
            imgs = imgs.to(device=device) 
            labels = labels.to(device=device)

            outputs = model(imgs)
            loss = loss_fn(outputs, labels)
            loss.backward()

            with torch.no_grad():
                for i, p in enumerate(model.parameters()):
                    if p.grad is not None:
                        grad = p.grad.detach()
                                            
                        grad = grad + w_decay*p.data # L2 regularization term

                        p.data = p.data - lr*grad
                        p.grad.zero_()     #  zero out all gradients                   

            loss_train += loss.item()

        if epoch == 1 or epoch % 5 == 0:
            print('{}  |  Epoch {}  |  Training loss {:.3f}'.format(
                datetime.now().time(), epoch,
                loss_train / len(train_loader)))

In [12]:
model3 = MyMLP().to(device=device) 
lr = 1e-2
w_decay = 0.02

train_manual_w_decay(
    n_epochs = n_epoch,
    model = model3,
    loss_fn = loss_fn,
    w_decay = w_decay,
    train_loader = train_loader,
    lr = lr,
)

20:23:30.126481  |  Epoch 1  |  Training loss 0.678
20:23:43.132679  |  Epoch 5  |  Training loss 0.561
20:23:59.186072  |  Epoch 10  |  Training loss 0.480
20:24:15.139791  |  Epoch 15  |  Training loss 0.439
20:24:31.131823  |  Epoch 20  |  Training loss 0.411
20:24:47.200158  |  Epoch 25  |  Training loss 0.382
20:25:03.849528  |  Epoch 30  |  Training loss 0.359


7. Modify train manual update by adding a momentum term in your parameter update. Add
an additional momentum parameter to train manual update. Check again the correctness of
the new update rule by comparing it to train function (with 0 < momentum < 1).

In [13]:
def train_manual_w_decay_mom(n_epochs, model, loss_fn, train_loader, lr=1e-2, w_decay=0.,c_mom=0.):
    model.train()
    
    mom_lst = []
    curr_mom_lst = []

    for epoch in range(1, n_epochs + 1):
        loss_train = 0.0
        for imgs, labels in train_loader:
            imgs = imgs.to(device=device) 
            labels = labels.to(device=device)

            outputs = model(imgs)
            loss = loss_fn(outputs, labels)
            loss.backward()

            with torch.no_grad():
                for i, p in enumerate(model.parameters()):
                    if p.grad is not None:
                        grad = p.grad.detach()
                                            
                        grad = grad + w_decay*p.data # L2 regularization term

                        if mom_lst == []:  # first time
                          if curr_mom_lst == []:
                            curr_mom_lst = [grad]
                          else:
                            curr_mom_lst.append(grad)
                        else:
                            grad = grad + mom_lst[i]*c_mom # build up some momentum!
                            mom_lst[i] = grad

                        p.data = p.data - lr*grad
                        p.grad.zero_()     #  zero out all gradients
                   
            loss_train += loss.item()

        if epoch == 1 or epoch % 5 == 0:
            print('{}  |  Epoch {}  |  Training loss {:.3f}'.format(
                datetime.now().time(), epoch,
                loss_train / len(train_loader)))

In [14]:
model4 = MyMLP().to(device=device) 
lr = 1e-2
w_decay = 0.02
c_mom = 0.75

train_manual_w_decay_mom(
    n_epochs = n_epoch,
    model = model4,
    loss_fn = loss_fn,
    w_decay = w_decay,
    c_mom = c_mom,
    train_loader = train_loader,
    lr = lr,
)


20:25:07.382203  |  Epoch 1  |  Training loss 0.684
20:25:20.413573  |  Epoch 5  |  Training loss 0.585
20:25:36.474543  |  Epoch 10  |  Training loss 0.496
20:25:52.659864  |  Epoch 15  |  Training loss 0.452
20:26:10.106236  |  Epoch 20  |  Training loss 0.420
20:26:31.571426  |  Epoch 25  |  Training loss 0.393
20:26:48.716245  |  Epoch 30  |  Training loss 0.368


In [15]:
def model_val(model, train_loader, val_loader):
    model.eval()
    accs = {}
    for name, loader in [("train", train_loader), ("val", val_loader)]:
        correct = 0
        total = 0
        with torch.no_grad():
            for imgs, labels in loader:
                imgs = imgs.to(device=device)
                labels = labels.to(device=device)

                outputs = model(imgs)
                _, predicted = torch.max(outputs, dim=1)
                total += labels.shape[0]
                correct += int((predicted == labels).sum())

        print("Acc {}: {:.3f}".format(name , correct / total))
        accs[name] = correct / total
    return accs

In [16]:
train_load = torch.utils.data.DataLoader(cifar2_train, batch_size=batch_size, shuffle=False)
val_load = torch.utils.data.DataLoader(cifar2_val, batch_size=batch_size, shuffle=False)

print("\n Torch auto mode: model1")
model_val(model1, train_load, val_load)
print("\n Manual train: model2")
model_val(model2, train_load, val_load)
print("\n Manual train with weight decay: model3")
model_val(model3, train_load, val_load)
print("\n Manual train with weight decay + momentum: model4")
model_val(model4, train_load, val_load)


 Torch auto mode: model1
Acc train: 0.880
Acc val: 0.842

 Manual train: model2
Acc train: 0.888
Acc val: 0.845

 Manual train with weight decay: model3
Acc train: 0.865
Acc val: 0.829

 Manual train with weight decay + momentum: model4
Acc train: 0.858
Acc val: 0.813


{'train': 0.8576023067539092, 'val': 0.8128179043743642}

8. Train different instances (at least 4) of the MyMLP model with different learning rate, momentum
and weight decay values . You can choose the same values as in the
gradient descent output.txt file.

I run out of memory with too many epochs in Google colab, 15 works..

In [23]:
model41 = MyMLP().to(device=device) 
lr = 0.01
w_decay = 0
c_mom = 0

train_manual_w_decay_mom(
    n_epochs = 15,
    model = model41,
    loss_fn = loss_fn,
    w_decay = w_decay,
    c_mom = c_mom,
    train_loader = train_loader,
    lr = lr,
)

20:39:48.950942  |  Epoch 1  |  Training loss 0.670
20:40:01.802657  |  Epoch 5  |  Training loss 0.564
20:40:17.945280  |  Epoch 10  |  Training loss 0.494
20:40:33.657585  |  Epoch 15  |  Training loss 0.446


In [24]:
model42 = MyMLP().to(device=device) 
lr = 0.01
w_decay = 0.01
c_mom = 0

train_manual_w_decay_mom(
    n_epochs = 15,
    model = model42,
    loss_fn = loss_fn,
    w_decay = w_decay,
    c_mom = c_mom,
    train_loader = train_loader,
    lr = lr,
)

20:40:37.057725  |  Epoch 1  |  Training loss 0.673
20:40:49.713985  |  Epoch 5  |  Training loss 0.555
20:41:05.600759  |  Epoch 10  |  Training loss 0.481
20:41:21.336818  |  Epoch 15  |  Training loss 0.440


In [25]:
model43 = MyMLP().to(device=device) 
lr = 0.01
w_decay = 0.001
c_mom = 0.9

train_manual_w_decay_mom(
    n_epochs = 15,
    model = model43,
    loss_fn = loss_fn,
    w_decay = w_decay,
    c_mom = c_mom,
    train_loader = train_loader,
    lr = lr,
)

20:41:24.686031  |  Epoch 1  |  Training loss 0.686
20:41:37.314978  |  Epoch 5  |  Training loss 0.606
20:41:53.074532  |  Epoch 10  |  Training loss 0.504
20:42:08.867373  |  Epoch 15  |  Training loss 0.454


In [26]:
model44 = MyMLP().to(device=device) 
lr = 0.01
w_decay = 0.01
c_mom = 0.8

train_manual_w_decay_mom(
    n_epochs = 15,
    model = model44,
    loss_fn = loss_fn,
    w_decay = w_decay,
    c_mom = c_mom,
    train_loader = train_loader,
    lr = lr,
)

20:42:12.225143  |  Epoch 1  |  Training loss 0.679
20:42:24.915765  |  Epoch 5  |  Training loss 0.563
20:42:40.854527  |  Epoch 10  |  Training loss 0.477
20:42:56.945045  |  Epoch 15  |  Training loss 0.438


In [27]:
print("\n Manual train with weight decay + momentum: model41")
model_val(model41, train_load, val_load)
print("\n Manual train with weight decay + momentum: model42")
model_val(model42, train_load, val_load)
print("\n Manual train with weight decay + momentum: model43")
model_val(model43, train_load, val_load)
print("\n Manual train with weight decay + momentum: model44")
model_val(model44, train_load, val_load)


 Manual train with weight decay + momentum: model41
Acc train: 0.819
Acc val: 0.811

 Manual train with weight decay + momentum: model42
Acc train: 0.818
Acc val: 0.804

 Manual train with weight decay + momentum: model43
Acc train: 0.811
Acc val: 0.805

 Manual train with weight decay + momentum: model44
Acc train: 0.819
Acc val: 0.811


{'train': 0.8192303426860374, 'val': 0.8107833163784334}

• Wrap your computations inside a ”with torch.no grad():” context.
• Remember that trainable parameters can be accessed using ”for p in model.parameters()”
or ”for name, p in model.named parameters()”.
• Remember that parameter values can then be accessed using ”p.data” and their gradients
using ”p.grad”.
• Gradient descent rules with L2-regularization and momentum can be found in the documentation of torch.optim.SGD.


In [29]:
def model_test(model, test_loader):
    model.eval()
    accs = {}
    for name, loader in [("TEST", test_loader)]:
        correct = 0
        total = 0
        with torch.no_grad():
            for imgs, labels in loader:
                imgs = imgs.to(device=device)
                labels = labels.to(device=device)

                outputs = model(imgs)
                _, predicted = torch.max(outputs, dim=1)
                total += labels.shape[0]
                correct += int((predicted == labels).sum())

        print("Test Acc {}: {:.3f}".format(name , correct / total))
        accs[name] = correct / total
    return accs

In [31]:
# Pick model44 as best
'''
 Manual train with weight decay + momentum: model44
Acc train: 0.819
Acc val: 0.811
{'train': 0.8192303426860374, 'val': 0.8107833163784334}'''

test_load = torch.utils.data.DataLoader(cifar2_test, batch_size=batch_size, shuffle=False)

cifar2_test

print("\n Manual train with weight decay + momentum: model44")
model_test(model44, test_load)


 Manual train with weight decay + momentum: model44
Test Acc TEST: 0.822


{'TEST': 0.8215}