# Optimization and Regularization

In the exercises below you will learn how to implement your own loss function (CE-loss) and optimizer (Adam and AdamW), and how to add regularization to both the loss function and the optimizer directly. 
You'll train and evaluate a LeNet architecture on the CIFAR-10 image classification task with your own loss function and optimizer and compare it with the official implementation. 

### Import packages

In [2]:
!pip3 install torch torchvision

Collecting torch
  Downloading torch-2.9.1-cp311-cp311-win_amd64.whl.metadata (30 kB)
Collecting torchvision
  Downloading torchvision-0.24.1-cp311-cp311-win_amd64.whl.metadata (5.9 kB)
Collecting filelock (from torch)
  Downloading filelock-3.20.0-py3-none-any.whl.metadata (2.1 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Downloading networkx-3.5-py3-none-any.whl.metadata (6.3 kB)
Collecting jinja2 (from torch)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting fsspec>=0.8.5 (from torch)
  Downloading fsspec-2025.10.0-py3-none-any.whl.metadata (10 kB)
Collecting mpmath<1.4,>=1.1.0 (from sympy>=1.13.3->torch)
  Downloading mpmath-1.3.0-py3-none-any.whl.metadata (8.6 kB)
Collecting MarkupSafe>=2.0 (from jinja2->torch)
  Downloading markupsafe-3.0.3-cp311-cp311-win_amd64.whl.metadata (2.8 kB)
Downloading torch-2.9.1-cp311-cp311-win_amd64.whl (111.0 MB)
   ------------


[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
from torchvision import datasets, transforms
from torch.utils.data import DataLoader
import torch
from torch import nn
import torch.nn.functional as F
from torch.optim import Optimizer
import numpy as np

# %% Global Constant
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(f'Using device: {device}')

Using device: cpu


### Define the model architecture and training loop

In [None]:
# %% Network Architecture
class LeNet5(nn.Module):
    def __init__(self, n_classes=10):
        super(LeNet5, self).__init__()

        self.c1 = nn.Conv2d(in_channels=3, out_channels=6, kernel_size=5) #rgb(3 channels), 6 feature maps
        self.c2 = nn.Conv2d(in_channels=6, out_channels=16, kernel_size=5) # 6 & 16 feature maps
        self.c3 = nn.Conv2d(in_channels=16, out_channels=120, kernel_size=5) # 16 & 120 feature maps
        self.s1 = nn.AvgPool2d(kernel_size=2) #reduces spatial dimensions by taking the average over 2x2 regions
        self.s2 = nn.AvgPool2d(kernel_size=2) #subsampling Layer

        self.classifier = nn.Sequential( #fully connected Classifier Layers
            nn.Linear(in_features=120, out_features=84), nn.Tanh(), #linear layer maps 120 features to 84 neurons - Tanh activation introduces non-linearity
            nn.Linear(in_features=84, out_features=n_classes),) #linear layer maps 84 neurons to output classes

    def forward(self, x):
        x = self.s1(torch.tanh(self.c1(x))) #apply conv1, tanh, then subsampling1
        x = self.s2(torch.tanh(self.c2(x)))
        x = torch.tanh(self.c3(x)) #apply conv3 and tanh
        x = torch.flatten(x, 1) ## Flatten the tensor into a 1D vector per sample (keep batch dimension)
        output = self.classifier(x)
        probs = F.softmax(output, dim=1) #apply softmax to get class probabilities
        print('Output shape:', output.shape)
        print('Probs shape:', probs.shape)
        return output, probs
    
    def fit(self, train_data, valid_data, loss_function, optimizer, nb_epochs=100, device='cuda'):
        train_loss_per_epoch = [] # List to store training loss for each epoch
        valid_loss_per_epoch = []

        # Process each epoch
        for i in range(1, nb_epochs + 1):
            train_running_loss = 0
            valid_running_loss = 0
            sample_counter = 0

            # Perform training iteration (1 Epoch)
            self.train()
            for x_train, y_train in train_data:
                optimizer.zero_grad() # Zero the gradients

                x_train = x_train.to(device)
                y_train = y_train.to(device)

                # Forward pass
                y_hat, _ = self.forward(x_train) 
                loss = loss_function(y_hat, y_train) 
                train_running_loss += loss.item() * x_train.size(0)

                # Backward pass
                loss.backward()
                optimizer.step()
                sample_counter += x_train.shape[0]

            epoch_loss = train_running_loss / sample_counter
            train_loss_per_epoch.append(epoch_loss)

            # Evaluate on validation data (After 1 epoch)
            self.eval()
            sample_counter = 0
            for x_val, y_val in valid_data:
                x_val = x_val.to(device)
                y_val = y_val.to(device)

                # Forward pass and record loss
                y_hat, _ = self.forward(x_val)
                loss = loss_function(y_hat, y_val)
                valid_running_loss += loss.item() * x_val.size(0)

                sample_counter += x_val.shape[0]

            epoch_valid_loss = valid_running_loss / sample_counter
            valid_loss_per_epoch.append(epoch_valid_loss)

            print('Epoch {:04} -- Train loss: {:.04} -- Validation loss: {:.04}'.format(i, epoch_loss,epoch_valid_loss))

        return train_loss_per_epoch, valid_loss_per_epoch

### Training preparations

* Specify training hyperparameters for this exercise
* Defining helper function for evaluation.
* Loading and normalizing the dataset (CIFAR-10).

Specify a folder/path below where the CIFAR-10 data should be downloaded to (should happen automatically with function call below).

In [7]:
###### data location ######
cifar10_local = '/home/your_user_name/cifar10_data'

###### Training Hyperparameters ######
batch_size = 32
learning_rate = 0.001
nb_epochs = 10

###### Evaluation Helper ######
def get_accuracy_helper(data_loader):
    # %% Make prediction on images in data_loader
    predictions_prob = np.vstack([estimator(x_test.to(device))[1].cpu().detach().numpy() for x_test, _ in data_loader])
    predictions = np.argmax(predictions_prob, axis=1)
    y_test = np.hstack([y_test for _, y_test in data_loader])
    # Compute accuracy:
    accuracy = np.count_nonzero(predictions == y_test) / len(y_test)
    return accuracy    

##### Dataset ######
# transformations applied to dataset
transform = transforms.Compose(
    [transforms.ToTensor(),  # transforms image values to range [0,1] for all channels
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])  # normalizes values to range [-1,1] 

# %% Setup datasets for train and testing on the CIFAR-10 dataset
# specify your local folder as root for the data
train_dataset = datasets.CIFAR10(root=cifar10_local, train=True, transform=transform, download=True)
test_dataset = datasets.CIFAR10(root=cifar10_local, train=False, transform=transform)

train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size)

100.0%


## Training of the model using Gradient Descent
Run training for the LeNet network, using vanilla (mini-batch) Gradient Descent as optimizer and Cross Entropy loss function. Report results on train and test set.

In [9]:
# set seed for reproducible training numbers
torch.manual_seed(0)

# %% Network training with Gradient Descent
estimator = LeNet5().to(device)

### start; your code here
# define the SGD optimizer
optimizer = torch.optim.SGD(estimator.parameters(), lr=0.1, momentum=0.9)
# define the CE-loss function 
loss_function = nn.CrossEntropyLoss()
### end of your code;

nb_train_params = sum(p.numel() for p in estimator.parameters() if p.requires_grad)
print('Starting to train a LeNet architecture with {} parameters for CIFAR-10 dataset.'.format(nb_train_params))

# launch training
### start; your code here
# call the model's fit function
tr_loss, va_loss = estimator.fit(train_loader, test_loader, loss_function, optimizer, nb_epochs=10, device=device)
### end of your code;


# run test
train_acc = get_accuracy_helper(train_loader)
print('Accuracy on Train Set: {}'.format(train_acc))
test_acc = get_accuracy_helper(test_loader)
print('Accuracy on Test Set: {}'.format(test_acc))

Starting to train a LeNet architecture with 62006 parameters for CIFAR-10 dataset.
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.Size([32, 10])
Probs shape: torch.Size([32, 10])
Output shape: torch.

KeyboardInterrupt: 

## Training of the model using Adam optimizer
Run training again, but this time we use Adam as optimizer. Report results on train and test set.

In [None]:
# set seed for reproducible training numbers
torch.manual_seed(0)

# %% Network training with Adam
estimator = LeNet5().to(device)

### start; your code here
# define the Adam optimizer
optimizer = torch.optim.Adam(estimator.parameters(), lr=0.001)
# define the CE-loss function 
loss_function = nn.CrossEntropyLoss()
### end of your code;

nb_train_params = sum(p.numel() for p in estimator.parameters() if p.requires_grad)
print('Starting to train a LeNet architecture with {} parameters for CIFAR-10 dataset.'.format(nb_train_params))

# launch training
### start; your code here
# call the model's fit function
tr_loss, va_loss = estimator.fit(train_loader, test_loader, loss_function, optimizer, nb_epochs=10, device=device)
### end of your code;

# run test
train_acc = get_accuracy_helper(train_loader)
print('Accuracy on Train Set: {}'.format(train_acc))
test_acc = get_accuracy_helper(test_loader)
print('Accuracy on Test Set: {}'.format(test_acc))

# Implement your own Cross Entropy loss

We use the pytorch Module class to implement the CE loss.
What you need to do:

### Exercise 1: CE-loss

* See this lecture's slides (which shows the *binary* CE loss formula - adjust accordingly for multi-class) and https://gombru.github.io/2018/05/23/cross_entropy_loss/ for the multi-class CE loss term to see what formula you need to implement

### Exercise 2: L2 Regularization

* Add L2 regularization and see lecture slides for the L2 regularized *binary* objective function (adjust accordingly)

Note that when training the network, we either apply regularization in the loss or in the optimizer, not both at the same time. The two implementation types are just there so you get to know how to implement regularization in multiple ways, and to get comfortable working with loss functions and optimizers.


In [None]:
class MyCrossEntropyLoss(nn.Module):
    
    def __init__(self, params, l2=0):
        super().__init__()
        self.l2 = l2
        # access to parameter values and gradients
        self.para = params

    def forward(self, y_predicted, y_target):
        # to align pytorch's cross entropy function and this custom one we need to apply softmax here
        # (otherwise input to loss function would need to be adjusted in LeNet code)
        y_predicted = F.softmax(y_predicted, dim=1)
        
        ### start; your code here
        # Exercise 1: CE-loss
        # cross-entropy term
        # take log of predicted output probabilities of your samples and then multiply with target vectors
        # for stability: add a small epsilon before taking log to avoid nan values when taking log of 0
        # hint1: easier if y_target consists of one-hot vectors (i.e. y_predicted.size() == y_target.size())
        # hint2: this can be done using vectorization (i.e. no loop should be needed here)

        
        
        
        # sum CE loss over all samples
        
        # take average of the CE sum to get avg loss per sample
         
        # negate the result
        loss =
        ### end of your code; 
        
        # using L2 regularization
        if self.l2 > 0:
            # loop over all parameters (weight matrices and bias vectors)
            for p in self.para:
                # p.data contains the current parameter values. In the used network all weights are matrices, so
                # we filter biases based on that fact
                if len(p.data.size()) == 1:
                    # skip bias vectors
                    continue
                    
                ### start; your code here
                # Exercise 2: L2 Regularization
                # loss term for L2 regularization
                # calculate L2 term (squared L2 norm of weight matrix)
                
                # calculate weight of L2 term using self.l2
                
                # multiple weight and term and add to CE loss
                
                ### end of your code;

        return loss

# Implement your own Adam optimizer

We use the default pytorch class for implementing the Adam optimizer. By using the "state" variable, this class will store the last state of the exponential averages for the Momentum and RMSprop terms for us, so we don't have to worry about this for the implementation. What you need to do:

### Exercise 1: Adam
* See this lecture's slides and/or the Adam paper* for the exact formulas you need to implement
* Don't forget to add bias correction for both the Momentum and RMSprop terms

### Exercise 2: Add Regularization
* Add L2 regularization directly into the optimizer according to page 3, Algorithm 2 "Adam with L2 regularization", in the referenced paper** (pytorch's official implementation follows this)
* Add "decoupled weight decay" into the optimizer (page 3, Algorithm 2 "Adam with decoupled weight decay") -> this is called the AdamW optimizer variant (shown to be better than vanilla Adam with L2 regularization for cases examined in the paper)
* Don't forget to scale the L2 hyperparameter by dividing it by batch size (as shown in the lecture)

\*Adam: https://arxiv.org/abs/1412.6980
\*\*Decoupled Weight Decay Regularization: https://arxiv.org/abs/1711.05101

In [None]:
# %% Optimizer
class MyAdamOptimizer(Optimizer):
    
    ### start; your code here:
    # set default parameters for Adam from the slides here
    def __init__(self, params, lr=, beta1=, beta2=, eps=, l2=0, adamw=False):  
    ### end of your code;     
        self.adamw = adamw
        defaults = dict(lr=lr, beta1=beta1, beta2=beta2, eps=eps, l2=l2)
        super(MyAdamOptimizer, self).__init__(params, defaults)
        
    #def __setstate__(self, state):
    #    super(MyOptimizerAdam, self).__setstate__(state)
        
    def step(self, closure=None):
        for group in self.param_groups:
            beta1 = group['beta1']
            beta2 = group['beta2']
            eps = group['eps']
            lr = group['lr']
            l2 = group['l2']

            for p in group['params']:
                if p.grad is None:
                    continue
                if p.grad.is_sparse:
                    raise RuntimeError('Adam does not support sparse gradients.')
                # get the state (a python dict) of the optimizer 
                # this dict contains last iteration's values for Momentum, RMSprop and tracks step count
                state = self.state[p]
                if len(state) == 0:
                    # initializing
                    state['step'] = 1
                    state['exp_avg'] = torch.zeros_like(p, memory_format=torch.preserve_format)
                    state['exp_avg_sq'] = torch.zeros_like(p, memory_format=torch.preserve_format)
                else:
                    state['step'] += 1
                
                # gradient of a learnable parameter (weight matrix or bias vector)
                d_p = p.grad
                
                # Part of Exercise 2.1: Adjustments to include L2 regularization
                if not self.adamw and (l2 > 0):
                    if len(p.data.size()) > 1:
                        # adjust the weight matrix gradient
                        
                # End of Exercise 2.1
                
                # tracked state variables; keep tracking them for the next iteration by updating exp_avg and
                # exp_avg_sq with this iteration's values (below)
                step = state['step']
                exp_avg = state['exp_avg']  # last iteration's Momentum term before bias correction
                exp_avg_sq = state['exp_avg_sq']  # last iteration's RMSprop term before bias correction

                ### start; your code here
                # calculate Momentum term and store in exp_avg
                
                # calculate RMSprop term and store in exp_avg_sq
                
                # bias correction for Momentum
                
                # bias correction for RMSprop
                
                # Adam update value (Adam step size)

                
                # Part of Exercise 2.2: Adjustments to include decoupled weight decay (AdamW)
                if self.adamw and (l2 > 0):
                    if len(p.data.size()) > 1:
                        # adjust Adam step size (weight matrix updates)
                        
                # End of Exercise 2.2
                
                # full update value (including learning rate) to be subtracted from current parameter
                
                # update the weight matrix or bias vector
                # hint: the weight matrix or bias vector is stored in p.data
                
                ### end of your code;

                
        return 

## Training with your own loss function and Adam optimizer
Run training again with your own Adam optimizer and CE loss function.
Without L2 regularization, performance of your implementation should be very similar/same as the official pytorch implementation. If results are not comparable, something might be wrong in your code.

Once you have confirmed, that your implementation without L2 involvement works as expected, move on to the next point.

Run the following three training scenarios and report accuracy results of both train and test. Don't forget to re-initialize the network and set the random seed anew prior to each training run:

* MyAdam; MyCE with L2=0.1;
* MyAdam with L2=0.1; MyCE;
* MyAdam with L2=0.1 and adamw=True; MyCE;

Optional: Compare pytorch's L2 regularized Adam with your own L2 regularized Adam. Results here might be somewhat different (due to some implementation differences).

Note that we weren't able to confirm significant improvements of test performance when using L2 on this particular task and with the given hyperparameters, but that is not representative to the overall usefulness of L2 or regularization in general. 
Feel free to e.g. train the network longer, to see and address overfitting effects with regularization, and report your findings in the tutorial.

In [None]:
# set seed for reproducible training numbers
torch.manual_seed(0)

# %% Network training with your own optimizer and loss function
estimator = LeNet5().to(device)

### start; your code here
# define your adam optimizer
optimizer = 
# define your CE-loss function 
loss_function = 
### end of your code;

nb_train_params = sum(p.numel() for p in estimator.parameters() if p.requires_grad)
print('Starting to train a LeNet architecture with {} parameters for CIFAR-10 dataset.'.format(nb_train_params))

# launch training
### start; your code here
# call the model's fit function
tr_loss, va_loss = 
### end of your code;

# run test
train_acc = get_accuracy_helper(train_loader)
print('Accuracy on Train Set: {}'.format(train_acc))
test_acc = get_accuracy_helper(test_loader)
print('Accuracy on Test Set: {}'.format(test_acc))