# CNN initialization, activation and optimization

This notebook demonstrates the performance of CNNs with different configurations.

In particular it explores the Kaiming-He, Xavier, uniform and PyTorch default weight initialization. It also utilizes activation functions such as relu and sigmoid and compares Adam and SDG optimizers.

## Set up paths and imports

In [None]:
import os

import torch
import torch.nn.functional as F
from torchvision import transforms

if not os.path.exists("./notebooks"):
    %cd ..

import src.model
from src.training import do_train, do_test
from src.dataset import prepare_dataset_loaders
from src.data_processing import load_mean_std
from src.config import DATASET_DIR

wandb_enabled = False

## 1. Load standarization data and define Config

In [None]:
mean, std = load_mean_std(f"{DATASET_DIR}/scaling_params.json")

he = lambda m: torch.nn.init.kaiming_uniform_(m.weight, nonlinearity='relu')
xavier = lambda m: torch.nn.init.xavier_uniform_(m.weight)
uniform = lambda m: torch.nn.init.uniform_(m.weight)

transform = transforms.Compose([
    transforms.Resize((32,32)),
    transforms.ToTensor(),
    transforms.Normalize(mean, std)
])
criterion = torch.nn.CrossEntropyLoss()

class Config:
    def __init__(self, lr=0.001, epochs=40, batch_size=32):
        self.learning_rate = lr
        self.epochs = epochs
        self.batch_size = batch_size

### Optionally initialize W&B project

In [None]:
wandb_enabled = True

## 2. Choose device

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Choose your architecture

The configuration below proved to be the best overall and achieved the fastest convergence.

Almost all other configurations presented in this notebook, while slightly worse, still achieved similar performance.

In [None]:
name = "INIT:PyTorch-ACT:Relu-OPT:Adam-LR:0.001"
model = src.model.OriginalSizeCNN(
    initialize=None,
    activation=F.relu,
)
config = Config(
    lr=0.001,
)
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)

Switching activation on the final layer to Sigmoid visibly slows down convergence.

However, the results after 50 epochs are extremely close to Relu.

In [None]:
name = "INIT:PyTorch-ACT:Sigmoid-OPT:Adam-LR:0.001"
model = src.model.OriginalSizeCNN(
    initialize=None,
    activation=F.sigmoid,
)
config = Config(
    lr=0.001,
)
optimizer = torch.optim.Adam(model.parameters(), lr=config.learning_rate)

This configuration converges very slowly due to the combination of SDG optimizer and learning rate of 0.001.

It's worth noting that the same learning rate yields good results when used with Adam.

In [None]:
name = "INIT:PyTorch-ACT:Sigmoid-OPT:SDG-LR:0.001"
model = src.model.OriginalSizeCNN(
    initialize=None,
    activation=F.sigmoid,
)
config = Config(
    lr=0.001,
)
optimizer = torch.optim.SDG(model.parameters(), lr=config.learning_rate)

Four configurations presented below all show very similar performance, proving that for a simple dataset, there is almost no difference in choice of initialization method.

In [None]:
name = "INIT:PyTorch-ACT:Sigmoid-OPT:SDG-LR:0.05"
model = src.model.OriginalSizeCNN(
    initialize=None,
    activation=F.sigmoid,
)
config = Config(
    lr=0.05,
)
optimizer = torch.optim.SDG(model.parameters(), lr=config.learning_rate)

In [None]:
name = "INIT:He-ACT:Sigmoid-OPT:SDG-LR:0.05"
model = src.model.OriginalSizeCNN(
    initialize=he,
    activation=F.sigmoid,
)
config = Config(
    lr=0.05,
)
optimizer = torch.optim.SDG(model.parameters(), lr=config.learning_rate)

In [None]:
name = "INIT:Xavier-ACT:Sigmoid-OPT:SDG-LR:0.05"
model = src.model.OriginalSizeCNN(
    initialize=xavier,
    activation=F.sigmoid,
)
config = Config(
    lr=0.05,
)
optimizer = torch.optim.SDG(model.parameters(), lr=config.learning_rate)

In [None]:
name = "INIT:Uniform-ACT:Sigmoid-OPT:SDG-LR:0.05"
model = src.model.OriginalSizeCNN(
    initialize=he,
    activation=F.sigmoid,
)
config = Config(
    lr=0.05,
)
optimizer = torch.optim.SDG(model.parameters(), lr=config.learning_rate)

The last configuration struggles to converge and achieves very poor results. The cause lies in increased learning rate of 0.05. While this learning rate is optimal for SDG is seems to result in high instability for Adam.

In [None]:
name = "INIT:PyTorch-ACT:Sigmoid-OPT:Adam-LR:0.05"
model = src.model.OriginalSizeCNN(
    initialize=None,
    activation=F.sigmoid,
)
config = Config(
    lr=0.05,
)
optimizer = torch.optim.SDG(model.parameters(), lr=config.learning_rate)

# Train the model

In [None]:
train_loader, val_loader, test_loader = prepare_dataset_loaders(transform, config.batch_size)
run = do_train(name, train_loader, val_loader, config, model, criterion, optimizer, device, wandb_enabled)
do_test(name, test_loader, model.__class__, run, device, wandb_enabled)

### Comparison of models
Comparison of architectures trainable using this notebook can be seen [here](https://api.wandb.ai/links/patonymous-warsaw-university-of-technology/sajmu7qa).

Almost all CNNs presented here achieved around 0.9 validation accuracy within 50 epochs. The notable exception is the combination of SDG optimizer with learning rate of 0.001 which has shown very slow convergence. The fastest when it comes to learning and the best overall proved to be the combination of Relu activation, PyTorch default initialization and Adam optimizer with learning rate equal 0.001.

The differences between different initialization methods are insignificant after few initial epochs. It seems that for a simple classification problem such as this one, it's possible to find multiple well-performing configurations. Adam with Relu and learning rate of 0.001 achieves similarly good results as SDG with Sigmoid and learning rate of 0.05.
