<a href="https://colab.research.google.com/github/YDIB11/2025/blob/main/HW3_practical_ex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 3: optimization of a CNN model
The task of this homework is to optimize a CNN model for the CIFAR-100. You are free to define the architecture of the model, and the training procedure. The only contraints are:
- It must be a `torch.nn.Module` object
- The number of trained parameters must be less than 1 million
- The test dataset must not be used for any step of training.
- The final training notebook should run on Google Colab within a maximum 1 hour approximately.
- Do not modify the random seed, as they are needed for reproducibility purpose.

For the grading, you must use the `evaluate` function defined below. It takes a model as input, and returns the test accuracy as output.

As a guideline, you are expected to **discuss** and motivate your choices regarding:
- Model architecture
- Hyperparameters (learning rate, batch size, etc)
- Regularization methods
- Optimizer
- Validation scheme

A code without any explanation of the choices will not be accepted. Test accuracy is not the only measure of success for this homework.

Remember that most of the train process is randomized, store your model's weights after training and load it before the evaluation!

In [None]:
import time
t0 = time.time()

## Example

### Loading packages and libraries

In [None]:
import random
import numpy as np
import torch
import torchvision


# Fix all random seeds
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)

# For full determinism
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# Import the best device available
device = torch.device('cuda' if torch.cuda.is_available() else 'mps' if torch.mps.is_available() else 'cpu')
print('Using device:', device)

# load the data
train_dataset = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=torchvision.transforms.ToTensor())

Using device: cuda


In [None]:
test_dataset = torchvision.datasets.CIFAR100(root='./data', train=False, download=True, transform=torchvision.transforms.ToTensor())

def evaluate(model):
    params_count = sum(p.numel() for p in model.parameters())
    print('The model has {} parameters'.format(params_count))

    if params_count > int(1e6):
        print('The model has too many parameters! Not allowed to evaluate.')
        return

    model = model.to(device)
    model.eval()
    correct = 0
    total = 0

    test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=32, shuffle=False)
    for images, labels in test_loader:
        images, labels = images.to(device), labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()


    # print in bold red in a notebook
    print('\033[1m\033[91mAccuracy on the test set: {}%\033[0m'.format(100 * correct / total))


### Example of a simple CNN model

In [None]:
class TinyNet(torch.nn.Module):
    def __init__(self):
        super(TinyNet, self).__init__()
        self.conv1 = torch.nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)
        self.conv2 = torch.nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)
        self.fc1 = torch.nn.Linear(8*8*64, 128)
        self.fc2 = torch.nn.Linear(128, 100)

    def forward(self, x):
        x = torch.nn.functional.relu(self.conv1(x))
        x = torch.nn.functional.max_pool2d(x, 2)
        x = torch.nn.functional.relu(self.conv2(x))
        x = torch.nn.functional.max_pool2d(x, 2)
        x = x.view(-1, 8*8*64)
        x = torch.nn.functional.relu(self.fc1(x))
        x = self.fc2(x)
        return x

print("Model parameters: ", sum(p.numel() for p in TinyNet().parameters()))

Model parameters:  556708


### Example of basic training

In [None]:

model = TinyNet()
model.to(device)
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters())

train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=32, shuffle=True)
for epoch in range(10):
    for i, (images, labels) in enumerate(train_loader):
        images, labels = images.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

    print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, 10, loss.item()))


Epoch [1/10], Loss: 4.6138
Epoch [2/10], Loss: 4.5809
Epoch [3/10], Loss: 4.5756
Epoch [4/10], Loss: 4.5898
Epoch [5/10], Loss: 4.5962
Epoch [6/10], Loss: 4.5979
Epoch [7/10], Loss: 4.6147
Epoch [8/10], Loss: 4.5324
Epoch [9/10], Loss: 4.5006
Epoch [10/10], Loss: 4.3964


In [None]:
# save the model on a file
torch.save(model.state_dict(), 'tiny_net.pt')

loaded_model = TinyNet()
loaded_model.load_state_dict(torch.load('tiny_net.pt', weights_only=True))
evaluate(loaded_model)

The model has 556708 parameters
[1m[91mAccuracy on the test set: 3.36%[0m


### **Solution**

## Setup choices
- Use torch/torchvision with AMP (`torch.amp.autocast`, `GradScaler`) for faster training on GPU.
- Cosine LR scheduler for smooth decay without manual milestones.
- We reuse earlier seed/device setup from the example cells.


In [None]:
from torchvision import transforms
from torch.utils.data import DataLoader, random_split, Subset
import torch.nn as nn
import torch.optim as optim
from torch.cuda.amp import autocast, GradScaler # Corrected from GradScalerQ
from torch.optim.lr_scheduler import CosineAnnealingLR
import torch.nn.functional as F
from torch.amp import autocast, GradScaler

In [None]:
# Data
mean = (0.5071, 0.4867, 0.4408)
std = (0.2675, 0.2565, 0.2761)
train_tf = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.RandomErasing(p=0.25)
])

val_tf = transforms.Compose([
    transforms.ToTensor()
])


full_train = torchvision.datasets.CIFAR100(root='./data', train=True, download=True, transform=train_tf)
val_size = 5000
train_size = len(full_train) - val_size
train_set, val_indices = random_split(full_train, [train_size, val_size])
# Rebuild val subset with val transforms
val_set = Subset(torchvision.datasets.CIFAR100(root='./data', train=True, download=False, transform=val_tf),
                 val_indices.indices)

train_loader = DataLoader(train_set, batch_size=128, shuffle=True, num_workers=2, pin_memory=True)
val_loader = DataLoader(val_set, batch_size=256, shuffle=False, num_workers=2, pin_memory=True)

# Model (depthwise-separable residual blocks)
class DWBlock(nn.Module):
    def __init__(self, in_ch, out_ch, stride=1):
        super().__init__()
        mid = out_ch // 2
        self.reduce = nn.Conv2d(in_ch, mid, 1, bias=False)
        self.bn1 = nn.BatchNorm2d(mid)
        self.dw = nn.Conv2d(mid, mid, 3, stride=stride, padding=1, groups=mid, bias=False)
        self.bn2 = nn.BatchNorm2d(mid)
        self.expand = nn.Conv2d(mid, out_ch, 1, bias=False)
        self.bn3 = nn.BatchNorm2d(out_ch)
        self.skip = nn.Conv2d(in_ch, out_ch, 1, stride=stride, bias=False) if (in_ch != out_ch or stride != 1) else None

    def forward(self, x):
        identity = x
        out = F.relu(self.bn1(self.reduce(x)))
        out = F.relu(self.bn2(self.dw(out)))
        out = self.bn3(self.expand(out))
        if self.skip is not None:
            identity = self.skip(identity)
        return F.relu(out + identity)

class SmallNet(nn.Module):
    def __init__(self, num_classes=100):
        super().__init__()
        mean = torch.tensor([0.5071, 0.4867, 0.4408]).view(1,3,1,1)
        std = torch.tensor([0.2675, 0.2565, 0.2761]).view(1,3,1,1)
        self.register_buffer('mean', mean)
        self.register_buffer('std', std)
        self.stem = nn.Sequential(
            nn.Conv2d(3, 96, 3, padding=1, bias=False),
            nn.BatchNorm2d(96),
            nn.ReLU(inplace=True)
        )
        self.stage1 = nn.Sequential(DWBlock(96, 128), DWBlock(128, 128))
        self.stage2 = nn.Sequential(DWBlock(128, 192, stride=2), DWBlock(192, 192))
        self.stage3 = nn.Sequential(DWBlock(192, 256, stride=2), DWBlock(256, 256))
        self.stage4 = nn.Sequential(DWBlock(256, 448, stride=2), DWBlock(448, 448))
        self.pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Linear(448, num_classes)

    def forward(self, x):
          x = (x - self.mean) / self.std
          x = self.stem(x)
          x = self.stage1(x)
          x = self.stage2(x)
          x = self.stage3(x)
          x = self.stage4(x)
          x = self.pool(x).flatten(1)
          return self.fc(x)

model = SmallNet().to(device)
print("Trainable params:", sum(p.numel() for p in model.parameters() if p.requires_grad))


Trainable params: 845380


### Data & model design
- Augmentations: random crop + flip + random erasing to improve generalization on CIFAR-100; no normalization in transforms because we normalize inside the model, so train/val/test all see consistent preprocessing.
- Validation split: 5k held-out from training via `random_split`; val loader uses only ToTensor to match test-time preprocessing.
- Model (SmallNet): depthwise-separable residual blocks keep params low (<1M) while retaining representational power; residual skips help optimization. Channel progression 96→128→192→256→448 gives ~0.84M params, under the 1M cap.
- In-model normalization: registered mean/std buffers apply CIFAR-100 normalization in `forward`, ensuring consistency with the provided `evaluate` loader (which uses raw ToTensor).


In [None]:
# Train/val loop
epochs = 50
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=epochs)
scaler = GradScaler('cuda')
best_val = 0.0
best_path = 'best_smallnet.pth'

for epoch in range(epochs):
    model.train()
    tot_loss = tot_correct = tot_seen = 0
    for imgs, labels in train_loader:
        imgs, labels = imgs.to(device, non_blocking=True), labels.to(device, non_blocking=True)
        optimizer.zero_grad()
        with autocast('cuda'):
          logits = model(imgs)
          loss = criterion(logits, labels)
        scaler.scale(loss).backward()
        scaler.unscale_(optimizer)
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        scaler.step(optimizer)
        scaler.update()

        tot_loss += loss.item() * labels.size(0)
        tot_correct += (logits.argmax(1) == labels).sum().item()
        tot_seen += labels.size(0)

    scheduler.step()
    train_loss = tot_loss / tot_seen
    train_acc = tot_correct / tot_seen

    model.eval()
    val_correct = val_seen = 0
    with torch.no_grad():
        for imgs, labels in val_loader:
            imgs, labels = imgs.to(device, non_blocking=True), labels.to(device, non_blocking=True)
            logits = model(imgs)
            val_correct += (logits.argmax(1) == labels).sum().item()
            val_seen += labels.size(0)
    val_acc = val_correct / val_seen
    print(f"Epoch {epoch+1}/{epochs} - train_loss {train_loss:.4f} - train_acc {train_acc:.4f} - val_acc {val_acc:.4f}")

    if val_acc > best_val:
        best_val = val_acc
        torch.save(model.state_dict(), best_path)

print("Best val_acc:", best_val)


Epoch 1/50 - train_loss 4.1262 - train_acc 0.0838 - val_acc 0.1578
Epoch 2/50 - train_loss 3.4942 - train_acc 0.2039 - val_acc 0.2264
Epoch 3/50 - train_loss 3.1045 - train_acc 0.2945 - val_acc 0.2660
Epoch 4/50 - train_loss 2.8715 - train_acc 0.3593 - val_acc 0.3620
Epoch 5/50 - train_loss 2.7240 - train_acc 0.3983 - val_acc 0.4066
Epoch 6/50 - train_loss 2.6113 - train_acc 0.4320 - val_acc 0.4196
Epoch 7/50 - train_loss 2.5299 - train_acc 0.4524 - val_acc 0.4056
Epoch 8/50 - train_loss 2.4701 - train_acc 0.4710 - val_acc 0.4532
Epoch 9/50 - train_loss 2.4201 - train_acc 0.4876 - val_acc 0.3614
Epoch 10/50 - train_loss 2.3780 - train_acc 0.5015 - val_acc 0.4330
Epoch 11/50 - train_loss 2.3344 - train_acc 0.5140 - val_acc 0.4708
Epoch 12/50 - train_loss 2.3030 - train_acc 0.5208 - val_acc 0.4914
Epoch 13/50 - train_loss 2.2716 - train_acc 0.5334 - val_acc 0.4718
Epoch 14/50 - train_loss 2.2405 - train_acc 0.5409 - val_acc 0.4982
Epoch 15/50 - train_loss 2.2097 - train_acc 0.5541 - val_

### Training setup
- Optimizer: SGD with momentum 0.9 and weight decay 5e-4.
- LR: 0.1 with cosine annealing over 50 epochs.
- Loss: CrossEntropy with label smoothing 0.1 for regularization.
- AMP: mixed precision (`autocast`, `GradScaler`) plus grad clipping for stability and speed.
- Batch sizes: 128 train / 256 val to balance throughput and memory.
- Checkpointing: save best model by validation accuracy.


In [None]:
# Load best and run provided evaluate() on the test set
model.load_state_dict(torch.load(best_path, map_location=device))
model.to(device)
evaluate(model)

The model has 845380 parameters
[1m[91mAccuracy on the test set: 72.06%[0m


### Evaluation
- Reload best checkpoint (highest val acc) before test.
- Use the provided `evaluate` to validate.


In [None]:
t1 = time.time()
elapsed = t1 - t0
print(f"Total wall time: {elapsed/60:.1f} min")
if elapsed < 3600:
    print("✅ End-to-end run completed in under 1 hour.")
else:
    print("⚠️ Took longer than 1 hour.")


Total wall time: 19.6 min
✅ End-to-end run completed in under 1 hour.
