# Assignment Module 2: Pet Classification

The goal of this assignment is to implement a neural network that classifies images of 37 breeds of cats and dogs from the [Oxford-IIIT-Pet dataset](https://www.robots.ox.ac.uk/~vgg/data/pets/). The assignment is divided into two parts: first, you will be asked to implement from scratch your own neural network for image classification; then, you will fine-tune a pretrained network provided by PyTorch.

## Dataset

The following cells contain the code to download and access the dataset you will be using in this assignment. Note that, although this dataset features each and every image from [Oxford-IIIT-Pet](https://www.robots.ox.ac.uk/~vgg/data/pets/), it uses a different train-val-test split than the original authors.

In [29]:
!git clone https://github.com/CVLAB-Unibo/ipcv-assignment-2.git

fatal: destination path 'ipcv-assignment-2' already exists and is not an empty directory.


In [42]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
from torch import Tensor
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
from torchsummary import summary
import torch.nn.functional as F
import torch.optim.lr_scheduler as lr_scheduler
import matplotlib.pyplot as plt
from PIL import Image
from typing import List, Tuple, Dict, Optional
from pathlib import Path
from tqdm import tqdm
import pandas as pd
import seaborn as sns

In [31]:
# Check for CUDA availability
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {DEVICE}")

def fix_random(seed: int) -> None:
    """Fix all the possible sources of randomness.

    Args:
        seed: the seed to use.
    """
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)

    torch.backends.cudnn.benchmark = False
    torch.backends.cudnn.deterministic = True

fix_random(42)

Using device: cuda


In [32]:
class OxfordPetDataset(Dataset):
    def __init__(self, split: str, transform=None) -> None:
        super().__init__()

        self.root = Path("ipcv-assignment-2") / "dataset"
        self.split = split
        self.names, self.labels = self._get_names_and_labels()
        self.transform = transform

    def __len__(self) -> int:
        return len(self.labels)

    def __getitem__(self, idx) -> Tuple[Tensor, int]:
        img_path = self.root / "images" / f"{self.names[idx]}.jpg"
        img = Image.open(img_path).convert("RGB")
        label = self.labels[idx]
        
        if self.transform:
            img = self.transform(img)

        return img, label
    
    def get_num_classes(self) -> int:
        return max(self.labels) + 1

    def _get_names_and_labels(self) -> Tuple[List[str], List[int]]:
        names = []
        labels = []

        with open(self.root / "annotations" / f"{self.split}.txt") as f:
            for line in f:
                name, label = line.replace("\n", "").split(" ")
                names.append(name), 
                labels.append(int(label) - 1)

        return names, labels

## Part 1: design your own network

Your goal is to implement a convolutional neural network for image classification and train it from scratch on `OxfordPetDataset`. You should consider yourselves satisfied once you obtain a classification accuracy on the test split of ~60%. You are free to achieve this however you want, except for a few rules you must follow:

- Compile this notebook by displaying the results obtained by the best model you found throughout your experimentation; then show how, by removing some of its components, its performance drops. In other words, do an *ablation study* to prove that your design choices have a positive impact on the final result.

- Do not instantiate an off-the-self PyTorch network. Instead, construct your network as a composition of existing PyTorch layers. In more concrete terms, you can use e.g. `torch.nn.Linear`, but you cannot use e.g. `torchvision.models.alexnet`.

- Show your results and ablations with plots, tables, images, etc. — the clearer, the better.

Don't be too concerned with your model performance: the ~60% is just to give you an idea of when to stop. Keep in mind that a thoroughly justified model with lower accuracy will be rewarded more points than a poorly experimentally validated model with higher accuracy.

In [33]:
# ImageNet mean and std for normalization
IMG_SIZE = (224, 224) # A common size for image classification tasks
MEAN = [0.485, 0.456, 0.406]
STD = [0.229, 0.224, 0.225]

train_transform = transforms.Compose([
    transforms.TrivialAugmentWide(),
    transforms.RandomResizedCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(MEAN, STD)
])

val_test_transform = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(MEAN, STD)
])

# Create Dataset instances
train_dataset = OxfordPetDataset(split="train", transform=train_transform)
val_dataset = OxfordPetDataset(split="val", transform=val_test_transform)
test_dataset = OxfordPetDataset(split="test", transform=val_test_transform)

# Create DataLoader instances
BATCH_SIZE = 128 # You can tune this hyperparameter

train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, shuffle=True, num_workers=2)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=False, num_workers=2)

# Get number of classes
NUM_CLASSES = train_dataset.get_num_classes()
INPUT_DIM = len(train_dataset[0][0])
print(f"Number of classes: {NUM_CLASSES}")
print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")

Number of classes: 37
Training samples: 3669
Validation samples: 1834
Test samples: 1846


In [34]:
class HSwish(nn.Module):
    def forward(self, x):
        return x * F.relu6(x + 3) / 6

class SEBlock(nn.Module):
    def __init__(self, channels):
        super().__init__()
        self.fc1 = nn.Conv2d(channels, channels // 4, 1)
        self.fc2 = nn.Conv2d(channels // 4, channels, 1)
    
    def forward(self, x):
        s = F.adaptive_avg_pool2d(x, 1)
        s = F.relu(self.fc1(s))
        s = F.relu6(self.fc2(s) + 3) / 6  # Hard sigmoid
        return x * s

class InvertedResidual(nn.Module):
    def __init__(self, inp, exp, out, k, s, se, hs):
        super().__init__()
        self.use_res = (s == 1 and inp == out)
        
        layers = []
        # Expand
        if exp != inp:
            layers += [nn.Conv2d(inp, exp, 1, bias=False), nn.BatchNorm2d(exp), 
                      HSwish() if hs else nn.ReLU()]
        # Depthwise
        layers += [nn.Conv2d(exp, exp, k, s, k//2, groups=exp, bias=False), 
                  nn.BatchNorm2d(exp), HSwish() if hs else nn.ReLU()]
        # SE
        if se:
            layers.append(SEBlock(exp))
        # Project
        layers += [nn.Conv2d(exp, out, 1, bias=False), nn.BatchNorm2d(out)]
        
        self.conv = nn.Sequential(*layers)
    
    def forward(self, x):
        if self.use_res:
            return x + self.conv(x)
        return self.conv(x)

class MobileNetV3(nn.Module):
    def __init__(self, num_classes=37, mode='large'):
        super().__init__()
        
        # [inp, exp, out, kernel, stride, SE, HS]
        if mode == 'large':
            cfg = [
                [16, 16, 16, 3, 1, 0, 0], [16, 64, 24, 3, 2, 0, 0], [24, 72, 24, 3, 1, 0, 0],
                [24, 72, 40, 5, 2, 1, 0], [40, 120, 40, 5, 1, 1, 0], [40, 120, 40, 5, 1, 1, 0],
                [40, 240, 80, 3, 2, 0, 1], [80, 200, 80, 3, 1, 0, 1], [80, 184, 80, 3, 1, 0, 1],
                [80, 184, 80, 3, 1, 0, 1], [80, 480, 112, 3, 1, 1, 1], [112, 672, 112, 3, 1, 1, 1],
                [112, 672, 160, 5, 2, 1, 1], [160, 960, 160, 5, 1, 1, 1], [160, 960, 160, 5, 1, 1, 1]
            ]
            last_ch = 960
        else:  # small
            cfg = [
                [16, 16, 16, 3, 2, 1, 0], [16, 72, 24, 3, 2, 0, 0], [24, 88, 24, 3, 1, 0, 0],
                [24, 96, 40, 5, 2, 1, 1], [40, 240, 40, 5, 1, 1, 1], [40, 240, 40, 5, 1, 1, 1],
                [40, 120, 48, 5, 1, 1, 1], [48, 144, 48, 5, 1, 1, 1], [48, 288, 96, 5, 2, 1, 1],
                [96, 576, 96, 5, 1, 1, 1], [96, 576, 96, 5, 1, 1, 1]
            ]
            last_ch = 576
        
        # Stem
        self.stem = nn.Sequential(
            nn.Conv2d(3, 16, 3, 2, 1, bias=False),
            nn.BatchNorm2d(16),
            HSwish()
        )
        
        # Blocks
        layers = []
        for inp, exp, out, k, s, se, hs in cfg:
            layers.append(InvertedResidual(inp, exp, out, k, s, se, hs))
        self.blocks = nn.Sequential(*layers)
        
        # Head
        self.head = nn.Sequential(
            nn.Conv2d(cfg[-1][2], last_ch, 1, bias=False),
            nn.BatchNorm2d(last_ch),
            HSwish(),
            nn.AdaptiveAvgPool2d(1)
        )
        
        # Classifier
        self.classifier = nn.Sequential(
            nn.Linear(last_ch, 1280 if mode == 'large' else 1024),
            HSwish(),
            nn.Dropout(0.2),
            nn.Linear(1280 if mode == 'large' else 1024, num_classes)
        )
    
    def forward(self, x):
        x = self.stem(x)
        x = self.blocks(x)
        x = self.head(x)
        x = x.view(x.size(0), -1)
        x = self.classifier(x)
        return x

def mobilenetv3_large(num_classes=37):
    return MobileNetV3(num_classes, 'large')

def mobilenetv3_small(num_classes=37):
    return MobileNetV3(num_classes, 'small')

In [35]:
def train_epoch(model: nn.Module,
                dataloader: DataLoader,
                criterion: nn.Module,
                optimizer: optim.Optimizer,
                device: torch.device,
                scheduler: Optional[lr_scheduler.LRScheduler] = None) -> Tuple[float, float]:
    model.train()
    epoch_loss = 0.0
    correct_predictions = 0
    total_samples = 0

    for inputs, labels in dataloader:
        inputs, labels = inputs.to(device), labels.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        if scheduler is not None: 
            scheduler.step()

        epoch_loss += loss.item() * inputs.size(0)
        _, preds = torch.max(outputs, 1)
        correct_predictions += torch.sum(preds == labels.data)
        total_samples += labels.size(0)

    avg_loss = epoch_loss / total_samples
    avg_acc = correct_predictions.double() / total_samples
    return avg_loss, avg_acc.item()

def evaluate_model(model: nn.Module,
                   dataloader: DataLoader,
                   criterion: nn.Module,
                   device: torch.device) -> Tuple[float, float]:
    model.eval()
    epoch_loss = 0.0
    correct_predictions = 0
    total_samples = 0

    with torch.no_grad():
        for inputs, labels in dataloader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, labels)

            epoch_loss += loss.item() * inputs.size(0)
            _, preds = torch.max(outputs, 1)
            correct_predictions += torch.sum(preds == labels.data)
            total_samples += labels.size(0)

    avg_loss = epoch_loss / total_samples
    avg_acc = correct_predictions.double() / total_samples
    return avg_loss, avg_acc.item()

def plot_history(history: Dict[str, List[float]]):
    plt.figure(figsize=(12, 4))
    
    plt.subplot(1, 2, 1)
    plt.plot(history['train_loss'], label='Train Loss')
    plt.plot(history['val_loss'], label='Validation Loss')
    plt.xlabel('Epoch')
    plt.ylabel('Loss')
    plt.title('Loss Over Epochs')
    plt.grid(True, which='both', linestyle='--', linewidth=0.5)
    plt.legend()

    plt.subplot(1, 2, 2)
    plt.plot(history['train_acc'], label='Train Accuracy')
    plt.plot(history['val_acc'], label='Validation Accuracy')
    plt.xlabel('Epoch')
    plt.ylabel('Accuracy')
    plt.title('Accuracy Over Epochs')
    plt.grid(True, which='both', linestyle='--', linewidth=0.5)
    plt.legend()

    plt.tight_layout()
    plt.show()

In [37]:
model = mobilenetv3_large(num_classes=37).to(DEVICE)
x = torch.randn(1, 3, 224, 224).to(DEVICE)
print(f"Output: {model(x).shape}")
print(f"Params: {sum(p.numel() for p in model.parameters()):,}")

Output: torch.Size([1, 37])
Params: 4,247,595


In [46]:
EPOCHS = 50
HIDDEN_DIM = 6

# Then train with your existing functions:
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}
best_val_acc = 0.0

# main progress bar over epochs
pbar = tqdm(range(EPOCHS), desc=f"Training", ncols=100)

for epoch in pbar:
    # --- Training ---
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, DEVICE)
    
    # --- Validation ---
    val_loss, val_acc = evaluate_model(model, val_loader, criterion, DEVICE)
    
    # --- Record metrics ---
    history['train_loss'].append(train_loss)
    history['train_acc'].append(train_acc)
    history['val_loss'].append(val_loss)
    history['val_acc'].append(val_acc)

    pbar.set_postfix({
        "Train Loss": f"{train_loss:.4f}",
        "Val Loss": f"{val_loss:.4f}",
        "Train Acc": f"{train_acc:.4f}",
        "Val Acc": f"{val_acc:.4f}"    
        })



[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

[A[A

Training: 100%|█| 50/50 [2:30:36<00:00, 180.73s/it, Train Loss=1.6871, Val Loss=1.5185, Train Acc=0.
