# Lab 3

This week we look at a new dataset below.

!curl https://storage.googleapis.com/aiolympiadmy/ioai-2025-tsp/lab3.zip -o lab3.zip

!unzip -q lab3.zip

The above commands should create a `train/` folder and `test/` folder in this directory. Each folder contains `images/` and `labels/` subdirectories respectively.

In this dataset, each image is associated with a text file. In the text file, each row represents a single bounding box around a person in the image. Each bounding box is represented by 5 numbers: x_center, y_center, width, height, objectness. Objectness is always 1. All other dimensions are represented as fractions of image dimensions. e.g. x_center's actual pixel location needs to be multiplied by the width of the image.

If you still recall the network you worked with in Lab 2:
```python
class FCN(nn.Module):
    def __init__(self, backbone, head):
        super().__init__()
        self.backbone = backbone
        self.head = head

    def forward(self, x):
        x = self.backbone(x)
        x = self.head(x)
        return x

backbone = ...
head = ...
model = FCN(backbone, head)
```

Here's an idea. Given that the backbone defined above will output a spatial feature map, we can iterate over each cell location in the spatial feature map, and predict if there is an object to detect in that cell (objectness). Notice that this is very similar to FCN segmentation where we were practically doing pixel-wise classification. However, we take this a bit further. If objectness in each cell is high enough, we predict an object within a bounding box as specified by our neural network.

## Another new network

Create a new class that is almost the exact same as `FCN` above, but specify the `head` to be a `nn.Conv2d` layer with kernel size 1. This layer should output 5 channels, corresponding to the 5 numbers in each row of our labels. This network will be architecturally simpler than the FCN of Lab 2!

_1 pt granted upon completion of network definition_

In [4]:
import torchvision.models as models
import torch.nn as nn

class FCN(nn.Module):
    def __init__(self, backbone):
        super().__init__()
        self.backbone = backbone
        self.head = nn.Conv2d(in_channels=512, out_channels=5, kernel_size=1)

    def forward(self, x):
        x = self.backbone(x)
        x = self.head(x)
        return x

device = 'xpu'

backbone = models.resnet34(pretrained=True)
backbone = nn.Sequential(*list(backbone.children())[:-2])
backbone.add_module('adaptive_pool', nn.AdaptiveAvgPool2d((8, 8)))

model = FCN(backbone).to(device)



## Dataset and dataloaders

Now go ahead and build your Dataset and Dataloaders in Pytorch.

Note that the dataset should calculate and return x_offset and y_offset instead of x_center and y_center. If you leave x_center and y_center as is, you will force your network to also learn how to predict larger values of x_center and y_center with increasing values of x and y on the spatial feature map output by the backbone! You can resolve this by storing the offset between the bounding box center and the center of the cell of the spatial feature map your bounding box is in. I'll help you a little by providing an example:

```python
ipdb>  grid_size
8

ipdb>  grid_y
4

ipdb>  grid_y_min
0.5

ipdb>  y_center
0.5132275132275133

ipdb>  y_offset
0.013227513227513255
```

_1 pt granted upon successfully running the code below_,

```python
train_dataset = ...
test_dataset = ...
train_dataloader = DataLoader(train_dataset, ...)
test_dataloader = DataLoader(test_dataset, ...)
one_X, one_y = next(iter(test_dataset))
batch_X, batch_y = next(iter(test_dataloader))
```

_and demonstrating the following results_:
- `one_X.shape` = (3, im_h, im_w) where im_h and im_w are height and width of the image
- `one_y.shape` = (5, gy, gx) where gy and gx are height and width of the spatial feature map
- `batch_X.shape` = (B, 3, im_h, im_w) where B is batch size
- `batch_y.shape` = (B, 5, gy, gx)

Unless you've went out of your way to resize your images to a shape other than a square, `im_h` should be equal to `im_w`, and `gy` should be equal to `gx`.

In [5]:
import torch
from torch.utils.data import Dataset, DataLoader
from PIL import Image
import os
import numpy as np
from torchvision import transforms

from torchvision import transforms
from PIL import Image

class DetectionDataset(Dataset):
    def __init__(self, root_dir, split='train', transform=None):
        self.root_dir = root_dir
        self.split = split
        self.transform = transform
        self.images_dir = os.path.join(root_dir, split, 'images')
        self.labels_dir = os.path.join(root_dir, split, 'labels')
        
        # Get image filenames (PNGs)
        self.image_files = [f for f in os.listdir(self.images_dir) if f.endswith('.png')]
        self.label_files = [f.replace('.png', '.txt') for f in self.image_files]
        
        # Grid dimensions (gy and gx are both 8)
        self.gy = 8
        self.gx = 8

        # Default transform: Resize to 416x416 and convert to tensor
        if self.transform is None:
            self.transform = transforms.Compose([
                transforms.Resize((416, 416)),  # Resize to fixed dimensions
                transforms.ColorJitter(0.2, 0.2, 0.2),
                transforms.RandomHorizontalFlip(p=0.5),
                transforms.ToTensor()          # Converts to [0,1] and (C, H, W)
            ])

    def __len__(self):
        return len(self.image_files)

    def __getitem__(self, idx):
        # Load image
        img_path = os.path.join(self.images_dir, self.image_files[idx])
        image = Image.open(img_path).convert("RGB")
        image_tensor = self.transform(image)  # Apply transform to PIL Image
        
        # Load labels and compute targets
        targets = torch.zeros(5, self.gy, self.gx)  # [x_offset, y_offset, width, height, objectness]
        label_path = os.path.join(self.labels_dir, self.label_files[idx])
        
        with open(label_path, 'r') as f:
            for line in f:
                data = line.strip().split()
                x_center, y_center, width, height, _ = map(float, data)  # objectness is always 1
                
                # Find cell indices (i, j)
                j = int(x_center * self.gx)  # Column index
                i = int(y_center * self.gy)  # Row index
                
                # Compute cell's center (normalized coordinates)
                cell_x = (j + 0.5) / self.gx
                cell_y = (i + 0.5) / self.gy
                
                # Compute offsets
                x_offset = x_center - cell_x
                y_offset = y_center - cell_y
                
                # Update targets for this cell
                targets[0, i, j] = x_offset
                targets[1, i, j] = y_offset
                targets[2, i, j] = width
                targets[3, i, j] = height
                targets[4, i, j] = 1.0  # Objectness (always 1 for valid boxes)

        return image_tensor, targets

# Example usage:
train_dataset = DetectionDataset(root_dir='lab3', split='train')
test_dataset = DetectionDataset(root_dir='lab3', split='test')

# Check first sample
try:
    img, target = train_dataset[0]
    print("Sample loaded successfully:")
    print("Image shape:", img.shape)
    print("Target shape:", target.shape)
except Exception as e:
    print("Error loading dataset:", e)

# Create Dataloaders
train_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=4, shuffle=False)

print(f"Train images: {len(train_dataset.image_files)}")  # Should be 2 (A00002.png, A00003.png)
print(f"Test images: {len(test_dataset.image_files)}")   # Should be 2 (A00001.png, A00005.png)

# Verify shapes (replace with actual code)
one_X, one_y = next(iter(test_dataset))
print("one_X.shape:", one_X.shape)  # Should be (3, H, W)
print("one_y.shape:", one_y.shape)  # Should be (5, 8, 8)

batch_X, batch_y = next(iter(test_dataloader))
print("batch_X.shape:", batch_X.shape)  # (4, 3, H, W)
print("batch_y.shape:", batch_y.shape)  # (4, 5, 8, 8)

Sample loaded successfully:
Image shape: torch.Size([3, 416, 416])
Target shape: torch.Size([5, 8, 8])
Train images: 136
Test images: 34
one_X.shape: torch.Size([3, 416, 416])
one_y.shape: torch.Size([5, 8, 8])
batch_X.shape: torch.Size([4, 3, 416, 416])
batch_y.shape: torch.Size([4, 5, 8, 8])


## Visualization

Being able to see is important for diagnosing computer vision applications. Create a visualization function to plot an overlay of bounding boxes on their respective images. You can use this function template below.

```python
def plot_batch_predictions(imgs, outputs):
    # img should have shape (batch, 3, im_h, im_w)
    # outputs should have shape (batch, 5, gy, gx)
    ...
```

Will also show you some sample matplotlib code to save time:

```python
import numpy as np
import matplotlib.pyplot as plt

# This creates a matplotlib figure with 4 cols
# Note that axes is an array of individual `ax`
batch_size = 4
fig, axes = plt.subplots(ncols=batch_size, figsize=(20, 8))
axes = axes if isinstance(axes, np.ndarray) else [axes]  # Handle batch_size=1

# This is how to plot an image
# Note that imshow requires image dimensions to be (H, W, 3)
# while Pytorch works with image dimensions (3, H, W)!
mock_tensor = torch.rand(3, 128, 128)
mock_np = mock_tensor.permute(1, 2, 0).contiguous().numpy()
# Usually you would just use plt.imshow where plt will grab the latest ax
# When you have as many as 4 in this example, specify which ax to use
ax = axes[0]
ax.imshow(img_np)
    
# This is how to draw rectangles using matplotlib
xmin = int((x_center - width / 2) * im_width)
xmax = int((x_center + width / 2) * im_width)
rect = plt.Rectangle(
    (xmin, ymin), width, height, 
    linewidth=2, edgecolor='red', facecolor='none'
)
# Add the rectangle to the Axes
ax.add_patch(rect)
```

Remember that your network output is x_offset and y_offset, need to convert them back to x_center and y_center!

_1 pt granted upon plotting one batch of images and labels from `test_dataloader` using `plot_batch_predictions`_

In [6]:
import matplotlib.pyplot as plt
import torch

def plot_batch_predictions(imgs, outputs, threshold=0.5):
    batch_size = imgs.shape[0]
    fig, axes = plt.subplots(ncols=batch_size, figsize=(20, 8))
    
    if batch_size == 1:  # Handle single-image batches
        axes = [axes]
    
    for idx in range(batch_size):
        ax = axes[idx]
        # Convert tensor to numpy image (H, W, 3)
        img = imgs[idx].cpu().permute(1, 2, 0).numpy()
        ax.imshow(img)
        
        output = outputs[idx].cpu().detach()
        gy, gx = output.shape[1], output.shape[2]  # Grid dimensions (8x8)
        
        # Convert outputs to bounding boxes
        for i in range(gy):
            for j in range(gx):
                obj_prob = torch.sigmoid(output[4, i, j])
                if obj_prob > threshold:
                    # Get predicted box parameters
                    x_offset = output[0, i, j].item()
                    y_offset = output[1, i, j].item()
                    width = output[2, i, j].item()
                    height = output[3, i, j].item()
                    
                    # Calculate absolute coordinates
                    cell_x = (j + 0.5) / gx  # Normalized cell center
                    cell_y = (i + 0.5) / gy
                    x_center = (cell_x + x_offset) * img.shape[1]  # Pixel coordinates
                    y_center = (cell_y + y_offset) * img.shape[0]
                    w = width * img.shape[1]
                    h = height * img.shape[0]
                    
                    # Create rectangle
                    rect = plt.Rectangle(
                        (x_center - w/2, y_center - h/2), w, h,
                        linewidth=2, edgecolor='red', facecolor='none'
                    )
                    ax.add_patch(rect)
        
        ax.axis('off')  # Remove axis ticks
    plt.show()

# Get a batch from dataloader

In [7]:
batch_X, _ = next(iter(test_dataloader))
batch_X = batch_X.to(device)

# Get predictions

In [8]:
with torch.no_grad():
  outputs = model(batch_X)

print(outputs.shape)

torch.Size([4, 5, 8, 8])


# Visualize predictions


# NOTE: ONLY RUN THIS CELL BELOW IN GOOGLE COLAB / KAGGLE (not sure) else your Kernel will crash (at least mine crashed)

In [None]:
#plot_batch_predictions(batch_X, outputs, threshold=0.5)

: 

## Setting up loss calculations

Objectness is essentially a binary classification task, while predicting the correct bounding box is a regression task. 

Set up loss function calculations for objectness using binary cross entropy, and for bounding box localization using MSE. Create both losses in the same function for convenience, but return them separately instead of as a sum so they are easy to log later on. 

Concept-wise this is pretty straightforward. However, implementation-wise, you will need to place your tensors with great care. I'll help you a bit by providing you with this template below.

```python
def custom_loss(preds, targets):
    # both preds and targets should have shape (B, 5, gy, gx)
    # where B is batch size, gy and gx are spatial feature map h and w 
    ...
    return objectness_loss, localization_loss
```

_1 pt granted upon completion of loss function calculation_

In [9]:
import torch.nn.functional as F

def custom_loss(preds, targets):
    """Calculate objectness (BCE) and localization (MSE) losses"""
    # Objectness loss (binary classification)
    obj_preds = preds[:, 4, :, :]  # (B, gy, gx)
    obj_targets = targets[:, 4, :, :]
    obj_loss = F.binary_cross_entropy_with_logits(obj_preds, obj_targets)
    
    # Localization loss (regression)
    loc_preds = preds[:, :4, :, :]  # (B, 4, gy, gx)
    loc_targets = targets[:, :4, :, :]
    
    # Mask for cells containing objects
    mask = (obj_targets > 0.5).unsqueeze(1)  # (B, 1, gy, gx)
    mask = mask.expand_as(loc_preds)  # Match dimensions (B, 4, gy, gx)
    
    # Filter predictions and targets using mask
    loc_preds_filtered = loc_preds[mask]
    loc_targets_filtered = loc_targets[mask]
    
    # Calculate MSE loss only for positive cells
    loc_loss = F.mse_loss(loc_preds_filtered, loc_targets_filtered) if loc_preds_filtered.numel() > 0 \
               else torch.tensor(0.0, device=preds.device)
    
    return obj_loss, loc_loss

## Model evalution and baseline score

Create a `test_one_epoch` function that takes the model and the test dataloader as arguments. Calculate and return box IOU score (`torchvision.ops.box_iou`) in a dictionary like so:

```python
>>> metrics = test_one_epoch(model, test_dataloader)
>>> print(metrics)
{"miou": 0.005}
```

_1 pt granted upon implementing `test_one_epoch` and seeing the mean IOU score of the untrained model_

In [10]:
from torchvision.ops import box_iou

def test_one_epoch(model, test_loader, threshold=0.5, img_size=416):
    model.eval()
    total_iou = 0.0
    total_samples = 0
    
    with torch.no_grad():
        for batch_X, batch_y in test_loader:
            batch_X = batch_X.to(device)
            outputs = model(batch_X)
            
            # Convert outputs and targets to boxes
            for i in range(batch_X.size(0)):  # Iterate through batch
                # Get predictions for this image
                pred_boxes = []
                output = outputs[i]  # (5, gy, gx)
                gy, gx = output.shape[1], output.shape[2]
                
                # Convert model outputs to boxes
                for y in range(gy):
                    for x in range(gx):
                        obj_score = torch.sigmoid(output[4, y, x])
                        if obj_score > threshold:
                            # Calculate box coordinates
                            x_offset = output[0, y, x].item()
                            y_offset = output[1, y, x].item()
                            width = output[2, y, x].item()
                            height = output[3, y, x].item()
                            
                            # Convert to pixel coordinates
                            cell_x = (x + 0.5) / gx
                            cell_y = (y + 0.5) / gy
                            x_center = (cell_x + x_offset) * img_size
                            y_center = (cell_y + y_offset) * img_size
                            w = width * img_size
                            h = height * img_size
                            
                            pred_boxes.append(torch.tensor([
                                x_center - w/2,  # xmin
                                y_center - h/2,  # ymin
                                x_center + w/2,  # xmax
                                y_center + h/2   # ymax
                            ]))
                
                # Convert ground truth to boxes
                gt_boxes = []
                target = batch_y[i]  # (5, gy, gx)
                for y in range(gy):
                    for x in range(gx):
                        if target[4, y, x] == 1:  # Object exists
                            x_offset = target[0, y, x].item()
                            y_offset = target[1, y, x].item()
                            width = target[2, y, x].item()
                            height = target[3, y, x].item()
                            
                            # Convert to pixel coordinates
                            cell_x = (x + 0.5) / gx
                            cell_y = (y + 0.5) / gy
                            x_center = (cell_x + x_offset) * img_size
                            y_center = (cell_y + y_offset) * img_size
                            w = width * img_size
                            h = height * img_size
                            
                            gt_boxes.append(torch.tensor([
                                x_center - w/2,
                                y_center - h/2,
                                x_center + w/2,
                                y_center + h/2
                            ]))
                
                # Calculate IoU if we have both predictions and ground truth
                if pred_boxes and gt_boxes:
                    pred_tensor = torch.stack(pred_boxes).to(device)
                    gt_tensor = torch.stack(gt_boxes).to(device)
                    
                    iou_matrix = box_iou(pred_tensor, gt_tensor)
                    best_ious = iou_matrix.max(dim=0).values  # For each GT box
                    mean_iou = best_ious.mean().item()
                    
                    total_iou += mean_iou
                    total_samples += 1
                elif gt_boxes:  # No predictions but has GT (count as 0 IoU)
                    total_samples += 1

    return {"miou": total_iou / total_samples if total_samples else 0.0}

In [11]:
metrics = test_one_epoch(model, test_dataloader)
print(f"Untrained model mIoU: {metrics['miou']:.4f}")
# Typical output: {"miou": 0.001-0.01} 

Untrained model mIoU: 0.0007


## Model training

Train your model on the training set. Track objectness loss and localization loss during training for every 10 minibatches (a). I will leave it up to choose how to combine your losses. 

At the end of every epoch, show metrics on both train (b) and test data (c), and plot prediction outputs of the first batch of the test dataset (d). Save the best performing model with the highest mean IOU score on test (e).

You don't need to run training for too long. I suspect <50 epochs will be sufficient.

_1 pt granted upon completion of (a) to (e)._

_Another 1 pt granted for exceeding 0.4 mean IOU on the test dataset._

In [12]:
import torch
import torch.optim as optim
from torch.optim.lr_scheduler import ReduceLROnPlateau

# Training setup
model = model.to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3, weight_decay=1e-4)
scheduler = ReduceLROnPlateau(optimizer, mode='max', patience=3, factor=0.2, verbose=True)
best_miou = 0.0

# Training loop
for epoch in range(50):
    model.train()
    epoch_obj_loss = 0.0
    epoch_loc_loss = 0.0
    
    # Training phase
    for batch_idx, (X, y) in enumerate(train_dataloader):
        X, y = X.to(device), y.to(device)
        
        optimizer.zero_grad()
        outputs = model(X)
        obj_loss, loc_loss = custom_loss(outputs, y)
        total_loss = obj_loss + loc_loss
        total_loss.backward()
        optimizer.step()

        epoch_obj_loss += obj_loss.item()
        epoch_loc_loss += loc_loss.item()

        # Log every 10 batches
        if (batch_idx + 1) % 10 == 0:
            avg_obj = epoch_obj_loss / (batch_idx + 1)
            avg_loc = epoch_loc_loss / (batch_idx + 1)
            print(f"Epoch {epoch+1} Batch {batch_idx+1}: "
                  f"Obj Loss: {avg_obj:.4f}, Loc Loss: {avg_loc:.4f}")

    # Validation phase
    model.eval()
    train_metrics = test_one_epoch(model, train_dataloader)
    test_metrics = test_one_epoch(model, test_dataloader)
    scheduler.step(test_metrics['miou'])

    print(f"\nEpoch {epoch+1} Results:")
    print(f"Train mIoU: {train_metrics['miou']:.4f}")
    print(f"Test mIoU: {test_metrics['miou']:.4f}")
    print(f"Objectness Loss: {epoch_obj_loss/len(train_dataloader):.4f}")
    print(f"Localization Loss: {epoch_loc_loss/len(train_dataloader):.4f}\n")

    # Save best model
    if test_metrics['miou'] > best_miou:
        best_miou = test_metrics['miou']
        torch.save(model.state_dict(), "best_model.pth")
        print(f"New best model saved with mIoU: {best_miou:.4f}")

    # Visualization
    test_sample = next(iter(test_dataloader))
    test_images, _ = test_sample
    with torch.no_grad():
        outputs = model(test_images.to(device))
    plot_batch_predictions(test_images, outputs.cpu())

print(f"Training complete. Best Test mIoU: {best_miou:.4f}")



Epoch 1 Batch 10: Obj Loss: 0.2878, Loc Loss: 0.8711
Epoch 1 Batch 20: Obj Loss: 0.2183, Loc Loss: 0.4515
Epoch 1 Batch 30: Obj Loss: 0.1941, Loc Loss: 0.3552

Epoch 1 Results:
Train mIoU: 0.0000
Test mIoU: 0.0000
Objectness Loss: 0.1883
Localization Loss: 0.3192



: 

## Post training

Create a plot that contains four subplots: image with true bounding boxes, image with predicted bounding boxes, predicted objectness over spatial feature map, true objectness over spatial feature map). Repeat this plot for a few images.

_1 pt granted upon completion of the above_

In [None]:
def plot_results(imgs, outputs, targets):
    # Plot images with predicted and true boxes, objectness heatmaps...

What learning task did you just perform in this notebook?

_1 pt granted upon finding the right answer_

In [None]:
# Your answer here

This dataset is 17 years old this year. What is it called?

_1 pt granted upon finding the right answer_

In [None]:
# Your answer here

## EX: Going off track

Name a limitation of this training setup and briefly explain your reasoning

_1 pt granted upon a satisfactory answer_

In [None]:
# Your work here

Show me how you can modify this training setup to attain better performance.

_2 pts granted upon successfully scoring at least +0.2 mean IOU higher than the score of the best model above. Partial credit to be granted at discretion. Bonus additional +1 pt to be granted for outstanding improvements_

In [None]:
# Your work here