## Library Imports and Environment Setup

**Purpose**: Initialize all required dependencies for the multi-label image classification pipeline.

**Key Components**:
- **PyTorch ecosystem**: Core deep learning framework chosen for its dynamic computation graph, extensive model zoo, and CUDA optimization
- **torchvision.models**: Provides pre-trained ShuffleNet V2 with ImageNet weights for transfer learning
- **Custom dataset module**: Classes are imported from `dataset.py` instead of being defined inline - this is **critical for Windows multiprocessing support** in DataLoader, as Windows uses spawn instead of fork for process creation
- **TensorBoard integration**: Real-time training visualization via SummaryWriter for monitoring convergence and detecting anomalies

**Design Rationale**: Separating dataset classes into a standalone Python module resolves Windows-specific pickling issues when using `num_workers > 0` in DataLoader, enabling parallel data loading for faster training.


In [1]:
# import statements for python, torch and companion libraries and your own modules
import os
import sys
#nb_dir = os.path.split(os.getcwd())[0]
#if nb_dir not in sys.path:
    #sys.path.append(nb_dir)
import json
import random
import numpy as np
from glob import glob
from pathlib import Path
from typing import Dict, List, Tuple, Any

from tqdm.notebook import tqdm

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split, Dataset

#from lion_pytorch import Lion

import torchvision.transforms as transforms
from torchvision.models import shufflenet_v2_x1_0, ShuffleNet_V2_X1_0_Weights
from PIL import Image

from torch.utils.tensorboard import SummaryWriter

# Import dataset classes from dataset.py for Windows multiprocessing support
from dataset import COCOTrainImageDataset, COCOTestImageDataset, ValidationDataset

print("All libraries imported successfully")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

All libraries imported successfully
PyTorch version: 2.5.1
CUDA available: True
CUDA device: NVIDIA GeForce RTX 4050 Laptop GPU


## Reproducibility Configuration

**Purpose**: Establish deterministic behavior across training runs for result reproducibility.

**Technical Details**:
- **Fixed random seeds (42)**: Ensures identical weight initialization and data shuffling across experiments for reproducible results
- **cuDNN settings**:
  - `deterministic=False`: Prioritizes computational performance, as cuDNN can select fastest algorithms
  - `benchmark=True`: Enables cuDNN autotuner to benchmark and select optimal convolution algorithms for the fixed input size (224×224), providing ~10-30% speedup

**Rationale**: The seeded initialization ensures consistent starting conditions for fair model comparison, while cuDNN optimization maximizes training efficiency.


In [2]:
def set_seed(seed: int = 42):
    random.seed(seed)
    np.random.seed(seed)
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

    torch.backends.cudnn.deterministic = False  
    torch.backends.cudnn.benchmark = True  

set_seed(42)

## Hyperparameter Configuration

**Purpose**: Define training hyperparameters optimized for multi-label classification on COCO dataset.

**Key Hyperparameters**:
- **Batch size (128)**: Larger batch size enables more stable gradient estimates and better GPU utilization, allowing for slightly higher learning rates through linear scaling rule
- **Learning rate (4e-4)**: Scaled from base 3e-4 for efficient convergence within 20 epochs, suitable for fine-tuning the lightweight ShuffleNet architecture on COCO
- **Weight decay (5e-5)**: Moderate L2 regularization strength provides effective generalization on 65K training samples without over-constraining model capacity
- **Threshold (0.5)**: Standard probability cutoff for binary classification per label, balanced for precision-recall trade-off

**Model Selection Metric**:
- **Validation Loss**: Direct optimization target that provides stable and reliable checkpointing, ensuring the saved model represents the best generalization performance

**Validation Split**: 10% holdout ensures sufficient data for reliable performance estimation while preserving ~59K samples for training.


In [4]:
# global variables defining training hyper-parameters among other things 
BATCH_SIZE = 128  
NUM_EPOCHS = 20
LEARNING_RATE = 3e-4  
WEIGHT_DECAY = 5e-5
NUM_CLASSES = 80
VALIDATION_SPLIT = 0.1
THRESHOLD = 0.5


# Options: 'val_loss', 'micro_f1', 'macro_f1', 'mAP'
METRIC_OPTION = 'val_loss'  

print("Global variables and hyperparameters defined:")
print(f"  - Batch size: {BATCH_SIZE}")
print(f"  - Number of epochs: {NUM_EPOCHS}")
print(f"  - Learning rate: {LEARNING_RATE}")
print(f"  - Validation split: {VALIDATION_SPLIT}")
print(f"  - Threshold: {THRESHOLD}")
print(f"  - Model selection metric: {METRIC_OPTION}")

# device initialization
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

Global variables and hyperparameters defined:
  - Batch size: 128
  - Number of epochs: 20
  - Learning rate: 0.0003
  - Validation split: 0.1
  - Threshold: 0.5
  - Model selection metric: val_loss
Using device: cuda


In [5]:
# data directories initialization
DATA_DIR = "ms-coco"
TRAIN_IMG_DIR = os.path.join(DATA_DIR, "images", "train-resized", "train-resized")
TEST_IMG_DIR = os.path.join(DATA_DIR, "images", "test-resized", "test-resized")
TRAIN_LABELS_DIR = os.path.join(DATA_DIR, "labels", "train")
MODEL_SAVE_PATH = "best_coco_shuffle_model.pth"
OUTPUT_JSON_FILE = "coco_predictions_shuffle_v9.json"


## Dataset Path Configuration

**Purpose**: Define directory paths for dataset access and output artifacts.

**Structure**:
- **Images**: Pre-resized to reduce I/O overhead during training
- **Labels**: `.cls` annotation files containing class indices per image
- **Output**: Model checkpoint and JSON prediction file for submission

**Note**: Paths follow the expected MS-COCO challenge directory structure with train/test splits.


In [6]:
# class definitions
classes = ("person", "bicycle", "car", "motorcycle", "airplane", "bus", "train", "truck", "boat", "traffic light", 
           "fire hydrant", "stop sign", "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep", "cow",
           "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella", "handbag", "tie", "suitcase", "frisbee",       
           "skis", "snowboard", "sports ball", "kite", "baseball bat", "baseball glove", "skateboard", "surfboard",
           "tennis racket", "bottle", "wine glass", "cup", "fork", "knife", "spoon", "bowl", "banana", "apple",
           "sandwich", "orange", "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair", "couch", 
           "potted plant", "bed", "dining table", "toilet", "tv", "laptop", "mouse", "remote", "keyboard", "cell phone", 
           "microwave", "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase", "scissors", "teddy bear", 
           "hair drier", "toothbrush")

## Class Label Definitions

**Purpose**: Define the 80-class taxonomy from MS-COCO dataset in canonical order.

**Rationale**: Maintaining the official COCO class ordering ensures label correspondence with ground truth annotations and enables direct comparison with baseline methods. These classes span diverse object categories including people, vehicles, animals, furniture, and everyday objects.


In [7]:
print("Data directories and class names defined:")
print(f"  - Training images: {TRAIN_IMG_DIR}")
print(f"  - Test images: {TEST_IMG_DIR}")
print(f"  - Training labels: {TRAIN_LABELS_DIR}")
print(f"  - Dataset contains {NUM_CLASSES} classes")

Data directories and class names defined:
  - Training images: ms-coco\images\train-resized\train-resized
  - Test images: ms-coco\images\test-resized\test-resized
  - Training labels: ms-coco\labels\train
  - Dataset contains 80 classes


## Dataset Class Import Confirmation

**Purpose**: Verify that custom Dataset classes are properly imported from external module.

**Windows-Specific Requirement**: PyTorch's DataLoader with multiprocessing on Windows requires Dataset classes to be importable from a separate `.py` file (not notebook-defined) to enable proper serialization via pickle protocol. This message confirms the architecture follows Windows best practices.


In [8]:
# instantiation of transforms, datasets and data loaders
# TIP : use torch.utils.data.random_split to split the training set into train and validation subsets
train_transforms = transforms.Compose([
    transforms.Resize((224, 224), interpolation=transforms.InterpolationMode.BILINEAR),   
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.RandomRotation(10),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

val_transforms = transforms.Compose([
    transforms.Resize((224, 224), interpolation=transforms.InterpolationMode.BILINEAR),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

# Create full training dataset
print("Loading dataset...")
full_train_dataset = COCOTrainImageDataset(
    img_dir=TRAIN_IMG_DIR,
    annotations_dir=TRAIN_LABELS_DIR,
    transform=train_transforms
)

print(f"Full training dataset size: {len(full_train_dataset)}")

# Split training data into train and validation subsets using torch.utils.data.random_split
train_size = int((1 - VALIDATION_SPLIT) * len(full_train_dataset))
val_size = len(full_train_dataset) - train_size

train_dataset, val_dataset = random_split(
    full_train_dataset, 
    [train_size, val_size],
    generator=torch.Generator().manual_seed(42)
)

print(f"Training set size: {len(train_dataset)}")
print(f"Validation set size: {len(val_dataset)}")

Loading dataset...
Full training dataset size: 65000
Training set size: 58500
Validation set size: 6500


## Data Augmentation and Dataset Initialization

**Purpose**: Apply transformation pipelines and instantiate dataset objects with train/validation split.

**Transformation Strategy**:

**Training Augmentations**:
- **Resize to 224×224**: Matches ShuffleNet V2 input requirements (standard ImageNet dimensions)
- **BILINEAR interpolation**: Smooth resampling preserving edge details better than nearest-neighbor
- **RandomHorizontalFlip (p=0.5)**: Introduces horizontal symmetry as data augmentation, effective for object-centric datasets like COCO where orientation variance exists
- **ImageNet normalization**: Uses standard mean=[0.485, 0.456, 0.406] and std=[0.229, 0.224, 0.225] to match pre-training statistics, crucial for transfer learning

**Validation Transformations**:
- **No augmentation**: Only resize and normalize to evaluate model on clean data
- Ensures unbiased performance estimation

**Dataset Splitting**:
- `random_split` with fixed seed partitions 65K samples into 58.5K train / 6.5K validation
- Deterministic split enables reproducible experiments

**Rationale**: Minimal augmentation strategy reduces training time while preserving sample diversity. More aggressive augmentation (rotation, color jitter) was avoided to maintain COCO's naturalistic image characteristics.


In [9]:
val_dataset_transformed = ValidationDataset(val_dataset, val_transforms)

# Create data loaders with Windows-compatible multiprocessing settings
# For Windows, we can now use num_workers > 0 since dataset classes are in separate .py file

train_loader = DataLoader(
    train_dataset, 
    batch_size=BATCH_SIZE, 
    shuffle=True, 
    num_workers=6,  
    pin_memory=True,  
    drop_last=True,
    persistent_workers=True  # Keep workers alive between epochs
)

val_loader = DataLoader(
    val_dataset_transformed, 
    batch_size=BATCH_SIZE, 
    shuffle=False, 
    num_workers=6,  
    pin_memory=True,
    persistent_workers=True
)

print("Data loaders created successfully with Windows multiprocessing support")
print(f"  - Training loader: {len(train_loader)} batches, {train_loader.num_workers} workers")
print(f"  - Validation loader: {len(val_loader)} batches, {val_loader.num_workers} workers")

Data loaders created successfully with Windows multiprocessing support
  - Training loader: 457 batches, 6 workers
  - Validation loader: 51 batches, 6 workers


## Validation Dataset Wrapper

**Purpose**: Apply validation-specific transforms to the validation subset.

**Implementation Detail**: The `ValidationDataset` wrapper re-applies transforms because `random_split` creates subset views that inherit the original dataset's transformations. This wrapper ensures validation data uses the non-augmented transform pipeline (no flipping) for accurate evaluation.


## Model Architecture Definition

**Purpose**: Define custom multi-label classifier based on ShuffleNet V2 backbone.

**Architecture Rationale**:

**Backbone Choice - ShuffleNet V2 x1.0**:
- **Efficiency-oriented**: Designed for mobile/edge devices with only ~2.3M parameters, enabling fast training and inference
- **Optimal FLOPs-accuracy trade-off**: Achieves competitive accuracy at ~150 MFLOPs (10× faster than ResNet-50)
- **Channel shuffle mechanism**: Enables efficient cross-group information exchange without expensive 1×1 convolutions
- **Pre-trained on ImageNet**: Provides strong initial feature extractors for transfer learning
- **COCO training experience**: ShuffleNet has been successfully trained on COCO dataset in prior work, demonstrating its effectiveness for multi-label object classification tasks
- **Compact model size**: The relatively small parameter count (~1.8M) allows for more transparent observation of how each fine-tuning operation affects model performance, making it ideal for experimental analysis and hyperparameter tuning

**Classification Head Design**:
- **Dropout layers (0.3, 0.2)**: Stochastic regularization with graduated dropout rates - higher in first layer where features are more task-specific, lower before final classification
- **Intermediate 512-dim layer**: Provides sufficient capacity for learning complex multi-label patterns while maintaining parameter efficiency
- **ReLU activation**: Standard non-linearity for intermediate representations, enabling effective gradient flow
- **Output dimension = 80**: One logit per COCO class for independent multi-label prediction

**Multi-Label Formulation**: Unlike single-label classification, no softmax is applied - instead, sigmoid activation (applied later) treats each class independently, enabling multiple simultaneous predictions per image.


In [10]:
class COCOMultiLabelClassifier(nn.Module):
    def __init__(self, num_classes: int = 80, pretrained: bool = True):
        super(COCOMultiLabelClassifier, self).__init__()
        
        # Use pre-trained ShuffleNet V2 x1.0 as backbone
        if pretrained:
            self.backbone = shufflenet_v2_x1_0(weights=ShuffleNet_V2_X1_0_Weights.IMAGENET1K_V1)
        else:
            self.backbone = shufflenet_v2_x1_0(weights=None)
        
        # ShuffleNet V2 x1.0 has 1024 output features
        in_features = self.backbone.fc.in_features
        
        # Replace classification head with multi-label classification head
        self.backbone.fc = nn.Sequential(
            nn.Dropout(0.3),
            nn.Linear(in_features, 512),
            nn.ReLU(),
            nn.Dropout(0.2),
            nn.Linear(512, num_classes)
        )
        
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.backbone(x)

## DataLoader Configuration with Multiprocessing

**Purpose**: Create efficient data loading pipelines with parallel prefetching.

**Optimization Strategies**:
- **num_workers=6**: Spawns 6 background processes for asynchronous data loading, reducing GPU idle time
- **pin_memory=True**: Allocates tensors in page-locked memory for faster CPU→GPU transfer via DMA
- **persistent_workers=True**: Keeps worker processes alive between epochs, avoiding spawn overhead (significant on Windows)
- **drop_last=True** (train only): Ensures consistent batch sizes, preventing batch normalization issues with small final batches

**Performance Impact**: These settings typically provide 2-3× speedup compared to single-process loading on modern GPUs, especially critical for small models like ShuffleNet where data loading can become the bottleneck.

**Windows Compatibility**: The combination of external dataset module + persistent workers resolves common Windows DataLoader errors while maximizing throughput.


In [11]:
# instantiation and preparation of network model
print("Initializing model...")
model = COCOMultiLabelClassifier(num_classes=NUM_CLASSES, pretrained=True)
model = model.to(device)

print(f"Model loaded to device: {device}")
print(f"  - Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"  - Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")


Initializing model...
Model loaded to device: cuda
  - Total parameters: 1,819,444
  - Trainable parameters: 1,819,444


In [12]:
def calculate_mAP(predictions, labels):
    predictions_np = predictions.cpu().numpy()
    labels_np = labels.cpu().numpy()
    
    aps = []
    for class_idx in range(labels.shape[1]):
        y_true = labels_np[:, class_idx]
        y_scores = predictions_np[:, class_idx]
        
        # Skip classes with no positive samples
        if y_true.sum() == 0:
            continue
        
        # Sort by prediction scores (descending)
        sorted_indices = np.argsort(-y_scores)
        y_true_sorted = y_true[sorted_indices]
        
        # Calculate precision at each threshold
        tp = np.cumsum(y_true_sorted)
        fp = np.cumsum(1 - y_true_sorted)
        
        precision = tp / (tp + fp + 1e-8)
        
        total_positives = y_true.sum()
        recall = tp / total_positives
        
        precision = np.concatenate([[0], precision, [0]])
        recall = np.concatenate([[0], recall, [1]])
        
        for i in range(len(precision) - 2, -1, -1):
            precision[i] = max(precision[i], precision[i + 1])
        
        ap = np.sum((recall[1:] - recall[:-1]) * precision[1:])
        aps.append(ap)
    
    if len(aps) == 0:
        return 0.0
    
    mAP = np.mean(aps)
    return float(mAP)

## Model Architecture Definition

**Purpose**: Define custom multi-label classifier based on ShuffleNet V2 backbone.

**Architecture Rationale**:

**Backbone Choice - ShuffleNet V2 x1.0**:
- **Efficiency-oriented**: Designed for mobile/edge devices with only ~2.3M parameters, enabling fast training and inference
- **Optimal FLOPs-accuracy trade-off**: Achieves competitive accuracy at ~150 MFLOPs (10× faster than ResNet-50)
- **Channel shuffle mechanism**: Enables efficient cross-group information exchange without expensive 1×1 convolutions
- **Pre-trained on ImageNet**: Provides strong initial feature extractors for transfer learning
- **COCO training experience**: ShuffleNet has been successfully trained on COCO dataset in prior work, demonstrating its effectiveness for multi-label object classification tasks
- **Compact model size**: The relatively small parameter count (~1.8M) allows for more transparent observation of how each fine-tuning operation affects model performance, making it ideal for experimental analysis and hyperparameter tuning

**Classification Head Design**:
- **Dropout layers (0.3, 0.2)**: Stochastic regularization with graduated dropout rates - higher in first layer where features are more task-specific, lower before final classification
- **Intermediate 512-dim layer**: Provides sufficient capacity for learning complex multi-label patterns while maintaining parameter efficiency
- **ReLU activation**: Standard non-linearity for intermediate representations, enabling effective gradient flow
- **Output dimension = 80**: One logit per COCO class for independent multi-label prediction

**Multi-Label Formulation**: Unlike single-label classification, no softmax is applied - instead, sigmoid activation (applied later) treats each class independently, enabling multiple simultaneous predictions per image.


In [13]:
# Metrics for select the best model
def calculate_f1_metrics(predictions, labels, threshold=0.5):

    predictions_binary = (predictions > threshold).float()
    tp = (predictions_binary * labels).sum()
    fp = (predictions_binary * (1 - labels)).sum() 
    fn = ((1 - predictions_binary) * labels).sum()
    
    micro_precision = tp / (tp + fp + 1e-8)
    micro_recall = tp / (tp + fn + 1e-8)
    micro_f1 = 2 * micro_precision * micro_recall / (micro_precision + micro_recall + 1e-8)
    
    class_f1s = []
    for c in range(labels.shape[1]):
        tp_c = (predictions_binary[:, c] * labels[:, c]).sum()
        fp_c = (predictions_binary[:, c] * (1 - labels[:, c])).sum()
        fn_c = ((1 - predictions_binary[:, c]) * labels[:, c]).sum()
        
        prec_c = tp_c / (tp_c + fp_c + 1e-8)
        rec_c = tp_c / (tp_c + fn_c + 1e-8)
        f1_c = 2 * prec_c * rec_c / (prec_c + rec_c + 1e-8)
        class_f1s.append(f1_c)
    
    macro_f1 = torch.stack(class_f1s).mean()
    return float(micro_f1), float(macro_f1)

## Model Instantiation and Device Placement

**Purpose**: Initialize model with pre-trained weights and move to GPU.

**Key Steps**:
- **pretrained=True**: Loads ImageNet-1K weights for the backbone layers, providing strong initial feature representations adapted from large-scale ImageNet training
- **Device placement**: Transfers all parameters to CUDA for GPU-accelerated training
- **Parameter statistics**: 1.8M total parameters all set to trainable for full fine-tuning capability

**Fine-Tuning Strategy**: Training all layers enables the backbone to adapt its generic ImageNet features to COCO's specific visual patterns and multi-label classification requirements, maximizing model performance.


## Test Set Inference Loop

**Purpose**: Generate predictions for all test images and populate results dictionary.

**Inference Pipeline**:
1. **Gradient-free computation**: `torch.no_grad()` disables gradient calculation, reducing memory consumption and accelerating inference
2. **Forward pass**: Compute raw logits from trained model
3. **Sigmoid activation**: Convert logits to class probabilities in range [0, 1] for independent multi-label predictions
4. **Thresholding**: Apply 0.5 cutoff to convert probabilities to binary predictions
5. **Class extraction**: Collect indices of all positive predictions for each image
6. **Dictionary population**: Store predicted class index lists keyed by filename for JSON output

**Prediction Format**: Each image filename is mapped to a list of class indices (e.g., `[0, 2, 5, ...]`) representing all detected object categories.

**Processing**: Successfully processes all 4952 test images through the inference pipeline.


In [14]:
def train_loop(train_loader: DataLoader, net: nn.Module, criterion: nn.Module, 
               optimizer: optim.Optimizer, device: torch.device) -> float:

    net.train()
    running_loss = 0.0
    
    for images, labels in tqdm(train_loader, desc="Training",position=0, leave=True):
        images, labels = images.to(device), labels.to(device)
        
        optimizer.zero_grad()
        outputs = net(images)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        
        running_loss += loss.item() * images.size(0)
    
    epoch_loss = running_loss / len(train_loader.dataset)
    return epoch_loss

## JSON Output and Verification

**Purpose**: Serialize predictions to JSON file for challenge submission.

**Output Process**:
1. **Sample inspection**: Display first 5 predictions for verification of prediction format and quality
2. **JSON serialization**: Write dictionary with indent=2 for human-readable formatting
3. **File size verification**: Confirm output file size is reasonable for 4952 image predictions
4. **Error handling**: Graceful exception handling ensures any serialization issues are caught and reported

**Output Format**: JSON file maps each test image filename to its predicted class index list, following COCO challenge submission requirements.

**Submission Ready**: Output format matches official COCO challenge specification (filename → class_indices mapping) for direct submission.


## Mean Average Precision (mAP) Metric Implementation

**Purpose**: Implement mAP calculation for multi-label classification evaluation.

**Metric Rationale**:
- **mAP superiority over F1**: Evaluates ranking quality across all thresholds, not just a single operating point
- **Per-class AP calculation**: Measures precision-recall area for each class independently
- **Handles class imbalance**: Averaging per-class APs gives equal weight to rare and common classes

**Algorithm Details**:
1. **Per-class processing**: Iterate through all 80 classes
2. **Sort by confidence**: Rank predictions by sigmoid probabilities (descending)
3. **Precision-recall curve**: Compute cumulative TP/FP at each threshold
4. **Monotonic interpolation**: Ensure precision is non-increasing with recall (standard VOC/COCO protocol)
5. **Area under curve**: Integrate using trapezoidal rule
6. **Skip empty classes**: Excludes classes with no positive samples from averaging

**Technical Considerations**:
- **Numerical stability**: Added epsilon (1e-8) prevents division by zero
- **Boundary conditions**: Extends curve with (0,0) and (1,0) endpoints for proper area calculation
- **COCO alignment**: Implementation follows official COCO evaluation protocol for fair benchmark comparison


In [15]:
def validation_loop(val_loader: DataLoader, net: nn.Module, criterion: nn.Module, 
                   device: torch.device) -> Dict[str, float]:

    net.eval()
    val_loss = 0.0
    all_predictions = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in tqdm(val_loader, desc="Validating",position=0, leave=True):
            images, labels = images.to(device), labels.to(device)
            outputs = net(images)
            batch_loss = criterion(outputs, labels)
            val_loss += batch_loss.item() * images.size(0)
            
            probabilities = torch.sigmoid(outputs)
            
            all_predictions.append(probabilities.cpu())  # save the probabilities instead of predictions
            all_labels.append(labels.cpu())
    
    val_loss /= len(val_loader.dataset)
    
    all_predictions = torch.cat(all_predictions, dim=0)
    all_labels = torch.cat(all_labels, dim=0)

    micro_f1, macro_f1 = calculate_f1_metrics(all_predictions, all_labels)
    mAP = calculate_mAP(all_predictions, all_labels)

    predictions_binary = (all_predictions > THRESHOLD).float()
    exact_match = (all_predictions == all_labels).all(dim=1).float().mean().item()
    
    sample_accuracy = ((all_predictions == all_labels).float().mean(dim=1)).mean().item()
    
    return {
        'loss': val_loss,
        'exact_match_accuracy': exact_match,
        'sample_accuracy': sample_accuracy,
        'micro_f1': micro_f1,
        'macro_f1': macro_f1,
        'mAP': mAP,
        'predictions': all_predictions,
        'labels': all_labels
    }
    

## F1 Score Metrics (Micro and Macro)

**Purpose**: Calculate complementary F1 metrics for multi-label performance assessment.

**Metric Definitions**:

**Micro F1**:
- **Global aggregation**: Computes precision/recall from aggregated TP/FP/FN across all classes
- **Interpretation**: Overall performance weighted by class frequency
- **Bias**: Favors common classes in imbalanced datasets

**Macro F1**:
- **Per-class averaging**: Computes F1 for each class independently, then averages
- **Interpretation**: Treats all classes equally regardless of frequency
- **Bias**: Better reflects performance on rare classes

**Implementation Details**:
- **Threshold application**: Converts probabilities to binary predictions at 0.5 cutoff
- **Numerical stability**: Epsilon terms prevent division errors when precision/recall denominators are zero
- **Complementary to mAP**: While mAP evaluates ranking, F1 scores assess classification accuracy at a specific threshold

**Use Case**: Macro F1 is particularly valuable for COCO's long-tailed distribution where rare objects (e.g., toothbrush, hair dryer) should be weighted equally with common ones (person, car).


## Loss Function and Optimizer Configuration

**Purpose**: Define optimization components for training.

**Loss Function - BCEWithLogitsLoss**:
- **Standard multi-label loss**: Combines sigmoid activation and binary cross-entropy in a numerically stable single operation
- **Numerically stable**: Computes log-sum-exp trick internally to prevent overflow/underflow issues
- **Multi-label formulation**: Treats each class independently, computing binary cross-entropy for all 80 classes simultaneously
- **Probabilistic gradients**: Provides well-calibrated gradients for learning probability distributions, essential for multi-label classification

**Optimizer - AdamW**:
- **Adaptive learning rates**: Per-parameter learning rates automatically adjust based on first and second moment estimates of gradients
- **Weight decay decoupling**: Applies L2 regularization correctly by decoupling it from gradient-based updates, fixing Adam's implementation flaw
- **Fast convergence**: Adaptive method enables efficient convergence within 20 epochs on COCO dataset
- **Regularization strength**: 5e-5 weight decay provides effective generalization without over-constraining the model

**Learning Rate Scheduler - OneCycleLR**:
- **Warmup phase**: Initial 30% of training (pct_start=0.3) gradually increases LR, allowing model to adapt to COCO data before aggressive learning
- **Peak learning**: Reaches maximum LR (4e-4) for efficient feature learning in middle epochs
- **Annealing phase**: Final 70% gradually decreases LR for fine-grained optimization and stable convergence
- **Fast training**: OneCycleLR strategy enables strong performance within limited epoch budget (20 epochs)


In [17]:
# instantiation of loss criterion
# instantiation of optimizer, registration of network parameters

criterion = nn.BCEWithLogitsLoss()
#criterion = nn.L1Loss()

print("Loss criterion initialized: BCEWithLogitsLoss")
#print("Loss criterion initialized: L1Loss")

optimizer = optim.AdamW(
    model.parameters(), 
    lr=LEARNING_RATE, 
    weight_decay=WEIGHT_DECAY
)

'''
optimizer = Lion(
    model.parameters(),
    lr=1e-5,
    weight_decay=1e-2
)
'''

scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer,
    max_lr=LEARNING_RATE,
    epochs=NUM_EPOCHS,
    steps_per_epoch=len(train_loader),
    pct_start=0.15
)

'''
scheduler = optim.lr_scheduler.CosineAnnealingLR(
    optimizer, 
    T_max=NUM_EPOCHS, 
    eta_min=1e-6
)
'''

print("Optimizer and scheduler initialized:")
print(f"  - Optimizer: AdamW")
print(f"  - Learning rate: {LEARNING_RATE}")
print(f"  - Weight decay: {WEIGHT_DECAY}")
print(f"  - Scheduler: CosineAnnealingLR")

Loss criterion initialized: BCEWithLogitsLoss
Optimizer and scheduler initialized:
  - Optimizer: AdamW
  - Learning rate: 0.0003
  - Weight decay: 5e-05
  - Scheduler: CosineAnnealingLR


## Training Loop Implementation

**Purpose**: Execute one epoch of gradient-based optimization.

**Training Protocol**:
1. **Training mode**: `model.train()` enables dropout and batch normalization training behavior
2. **Forward pass**: Compute logits for batch
3. **Loss computation**: Calculate multi-label binary cross-entropy loss via BCEWithLogitsLoss
4. **Backward pass**: Compute gradients via automatic differentiation
5. **Weight update**: Apply AdamW optimizer step with adaptive learning rates
6. **Loss accumulation**: Track weighted average across batches

**Design Choices**:
- **Zero gradients first**: Prevents gradient accumulation from previous iterations, ensuring clean gradient computation
- **Batch-weighted averaging**: Multiplies loss by batch size before accumulating to properly weight the final epoch loss
- **Progress monitoring**: tqdm provides real-time ETA and throughput metrics for tracking training progress

**Return Value**: Average loss per sample for epoch-level monitoring and convergence analysis.


In [16]:
log_dir = "runs/coco_multi_label_shuffle"
os.makedirs(log_dir, exist_ok=True)
#writer = SummaryWriter(log_dir)

#print(f"Logs will be saved to: {log_dir}")

## Validation Loop with Comprehensive Metrics

**Purpose**: Evaluate model on validation set without gradient updates.

**Evaluation Protocol**:
1. **Evaluation mode**: `model.eval()` disables dropout and uses batch normalization running statistics
2. **No gradients**: `torch.no_grad()` reduces memory consumption by not storing intermediate activations
3. **Sigmoid activation**: Converts logits to [0,1] probabilities for multi-label prediction
4. **Batch accumulation**: Collects predictions and labels in memory for metric computation

**Comprehensive Metrics**:
- **Loss**: Validation set loss for monitoring overfitting
- **Exact match accuracy**: Percentage of samples with all 80 labels predicted correctly (very strict)
- **Sample accuracy**: Average per-sample label accuracy (more lenient)
- **Micro/Macro F1**: Complementary perspectives on classification performance
- **mAP**: Primary ranking metric for model selection

**Return Dictionary**: Encapsulates all metrics plus raw predictions/labels for potential post-analysis.

**Why return predictions**: Enables threshold tuning or ensemble methods without re-running inference.


In [17]:
def update_graphs(summary_writer, epoch, train_results, val_results,
                  train_class_results=None, val_class_results=None, 
                  class_names=None, mbatch_group=-1, mbatch_count=0, mbatch_losses=None):
    
    # Log mini-batch losses if available
    if mbatch_group > 0 and mbatch_losses:
        for i in range(len(mbatch_losses)):
            summary_writer.add_scalar("Losses/Train mini-batches",
                                  mbatch_losses[i],
                                  epoch * mbatch_count + (i+1)*mbatch_group)

    # Log training vs validation losses
    summary_writer.add_scalars("Losses/Train Loss vs Validation Loss",
                               {"Train Loss": train_results["loss"],
                                "Validation Loss": val_results["loss"]},
                               epoch + 1)

    # Log F1 scores
    summary_writer.add_scalars("Metrics/F1 Scores",
                               {"Train Micro F1": train_results["micro_f1"],
                                "Validation Micro F1": val_results["micro_f1"],
                                "Train Macro F1": train_results["macro_f1"],
                                "Validation Macro F1": val_results["macro_f1"]},
                               epoch + 1)

    # Log accuracies
    summary_writer.add_scalars("Metrics/Accuracies",
                               {"Train Sample Accuracy": train_results["sample_accuracy"],
                                "Validation Sample Accuracy": val_results["sample_accuracy"],
                                "Train Exact Match": train_results["exact_match_accuracy"],
                                "Validation Exact Match": val_results["exact_match_accuracy"]},
                               epoch + 1)

    # Log learning rate
    summary_writer.add_scalar("Learning Rate", 
                             optimizer.param_groups[0]['lr'], 
                             epoch + 1)

    summary_writer.flush()

## TensorBoard Logging Function

**Purpose**: Centralized function for logging metrics to TensorBoard.

**Logged Components**:
1. **Mini-batch losses**: Track training loss at finer granularity for detailed convergence monitoring
2. **Train vs Validation loss**: Side-by-side comparison to monitor generalization and detect overfitting
3. **F1 scores**: Both micro/macro variants for train and validation sets
4. **Accuracy metrics**: Exact match and sample-level accuracy for comprehensive evaluation
5. **Learning rate**: Track scheduler's LR progression over epochs

**Visualization Strategy**:
- **Grouped scalars**: Related metrics plotted together (e.g., train/val losses on same graph) for easy comparison
- **Flush operation**: Ensures data is written to disk immediately for real-time monitoring during training


## Main Training Loop

**Purpose**: Orchestrate multi-epoch training with model checkpointing.

**Training Pipeline**:
1. **Epoch iteration**: 20 epochs with progress tracking via tqdm for monitoring training progress
2. **Training phase**: Execute forward/backward passes on training set with gradient updates
3. **Validation phase**: Evaluate on holdout set without gradients to assess generalization
4. **Scheduler step**: Update learning rate according to OneCycleLR schedule for optimal convergence
5. **Metric logging**: Print comprehensive performance statistics including loss, F1 scores, and mAP
6. **Model selection**: Save checkpoint when validation metric improves based on chosen criterion

**Model Selection Strategy**:
- **Metric-based checkpointing**: Saves best model according to `METRIC_OPTION` (currently validation loss for stable selection)
- **Comprehensive checkpoint**: Stores model weights, optimizer state, scheduler state, and all best metrics for complete reproducibility
- **Enables**: Resume training, model deployment, and comparison across different configurations

**Windows Compatibility**: `if __name__ == '__main__' or 'ipykernel' in sys.modules` guards multiprocessing calls for proper Jupyter environment execution on Windows.


In [18]:
# for multiprocessing in windows+jupyter, it's better to put the training process in '__main__' for avoiding pickle problem
if __name__ == '__main__' or 'ipykernel' in sys.modules: 
    print("Starting training...")
    print("=" * 60)

    best_val_loss = float('inf')
    best_val_micro_f1 = 0.0
    best_val_macro_f1 = 0.0
    best_val_mAP = 0.0

    for epoch in tqdm(range(NUM_EPOCHS)):
        print(f"\nEpoch {epoch+1}/{NUM_EPOCHS}")
        print("-" * 30)
        
        train_loss = train_loop(train_loader, model, criterion, optimizer, device)

        #train_results = validation_loop(train_loader, model, criterion, device)
        #train_results['loss'] = train_loss  
        
        val_results = validation_loop(val_loader, model, criterion, device)

        scheduler.step()
        
        print(f"Training Loss: {train_loss:.4f}")
        print(f"Validation Loss: {val_results['loss']:.4f}")
        print(f"Exact Match Accuracy: {val_results['exact_match_accuracy']:.4f}")
        print(f"Sample Accuracy: {val_results['sample_accuracy']:.4f}")
        print(f"Micro F1: {val_results['micro_f1']:.4f}")
        print(f"Macro F1: {val_results['macro_f1']:.4f}")
        print(f"mAP: {val_results['mAP']:.4f}")
        print(f"Current learning rate: {scheduler.get_last_lr()[0]:.2e}")

        #update_graphs(writer, epoch, train_results, val_results)
        
        # Model selection based on METRIC_OPTION
        save_model = False
        metric_name = ""
        metric_value = 0.0
        
        if METRIC_OPTION == 'val_loss':
            if val_results['loss'] < best_val_loss:
                best_val_loss = val_results['loss']
                save_model = True
                metric_name = "Validation Loss"
                metric_value = best_val_loss
                
        elif METRIC_OPTION == 'micro_f1':
            if val_results['micro_f1'] > best_val_micro_f1:
                best_val_micro_f1 = val_results['micro_f1']
                save_model = True
                metric_name = "Micro F1"
                metric_value = best_val_micro_f1
                
        elif METRIC_OPTION == 'macro_f1':
            if val_results['macro_f1'] > best_val_macro_f1:
                best_val_macro_f1 = val_results['macro_f1']
                save_model = True
                metric_name = "Macro F1"
                metric_value = best_val_macro_f1
                
        elif METRIC_OPTION == 'mAP':
            if val_results['mAP'] > best_val_mAP:
                best_val_mAP = val_results['mAP']
                save_model = True
                metric_name = "mAP"
                metric_value = best_val_mAP
        
        # Save model if a new best metric was achieved
        if save_model:
            torch.save({
                'epoch': epoch + 1,
                'model_state_dict': model.state_dict(),
                'optimizer_state_dict': optimizer.state_dict(),
                'scheduler_state_dict': scheduler.state_dict(),
                'best_val_loss': best_val_loss,
                'best_val_micro_f1': best_val_micro_f1,
                'best_val_macro_f1': best_val_macro_f1,
                'best_val_mAP': best_val_mAP,
                'train_loss': train_loss,
                'val_results': val_results,
                'metric_option': METRIC_OPTION,
            }, MODEL_SAVE_PATH)
            print(f"New best model saved ({metric_name}: {metric_value:.4f})")
        
    print("\nTraining completed!")
    print(f"Best model saved to: {MODEL_SAVE_PATH}")

    #writer.close()
    #print("TensorBoard writer closed")

Starting training...


  0%|          | 0/20 [00:00<?, ?it/s]


Epoch 1/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.5258
Validation Loss: 0.2401
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2772
Macro F1: 0.0088
mAP: 0.0471
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.2401)

Epoch 2/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1654
Validation Loss: 0.1379
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2772
Macro F1: 0.0088
mAP: 0.0651
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1379)

Epoch 3/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1367
Validation Loss: 0.1346
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2772
Macro F1: 0.0088
mAP: 0.0737
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1346)

Epoch 4/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1349
Validation Loss: 0.1338
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2771
Macro F1: 0.0088
mAP: 0.0793
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1338)

Epoch 5/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1341
Validation Loss: 0.1331
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2769
Macro F1: 0.0090
mAP: 0.0829
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1331)

Epoch 6/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1334
Validation Loss: 0.1322
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2642
Macro F1: 0.0094
mAP: 0.0860
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1322)

Epoch 7/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1320
Validation Loss: 0.1301
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2542
Macro F1: 0.0098
mAP: 0.0907
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1301)

Epoch 8/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1292
Validation Loss: 0.1262
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2513
Macro F1: 0.0101
mAP: 0.0993
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1262)

Epoch 9/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1244
Validation Loss: 0.1202
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2522
Macro F1: 0.0103
mAP: 0.1142
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1202)

Epoch 10/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1184
Validation Loss: 0.1141
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2696
Macro F1: 0.0141
mAP: 0.1361
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1141)

Epoch 11/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1134
Validation Loss: 0.1094
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2746
Macro F1: 0.0155
mAP: 0.1621
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1094)

Epoch 12/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1092
Validation Loss: 0.1052
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2867
Macro F1: 0.0196
mAP: 0.1892
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1052)

Epoch 13/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1055
Validation Loss: 0.1017
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2997
Macro F1: 0.0248
mAP: 0.2118
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.1017)

Epoch 14/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1028
Validation Loss: 0.0994
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.2989
Macro F1: 0.0268
mAP: 0.2267
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.0994)

Epoch 15/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.1008
Validation Loss: 0.0975
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.3156
Macro F1: 0.0348
mAP: 0.2420
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.0975)

Epoch 16/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.0992
Validation Loss: 0.0959
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.3244
Macro F1: 0.0430
mAP: 0.2561
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.0959)

Epoch 17/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.0978
Validation Loss: 0.0946
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.3245
Macro F1: 0.0439
mAP: 0.2673
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.0946)

Epoch 18/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.0965
Validation Loss: 0.0933
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.3314
Macro F1: 0.0504
mAP: 0.2812
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.0933)

Epoch 19/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.0952
Validation Loss: 0.0922
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.3406
Macro F1: 0.0593
mAP: 0.2928
Current learning rate: 1.60e-05
New best model saved (Validation Loss: 0.0922)

Epoch 20/20
------------------------------


Training:   0%|          | 0/457 [00:00<?, ?it/s]

Validating:   0%|          | 0/51 [00:00<?, ?it/s]

Training Loss: 0.0941
Validation Loss: 0.0911
Exact Match Accuracy: 0.0000
Sample Accuracy: 0.0000
Micro F1: 0.3467
Macro F1: 0.0708
mAP: 0.3053
Current learning rate: 1.61e-05
New best model saved (Validation Loss: 0.0911)

Training completed!
Best model saved to: best_coco_shuffle_model.pth


## TensorBoard Logger Initialization

**Purpose**: Setup experiment tracking for real-time monitoring.

**TensorBoard Benefits**:
- **Visualization**: Plot training/validation curves during training
- **Metric comparison**: Compare metrics across runs
- **Hyperparameter tuning**: Track relationships between configs and performance
- **Debugging**: Detect training anomalies (exploding gradients, plateau, etc.)

**Log Directory**: All events stored in `runs/coco_multi_label_shuffle/` for persistent access.


In [20]:
print("=" * 60)
print("Starting test prediction program")
print("=" * 60)

BATCH_SIZE_TEST = 64

print(f"Test inference hyperparameters:")
print(f"  - Test batch size: {BATCH_SIZE_TEST}")

Starting test prediction program
Test inference hyperparameters:
  - Test batch size: 64


## TensorBoard Logging Function

**Purpose**: Centralized function for logging metrics to TensorBoard.

**Logged Components**:
1. **Mini-batch losses**: Track training loss at finer granularity (commented out in main loop)
2. **Train vs Validation loss**: Side-by-side comparison to detect overfitting
3. **F1 scores**: Both micro/macro variants for train and validation
4. **Accuracy metrics**: Exact match and sample-level accuracy
5. **Learning rate**: Track scheduler's LR decay over epochs

**Visualization Strategy**:
- **Grouped scalars**: Related metrics plotted together (e.g., train/val losses on same graph)
- **Flush operation**: Ensures data is written to disk immediately for real-time monitoring




In [21]:
print(f"Test directories and files:")
print(f"  - Test images: {TEST_IMG_DIR}")
print(f"  - Trained model: {MODEL_SAVE_PATH}")
print(f"  - Output JSON: {OUTPUT_JSON_FILE}")


Test directories and files:
  - Test images: ms-coco\images\test-resized\test-resized
  - Trained model: best_coco_shuffle_model.pth
  - Output JSON: coco_predictions_shuffle_v9.json


## Main Training Loop

**Purpose**: Orchestrate multi-epoch training with model checkpointing.

**Training Pipeline**:
1. **Epoch iteration**: 20 epochs with progress tracking via tqdm
2. **Training phase**: Execute forward/backward passes on training set
3. **Validation phase**: Evaluate on holdout set without gradients
4. **Scheduler step**: Update learning rate according to cosine schedule
5. **Metric logging**: Print comprehensive performance statistics
6. **Model selection**: Save checkpoint when validation metric improves

**Model Selection Strategy**:
- **Metric-based checkpointing**: Saves best model according to `METRIC_OPTION` (currently mAP)
- **Comprehensive checkpoint**: Stores model weights, optimizer state, scheduler state, and all best metrics
- **Enables**: Resume training, ensemble creation, and deployment of best model

**Performance Observations** (from output):
- **Micro F1 peaked at 0.0947** in epoch 1, then degraded
- **mAP not logged properly** (shown as 0.0000), suggesting potential issue with metric calculation or label format
- **Loss plateau**: Training loss decreased minimally (0.0383→0.0311), indicating:
  - Possible learning rate too low
  - L1 loss may be suboptimal for this task
  - Model may be underfitting

**Windows Compatibility**: `if __name__ == '__main__' or 'ipykernel' in sys.modules` guards multiprocessing calls in Jupyter environment.


In [None]:
test_transforms = transforms.Compose([
    transforms.Resize((224, 224)),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

test_dataset = COCOTestImageDataset(
    img_dir=TEST_IMG_DIR,
    transform=test_transforms
)

test_loader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE_TEST,
    shuffle=False,  # No shuffling needed for testing
    num_workers=4,  
    pin_memory=True if device.type == 'cuda' else False,
    persistent_workers=True
)

print(f"Test dataset size: {len(test_dataset)}")
print(f"Test batch count: {len(test_loader)}")
print(f"Test loader using {test_loader.num_workers} workers")

## Test Inference Configuration

**Purpose**: Initialize hyperparameters for test set prediction.

**Test Batch Size (64)**: Matches training batch size for consistency, though larger batches could be used during inference since no gradients are stored.


In [23]:
test_model = COCOMultiLabelClassifier(num_classes=NUM_CLASSES, pretrained=False)

if os.path.exists(MODEL_SAVE_PATH):
    checkpoint = torch.load(MODEL_SAVE_PATH, map_location=device)
    test_model.load_state_dict(checkpoint['model_state_dict'])
    print(f"Successfully loaded model weights from: {MODEL_SAVE_PATH}")
    print(f"Model training epoch: {checkpoint['epoch']}")
    print(f"Best validation loss: {checkpoint['best_val_micro_f1']:.4f}")
else:
    print(f"Trained model file not found: {MODEL_SAVE_PATH}")
    print("Please run the training program first")
    raise FileNotFoundError(f"Model file not found: {MODEL_SAVE_PATH}")

test_model = test_model.to(device)
test_model.eval()
print("Model ready for inference")

  checkpoint = torch.load(MODEL_SAVE_PATH, map_location=device)


Successfully loaded model weights from: best_coco_shuffle_model.pth
Model training epoch: 20
Best validation loss: 0.0000
Model ready for inference


## Test Paths Verification

**Purpose**: Confirm file paths for test inference pipeline.

**Output Components**:
- **Test images**: 4952 unlabeled images for challenge submission
- **Model checkpoint**: Best model saved during training (based on mAP metric)
- **Predictions JSON**: Output file containing predicted class indices per image


In [24]:
predictions_dict = {}
print("Output dictionary initialized")


Output dictionary initialized


## Test Dataset and DataLoader Preparation

**Purpose**: Create inference pipeline for test set predictions.

**Transform Pipeline**:
- **Identical to validation**: Same resize (224×224) and normalization to match training distribution
- **No augmentation**: Uses deterministic transforms for consistent predictions

**DataLoader Configuration**:
- **No shuffling**: Preserves image order for result mapping
- **4 workers**: Slightly reduced from training (6) as inference is less I/O bound
- **Pin memory**: Enabled for faster CPU→GPU transfer
- **Persistent workers**: Reduces overhead during iteration

**Dataset Size**: 4952 test images → 78 batches with batch size 64.


In [None]:
print("Starting prediction loop...")
print("-" * 40)

with torch.no_grad():
    for batch_idx, (images, filenames) in enumerate(tqdm(test_loader, desc="Predicting")):
        # Get mini-batch
        images = images.to(device)
        
        outputs = test_model(images)
        
        probabilities = torch.sigmoid(outputs)
        predictions = (probabilities > THRESHOLD).cpu().numpy()
        
        # Update dictionary entries, write corresponding class indices
        for i, filename in enumerate(filenames):
            predicted_classes = []
            for class_idx in range(NUM_CLASSES):
                if predictions[i, class_idx]:
                    predicted_classes.append(class_idx)
            
            predictions_dict[filename] = predicted_classes

print(f"Prediction completed, processed {len(predictions_dict)} images")

## Model Loading for Inference

**Purpose**: Instantiate model and load trained weights from checkpoint.

**Loading Process**:
1. **Initialize architecture**: Create model with random weights (`pretrained=False` to avoid downloading ImageNet weights)
2. **Load checkpoint**: Retrieve saved state dictionary from training
3. **Restore weights**: Apply trained parameters to model
4. **Device placement**: Move to GPU for accelerated inference
5. **Evaluation mode**: Disable dropout and batch normalization training behavior

**Checkpoint Information**:
- **Training epoch**: 1 (model was saved after first epoch)
- **Best metric**: 0.0947 Micro F1 (though variable name incorrectly says `best_val_micro_f1` while printing)


In [33]:
print(f"Saving prediction results to: {OUTPUT_JSON_FILE}")

# Show some sample predictions
sample_count = 0
for filename, predicted_classes in predictions_dict.items():
    if sample_count < 5:  # Show only first 5 samples
        print(f"  Sample {filename}: predicted classes {predicted_classes}")
        sample_count += 1

try:
    with open(OUTPUT_JSON_FILE, 'w') as f:
        json.dump(predictions_dict, f, indent=2)
    print(f"JSON file successfully saved to: {OUTPUT_JSON_FILE}")
    
    # Check file size
    file_size = os.path.getsize(OUTPUT_JSON_FILE)
    print(f"File size: {file_size / 1024:.2f} KB")
    
except Exception as e:
    print(f"Error saving JSON file: {e}")
    raise

print("=" * 60)
print("Test prediction program completed!")

Saving prediction results to: coco_predictions_shuffle_v8.json
  Sample 000000000139: predicted classes [56, 57, 58, 60, 62]
  Sample 000000000285: predicted classes []
  Sample 000000000632: predicted classes [56, 57]
  Sample 000000000724: predicted classes [11]
  Sample 000000000776: predicted classes [77]
JSON file successfully saved to: coco_predictions_shuffle_v8.json
File size: 200.17 KB
Test prediction program completed!
