# RADIO Encoder Evaluation

This notebook demonstrates how to use NVIDIA's RADIO (Reduce All Domains Into One) encoder for semantic segmentation.

**Requirements:**
- Hugging Face authentication: Run `huggingface-cli login`
- ~2GB download for RADIO model (first time only)
- Pascal VOC 2012 dataset (see setup below)

## Dataset Setup (One-time)

Download Pascal VOC 2012 from Kaggle:
1. Go to: https://www.kaggle.com/datasets/huanghanchina/pascal-voc-2012
2. Download the dataset
3. Extract to `./data/pascal_voc/`

Expected structure:
```
data/pascal_voc/
  └── VOCdevkit/
      └── VOC2012/
          ├── JPEGImages/
          ├── SegmentationClass/
          └── ImageSets/
```

**Alternative**: Use DummyDataset for immediate testing (change `pascal_voc` to `dummy` below)

## 1. Setup and Imports

In [None]:
import sys
from pathlib import Path

# Add project to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
from tqdm.notebook import tqdm

# Import framework components
from vlm_eval import EncoderRegistry, HeadRegistry, DatasetRegistry

print("✓ Imports successful!")

## 2. Load Dataset

In [None]:
# List available datasets
print("Available datasets:", DatasetRegistry.list_available())

# Create dataset (50 image subset for fast testing)
dataset = DatasetRegistry.get(
    "pascal_voc",
    root="./data/pascal_voc",
    split="val",
    subset_size=50,
    image_size=518  # RADIO's recommended size for segmentation
)

print(f"\nDataset: {dataset.__class__.__name__}")
print(f"Number of samples: {len(dataset)}")
print(f"Number of classes: {dataset.num_classes}")
print(f"Class names: {dataset.class_names[:5]}...")  # Show first 5

## 3. Visualize Real Dataset Samples

In [None]:
# Get a sample
sample = dataset[0]
image = sample["image"]
mask = sample["mask"]

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Image
axes[0].imshow(image.permute(1, 2, 0).numpy())
axes[0].set_title(f"Image: {sample['filename']}")
axes[0].axis('off')

# Mask
axes[1].imshow(mask.numpy(), cmap='tab20', vmin=0, vmax=20)
axes[1].set_title("Segmentation Mask (21 classes)")
axes[1].axis('off')

plt.tight_layout()
plt.show()

print(f"Image shape: {image.shape}")
print(f"Mask shape: {mask.shape}")
print(f"Unique classes in mask: {sorted(mask.unique().tolist())}")

## 4. Create RADIO Encoder

Load NVIDIA's RADIO model from Hugging Face.

In [None]:
# List available encoders
print("Available encoders:", EncoderRegistry.list_available())

# Create RADIO encoder
print("\nLoading RADIO encoder (this may take a moment on first run)...")
encoder = EncoderRegistry.get("radio", variant="base", pretrained=True)

print(f"\nEncoder: {encoder.__class__.__name__}")
print(f"Output channels: {encoder.output_channels}")
print(f"Patch size: {encoder.patch_size}")
print(f"Parameters: {encoder.get_num_parameters():,}")

## 5. Create Segmentation Head

In [None]:
# Create head
head = HeadRegistry.get(
    "linear_probe",
    encoder=encoder,
    num_classes=21,  # Pascal VOC has 21 classes
    freeze_encoder=True  # Freeze RADIO for linear probing
)

print(f"Head: {head.__class__.__name__}")
print(f"Total parameters: {head.get_num_parameters():,}")
print(f"Trainable parameters: {head.get_num_parameters(trainable_only=True):,}")

## 6. Run Forward Pass

In [None]:
# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

head = head.to(device)
head.eval()

# Get one sample
dataloader = DataLoader(dataset, batch_size=2, shuffle=False)
batch = next(iter(dataloader))
images = batch["image"].to(device)
masks = batch["mask"].to(device)

# Forward pass
with torch.no_grad():
    features = encoder(images)
    logits = head(features)
    predictions = logits.argmax(dim=1)

print(f"Features: {features.shape}")
print(f"Predictions: {predictions.shape}")

## 7. Visualize RADIO Features

In [None]:
# Visualize feature maps
feature_maps = features[0].cpu().numpy()

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i in range(6):
    axes[i].imshow(feature_maps[i], cmap='viridis')
    axes[i].set_title(f"Feature Channel {i}")
    axes[i].axis('off')

plt.suptitle("RADIO Spatial Features", fontsize=14)
plt.tight_layout()
plt.show()

## 8. Visualize Predictions on Real Images

In [None]:
# Visualize predictions
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for i in range(2):
    # Input image
    axes[i, 0].imshow(images[i].cpu().permute(1, 2, 0).numpy())
    axes[i, 0].set_title("Input Image")
    axes[i, 0].axis('off')
    
    # Ground truth
    axes[i, 1].imshow(masks[i].cpu().numpy(), cmap='tab20', vmin=0, vmax=20)
    axes[i, 1].set_title("Ground Truth")
    axes[i, 1].axis('off')
    
    # Prediction
    axes[i, 2].imshow(predictions[i].cpu().numpy(), cmap='tab20', vmin=0, vmax=20)
    axes[i, 2].set_title("Prediction (Untrained)")
    axes[i, 2].axis('off')

plt.tight_layout()
plt.show()

print("Note: Head is untrained. After training, RADIO achieves 70-80% mIoU on Pascal VOC.")

## Summary

✅ Loaded **real Pascal VOC 2012 dataset** with semantic labels  
✅ Created **NVIDIA RADIO encoder** (~300M parameters)  
✅ Extracted rich **spatial features** for segmentation  
✅ Visualized predictions on **real-world images**  

### Next Steps
1. Compare with SimpleCNN: See `04_model_comparison.ipynb`
2. Train the linear probe head for better predictions
3. Try different input resolutions (RADIO is flexible)

### Key Insights
- RADIO provides **1024-dim features** vs SimpleCNN's 256-dim
- Pretrained on massive datasets → better transfer learning
- Suitable for dense prediction tasks like segmentation