# RADIO Encoder Evaluation

This notebook demonstrates how to use NVIDIA's RADIO (Reduce All Domains Into One) encoder for semantic segmentation.

**Requirements:**
- Hugging Face authentication: Run `huggingface-cli login` before starting
- ~2GB download for RADIO model (first time only)
- ~450MB download for Pascal VOC dataset (first time only)

## 1. Setup and Imports

In [None]:
import sys
from pathlib import Path

# Add project to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root))

import torch
import torch.nn as nn
from torch.utils.data import DataLoader
import matplotlib.pyplot as plt
import numpy as np
from tqdm.notebook import tqdm

# Import framework components
from vlm_eval import EncoderRegistry, HeadRegistry, DatasetRegistry
from vlm_eval.encoders import RADIOEncoder
from vlm_eval.heads import LinearProbeHead
from vlm_eval.datasets import PascalVOCDataset

print("✓ Imports successful!")

## 2. Load Pascal VOC Dataset

We'll use a subset of 50 images for faster testing.

In [None]:
# List available datasets
print("Available datasets:", DatasetRegistry.list_available())

# Create dataset (subset of 50 images)
dataset = DatasetRegistry.get(
    "pascal_voc",
    root="./data/pascal_voc",
    split="val",
    download=True,
    subset_size=50,
    image_size=518  # RADIO's recommended size for segmentation
)

print(f"\nDataset: {dataset.__class__.__name__}")
print(f"Number of samples: {len(dataset)}")
print(f"Number of classes: {dataset.num_classes}")
print(f"Class names: {dataset.class_names[:5]}...")  # Show first 5

## 3. Visualize Dataset Samples

In [None]:
# Get a sample
sample = dataset[0]
image = sample["image"]
mask = sample["mask"]

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Image
axes[0].imshow(image.permute(1, 2, 0).numpy())
axes[0].set_title(f"Image: {sample['filename']}")
axes[0].axis('off')

# Mask
axes[1].imshow(mask.numpy(), cmap='tab20', vmin=0, vmax=20)
axes[1].set_title("Segmentation Mask")
axes[1].axis('off')

plt.tight_layout()
plt.show()

print(f"Image shape: {image.shape}")
print(f"Mask shape: {mask.shape}")
print(f"Unique classes in mask: {mask.unique().tolist()}")

## 4. Create RADIO Encoder

Load NVIDIA's RADIO model from Hugging Face.

In [None]:
# List available encoders
print("Available encoders:", EncoderRegistry.list_available())

# Create RADIO encoder
print("\nLoading RADIO encoder (this may take a moment on first run)...")
encoder = EncoderRegistry.get("radio", variant="base", pretrained=True)

print(f"\nEncoder: {encoder.__class__.__name__}")
print(f"Output channels: {encoder.output_channels}")
print(f"Patch size: {encoder.patch_size}")
print(f"Parameters: {encoder.get_num_parameters():,}")

## 5. Create Segmentation Head

Add a linear probe head for semantic segmentation.

In [None]:
# List available heads
print("Available heads:", HeadRegistry.list_available())

# Create head
head = HeadRegistry.get(
    "linear_probe",
    encoder=encoder,
    num_classes=21,  # Pascal VOC has 21 classes
    freeze_encoder=True  # Freeze RADIO weights for linear probing
)

print(f"\nHead: {head.__class__.__name__}")
print(f"Total parameters: {head.get_num_parameters():,}")
print(f"Trainable parameters: {head.get_num_parameters(trainable_only=True):,}")
print(f"Head-only parameters: {head.get_head_parameters():,}")

## 6. Create DataLoader

In [None]:
dataloader = DataLoader(
    dataset,
    batch_size=2,  # Small batch size for RADIO (large model)
    shuffle=False,
    num_workers=0
)

print(f"DataLoader created with batch_size=2")
print(f"Number of batches: {len(dataloader)}")

## 7. Run Forward Pass

Test the RADIO encoder and segmentation head.

In [None]:
# Setup device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move model to device
head = head.to(device)
head.eval()

# Get one batch
batch = next(iter(dataloader))
images = batch["image"].to(device)
masks = batch["mask"].to(device)

print(f"\nInput shapes:")
print(f"  Images: {images.shape}")
print(f"  Masks: {masks.shape}")

# Forward pass
print("\nRunning forward pass...")
with torch.no_grad():
    features = encoder(images)
    logits = head(features)
    predictions = logits.argmax(dim=1)

print(f"\nOutput shapes:")
print(f"  Features: {features.shape}")
print(f"  Logits: {logits.shape}")
print(f"  Predictions: {predictions.shape}")

## 8. Visualize RADIO Features

Visualize the spatial features extracted by RADIO.

In [None]:
# Visualize feature maps (first 6 channels)
idx = 0  # First image in batch
feature_maps = features[idx].cpu().numpy()

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.flatten()

for i in range(6):
    axes[i].imshow(feature_maps[i], cmap='viridis')
    axes[i].set_title(f"Feature Channel {i}")
    axes[i].axis('off')

plt.suptitle("RADIO Spatial Features (First 6 Channels)", fontsize=14)
plt.tight_layout()
plt.show()

print(f"Feature map shape: {feature_maps.shape}")
print(f"Feature value range: [{feature_maps.min():.3f}, {feature_maps.max():.3f}]")

## 9. Visualize Predictions

Compare input images, ground truth masks, and model predictions.

In [None]:
# Visualize both samples in batch
fig, axes = plt.subplots(2, 3, figsize=(15, 10))

for i in range(2):
    # Input image
    axes[i, 0].imshow(images[i].cpu().permute(1, 2, 0).numpy())
    axes[i, 0].set_title("Input Image")
    axes[i, 0].axis('off')
    
    # Ground truth mask
    axes[i, 1].imshow(masks[i].cpu().numpy(), cmap='tab20', vmin=0, vmax=20)
    axes[i, 1].set_title("Ground Truth")
    axes[i, 1].axis('off')
    
    # Prediction
    axes[i, 2].imshow(predictions[i].cpu().numpy(), cmap='tab20', vmin=0, vmax=20)
    axes[i, 2].set_title("Prediction (Untrained)")
    axes[i, 2].axis('off')

plt.tight_layout()
plt.show()

print("Note: Predictions are random since the head is untrained.")
print("Training the linear probe head would improve these predictions significantly.")

## 10. Calculate Evaluation Metrics

Compute IoU and pixel accuracy on the validation subset.

In [None]:
def compute_iou(pred, target, num_classes=21, ignore_index=255):
    """Compute mean Intersection over Union."""
    ious = []
    pred = pred.flatten()
    target = target.flatten()
    
    for cls in range(num_classes):
        pred_mask = pred == cls
        target_mask = target == cls
        
        intersection = (pred_mask & target_mask).sum().float()
        union = (pred_mask | target_mask).sum().float()
        
        if union > 0:
            ious.append((intersection / union).item())
    
    return np.mean(ious) if ious else 0.0

def compute_pixel_accuracy(pred, target, ignore_index=255):
    """Compute pixel accuracy."""
    valid_mask = target != ignore_index
    correct = (pred[valid_mask] == target[valid_mask]).sum()
    total = valid_mask.sum()
    return (correct / total).item() if total > 0 else 0.0

# Evaluate on a few batches
head.eval()
ious = []
pixel_accs = []

print("Evaluating on validation subset...")
with torch.no_grad():
    for i, batch in enumerate(tqdm(dataloader)):
        if i >= 10:  # Evaluate on first 10 batches
            break
        
        images = batch["image"].to(device)
        masks = batch["mask"].to(device)
        
        features = encoder(images)
        logits = head(features)
        predictions = logits.argmax(dim=1)
        
        # Compute metrics
        for j in range(predictions.shape[0]):
            iou = compute_iou(predictions[j], masks[j])
            pixel_acc = compute_pixel_accuracy(predictions[j], masks[j])
            ious.append(iou)
            pixel_accs.append(pixel_acc)

print(f"\nResults (untrained model):")
print(f"Mean IoU: {np.mean(ious):.4f}")
print(f"Pixel Accuracy: {np.mean(pixel_accs):.4f}")
print("\nNote: These are baseline metrics for an untrained head.")
print("After training, RADIO typically achieves 70-80% mIoU on Pascal VOC.")

## Summary

You've successfully:
1. ✅ Loaded Pascal VOC dataset with subset mode
2. ✅ Created NVIDIA's RADIO encoder from Hugging Face
3. ✅ Added a linear probe segmentation head
4. ✅ Run forward passes and extracted spatial features
5. ✅ Visualized RADIO features and predictions
6. ✅ Calculated evaluation metrics (IoU, pixel accuracy)

### Next Steps

1. **Train the linear probe head**: Fine-tune the segmentation head on Pascal VOC
2. **Compare with other encoders**: See `04_model_comparison.ipynb`
3. **Try different datasets**: ADE20K, Cityscapes, etc.
4. **Experiment with input sizes**: RADIO supports flexible resolutions

### Key Takeaways

- **RADIO** is a powerful vision foundation model from NVIDIA
- It provides rich **spatial features** suitable for dense prediction tasks
- The framework makes it easy to **swap encoders** and compare performance
- **Linear probing** is an efficient way to evaluate encoder quality