# 03. Vision Backbone Deep Dive

**Goal**: Understand how OpenVLA processes visual information using DINOv2 and SigLIP.

## What We'll Learn
1. Vision Transformer (ViT) fundamentals
2. DINOv2: Self-supervised visual features
3. SigLIP: Text-aligned visual features
4. Feature fusion strategy
5. Practical visualization of features

---
## 1. Vision Transformer Fundamentals

Both DINOv2 and SigLIP are based on the **Vision Transformer (ViT)** architecture.

### How ViT Works

```
┌────────────────────────────────────────────────────────────────┐
│                  Vision Transformer Pipeline                    │
├────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Input Image (224 × 224 × 3)                                   │
│           │                                                     │
│           ▼                                                     │
│  ┌──────────────────────────────────────────┐                  │
│  │ 1. Patch Extraction                       │                  │
│  │    Split into 14×14 patches (16×16 each) │                  │
│  │    = 196 patches                          │                  │
│  └──────────────────────────────────────────┘                  │
│           │                                                     │
│           ▼                                                     │
│  ┌──────────────────────────────────────────┐                  │
│  │ 2. Linear Embedding                       │                  │
│  │    Each patch → 1024-dim vector          │                  │
│  │    + [CLS] token + Position embeddings   │                  │
│  └──────────────────────────────────────────┘                  │
│           │                                                     │
│           ▼                                                     │
│  ┌──────────────────────────────────────────┐                  │
│  │ 3. Transformer Encoder (24 layers)        │                  │
│  │    Multi-Head Self-Attention              │                  │
│  │    + Feed-Forward Networks                │                  │
│  └──────────────────────────────────────────┘                  │
│           │                                                     │
│           ▼                                                     │
│  Output: 197 tokens × 1024 dims                                │
│  ([CLS] token + 196 patch tokens)                              │
│                                                                 │
└────────────────────────────────────────────────────────────────┘
```

In [None]:
import torch
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt

# Visualize patch extraction
def visualize_patches(image_size=224, patch_size=16):
    """Visualize how an image is divided into patches."""
    n_patches = image_size // patch_size
    
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))
    
    # Create sample image
    sample_img = np.random.randint(0, 255, (image_size, image_size, 3), dtype=np.uint8)
    
    # Original image
    axes[0].imshow(sample_img)
    axes[0].set_title(f"Original Image ({image_size}×{image_size})")
    axes[0].axis('off')
    
    # Image with patch grid
    axes[1].imshow(sample_img)
    for i in range(n_patches + 1):
        axes[1].axhline(y=i * patch_size, color='red', linewidth=1)
        axes[1].axvline(x=i * patch_size, color='red', linewidth=1)
    axes[1].set_title(f"Patches ({n_patches}×{n_patches} = {n_patches**2} patches)")
    axes[1].axis('off')
    
    plt.tight_layout()
    plt.show()
    
    print(f"\nViT-L/14 Configuration:")
    print(f"  Image size: {image_size}×{image_size}")
    print(f"  Patch size: {patch_size}×{patch_size}")
    print(f"  Number of patches: {n_patches**2}")
    print(f"  + 1 [CLS] token = {n_patches**2 + 1} tokens total")

visualize_patches()

---
## 2. DINOv2: Self-Supervised Visual Features

**DINO** (Distillation with NO labels) learns visual features without any labeled data.

### Training Approach
- Teacher-student self-distillation
- Student learns to match teacher's output on augmented views
- Discovers semantic structure naturally

### What DINOv2 Captures
- Object boundaries and parts
- Semantic segmentation (emergent property)
- Spatial relationships
- Rich local features

In [None]:
# Load DINOv2 directly for exploration
import timm

print("Loading DINOv2-Large...")
dinov2 = timm.create_model('vit_large_patch14_dinov2.lvd142m', pretrained=True)
dinov2.eval()
print(f"DINOv2 loaded: {sum(p.numel() for p in dinov2.parameters())/1e6:.1f}M parameters")

In [None]:
# Inspect DINOv2 architecture
print("DINOv2-L Architecture:")
print("="*60)
print(f"Patch embedding: {dinov2.patch_embed}")
print(f"Num transformer blocks: {len(dinov2.blocks)}")
print(f"Hidden dimension: {dinov2.embed_dim}")
print(f"Num attention heads: {dinov2.blocks[0].attn.num_heads}")

In [None]:
# Process an image through DINOv2
from timm.data import resolve_data_config
from timm.data.transforms_factory import create_transform

# Get DINOv2's preprocessing config
config = resolve_data_config({}, model=dinov2)
transform = create_transform(**config)

print("DINOv2 Image Transform:")
print(f"  Input size: {config['input_size']}")
print(f"  Mean: {config['mean']}")
print(f"  Std: {config['std']}")

In [None]:
# Create and process a sample robot image
sample_image = Image.fromarray(
    np.random.randint(0, 255, (256, 256, 3), dtype=np.uint8)
)

# Transform image
input_tensor = transform(sample_image).unsqueeze(0)
print(f"Input tensor shape: {input_tensor.shape}")

# Forward pass
with torch.no_grad():
    # Get patch features (not pooled)
    features = dinov2.forward_features(input_tensor)
    print(f"DINOv2 output shape: {features.shape}")
    print(f"  - Batch: {features.shape[0]}")
    print(f"  - Tokens: {features.shape[1]} (1 CLS + 256 patches)")
    print(f"  - Feature dim: {features.shape[2]}")

In [None]:
# Visualize attention patterns in DINOv2
def get_attention_maps(model, image_tensor):
    """Extract attention maps from ViT."""
    attention_maps = []
    
    def hook_fn(module, input, output):
        # Output is (attn_output, attn_weights) for some implementations
        if isinstance(output, tuple):
            attention_maps.append(output[1].detach())
    
    hooks = []
    for block in model.blocks:
        hook = block.attn.register_forward_hook(hook_fn)
        hooks.append(hook)
    
    with torch.no_grad():
        _ = model(image_tensor)
    
    for hook in hooks:
        hook.remove()
    
    return attention_maps

print("DINOv2 attention patterns capture semantic structure:")
print("  - Different heads attend to different semantic parts")
print("  - Early layers: low-level features (edges, textures)")
print("  - Middle layers: object parts")
print("  - Late layers: high-level semantics")

---
## 3. SigLIP: Text-Aligned Visual Features

**SigLIP** (Sigmoid Loss for Language-Image Pre-training) learns to align images with text.

### Training Approach
- Contrastive learning on image-text pairs
- Sigmoid loss (more efficient than softmax)
- Large-scale web data (billions of pairs)

### What SigLIP Captures
- Text-image correspondence
- Compositional understanding ("red ball on table")
- Action-relevant features
- Language-grounded concepts

In [None]:
# Load SigLIP for exploration
print("Loading SigLIP-Large...")
siglip = timm.create_model('vit_large_patch16_siglip_256', pretrained=True)
siglip.eval()
print(f"SigLIP loaded: {sum(p.numel() for p in siglip.parameters())/1e6:.1f}M parameters")

In [None]:
# Compare DINOv2 vs SigLIP architectures
print("Architecture Comparison:")
print("="*60)
print(f"{'Feature':<25} {'DINOv2-L':<15} {'SigLIP-L':<15}")
print("-"*60)
print(f"{'Patch size':<25} {'14×14':<15} {'16×16':<15}")
print(f"{'Hidden dim':<25} {dinov2.embed_dim:<15} {siglip.embed_dim:<15}")
print(f"{'Num layers':<25} {len(dinov2.blocks):<15} {len(siglip.blocks):<15}")
print(f"{'Training':<25} {'Self-supervised':<15} {'Contrastive':<15}")
print(f"{'Strength':<25} {'Semantics':<15} {'Text align':<15}")

In [None]:
# Process same image through SigLIP
siglip_config = resolve_data_config({}, model=siglip)
siglip_transform = create_transform(**siglip_config)

siglip_input = siglip_transform(sample_image).unsqueeze(0)

with torch.no_grad():
    siglip_features = siglip.forward_features(siglip_input)
    print(f"SigLIP output shape: {siglip_features.shape}")
    print(f"  - Batch: {siglip_features.shape[0]}")
    print(f"  - Tokens: {siglip_features.shape[1]}")
    print(f"  - Feature dim: {siglip_features.shape[2]}")

---
## 4. Feature Fusion in OpenVLA

OpenVLA combines both encoders to get the best of both worlds.

In [None]:
fusion_diagram = """
┌─────────────────────────────────────────────────────────────────────┐
│                    OpenVLA Feature Fusion                           │
├─────────────────────────────────────────────────────────────────────┤
│                                                                      │
│                      [Input Image 224×224]                          │
│                              │                                       │
│                 ┌────────────┴────────────┐                         │
│                 │                         │                          │
│                 ▼                         ▼                          │
│        ┌──────────────┐          ┌──────────────┐                   │
│        │   DINOv2     │          │   SigLIP     │                   │
│        │   ViT-L/14   │          │   ViT-L/16   │                   │
│        └──────┬───────┘          └──────┬───────┘                   │
│               │                         │                            │
│               ▼                         ▼                            │
│     [B, 257, 1024]              [B, 257, 1024]                       │
│     (Rich semantics)            (Text-aligned)                       │
│               │                         │                            │
│               └──────────┬──────────────┘                           │
│                          │                                           │
│                          ▼                                           │
│               ┌──────────────────┐                                   │
│               │   Concatenate    │                                   │
│               │   along patches  │                                   │
│               └────────┬─────────┘                                   │
│                        │                                             │
│                        ▼                                             │
│               [B, 514, 1024]                                         │
│               (Fused features)                                       │
│                        │                                             │
│                        ▼                                             │
│               ┌──────────────────┐                                   │
│               │    Projector     │                                   │
│               │  1024 → 4096     │                                   │
│               └────────┬─────────┘                                   │
│                        │                                             │
│                        ▼                                             │
│               [B, 514, 4096]                                         │
│               (LLM-ready tokens)                                     │
│                                                                      │
└─────────────────────────────────────────────────────────────────────┘
"""
print(fusion_diagram)

In [None]:
# Load actual OpenVLA vision backbone
from transformers import AutoModelForVision2Seq, AutoProcessor

print("Loading OpenVLA to inspect vision backbone...")
vla = AutoModelForVision2Seq.from_pretrained(
    "openvla/openvla-7b",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("openvla/openvla-7b", trust_remote_code=True)

In [None]:
# Explore OpenVLA's vision backbone structure
vision_backbone = vla.vision_backbone

print("OpenVLA Vision Backbone:")
print("="*60)
print(f"Type: {type(vision_backbone).__name__}")

# List sub-components
for name, child in vision_backbone.named_children():
    params = sum(p.numel() for p in child.parameters())
    print(f"  {name}: {params/1e6:.1f}M parameters")

In [None]:
# Trace feature extraction
sample_image = Image.fromarray(
    np.random.randint(0, 255, (224, 224, 3), dtype=np.uint8)
)

# Process through OpenVLA's processor
inputs = processor("Pick up the red block", sample_image)
pixel_values = inputs['pixel_values'].to(vla.dtype)

print(f"Pixel values shape: {pixel_values.shape}")

# Extract vision features
with torch.no_grad():
    vision_features = vla.vision_backbone(pixel_values)
    print(f"\nVision backbone output: {vision_features.shape}")
    
    # Project to LLM space
    projected = vla.projector(vision_features)
    print(f"Projected features: {projected.shape}")

---
## 5. Why This Combination Works for Robotics

### DINOv2 Contribution
- **Object discovery**: Naturally segments objects without labels
- **Spatial understanding**: Captures where things are
- **Part-whole relationships**: Understands object structure

### SigLIP Contribution  
- **Instruction grounding**: Maps language to visual concepts
- **Compositional understanding**: "red block" vs "blue block"
- **Action relevance**: Learned from web data describing actions

In [None]:
# Conceptual comparison of what each encoder captures
comparison_table = """
┌─────────────────┬──────────────────────┬──────────────────────┐
│  Robot Task     │  DINOv2 Captures     │  SigLIP Captures     │
├─────────────────┼──────────────────────┼──────────────────────┤
│                 │                      │                      │
│ "Pick up the    │ - Object boundaries  │ - "Red" vs "blue"    │
│  red block"     │ - Block shape/size   │ - "Block" concept    │
│                 │ - Spatial position   │ - "Pick up" action   │
│                 │                      │                      │
├─────────────────┼──────────────────────┼──────────────────────┤
│                 │                      │                      │
│ "Place cup in   │ - Cup outline        │ - "Cup" vs "mug"     │
│  the drawer"    │ - Drawer structure   │ - "In" relationship  │
│                 │ - Opening detection  │ - "Drawer" concept   │
│                 │                      │                      │
├─────────────────┼──────────────────────┼──────────────────────┤
│                 │                      │                      │
│ "Stack blocks   │ - Each block's       │ - "Stack" action     │
│  by size"       │   position           │ - Size comparison    │
│                 │ - Relative sizes     │ - Order concept      │
│                 │                      │                      │
└─────────────────┴──────────────────────┴──────────────────────┘
"""
print(comparison_table)

---
## 6. Practical Feature Visualization

In [None]:
# Create a simple visualization of feature activations
def visualize_feature_statistics(features, name):
    """Visualize basic statistics of extracted features."""
    features_np = features.detach().float().numpy()
    
    print(f"\n{name} Feature Statistics:")
    print("="*50)
    print(f"Shape: {features_np.shape}")
    print(f"Mean: {features_np.mean():.4f}")
    print(f"Std: {features_np.std():.4f}")
    print(f"Min: {features_np.min():.4f}")
    print(f"Max: {features_np.max():.4f}")
    
    # Feature magnitude per token
    token_magnitudes = np.linalg.norm(features_np[0], axis=1)
    
    plt.figure(figsize=(12, 4))
    plt.subplot(1, 2, 1)
    plt.hist(features_np.flatten(), bins=50, alpha=0.7)
    plt.title(f"{name}: Feature Value Distribution")
    plt.xlabel("Value")
    plt.ylabel("Count")
    
    plt.subplot(1, 2, 2)
    plt.plot(token_magnitudes)
    plt.title(f"{name}: Token Magnitude")
    plt.xlabel("Token Index")
    plt.ylabel("L2 Norm")
    
    plt.tight_layout()
    plt.show()

# Visualize the fused features
visualize_feature_statistics(vision_features, "Fused Vision")

In [None]:
# Visualize the projected features
visualize_feature_statistics(projected, "Projected (LLM-ready)")

In [None]:
# Compare with standalone encoders
with torch.no_grad():
    # DINOv2 alone
    dino_input = transform(sample_image).unsqueeze(0)
    dino_features = dinov2.forward_features(dino_input)
    visualize_feature_statistics(dino_features, "DINOv2 Only")
    
    # SigLIP alone
    siglip_input = siglip_transform(sample_image).unsqueeze(0)
    siglip_feats = siglip.forward_features(siglip_input)
    visualize_feature_statistics(siglip_feats, "SigLIP Only")

---
## 7. The Projector: Bridging Vision and Language

In [None]:
# Inspect the projector architecture
projector = vla.projector

print("Projector Architecture:")
print("="*60)
for name, module in projector.named_modules():
    if name:  # Skip root
        if hasattr(module, 'in_features'):
            print(f"{name}: Linear({module.in_features} → {module.out_features})")
        elif hasattr(module, '__class__'):
            print(f"{name}: {module.__class__.__name__}")

In [None]:
# Understanding projector dimensionality
print("\nDimensionality Flow:")
print("="*60)
print(f"Vision features: {vision_features.shape[-1]} dims")
print(f"LLM embedding: {vla.llm_backbone.llm.config.hidden_size} dims")
print(f"Projected: {projected.shape[-1]} dims")
print(f"\nThe projector maps vision features to match LLM embedding dimension.")

---
## Summary

### Key Insights

1. **Dual Vision Encoder**: OpenVLA uses both DINOv2 and SigLIP
   - DINOv2: Rich semantic features from self-supervised learning
   - SigLIP: Text-aligned features from contrastive learning

2. **ViT Architecture**: Both use Vision Transformers
   - Image → patches → tokens → self-attention → features
   - ~257 tokens per encoder (1 CLS + 256 patches)

3. **Feature Fusion**: Simple concatenation
   - Combines complementary information
   - Preserves both semantic richness and text alignment

4. **Projector**: Maps to LLM space
   - Vision dim (1024) → LLM dim (4096)
   - Enables cross-modal attention in the LLM

### Why This Matters for Robot Actions
- DINOv2 tells the model **where objects are** and their **structure**
- SigLIP tells the model **what the instruction means** visually
- Together, they enable precise action prediction

### Next Steps
→ Continue to **04_action_tokenization.ipynb** to understand how continuous actions become tokens.

In [None]:
# Clean up
del vla, dinov2, siglip
torch.cuda.empty_cache()
print("Memory cleared.")