# 03 - Model Architecture

In this notebook, we'll design and implement neural network architectures for video similarity learning.

## Learning Objectives

By the end of this notebook, you will:
- Understand different similarity learning architectures
- Implement Siamese networks and triplet networks
- Design custom loss functions for similarity learning
- Compare different architectural choices
- **Complete 5 hands-on exercises** requiring architectural design

## Key Concepts

**Siamese Networks**: Neural networks that learn to compare two inputs and determine their similarity.

**Triplet Networks**: Networks that learn from triplets of examples (anchor, positive, negative).

**Contrastive Loss**: Loss function that pushes similar pairs closer and dissimilar pairs apart.

**Triplet Loss**: Loss function that ensures positive examples are closer to anchor than negative examples.

In [None]:
# Import required libraries
import sys
import os
from pathlib import Path

# Add the project root to the path
project_root = Path.cwd().parent
sys.path.append(str(project_root))

import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision.models as models
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm

# Import our utilities
from utils.model_utils import VideoFeatureExtractor, get_model_summary
from utils.video_utils import load_video, create_frame_transforms
from utils.data_utils import VideoDataset

# Set up plotting
plt.style.use('default')
sns.set_palette("husl")

print("Libraries imported successfully!")
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

## 1. Understanding Similarity Learning

Let's start by understanding the fundamental concepts of similarity learning.

In [None]:
# Visualize similarity learning concept
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Siamese Network
axes[0].text(0.5, 0.5, 'Siamese Network\n\nInput A → Encoder → Features A\nInput B → Encoder → Features B\n\nCompare Features', 
             ha='center', va='center', transform=axes[0].transAxes, fontsize=12)
axes[0].set_title('Siamese Network')
axes[0].axis('off')

# Triplet Network
axes[1].text(0.5, 0.5, 'Triplet Network\n\nAnchor → Encoder → Features A\nPositive → Encoder → Features P\nNegative → Encoder → Features N\n\nLearn: d(A,P) < d(A,N)', 
             ha='center', va='center', transform=axes[1].transAxes, fontsize=12)
axes[1].set_title('Triplet Network')
axes[1].axis('off')

# Contrastive Learning
axes[2].text(0.5, 0.5, 'Contrastive Learning\n\nSimilar pairs: Push closer\nDifferent pairs: Push apart\n\nLearn meaningful representations', 
             ha='center', va='center', transform=axes[2].transAxes, fontsize=12)
axes[2].set_title('Contrastive Learning')
axes[2].axis('off')

plt.tight_layout()
plt.show()

## 2. Basic Siamese Network Implementation

Let's implement a basic Siamese network for video similarity learning.

In [None]:
class VideoSiameseNetwork(nn.Module):
    """Basic Siamese network for video similarity learning"""
    
    def __init__(self, feature_dim=2048, embedding_dim=128):
        super(VideoSiameseNetwork, self).__init__()
        
        # Feature extractor (we'll use pre-extracted features)
        self.feature_extractor = VideoFeatureExtractor(
            model_name='resnet50',
            image_size=224,
            pooling_strategy='mean'
        )
        
        # Embedding layers
        self.embedding = nn.Sequential(
            nn.Linear(feature_dim, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, embedding_dim)
        )
        
        # Similarity layer
        self.similarity_layer = nn.Sequential(
            nn.Linear(embedding_dim * 2, 128),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()
        )
        
    def forward_one(self, video_frames):
        """Extract features and create embedding for one video"""
        features = self.feature_extractor.extract_features(video_frames)
        embedding = self.embedding(features)
        return embedding
    
    def forward(self, video1_frames, video2_frames):
        """Forward pass for two videos"""
        # Extract embeddings
        embedding1 = self.forward_one(video1_frames)
        embedding2 = self.forward_one(video2_frames)
        
        # Concatenate embeddings
        combined = torch.cat([embedding1, embedding2], dim=1)
        
        # Predict similarity
        similarity = self.similarity_layer(combined)
        
        return similarity.squeeze()

# Create model
model = VideoSiameseNetwork()
print("Siamese network created successfully!")
print(f"Total parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Trainable parameters: {sum(p.numel() for p in model.parameters() if p.requires_grad):,}")

## 🎯 EXERCISE 1: Siamese Network Analysis

**Task**: Analyze and improve the Siamese network architecture.

**Requirements**:
1. Calculate the number of parameters in each layer of the network
2. Implement a different similarity function (cosine similarity, Euclidean distance)
3. Add batch normalization to improve training stability
4. Create a visualization of the network architecture
5. Suggest architectural improvements for better performance

**Your code here**:

In [None]:
# TODO: Write your Siamese network analysis code

# 1. Calculate parameters per layer
# Your code here...

# 2. Implement different similarity functions
# Your code here...

# 3. Add batch normalization
# Your code here...

# 4. Create architecture visualization
# Your code here...

# 5. Suggest improvements
# Your code here...

## 3. Triplet Network Implementation

Now let's implement a triplet network for more effective similarity learning.

In [None]:
class VideoTripletNetwork(nn.Module):
    """Triplet network for video similarity learning"""
    
    def __init__(self, feature_dim=2048, embedding_dim=128):
        super(VideoTripletNetwork, self).__init__()
        
        # Feature extractor
        self.feature_extractor = VideoFeatureExtractor(
            model_name='resnet50',
            image_size=224,
            pooling_strategy='mean'
        )
        
        # Embedding network (shared across anchor, positive, negative)
        self.embedding = nn.Sequential(
            nn.Linear(feature_dim, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, 256),
            nn.BatchNorm1d(256),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(256, embedding_dim)
        )
        
    def forward_one(self, video_frames):
        """Extract embedding for one video"""
        features = self.feature_extractor.extract_features(video_frames)
        embedding = self.embedding(features)
        # L2 normalize embeddings
        embedding = F.normalize(embedding, p=2, dim=1)
        return embedding
    
    def forward(self, anchor_frames, positive_frames, negative_frames):
        """Forward pass for triplet (anchor, positive, negative)"""
        anchor_embedding = self.forward_one(anchor_frames)
        positive_embedding = self.forward_one(positive_frames)
        negative_embedding = self.forward_one(negative_frames)
        
        return anchor_embedding, positive_embedding, negative_embedding

# Create triplet network
triplet_model = VideoTripletNetwork()
print("Triplet network created successfully!")
print(f"Total parameters: {sum(p.numel() for p in triplet_model.parameters()):,}")

## 4. Loss Functions for Similarity Learning

Let's implement different loss functions for similarity learning.

In [None]:
class ContrastiveLoss(nn.Module):
    """Contrastive loss for Siamese networks"""
    
    def __init__(self, margin=1.0):
        super(ContrastiveLoss, self).__init__()
        self.margin = margin
    
    def forward(self, embedding1, embedding2, label):
        """
        Args:
            embedding1, embedding2: Embeddings of two videos
            label: 1 if similar, 0 if different
        """
        # Calculate Euclidean distance
        distance = F.pairwise_distance(embedding1, embedding2)
        
        # Contrastive loss
        loss_similar = label * torch.pow(distance, 2)
        loss_different = (1 - label) * torch.pow(torch.clamp(self.margin - distance, min=0.0), 2)
        
        loss = torch.mean(loss_similar + loss_different)
        return loss

class TripletLoss(nn.Module):
    """Triplet loss for triplet networks"""
    
    def __init__(self, margin=0.3):
        super(TripletLoss, self).__init__()
        self.margin = margin
    
    def forward(self, anchor, positive, negative):
        """
        Args:
            anchor, positive, negative: Embeddings of triplet
        """
        # Calculate distances
        pos_distance = F.pairwise_distance(anchor, positive)
        neg_distance = F.pairwise_distance(anchor, negative)
        
        # Triplet loss
        loss = torch.clamp(pos_distance - neg_distance + self.margin, min=0.0)
        loss = torch.mean(loss)
        
        return loss

# Test loss functions
print("Loss functions created successfully!")
print("\nContrastive Loss:")
print("- Pushes similar pairs closer together")
print("- Pushes different pairs apart (beyond margin)")
print("\nTriplet Loss:")
print("- Ensures positive is closer to anchor than negative")
print("- Uses margin to create separation")

## 🎯 EXERCISE 2: Loss Function Analysis

**Task**: Analyze and compare different loss functions for similarity learning.

**Requirements**:
1. Implement additional loss functions (N-pair loss, angular loss)
2. Create a function to visualize loss landscapes
3. Compare the behavior of different loss functions on sample data
4. Analyze the impact of margin values on training
5. Suggest optimal loss functions for different scenarios

**Your code here**:

In [None]:
# TODO: Write your loss function analysis code

# 1. Implement additional loss functions
# Your code here...

# 2. Create loss landscape visualization
# Your code here...

# 3. Compare loss behaviors
# Your code here...

# 4. Analyze margin impact
# Your code here...

# 5. Suggest optimal losses
# Your code here...

## 5. Advanced Architectures

Let's implement more advanced architectures for video similarity learning.

In [None]:
class AttentionBasedSiamese(nn.Module):
    """Siamese network with attention mechanism"""
    
    def __init__(self, feature_dim=2048, embedding_dim=128, num_heads=8):
        super(AttentionBasedSiamese, self).__init__()
        
        self.feature_extractor = VideoFeatureExtractor(
            model_name='resnet50',
            image_size=224,
            pooling_strategy='none'  # Get frame-level features
        )
        
        # Attention mechanism
        self.attention = nn.MultiheadAttention(
            embed_dim=feature_dim,
            num_heads=num_heads,
            batch_first=True
        )
        
        # Embedding layers
        self.embedding = nn.Sequential(
            nn.Linear(feature_dim, 512),
            nn.BatchNorm1d(512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, embedding_dim)
        )
        
        # Similarity layer
        self.similarity_layer = nn.Sequential(
            nn.Linear(embedding_dim * 2, 128),
            nn.ReLU(),
            nn.Linear(128, 1),
            nn.Sigmoid()
        )
        
    def forward_one(self, video_frames):
        """Extract features with attention"""
        # Get frame-level features
        frame_features = self.feature_extractor.extract_frame_features(video_frames)
        
        # Apply self-attention
        attended_features, _ = self.attention(frame_features, frame_features, frame_features)
        
        # Global average pooling
        video_features = torch.mean(attended_features, dim=1)
        
        # Create embedding
        embedding = self.embedding(video_features)
        return embedding
    
    def forward(self, video1_frames, video2_frames):
        """Forward pass"""
        embedding1 = self.forward_one(video1_frames)
        embedding2 = self.forward_one(video2_frames)
        
        combined = torch.cat([embedding1, embedding2], dim=1)
        similarity = self.similarity_layer(combined)
        
        return similarity.squeeze()

# Create attention-based model
attention_model = AttentionBasedSiamese()
print("Attention-based Siamese network created successfully!")
print(f"Total parameters: {sum(p.numel() for p in attention_model.parameters()):,}")

## 🎯 EXERCISE 3: Advanced Architecture Design

**Task**: Design and implement advanced architectures for video similarity learning.

**Requirements**:
1. Implement a temporal convolutional network (TCN) for video processing
2. Design a hierarchical attention mechanism
3. Create a multi-scale feature fusion architecture
4. Implement a graph neural network for video similarity
5. Compare the computational complexity of different architectures

**Your code here**:

In [None]:
# TODO: Write your advanced architecture design code

# 1. Implement TCN
# Your code here...

# 2. Design hierarchical attention
# Your code here...

# 3. Create multi-scale fusion
# Your code here...

# 4. Implement GNN
# Your code here...

# 5. Compare complexity
# Your code here...

## 6. Model Evaluation and Comparison

Let's create functions to evaluate and compare different architectures.

In [None]:
def evaluate_model(model, test_loader, device='cpu'):
    """Evaluate a similarity learning model"""
    model.eval()
    
    similarities = []
    labels = []
    
    with torch.no_grad():
        for batch in test_loader:
            video1, video2, label = batch
            
            # Move to device
            video1 = video1.to(device)
            video2 = video2.to(device)
            
            # Get predictions
            pred = model(video1, video2)
            
            similarities.extend(pred.cpu().numpy())
            labels.extend(label.numpy())
    
    # Calculate metrics
    similarities = np.array(similarities)
    labels = np.array(labels)
    
    # ROC AUC
    from sklearn.metrics import roc_auc_score, roc_curve
    auc = roc_auc_score(labels, similarities)
    
    # Precision-Recall
    from sklearn.metrics import precision_recall_curve, average_precision_score
    ap = average_precision_score(labels, similarities)
    
    return {
        'auc': auc,
        'ap': ap,
        'similarities': similarities,
        'labels': labels
    }

def compare_architectures(models, test_loader, device='cpu'):
    """Compare multiple model architectures"""
    results = {}
    
    for name, model in models.items():
        print(f"Evaluating {name}...")
        results[name] = evaluate_model(model, test_loader, device)
        
    # Create comparison plot
    fig, axes = plt.subplots(1, 2, figsize=(15, 6))
    
    # ROC curves
    for name, result in results.items():
        fpr, tpr, _ = roc_curve(result['labels'], result['similarities'])
        axes[0].plot(fpr, tpr, label=f'{name} (AUC={result["auc"]:.3f})')
    
    axes[0].plot([0, 1], [0, 1], 'k--', alpha=0.5)
    axes[0].set_xlabel('False Positive Rate')
    axes[0].set_ylabel('True Positive Rate')
    axes[0].set_title('ROC Curves')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)
    
    # Precision-Recall curves
    for name, result in results.items():
        precision, recall, _ = precision_recall_curve(result['labels'], result['similarities'])
        axes[1].plot(recall, precision, label=f'{name} (AP={result["ap"]:.3f})')
    
    axes[1].set_xlabel('Recall')
    axes[1].set_ylabel('Precision')
    axes[1].set_title('Precision-Recall Curves')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    return results

print("Model evaluation functions created successfully!")

## 🎯 EXERCISE 4: Model Evaluation and Analysis

**Task**: Evaluate and analyze different model architectures.

**Requirements**:
1. Create synthetic test data for model evaluation
2. Implement additional evaluation metrics (F1-score, confusion matrix)
3. Analyze model performance on different types of video pairs
4. Create a model complexity vs performance trade-off analysis
5. Suggest ensemble methods for improved performance

**Your code here**:

In [None]:
# TODO: Write your model evaluation and analysis code

# 1. Create synthetic test data
# Your code here...

# 2. Implement additional metrics
# Your code here...

# 3. Analyze performance on different video types
# Your code here...

# 4. Complexity vs performance analysis
# Your code here...

# 5. Suggest ensemble methods
# Your code here...

## 7. Architecture Design Patterns

Let's explore common design patterns for video similarity learning.

In [None]:
# Design pattern: Feature fusion
class FeatureFusionModule(nn.Module):
    """Module for fusing different types of features"""
    
    def __init__(self, feature_dims, fusion_dim=256):
        super(FeatureFusionModule, self).__init__()
        
        self.feature_dims = feature_dims
        self.fusion_dim = fusion_dim
        
        # Individual feature projections
        self.projections = nn.ModuleList([
            nn.Linear(dim, fusion_dim) for dim in feature_dims
        ])
        
        # Fusion layer
        self.fusion = nn.Sequential(
            nn.Linear(fusion_dim * len(feature_dims), fusion_dim),
            nn.ReLU(),
            nn.Dropout(0.3)
        )
        
    def forward(self, features_list):
        """Fuse multiple feature vectors"""
        # Project each feature to same dimension
        projected_features = []
        for i, features in enumerate(features_list):
            projected = self.projections[i](features)
            projected_features.append(projected)
        
        # Concatenate and fuse
        concatenated = torch.cat(projected_features, dim=1)
        fused = self.fusion(concatenated)
        
        return fused

# Design pattern: Multi-scale processing
class MultiScaleProcessor(nn.Module):
    """Process video at multiple temporal scales"""
    
    def __init__(self, base_dim, scales=[1, 2, 4]):
        super(MultiScaleProcessor, self).__init__()
        
        self.scales = scales
        self.processors = nn.ModuleList([
            nn.Conv1d(base_dim, base_dim, kernel_size=scale, stride=scale)
            for scale in scales
        ])
        
        # Fusion layer
        self.fusion = nn.Linear(base_dim * len(scales), base_dim)
        
    def forward(self, features):
        """Process features at multiple scales"""
        # features shape: (batch, time, dim)
        features = features.transpose(1, 2)  # (batch, dim, time)
        
        multi_scale_features = []
        for processor in self.processors:
            processed = processor(features)
            # Global average pooling
            pooled = torch.mean(processed, dim=2)
            multi_scale_features.append(pooled)
        
        # Fuse multi-scale features
        concatenated = torch.cat(multi_scale_features, dim=1)
        fused = self.fusion(concatenated)
        
        return fused

print("Design pattern modules created successfully!")
print("\nFeatureFusionModule: Combines different feature types")
print("MultiScaleProcessor: Processes video at multiple temporal scales")

## 🎯 EXERCISE 5: Architecture Design Patterns

**Task**: Implement and analyze different architecture design patterns.

**Requirements**:
1. Implement a residual connection pattern for video processing
2. Design a skip connection architecture for multi-scale features
3. Create a bottleneck design for efficient processing
4. Implement a pyramid network for hierarchical feature extraction
5. Compare the effectiveness of different design patterns

**Your code here**:

In [None]:
# TODO: Write your architecture design patterns code

# 1. Implement residual connections
# Your code here...

# 2. Design skip connections
# Your code here...

# 3. Create bottleneck design
# Your code here...

# 4. Implement pyramid network
# Your code here...

# 5. Compare design patterns
# Your code here...

## 🎯 FINAL EXERCISE: Architecture Design Report

**Task**: Write a comprehensive report on architecture design for video similarity learning.

**Requirements**:
1. Compare Siamese vs Triplet networks for different scenarios
2. Analyze the impact of attention mechanisms on performance
3. Recommend optimal architectures for different video types
4. Suggest architectural improvements for real-time applications
5. Propose a complete architecture design methodology

**Your report here** (write in markdown):

In [None]:
# TODO: Write your architecture design report
report = """
## Architecture Design Report

### Siamese vs Triplet Networks:
[Your analysis here]

### Attention Mechanism Impact:
[Your analysis here]

### Optimal Architectures:
[Your recommendations here]

### Real-time Improvements:
[Your suggestions here]

### Design Methodology:
[Your proposal here]
"""

print(report)

## Summary

In this notebook, we've learned:

✅ **Similarity Learning**: Understanding Siamese and Triplet networks
✅ **Loss Functions**: Implementing contrastive and triplet losses
✅ **Advanced Architectures**: Attention mechanisms and feature fusion
✅ **Design Patterns**: Common patterns for video similarity learning
✅ **Model Evaluation**: Comparing different architectures
✅ **5 Interactive Exercises**: Hands-on architectural design

### Key Takeaways:

1. **Architecture Choice Matters**: Different architectures work better for different scenarios
2. **Loss Function Design**: The choice of loss function significantly impacts learning
3. **Attention Mechanisms**: Can improve performance by focusing on important parts
4. **Design Patterns**: Reusable components can speed up development
5. **Evaluation is Key**: Proper evaluation helps choose the best architecture

### Next Steps:

In the next notebook, we'll learn about **Training Setup** - how to prepare data and configure training parameters.

---

**Questions to think about:**
- Which architecture would work best for your specific video domain?
- How would you handle videos of very different lengths?
- What loss function would be most appropriate for your use case?
- How would you design an architecture for real-time video similarity?
- What evaluation metrics are most important for your application?