# Week 8 ‚Äî Deep Learning Architectures

## Objectives
- Understand the evolution of deep learning architectures
- Implement AlexNet for image classification
- Learn about skip connections and build ResNet blocks
- Implement LSTM for time series prediction
- Understand the Transformer architecture and attention mechanism
- Explore Vision Transformers (ViT)

This notebook provides a comprehensive tour through the major deep learning architectures that have shaped modern AI.


In [None]:
import numpy as np
import math
from utils import (
    show_result, generate_image_data, generate_time_series_data,
    generate_sequence_classification_data, train_test_split,
    Conv2d, MaxPool2d, ReLU, Dropout, Linear, BatchNorm2d,
    LSTM, LSTMCell, MultiHeadAttention, PositionalEncoding,
    scaled_dot_product_attention, softmax, accuracy, mse,
    test_alexnet_architecture, test_resnet_skip_connection,
    test_lstm_forward, test_transformer_attention, test_vit_patch_embedding
)


## 1. Introduction to Deep Learning Architectures

### Brief History
- **2012**: AlexNet wins ImageNet, sparking the deep learning revolution
- **2015**: ResNet introduces skip connections, enabling very deep networks (152+ layers)
- **2017**: Transformers revolutionize NLP with attention mechanisms
- **2020**: Vision Transformers (ViT) show transformers can excel at computer vision

### Key Architecture Families

1. **Convolutional Neural Networks (CNNs)**
   - Designed for spatial data (images)
   - Use local connectivity and weight sharing
   - Examples: LeNet, AlexNet, VGG, ResNet, Inception

2. **Recurrent Neural Networks (RNNs)**
   - Designed for sequential data (text, time series)
   - Maintain hidden state across time steps
   - Examples: Vanilla RNN, LSTM, GRU

3. **Transformers**
   - Use attention mechanisms to process sequences
   - Can be parallelized (unlike RNNs)
   - Examples: BERT, GPT, T5, ViT

### Why Different Architectures?
- **Inductive biases**: Built-in assumptions about the data structure
- **CNNs** assume spatial locality and translation invariance
- **RNNs** assume sequential dependencies
- **Transformers** make fewer assumptions, learn patterns from data


## 2. AlexNet: The CNN Revolution

AlexNet (2012) was the breakthrough that brought deep learning to mainstream computer vision.

### Architecture Overview
```
Input (224√ó224√ó3)
    ‚Üì
Conv1 (11√ó11, stride 4) ‚Üí 96 filters ‚Üí ReLU ‚Üí MaxPool
    ‚Üì
Conv2 (5√ó5) ‚Üí 256 filters ‚Üí ReLU ‚Üí MaxPool
    ‚Üì
Conv3 (3√ó3) ‚Üí 384 filters ‚Üí ReLU
    ‚Üì
Conv4 (3√ó3) ‚Üí 384 filters ‚Üí ReLU
    ‚Üì
Conv5 (3√ó3) ‚Üí 256 filters ‚Üí ReLU ‚Üí MaxPool
    ‚Üì
Flatten
    ‚Üì
FC1 (4096) ‚Üí ReLU ‚Üí Dropout
    ‚Üì
FC2 (4096) ‚Üí ReLU ‚Üí Dropout
    ‚Üì
FC3 (num_classes)
```

### Key Innovations
1. **ReLU activation**: Faster training than sigmoid/tanh
2. **Dropout**: Regularization to prevent overfitting
3. **Data augmentation**: Random crops, flips
4. **GPU training**: Made deep networks practical


In [None]:
# Exercise 1: Implement AlexNet

class AlexNet:
    def __init__(self, num_classes=10):
        """
        Initialize AlexNet architecture.
        For simplicity, we'll use a slightly smaller version.
        
        Args:
            num_classes: Number of output classes
        """
        # TODO: Define convolutional layers
        # Hint: Use Conv2d, ReLU, MaxPool2d from utils
        # self.conv1 = Conv2d(3, 96, kernel_size=11, stride=4, padding=2)
        # self.conv2 = Conv2d(96, 256, kernel_size=5, padding=2)
        # ... continue for conv3, conv4, conv5
        
        raise NotImplementedError
        
        # TODO: Define fully connected layers
        # self.fc1 = Linear(256 * 6 * 6, 4096)  # Adjust input size based on your conv layers
        # self.fc2 = Linear(4096, 4096)
        # self.fc3 = Linear(4096, num_classes)
        
        # TODO: Define activation and regularization
        # self.relu = ReLU()
        # self.dropout = Dropout(p=0.5)
        # self.maxpool = MaxPool2d(kernel_size=3, stride=2)
    
    def forward(self, x):
        """
        Forward pass through AlexNet.
        
        Args:
            x: Input tensor of shape (batch_size, 3, 224, 224)
        
        Returns:
            Output tensor of shape (batch_size, num_classes)
        """
        # TODO: Implement forward pass
        # 1. Pass through conv layers with ReLU and pooling
        # 2. Flatten the output
        # 3. Pass through FC layers with ReLU and dropout
        # 4. Return final output
        
        raise NotImplementedError


In [None]:
# Test AlexNet
res = test_alexnet_architecture(AlexNet)
show_result("Exercise 1 ‚Äì AlexNet Architecture", res)


## 3. Skip Connections and ResNet

### The Vanishing Gradient Problem
As networks get deeper, gradients can vanish during backpropagation, making training difficult.

### Skip Connections (Residual Connections)
ResNet's key innovation: add the input directly to the output of a layer block.

```
x ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                  ‚îÇ
Conv ‚Üí BN ‚Üí ReLU  ‚îÇ
‚îÇ                  ‚îÇ
Conv ‚Üí BN         ‚îÇ
‚îÇ                  ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ + ‚Üê‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
         ‚îÇ
       ReLU
         ‚îÇ
      output
```

**Mathematical formulation:**
- Without skip: $y = F(x)$
- With skip: $y = F(x) + x$

### Why Skip Connections Work
1. **Gradient flow**: Gradients can flow directly through the skip connection
2. **Identity mapping**: Network can learn identity function easily (just set F(x) ‚âà 0)
3. **Ensemble effect**: Multiple paths through the network


In [None]:
# Exercise 2: Implement ResNet Residual Block

class ResidualBlock:
    def __init__(self, in_channels, out_channels, stride=1):
        """
        Initialize a residual block.
        
        Args:
            in_channels: Number of input channels
            out_channels: Number of output channels
            stride: Stride for the first convolution
        """
        # TODO: Define the main path (F(x))
        # self.conv1 = Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, padding=1)
        # self.bn1 = BatchNorm2d(out_channels)
        # self.conv2 = Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1)
        # self.bn2 = BatchNorm2d(out_channels)
        # self.relu = ReLU()
        
        raise NotImplementedError
        
        # TODO: Define the skip connection
        # If dimensions change, use a 1x1 conv to match dimensions
        # self.downsample = None
        # if stride != 1 or in_channels != out_channels:
        #     self.downsample = ...
    
    def forward(self, x):
        """
        Forward pass through the residual block.
        
        Args:
            x: Input tensor of shape (batch_size, in_channels, H, W)
        
        Returns:
            Output tensor of shape (batch_size, out_channels, H', W')
        """
        # TODO: Implement forward pass with skip connection
        # 1. Save input for skip connection: identity = x
        # 2. Pass through conv1 ‚Üí bn1 ‚Üí relu
        # 3. Pass through conv2 ‚Üí bn2 (no ReLU yet!)
        # 4. If needed, apply downsample to identity
        # 5. Add: out = out + identity
        # 6. Apply final ReLU
        
        raise NotImplementedError


In [None]:
# Test Residual Block
res = test_resnet_skip_connection(ResidualBlock)
show_result("Exercise 2 ‚Äì ResNet Skip Connection", res)


### Demo: ResNet on MNIST

Let's see how skip connections help training deeper networks.


In [None]:
# Simple ResNet for MNIST (28x28 grayscale images)
class SimpleResNet:
    def __init__(self, num_classes=10):
        """
        A simple ResNet for MNIST classification.
        Uses your ResidualBlock implementation.
        """
        self.conv1 = Conv2d(1, 64, kernel_size=3, stride=1, padding=1)
        self.bn1 = BatchNorm2d(64)
        self.relu = ReLU()
        
        # Residual blocks
        self.layer1 = ResidualBlock(64, 64)
        self.layer2 = ResidualBlock(64, 128, stride=2)
        self.layer3 = ResidualBlock(128, 256, stride=2)
        
        # Final classification
        self.avgpool = MaxPool2d(kernel_size=7)  # Global average pooling
        self.fc = Linear(256, num_classes)
    
    def forward(self, x):
        # Initial conv
        x = self.relu(self.bn1(self.conv1(x)))
        
        # Residual blocks
        x = self.layer1.forward(x)
        x = self.layer2.forward(x)
        x = self.layer3.forward(x)
        
        # Classification head
        x = self.avgpool(x)
        x = x.reshape(x.shape[0], -1)  # Flatten
        x = self.fc(x)
        return x

# Generate some dummy MNIST-like data
print("Generating synthetic MNIST-like data...")
X, y = generate_image_data(n_samples=100, img_size=28, n_channels=1, n_classes=10)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print(f"Train set: {X_train.shape}, Test set: {X_test.shape}")
print(f"\nNote: In a real scenario, you would train this network with gradient descent.")
print("For this demo, we just verify the architecture works.")

try:
    model = SimpleResNet(num_classes=10)
    output = model.forward(X_train[:4])  # Forward pass on 4 samples
    print(f"\n‚úì ResNet forward pass successful!")
    print(f"  Input shape: (4, 1, 28, 28)")
    print(f"  Output shape: {output.shape} (batch_size=4, num_classes=10)")
except Exception as e:
    print(f"\n‚úó Error in ResNet: {e}")


## 4. LSTM for Time Series

### Why RNNs?
- Standard neural networks assume independence between inputs
- Sequences have temporal dependencies: $x_t$ depends on $x_{t-1}, x_{t-2}, ...$
- RNNs maintain a hidden state that captures information from previous time steps

### Vanilla RNN Problem
Simple RNNs suffer from vanishing/exploding gradients over long sequences.

### LSTM: Long Short-Term Memory
LSTM solves this with a gating mechanism:

1. **Forget gate** ($f_t$): What to forget from cell state
2. **Input gate** ($i_t$): What new information to add
3. **Output gate** ($o_t$): What to output

**LSTM equations:**
```
f_t = œÉ(W_f ¬∑ [h_{t-1}, x_t] + b_f)      # Forget gate
i_t = œÉ(W_i ¬∑ [h_{t-1}, x_t] + b_i)      # Input gate  
g_t = tanh(W_g ¬∑ [h_{t-1}, x_t] + b_g)   # Candidate values
o_t = œÉ(W_o ¬∑ [h_{t-1}, x_t] + b_o)      # Output gate

c_t = f_t ‚äô c_{t-1} + i_t ‚äô g_t          # Update cell state
h_t = o_t ‚äô tanh(c_t)                    # Update hidden state
```

where œÉ is sigmoid, ‚äô is element-wise multiplication.


In [None]:
# Exercise 3: Implement LSTM Model for Time Series Prediction

class LSTMModel:
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        """
        Initialize LSTM model for time series prediction.
        
        Args:
            input_size: Number of input features per time step
            hidden_size: Number of hidden units
            num_layers: Number of LSTM layers
            output_size: Number of output features
        """
        # TODO: Initialize LSTM and output layer
        # Hint: Use the LSTM class from utils
        # self.lstm = LSTM(input_size, hidden_size, num_layers)
        # self.fc = Linear(hidden_size, output_size)
        
        raise NotImplementedError
    
    def forward(self, x):
        """
        Forward pass through LSTM.
        
        Args:
            x: Input tensor of shape (batch_size, seq_len, input_size)
        
        Returns:
            Output tensor of shape (batch_size, output_size)
        """
        # TODO: Implement forward pass
        # 1. Pass through LSTM: output, (h_n, c_n) = self.lstm.forward(x)
        # 2. Take the last time step: last_output = output[:, -1, :]
        # 3. Pass through FC layer: prediction = self.fc(last_output)
        # 4. Return prediction
        
        raise NotImplementedError


In [None]:
# Test LSTM
res = test_lstm_forward(LSTMModel)
show_result("Exercise 3 ‚Äì LSTM Forward Pass", res)


In [None]:
# Demo: LSTM for Time Series Prediction
print("Generating synthetic time series data...")
X_ts, y_ts = generate_time_series_data(n_samples=200, seq_len=50, n_features=1)
X_train, X_test, y_train, y_test = train_test_split(X_ts, y_ts, test_size=0.2)

print(f"Train set: X={X_train.shape}, y={y_train.shape}")
print(f"Test set: X={X_test.shape}, y={y_test.shape}")
print(f"\nTask: Predict future values from historical sequence")

try:
    model = LSTMModel(input_size=1, hidden_size=32, num_layers=2, output_size=1)
    predictions = model.forward(X_train[:5])
    print(f"\n‚úì LSTM prediction successful!")
    print(f"  Input shape: {X_train[:5].shape}")
    print(f"  Output shape: {predictions.shape}")
    print(f"\nSample predictions vs actual:")
    for i in range(min(3, len(predictions))):
        print(f"  Sample {i}: pred={predictions[i][0]:.3f}, actual={y_train[i][0]:.3f}")
except Exception as e:
    print(f"\n‚úó Error in LSTM: {e}")


## 5. Transformers and Attention

### Motivation
- RNNs process sequences sequentially ‚Üí slow, can't parallelize
- Long-range dependencies still challenging despite LSTM
- **Solution**: Attention mechanisms

### Attention Mechanism
**Core idea**: For each position, compute a weighted sum over all positions.

**Intuition**: When reading "The cat sat on the mat", to understand "sat", we should attend to "cat" (subject) and "mat" (object).

### Scaled Dot-Product Attention

**Inputs:**
- Query (Q): What am I looking for?
- Key (K): What do I contain?
- Value (V): What do I actually store?

**Formula:**
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

**Steps:**
1. Compute similarity: $QK^T$ (dot product)
2. Scale by $\sqrt{d_k}$ to prevent large values
3. Apply softmax to get attention weights
4. Weighted sum of values: multiply by $V$

### Multi-Head Attention
- Run attention multiple times in parallel with different learned projections
- Allows attending to different aspects (e.g., syntactic vs semantic)
- Concatenate outputs and project again


In [None]:
# Exercise 4: Implement Scaled Dot-Product Attention

def student_attention(Q, K, V, mask=None):
    """
    Implement scaled dot-product attention.
    
    Args:
        Q: Query matrix of shape (batch_size, seq_len, d_k)
        K: Key matrix of shape (batch_size, seq_len, d_k)
        V: Value matrix of shape (batch_size, seq_len, d_k)
        mask: Optional mask of shape (seq_len, seq_len)
    
    Returns:
        output: Attention output of shape (batch_size, seq_len, d_k)
        attention_weights: Attention weights of shape (batch_size, seq_len, seq_len)
    """
    # TODO: Implement scaled dot-product attention
    # 1. Get d_k from the last dimension of Q
    # 2. Compute scores: Q @ K^T / sqrt(d_k)
    #    Hint: Use np.matmul(Q, K.transpose(0, 2, 1)) for batch matrix multiply
    # 3. Apply mask if provided: scores = scores + (mask * -1e9)
    # 4. Apply softmax along the last dimension
    #    Hint: Use softmax from utils
    # 5. Compute output: attention_weights @ V
    # 6. Return output and attention_weights
    
    raise NotImplementedError


In [None]:
# Test Attention
res = test_transformer_attention(student_attention)
show_result("Exercise 4 ‚Äì Scaled Dot-Product Attention", res)


### Understanding Attention Weights

Let's visualize what attention learns.


In [None]:
# Demo: Attention Visualization
print("Creating sample sequence for attention demo...\n")

# Simple example: 4 words, 8-dimensional embeddings
seq_len = 4
d_model = 8
batch_size = 1

# Create simple embeddings (in practice, these would be learned)
np.random.seed(42)
Q = np.random.randn(batch_size, seq_len, d_model).astype(np.float32)
K = Q.copy()  # Self-attention: keys are same as queries
V = Q.copy()  # Values are also the same

# Compute attention
try:
    output, weights = student_attention(Q, K, V)
    
    print("Attention Weights Matrix:")
    print("(Each row shows how much position i attends to all positions)\n")
    print("      Pos0   Pos1   Pos2   Pos3")
    for i in range(seq_len):
        row_str = f"Pos{i}: "
        for j in range(seq_len):
            row_str += f"{weights[0, i, j]:.3f}  "
        print(row_str)
    
    print("\nNote: Each row sums to 1.0 (softmax normalization)")
    print("Higher values = stronger attention")
    
    print(f"\nOutput shape: {output.shape}")
    print("Output is a weighted combination of all value vectors.")
    
except Exception as e:
    print(f"Error: {e}")
    print("Complete Exercise 4 first!")


### Transformer Encoder Block

A complete transformer encoder block consists of:
1. Multi-head self-attention
2. Add & Norm (residual connection + layer normalization)
3. Feed-forward network (two linear layers with ReLU)
4. Add & Norm again

```
Input
  ‚Üì
Multi-Head Attention
  ‚Üì
Add & Norm (+ residual)
  ‚Üì
Feed Forward (FFN)
  ‚Üì
Add & Norm (+ residual)
  ‚Üì
Output
```


In [None]:
# Demo: Complete Transformer Encoder (provided code)

class TransformerEncoder:
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        """
        Initialize Transformer Encoder.
        
        Args:
            d_model: Model dimension
            num_heads: Number of attention heads
            d_ff: Feed-forward dimension
            dropout: Dropout rate
        """
        self.attention = MultiHeadAttention(d_model, num_heads)
        self.ffn_1 = Linear(d_model, d_ff)
        self.ffn_2 = Linear(d_ff, d_model)
        self.relu = ReLU()
        self.dropout = Dropout(dropout)
    
    def forward(self, x, mask=None):
        """
        Forward pass through transformer encoder.
        
        Args:
            x: Input of shape (batch_size, seq_len, d_model)
        
        Returns:
            Output of shape (batch_size, seq_len, d_model)
        """
        # Multi-head self-attention with residual
        attn_output = self.attention.forward(x, x, x, mask)
        x = x + self.dropout(attn_output)  # Residual connection
        # In practice, we'd add layer normalization here
        
        # Feed-forward network with residual
        ffn_output = self.ffn_2(self.relu(self.ffn_1(x)))
        x = x + self.dropout(ffn_output)  # Residual connection
        
        return x

# Test the encoder
print("Testing Transformer Encoder...\n")
d_model = 64
num_heads = 4
d_ff = 256
seq_len = 10
batch_size = 2

encoder = TransformerEncoder(d_model, num_heads, d_ff)
x = np.random.randn(batch_size, seq_len, d_model).astype(np.float32)
output = encoder.forward(x)

print(f"‚úì Transformer Encoder working!")
print(f"  Input shape:  {x.shape}")
print(f"  Output shape: {output.shape}")
print(f"\nKey components:")
print(f"  - Multi-head attention: {num_heads} heads")
print(f"  - Model dimension: {d_model}")
print(f"  - Feed-forward dimension: {d_ff}")
print(f"  - Two residual connections (attention + FFN)")


In [None]:
# Demo: Positional Encoding
print("Demonstrating Positional Encoding...\n")

d_model = 64
max_len = 100
pos_encoder = PositionalEncoding(d_model, max_len)

# Create sample embeddings
seq_len = 20
batch_size = 1
embeddings = np.random.randn(batch_size, seq_len, d_model).astype(np.float32) * 0.1

# Add positional encoding
embeddings_with_pos = pos_encoder.forward(embeddings)

print(f"Original embeddings shape: {embeddings.shape}")
print(f"With positional encoding: {embeddings_with_pos.shape}")
print(f"\nPositional encoding allows the model to use position information!")
print(f"Without it, 'cat sat mat' = 'mat cat sat' = 'sat mat cat'")

# Show a few positional encoding values
print(f"\nSample positional encodings (first 3 positions, first 8 dims):")
for pos in range(3):
    print(f"Position {pos}: {pos_encoder.pe[pos, :8]}")


## 6. Vision Transformers (ViT)

### Can Transformers Replace CNNs?
In 2020, Vision Transformer (ViT) showed: **YES!** (with enough data)

### ViT Architecture

**Key idea**: Treat an image as a sequence of patches.

```
Image (224√ó224√ó3)
    ‚Üì
Split into patches (16√ó16) ‚Üí 196 patches
    ‚Üì
Flatten each patch ‚Üí 196 vectors of size 768
    ‚Üì
Linear projection (patch embedding)
    ‚Üì
Add [CLS] token + positional encoding
    ‚Üì
Transformer Encoder (12-24 layers)
    ‚Üì
[CLS] token ‚Üí Classification Head
```

### ViT vs CNN

**CNN advantages:**
- Strong inductive biases (locality, translation invariance)
- Works well with less data
- More efficient for small images

**ViT advantages:**
- Global receptive field from layer 1
- More flexible (no hardcoded filters)
- Scales better with data and compute
- Better for very large images


In [None]:
# Exercise 5: Implement Patch Embedding for ViT

class PatchEmbedding:
    def __init__(self, img_size=224, patch_size=16, in_channels=3, d_model=768):
        """
        Initialize patch embedding layer.
        
        Args:
            img_size: Input image size (assumes square images)
            patch_size: Size of each patch (assumes square patches)
            in_channels: Number of input channels (3 for RGB)
            d_model: Embedding dimension
        """
        self.img_size = img_size
        self.patch_size = patch_size
        self.n_patches = (img_size // patch_size) ** 2
        
        # TODO: Create a linear projection for patches
        # Each patch is patch_size √ó patch_size √ó in_channels flattened
        # self.patch_dim = in_channels * patch_size * patch_size
        # self.proj = Linear(self.patch_dim, d_model)
        
        raise NotImplementedError
    
    def forward(self, x):
        """
        Convert image to patch embeddings.
        
        Args:
            x: Input image of shape (batch_size, in_channels, img_size, img_size)
        
        Returns:
            Patch embeddings of shape (batch_size, n_patches, d_model)
        """
        # TODO: Implement patch extraction and embedding
        # 1. Extract patches from the image
        #    For simplicity, reshape: (B, C, H, W) ‚Üí (B, n_patches, patch_dim)
        # 2. Apply linear projection to each patch
        # 3. Return patch embeddings
        
        # Hint: You can use np.reshape or a loop over patches
        # Advanced: Use array reshaping tricks
        
        raise NotImplementedError


In [None]:
# Test Patch Embedding
res = test_vit_patch_embedding(PatchEmbedding)
show_result("Exercise 5 ‚Äì ViT Patch Embedding", res)


In [None]:
# Demo: Complete ViT Forward Pass
print("Demonstrating Vision Transformer (ViT)...\n")

class VisionTransformer:
    def __init__(self, img_size=224, patch_size=16, in_channels=3, 
                 num_classes=1000, d_model=768, num_heads=12, num_layers=12, d_ff=3072):
        """
        Complete Vision Transformer implementation.
        """
        self.patch_embed = PatchEmbedding(img_size, patch_size, in_channels, d_model)
        self.n_patches = self.patch_embed.n_patches
        
        # CLS token (learnable parameter)
        self.cls_token = np.random.randn(1, 1, d_model).astype(np.float32) * 0.02
        
        # Positional encoding
        self.pos_encoding = PositionalEncoding(d_model, max_len=self.n_patches + 1)
        
        # Transformer encoder layers
        self.encoders = [TransformerEncoder(d_model, num_heads, d_ff) for _ in range(num_layers)]
        
        # Classification head
        self.head = Linear(d_model, num_classes)
    
    def forward(self, x):
        """
        Forward pass through ViT.
        
        Args:
            x: Input images of shape (batch_size, in_channels, img_size, img_size)
        
        Returns:
            Class logits of shape (batch_size, num_classes)
        """
        batch_size = x.shape[0]
        
        # 1. Patch embedding
        x = self.patch_embed.forward(x)  # (B, n_patches, d_model)
        
        # 2. Prepend CLS token
        cls_tokens = np.repeat(self.cls_token, batch_size, axis=0)
        x = np.concatenate([cls_tokens, x], axis=1)  # (B, n_patches+1, d_model)
        
        # 3. Add positional encoding
        x = self.pos_encoding.forward(x)
        
        # 4. Pass through transformer encoders
        for encoder in self.encoders:
            x = encoder.forward(x)
        
        # 5. Classification using CLS token
        cls_output = x[:, 0, :]  # Take CLS token
        logits = self.head(cls_output)
        
        return logits

try:
    # Create a small ViT for demonstration
    vit = VisionTransformer(
        img_size=224, 
        patch_size=16, 
        num_classes=10,
        d_model=192,  # Smaller for demo
        num_heads=3,
        num_layers=6,  # Fewer layers for demo
        d_ff=768
    )
    
    # Test forward pass
    test_img = np.random.randn(2, 3, 224, 224).astype(np.float32)
    output = vit.forward(test_img)
    
    print(f"‚úì Vision Transformer working!")
    print(f"\nArchitecture:")
    print(f"  - Image size: 224√ó224")
    print(f"  - Patch size: 16√ó16")
    print(f"  - Number of patches: {vit.n_patches}")
    print(f"  - Embedding dimension: 192")
    print(f"  - Attention heads: 3")
    print(f"  - Transformer layers: 6")
    print(f"\nForward pass:")
    print(f"  - Input: {test_img.shape}")
    print(f"  - Output: {output.shape}")
    print(f"\nViT treats images as sequences of patches!")
    print(f"No convolutions needed ‚Äì pure transformer architecture.")
    
except Exception as e:
    print(f"Error: {e}")
    print("Make sure Exercise 5 is completed!")


## 7. Summary and Comparison

### Architecture Comparison

| Architecture | Best For | Key Innovation | Parameters (typical) |
|--------------|----------|----------------|---------------------|
| **AlexNet** | Image classification | Deep CNNs, ReLU, Dropout | ~60M |
| **ResNet** | Very deep networks | Skip connections | 25M-60M |
| **LSTM** | Sequential data | Gating mechanism | Varies |
| **Transformer** | Long sequences | Attention, parallelization | 100M-1B+ |
| **ViT** | Images (with lots of data) | Patch-based transformers | 86M-632M |

### When to Use What?

**Use CNNs (AlexNet/ResNet) when:**
- Working with images
- Limited training data
- Need translation invariance
- Want efficiency

**Use RNNs/LSTMs when:**
- Sequential data with temporal dependencies
- Online/streaming processing
- Audio, time series, text (small scale)

**Use Transformers when:**
- Need long-range dependencies
- Have lots of data and compute
- Want parallelization
- NLP tasks, large-scale vision

### Modern Trends (2024)
1. **Hybrid architectures**: Combining CNNs + Transformers (e.g., ConvNeXt)
2. **Efficient transformers**: Reducing computational cost
3. **Vision-language models**: CLIP, Flamingo (multimodal)
4. **Foundation models**: Pre-trained on massive data, fine-tuned for tasks


## 8. Reflection Questions

1. **Skip connections**: Why do skip connections help with training very deep networks? What problem do they solve?

2. **LSTM vs Transformer**: Both handle sequential data. What are the key differences in how they process sequences? When would you choose one over the other?

3. **Attention mechanism**: Explain in your own words how scaled dot-product attention works. Why is the scaling factor $\sqrt{d_k}$ important?

4. **ViT vs CNN**: Vision Transformers treat images as sequences of patches, while CNNs use convolutions. What are the trade-offs? Why does ViT need more data than CNNs?

5. **Inductive biases**: CNNs have strong inductive biases (locality, translation invariance), while Transformers have fewer. What does this mean for learning and generalization?


_Write your answers here._


## 9. Next Steps

### For Your Projects
1. **Start simple**: Use pre-trained models (transfer learning)
2. **Image tasks**: Try ResNet or ViT from PyTorch/TensorFlow
3. **Sequence tasks**: Use transformers (Hugging Face library)
4. **Don't reinvent**: Leverage existing implementations

### Further Learning
- **Papers**: Original papers (AlexNet, ResNet, Attention is All You Need, ViT)
- **Courses**: CS231n (Stanford), CS224n (Stanford)
- **Implementations**: PyTorch tutorials, TensorFlow guides
- **Practice**: Kaggle competitions, personal projects

**Congratulations!** You've now seen the major architectures powering modern AI. üéâ
