<a href="https://colab.research.google.com/github/gitmystuff/DTSC5502/blob/main/Module_13-Deep_Learning/deep_learning_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# üß† Deep Learning Fundamentals: A Comprehensive Assignment

## From ANNs to Transformers with PyTorch

---

### üìã Assignment Overview

In this assignment, you will explore the evolution of neural network architectures, from basic Artificial Neural Networks (ANNs) to modern Transformer models. You'll implement each architecture, understand their differences, and learn when to apply each one.

### üéØ Learning Objectives

By the end of this assignment, you will be able to:

1. **Understand** the architecture and components of ANN, CNN, RNN, LSTM, GRU, Attention, and Transformers
2. **Implement** each neural network type using PyTorch
3. **Compare** the architectures and identify their strengths and weaknesses
4. **Apply** appropriate activation functions, optimizers, and loss functions
5. **Explain** forward propagation and backpropagation
6. **Utilize** regularization techniques to prevent overfitting
7. **Evaluate** models using appropriate metrics

### üìö Table of Contents

1. [Part 1: Setup and Fundamentals](#part1)
2. [Part 2: Artificial Neural Networks (ANN)](#part2)
3. [Part 3: Convolutional Neural Networks (CNN)](#part3)
4. [Part 4: Recurrent Neural Networks (RNN)](#part4)
5. [Part 5: Long Short-Term Memory (LSTM) & GRU](#part5)
6. [Part 6: Attention Mechanisms](#part6)
7. [Part 7: Transformers](#part7)
8. [Part 8: Comparison and Analysis](#part8)
9. [Part 9: Final Project](#part9)

---

<a name="part1"></a>
# Part 1: Setup and Fundamentals

## 1.1 Environment Setup

In [None]:
# Check if running on Google Colab
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    print("Running on Google Colab!")
    # Enable GPU
    # Go to Runtime > Change runtime type > Hardware accelerator > GPU
else:
    print("Running locally")

In [None]:
# Import necessary libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset, random_split

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import torchvision
import torchvision.transforms as transforms

import warnings
warnings.filterwarnings('ignore')

# Set random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

# Check for GPU availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"Memory Allocated: {torch.cuda.memory_allocated(0)/1024**2:.2f} MB")

## 1.2 Understanding the Training Pipeline

Before diving into architectures, let's understand the fundamental concepts that apply to ALL neural networks.

### Forward Propagation
Forward propagation is the process of passing input data through the network to get predictions:

$$\hat{y} = f(W_n \cdot f(W_{n-1} \cdot ... f(W_1 \cdot x + b_1) ... + b_{n-1}) + b_n)$$

### Backpropagation
Backpropagation computes gradients of the loss with respect to weights using the chain rule:

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial W}$$

### The Training Loop
```python
for epoch in range(num_epochs):
    for batch in dataloader:
        # 1. Forward pass
        outputs = model(inputs)
        
        # 2. Compute loss
        loss = criterion(outputs, targets)
        
        # 3. Backward pass (compute gradients)
        optimizer.zero_grad()  # Clear previous gradients
        loss.backward()        # Compute gradients
        
        # 4. Update weights
        optimizer.step()
```

## 1.3 Activation Functions

Activation functions introduce non-linearity into the network, allowing it to learn complex patterns.

In [None]:
# Visualize different activation functions
x = torch.linspace(-5, 5, 100)

# Define activation functions
activations = {
    'Sigmoid': torch.sigmoid(x),
    'Tanh': torch.tanh(x),
    'ReLU': F.relu(x),
    'LeakyReLU': F.leaky_relu(x, 0.1),
    'ELU': F.elu(x),
    'GELU': F.gelu(x),
    'Softplus': F.softplus(x),
    'Swish/SiLU': F.silu(x)
}

fig, axes = plt.subplots(2, 4, figsize=(16, 8))
axes = axes.flatten()

for idx, (name, y) in enumerate(activations.items()):
    axes[idx].plot(x.numpy(), y.numpy(), 'b-', linewidth=2)
    axes[idx].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
    axes[idx].axvline(x=0, color='k', linestyle='-', linewidth=0.5)
    axes[idx].set_title(name, fontsize=14)
    axes[idx].set_xlabel('x')
    axes[idx].set_ylabel('f(x)')
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_xlim(-5, 5)

plt.tight_layout()
plt.suptitle('Activation Functions', fontsize=16, y=1.02)
plt.show()

### üìù Exercise 1.1: Activation Functions

Answer the following questions:

1. Why do we need activation functions in neural networks?
2. What is the "vanishing gradient problem" and which activation functions suffer from it?
3. Why is ReLU the most commonly used activation function?
4. When would you use Sigmoid vs Softmax for the output layer?

**Your Answers:**

1. [Your answer here]

2. [Your answer here]

3. [Your answer here]

4. [Your answer here]

## 1.4 Loss Functions

Loss functions measure how well our model's predictions match the actual targets.

In [None]:
# Demonstrate different loss functions

# Sample predictions and targets
predictions = torch.tensor([0.9, 0.2, 0.8, 0.1], dtype=torch.float32)
targets_binary = torch.tensor([1.0, 0.0, 1.0, 0.0], dtype=torch.float32)

predictions_reg = torch.tensor([2.5, 0.0, 2.1, 1.6], dtype=torch.float32)
targets_reg = torch.tensor([3.0, -0.5, 2.0, 1.5], dtype=torch.float32)

# Binary Cross-Entropy Loss (for binary classification)
bce_loss = nn.BCELoss()
print(f"Binary Cross-Entropy Loss: {bce_loss(predictions, targets_binary):.4f}")

# Mean Squared Error Loss (for regression)
mse_loss = nn.MSELoss()
print(f"MSE Loss: {mse_loss(predictions_reg, targets_reg):.4f}")

# Mean Absolute Error Loss (for regression)
mae_loss = nn.L1Loss()
print(f"MAE Loss: {mae_loss(predictions_reg, targets_reg):.4f}")

# Cross-Entropy Loss (for multi-class classification)
ce_loss = nn.CrossEntropyLoss()
predictions_multi = torch.tensor([[2.0, 1.0, 0.1], [0.5, 2.0, 0.3]], dtype=torch.float32)
targets_multi = torch.tensor([0, 1], dtype=torch.long)
print(f"Cross-Entropy Loss: {ce_loss(predictions_multi, targets_multi):.4f}")

### Loss Function Selection Guide

| Task | Loss Function | Output Activation |
|------|---------------|-------------------|
| Binary Classification | BCELoss or BCEWithLogitsLoss | Sigmoid |
| Multi-class Classification | CrossEntropyLoss | None (raw logits) or Softmax |
| Multi-label Classification | BCEWithLogitsLoss | Sigmoid |
| Regression | MSELoss or L1Loss | None (linear) |

## 1.5 Optimizers

Optimizers update the network weights based on computed gradients.

In [None]:
# Visualize optimizer behavior on a simple loss landscape
def rosenbrock(x, y):
    """Rosenbrock function - a common test function for optimization"""
    return (1 - x)**2 + 100 * (y - x**2)**2

# Create a simple model to demonstrate optimizers
class SimpleModel(nn.Module):
    def __init__(self, start_x, start_y):
        super().__init__()
        self.x = nn.Parameter(torch.tensor([start_x]))
        self.y = nn.Parameter(torch.tensor([start_y]))

    def forward(self):
        return rosenbrock(self.x, self.y)

def optimize_path(optimizer_class, lr, start_x=-1.0, start_y=1.0, steps=1000, **kwargs):
    model = SimpleModel(start_x, start_y)
    optimizer = optimizer_class(model.parameters(), lr=lr, **kwargs)

    path = [(model.x.item(), model.y.item())]

    for _ in range(steps):
        optimizer.zero_grad()
        loss = model()
        loss.backward()
        optimizer.step()
        path.append((model.x.item(), model.y.item()))

    return path

# Compare different optimizers
optimizers_config = {
    'SGD': (optim.SGD, {'lr': 0.001}),
    'SGD + Momentum': (optim.SGD, {'lr': 0.001, 'momentum': 0.9}),
    'Adam': (optim.Adam, {'lr': 0.01}),
    'RMSprop': (optim.RMSprop, {'lr': 0.01}),
}

# Create contour plot
x_range = np.linspace(-2, 2, 100)
y_range = np.linspace(-1, 3, 100)
X, Y = np.meshgrid(x_range, y_range)
Z = rosenbrock(X, Y)

fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.flatten()

for idx, (name, (opt_class, params)) in enumerate(optimizers_config.items()):
    path = optimize_path(opt_class, **params)
    path = np.array(path)

    axes[idx].contour(X, Y, Z, levels=np.logspace(-1, 3, 20), cmap='viridis', alpha=0.6)
    axes[idx].plot(path[:, 0], path[:, 1], 'r.-', markersize=2, linewidth=1, alpha=0.7)
    axes[idx].plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start')
    axes[idx].plot(path[-1, 0], path[-1, 1], 'r*', markersize=15, label='End')
    axes[idx].plot(1, 1, 'b^', markersize=10, label='Global Min')
    axes[idx].set_title(f'{name}', fontsize=14)
    axes[idx].set_xlabel('x')
    axes[idx].set_ylabel('y')
    axes[idx].legend()
    axes[idx].set_xlim(-2, 2)
    axes[idx].set_ylim(-1, 3)

plt.tight_layout()
plt.suptitle('Optimizer Comparison on Rosenbrock Function', fontsize=16, y=1.02)
plt.show()

### Optimizer Summary

| Optimizer | Key Features | Best For |
|-----------|--------------|----------|
| **SGD** | Simple, reliable | When you need fine control |
| **SGD + Momentum** | Faster convergence, smooths updates | General purpose |
| **Adam** | Adaptive learning rates, handles sparse gradients | Default choice, NLP |
| **AdamW** | Adam with proper weight decay | Transformers |
| **RMSprop** | Adaptive, handles non-stationary objectives | RNNs |

## 1.6 Weight Initialization

Proper initialization is crucial for training deep networks effectively.

In [None]:
# Demonstrate different initialization methods
def visualize_initialization(init_fn, name, in_features=100, out_features=100):
    layer = nn.Linear(in_features, out_features)
    init_fn(layer.weight)
    return layer.weight.data.numpy().flatten()

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

initializations = {
    'Uniform [-1, 1]': lambda w: nn.init.uniform_(w, -1, 1),
    'Normal (0, 1)': lambda w: nn.init.normal_(w, 0, 1),
    'Xavier Uniform': nn.init.xavier_uniform_,
    'Xavier Normal': nn.init.xavier_normal_,
    'Kaiming Uniform (He)': lambda w: nn.init.kaiming_uniform_(w, mode='fan_in'),
    'Kaiming Normal (He)': lambda w: nn.init.kaiming_normal_(w, mode='fan_in'),
}

for idx, (name, init_fn) in enumerate(initializations.items()):
    weights = visualize_initialization(init_fn, name)
    axes[idx].hist(weights, bins=50, density=True, alpha=0.7, color='steelblue')
    axes[idx].set_title(f'{name}\nŒº={weights.mean():.3f}, œÉ={weights.std():.3f}')
    axes[idx].set_xlabel('Weight Value')
    axes[idx].set_ylabel('Density')

plt.tight_layout()
plt.suptitle('Weight Initialization Methods', fontsize=16, y=1.02)
plt.show()

### Initialization Guidelines

- **Xavier/Glorot**: Use with Sigmoid or Tanh activations
- **Kaiming/He**: Use with ReLU and its variants
- **Orthogonal**: Good for RNNs
- **Normal/Uniform**: Rarely the best choice for deep networks

## 1.7 Regularization Techniques

Regularization helps prevent overfitting.

In [None]:
# Demonstrate Dropout
class DropoutDemo(nn.Module):
    def __init__(self, dropout_rate=0.5):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout_rate)

    def forward(self, x):
        return self.dropout(x)

# Create sample input
sample_input = torch.ones(1, 10)
dropout_model = DropoutDemo(0.5)

# Show dropout effect in training mode
dropout_model.train()
print("Training mode (dropout active):")
for i in range(3):
    output = dropout_model(sample_input)
    print(f"  Pass {i+1}: {output.numpy().flatten()}")

# Show dropout effect in evaluation mode
dropout_model.eval()
print("\nEvaluation mode (dropout inactive):")
for i in range(3):
    output = dropout_model(sample_input)
    print(f"  Pass {i+1}: {output.numpy().flatten()}")

In [None]:
# Demonstrate L1 and L2 Regularization
class RegularizedModel(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return self.fc2(x)

    def l1_regularization(self, lambda_l1=0.01):
        """Calculate L1 regularization term"""
        l1_norm = sum(p.abs().sum() for p in self.parameters())
        return lambda_l1 * l1_norm

    def l2_regularization(self, lambda_l2=0.01):
        """Calculate L2 regularization term"""
        l2_norm = sum(p.pow(2).sum() for p in self.parameters())
        return lambda_l2 * l2_norm

# Example usage
model = RegularizedModel(10, 20, 2)
x = torch.randn(32, 10)
y = torch.randint(0, 2, (32,))

criterion = nn.CrossEntropyLoss()
output = model(x)

# Loss with regularization
base_loss = criterion(output, y)
l1_reg = model.l1_regularization(0.01)
l2_reg = model.l2_regularization(0.01)

print(f"Base Loss: {base_loss.item():.4f}")
print(f"L1 Regularization: {l1_reg.item():.4f}")
print(f"L2 Regularization: {l2_reg.item():.4f}")
print(f"Total Loss (with L2): {(base_loss + l2_reg).item():.4f}")

# Note: PyTorch's optimizers have built-in weight_decay for L2 regularization
optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=0.01)  # weight_decay = L2

## 1.8 Batch Normalization & Layer Normalization

In [None]:
# Demonstrate Batch Normalization
batch_size = 4
features = 3

# Create sample data
x = torch.randn(batch_size, features) * 10 + 5  # Mean ~5, Std ~10
print("Input:")
print(x)
print(f"Mean per feature: {x.mean(dim=0)}")
print(f"Std per feature: {x.std(dim=0)}")

# Apply Batch Normalization
bn = nn.BatchNorm1d(features)
bn.train()  # Training mode
x_bn = bn(x)

print("\nAfter Batch Normalization:")
print(x_bn)
print(f"Mean per feature: {x_bn.mean(dim=0)}")
print(f"Std per feature: {x_bn.std(dim=0)}")

In [None]:
# Layer Normalization (commonly used in Transformers)
seq_len = 4
hidden_size = 3

x = torch.randn(batch_size, seq_len, hidden_size) * 10 + 5
print("Input shape:", x.shape)
print("Input[0]:")
print(x[0])

# Apply Layer Normalization
ln = nn.LayerNorm(hidden_size)
x_ln = ln(x)

print("\nAfter Layer Normalization:")
print(x_ln[0])
print(f"\nMean per token (should be ~0): {x_ln[0].mean(dim=-1)}")
print(f"Std per token (should be ~1): {x_ln[0].std(dim=-1)}")

### Normalization Comparison

| Type | Normalizes Over | Best For |
|------|-----------------|----------|
| **Batch Norm** | Batch dimension | CNNs, large batches |
| **Layer Norm** | Feature dimension | Transformers, RNNs |
| **Instance Norm** | Each sample independently | Style transfer |
| **Group Norm** | Groups of channels | Small batch sizes |

## 1.9 Learning Rate Scheduling

In [None]:
# Visualize different learning rate schedules
def get_lr_schedule(scheduler_class, optimizer, epochs, **kwargs):
    lrs = []
    for _ in range(epochs):
        lrs.append(optimizer.param_groups[0]['lr'])
        scheduler_class.step()
    return lrs

# Create dummy model and optimizer
dummy_model = nn.Linear(10, 1)
epochs = 100

fig, axes = plt.subplots(2, 3, figsize=(15, 8))
axes = axes.flatten()

schedules = [
    ('StepLR (step=30, Œ≥=0.1)', lambda opt: optim.lr_scheduler.StepLR(opt, step_size=30, gamma=0.1)),
    ('ExponentialLR (Œ≥=0.95)', lambda opt: optim.lr_scheduler.ExponentialLR(opt, gamma=0.95)),
    ('CosineAnnealingLR', lambda opt: optim.lr_scheduler.CosineAnnealingLR(opt, T_max=epochs)),
    ('ReduceLROnPlateau', lambda opt: optim.lr_scheduler.ReduceLROnPlateau(opt, patience=10)),
    ('OneCycleLR', lambda opt: optim.lr_scheduler.OneCycleLR(opt, max_lr=0.1, total_steps=epochs)),
    ('CosineAnnealingWarmRestarts', lambda opt: optim.lr_scheduler.CosineAnnealingWarmRestarts(opt, T_0=20)),
]

for idx, (name, scheduler_fn) in enumerate(schedules):
    optimizer = optim.SGD(dummy_model.parameters(), lr=0.1)
    scheduler = scheduler_fn(optimizer)

    lrs = []
    for epoch in range(epochs):
        lrs.append(optimizer.param_groups[0]['lr'])
        if 'Plateau' in name:
            # Simulate varying loss
            fake_loss = 1.0 / (epoch + 1) + np.random.random() * 0.1
            scheduler.step(fake_loss)
        else:
            scheduler.step()

    axes[idx].plot(range(epochs), lrs, 'b-', linewidth=2)
    axes[idx].set_title(name)
    axes[idx].set_xlabel('Epoch')
    axes[idx].set_ylabel('Learning Rate')
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.suptitle('Learning Rate Schedules', fontsize=16, y=1.02)
plt.show()

## 1.10 Data Handling: Train/Validation/Test Splits

In [None]:
# Demonstrate proper data splitting
from torch.utils.data import Subset

# Create synthetic dataset
n_samples = 1000
X = torch.randn(n_samples, 10)
y = torch.randint(0, 2, (n_samples,))

dataset = TensorDataset(X, y)

# Method 1: Using random_split
train_size = int(0.7 * len(dataset))
val_size = int(0.15 * len(dataset))
test_size = len(dataset) - train_size - val_size

train_dataset, val_dataset, test_dataset = random_split(
    dataset, [train_size, val_size, test_size],
    generator=torch.Generator().manual_seed(42)
)

print(f"Total samples: {len(dataset)}")
print(f"Training samples: {len(train_dataset)} ({len(train_dataset)/len(dataset)*100:.1f}%)")
print(f"Validation samples: {len(val_dataset)} ({len(val_dataset)/len(dataset)*100:.1f}%)")
print(f"Test samples: {len(test_dataset)} ({len(test_dataset)/len(dataset)*100:.1f}%)")

# Create DataLoaders
batch_size = 32

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)

print(f"\nNumber of batches:")
print(f"  Train: {len(train_loader)}")
print(f"  Validation: {len(val_loader)}")
print(f"  Test: {len(test_loader)}")

## 1.11 Utility Functions for Training

Let's create reusable training and evaluation functions.

In [None]:
class EarlyStopping:
    """Early stopping to stop training when validation loss doesn't improve."""
    def __init__(self, patience=7, min_delta=0, restore_best_weights=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best_weights = restore_best_weights
        self.best_loss = None
        self.counter = 0
        self.best_weights = None

    def __call__(self, val_loss, model):
        if self.best_loss is None:
            self.best_loss = val_loss
            self.best_weights = model.state_dict().copy()
        elif val_loss > self.best_loss - self.min_delta:
            self.counter += 1
            if self.counter >= self.patience:
                if self.restore_best_weights:
                    model.load_state_dict(self.best_weights)
                return True
        else:
            self.best_loss = val_loss
            self.best_weights = model.state_dict().copy()
            self.counter = 0
        return False


def train_epoch(model, train_loader, criterion, optimizer, device):
    """Train for one epoch."""
    model.train()
    running_loss = 0.0
    correct = 0
    total = 0

    for inputs, targets in train_loader:
        inputs, targets = inputs.to(device), targets.to(device)

        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, targets)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        _, predicted = outputs.max(1)
        total += targets.size(0)
        correct += predicted.eq(targets).sum().item()

    return running_loss / len(train_loader), 100. * correct / total


def evaluate(model, data_loader, criterion, device):
    """Evaluate the model."""
    model.eval()
    running_loss = 0.0
    correct = 0
    total = 0
    all_predictions = []
    all_targets = []

    with torch.no_grad():
        for inputs, targets in data_loader:
            inputs, targets = inputs.to(device), targets.to(device)
            outputs = model(inputs)
            loss = criterion(outputs, targets)

            running_loss += loss.item()
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()

            all_predictions.extend(predicted.cpu().numpy())
            all_targets.extend(targets.cpu().numpy())

    return running_loss / len(data_loader), 100. * correct / total, all_predictions, all_targets


def train_model(model, train_loader, val_loader, criterion, optimizer,
                num_epochs, device, scheduler=None, early_stopping=None):
    """Complete training loop with validation."""
    history = {'train_loss': [], 'train_acc': [], 'val_loss': [], 'val_acc': []}

    for epoch in range(num_epochs):
        train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
        val_loss, val_acc, _, _ = evaluate(model, val_loader, criterion, device)

        history['train_loss'].append(train_loss)
        history['train_acc'].append(train_acc)
        history['val_loss'].append(val_loss)
        history['val_acc'].append(val_acc)

        if scheduler:
            if isinstance(scheduler, optim.lr_scheduler.ReduceLROnPlateau):
                scheduler.step(val_loss)
            else:
                scheduler.step()

        if (epoch + 1) % 5 == 0 or epoch == 0:
            print(f'Epoch [{epoch+1}/{num_epochs}] '
                  f'Train Loss: {train_loss:.4f} Train Acc: {train_acc:.2f}% '
                  f'Val Loss: {val_loss:.4f} Val Acc: {val_acc:.2f}%')

        if early_stopping and early_stopping(val_loss, model):
            print(f'Early stopping triggered at epoch {epoch+1}')
            break

    return history


def plot_training_history(history, title="Training History"):
    """Plot training and validation metrics."""
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

    ax1.plot(history['train_loss'], label='Train Loss')
    ax1.plot(history['val_loss'], label='Val Loss')
    ax1.set_xlabel('Epoch')
    ax1.set_ylabel('Loss')
    ax1.set_title('Loss over Epochs')
    ax1.legend()
    ax1.grid(True, alpha=0.3)

    ax2.plot(history['train_acc'], label='Train Accuracy')
    ax2.plot(history['val_acc'], label='Val Accuracy')
    ax2.set_xlabel('Epoch')
    ax2.set_ylabel('Accuracy (%)')
    ax2.set_title('Accuracy over Epochs')
    ax2.legend()
    ax2.grid(True, alpha=0.3)

    plt.suptitle(title, fontsize=14)
    plt.tight_layout()
    plt.show()


def plot_confusion_matrix(y_true, y_pred, class_names=None):
    """Plot confusion matrix."""
    cm = confusion_matrix(y_true, y_pred)
    plt.figure(figsize=(8, 6))
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                xticklabels=class_names, yticklabels=class_names)
    plt.xlabel('Predicted')
    plt.ylabel('Actual')
    plt.title('Confusion Matrix')
    plt.show()


print("Utility functions defined successfully!")

---
<a name="part2"></a>
# Part 2: Artificial Neural Networks (ANN)

## 2.1 Theory

An **Artificial Neural Network (ANN)**, also known as a **Feedforward Neural Network** or **Multi-Layer Perceptron (MLP)**, is the simplest form of neural network.

### Architecture
- **Input Layer**: Receives input features
- **Hidden Layers**: Process information through weighted connections
- **Output Layer**: Produces predictions

### Key Characteristics
- Fully connected layers (each neuron connects to all neurons in adjacent layers)
- Information flows in one direction (forward)
- No memory of previous inputs
- Works with fixed-size inputs

### When to Use ANNs
- Tabular/structured data
- Classification tasks
- Regression problems
- When input features don't have spatial or temporal relationships

In [None]:
# Load MNIST dataset for ANN demonstration
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
])

# Download and load training data
train_dataset_full = torchvision.datasets.MNIST(
    root='./data', train=True, download=True, transform=transform
)

test_dataset = torchvision.datasets.MNIST(
    root='./data', train=False, download=True, transform=transform
)

# Split training into train and validation
train_size = int(0.9 * len(train_dataset_full))
val_size = len(train_dataset_full) - train_size
train_dataset, val_dataset = random_split(train_dataset_full, [train_size, val_size])

# Create data loaders
train_loader_mnist = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader_mnist = DataLoader(val_dataset, batch_size=64, shuffle=False)
test_loader_mnist = DataLoader(test_dataset, batch_size=64, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Validation samples: {len(val_dataset)}")
print(f"Test samples: {len(test_dataset)}")

# Visualize some samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
for i, ax in enumerate(axes.flatten()):
    img, label = train_dataset_full[i]
    ax.imshow(img.squeeze(), cmap='gray')
    ax.set_title(f'Label: {label}')
    ax.axis('off')
plt.suptitle('MNIST Sample Images', fontsize=14)
plt.tight_layout()
plt.show()

## 2.2 Building an ANN from Scratch

In [None]:
class ANN(nn.Module):
    """
    Artificial Neural Network (Multi-Layer Perceptron)

    Architecture:
    Input (784) -> FC(512) -> ReLU -> Dropout -> FC(256) -> ReLU -> Dropout -> FC(10)
    """
    def __init__(self, input_size=784, hidden_sizes=[512, 256], num_classes=10, dropout_rate=0.2):
        super(ANN, self).__init__()

        # Build layers dynamically
        layers = []
        prev_size = input_size

        for hidden_size in hidden_sizes:
            layers.extend([
                nn.Linear(prev_size, hidden_size),
                nn.ReLU(),
                nn.Dropout(dropout_rate)
            ])
            prev_size = hidden_size

        layers.append(nn.Linear(prev_size, num_classes))

        self.network = nn.Sequential(*layers)

        # Initialize weights
        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')
                nn.init.zeros_(m.bias)

    def forward(self, x):
        # Flatten input: (batch_size, 1, 28, 28) -> (batch_size, 784)
        x = x.view(x.size(0), -1)
        return self.network(x)

# Create model
ann_model = ANN().to(device)
print(ann_model)
print(f"\nTotal parameters: {sum(p.numel() for p in ann_model.parameters()):,}")

In [None]:
# Train the ANN
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(ann_model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, factor=0.5)
early_stopping = EarlyStopping(patience=5)

print("Training ANN on MNIST...\n")
ann_history = train_model(
    ann_model, train_loader_mnist, val_loader_mnist, criterion, optimizer,
    num_epochs=20, device=device, scheduler=scheduler, early_stopping=early_stopping
)

In [None]:
# Plot training history
plot_training_history(ann_history, "ANN Training History")

# Evaluate on test set
test_loss, test_acc, predictions, targets = evaluate(ann_model, test_loader_mnist, criterion, device)
print(f"\nTest Results: Loss: {test_loss:.4f}, Accuracy: {test_acc:.2f}%")

# Classification report
print("\nClassification Report:")
print(classification_report(targets, predictions, digits=4))

# Confusion matrix
plot_confusion_matrix(targets, predictions, class_names=[str(i) for i in range(10)])

### üìù Exercise 2.1: Modify the ANN

Experiment with the ANN architecture:

1. Add more hidden layers
2. Change the number of neurons
3. Try different activation functions
4. Adjust the dropout rate

Document your findings below.

In [None]:
# YOUR CODE HERE: Create a modified ANN
class ModifiedANN(nn.Module):
    def __init__(self):
        super(ModifiedANN, self).__init__()
        # TODO: Implement your modified architecture
        pass

    def forward(self, x):
        # TODO: Implement forward pass
        pass

# Train and compare with the original

**Your Findings:**

[Document what you observed when modifying the architecture]

---
<a name="part3"></a>
# Part 3: Convolutional Neural Networks (CNN)

## 3.1 Theory

**Convolutional Neural Networks** are designed to process data with grid-like topology, such as images.

### Key Components

1. **Convolutional Layers**: Apply learnable filters to detect features
2. **Pooling Layers**: Reduce spatial dimensions while retaining important information
3. **Fully Connected Layers**: Final classification/regression

### Key Concepts

- **Local Connectivity**: Each neuron connects only to a small region of the input
- **Parameter Sharing**: Same filter is used across the entire input
- **Translation Invariance**: Can detect features regardless of position

### When to Use CNNs
- Image classification
- Object detection
- Image segmentation
- Any data with spatial structure (audio spectrograms, etc.)

In [None]:
# Visualize convolution operation
def visualize_convolution():
    # Create a simple image
    image = torch.zeros(1, 1, 8, 8)
    image[0, 0, 2:6, 2:6] = 1  # White square in center

    # Define filters
    edge_detector_h = torch.tensor([[-1, -1, -1],
                                     [ 0,  0,  0],
                                     [ 1,  1,  1]], dtype=torch.float32).view(1, 1, 3, 3)

    edge_detector_v = torch.tensor([[-1,  0,  1],
                                     [-1,  0,  1],
                                     [-1,  0,  1]], dtype=torch.float32).view(1, 1, 3, 3)

    # Apply convolutions
    output_h = F.conv2d(image, edge_detector_h, padding=1)
    output_v = F.conv2d(image, edge_detector_v, padding=1)

    fig, axes = plt.subplots(1, 4, figsize=(16, 4))

    axes[0].imshow(image.squeeze(), cmap='gray')
    axes[0].set_title('Original Image')

    axes[1].imshow(edge_detector_h.squeeze(), cmap='RdBu')
    axes[1].set_title('Horizontal Edge Filter')

    axes[2].imshow(output_h.squeeze().detach(), cmap='RdBu')
    axes[2].set_title('Horizontal Edges Detected')

    axes[3].imshow(output_v.squeeze().detach(), cmap='RdBu')
    axes[3].set_title('Vertical Edges Detected')

    for ax in axes:
        ax.axis('off')

    plt.tight_layout()
    plt.show()

visualize_convolution()

In [None]:
# Visualize pooling operations
def visualize_pooling():
    # Create sample feature map
    feature_map = torch.tensor([[[[1, 2, 3, 4],
                                   [5, 6, 7, 8],
                                   [9, 10, 11, 12],
                                   [13, 14, 15, 16]]]], dtype=torch.float32)

    max_pool = nn.MaxPool2d(kernel_size=2, stride=2)
    avg_pool = nn.AvgPool2d(kernel_size=2, stride=2)

    max_pooled = max_pool(feature_map)
    avg_pooled = avg_pool(feature_map)

    fig, axes = plt.subplots(1, 3, figsize=(12, 4))

    # Original
    im0 = axes[0].imshow(feature_map.squeeze(), cmap='viridis')
    axes[0].set_title('Original (4x4)')
    for i in range(4):
        for j in range(4):
            axes[0].text(j, i, f'{int(feature_map[0,0,i,j])}', ha='center', va='center', color='white')

    # Max Pooled
    im1 = axes[1].imshow(max_pooled.squeeze(), cmap='viridis')
    axes[1].set_title('Max Pooled (2x2)')
    for i in range(2):
        for j in range(2):
            axes[1].text(j, i, f'{int(max_pooled[0,0,i,j])}', ha='center', va='center', color='white')

    # Avg Pooled
    im2 = axes[2].imshow(avg_pooled.squeeze(), cmap='viridis')
    axes[2].set_title('Avg Pooled (2x2)')
    for i in range(2):
        for j in range(2):
            axes[2].text(j, i, f'{avg_pooled[0,0,i,j]:.1f}', ha='center', va='center', color='white')

    for ax in axes:
        ax.axis('off')

    plt.tight_layout()
    plt.show()

visualize_pooling()

## 3.2 Building a CNN

In [None]:
class CNN(nn.Module):
    """
    Convolutional Neural Network for image classification.

    Architecture:
    Conv1(32) -> BN -> ReLU -> MaxPool ->
    Conv2(64) -> BN -> ReLU -> MaxPool ->
    Conv3(128) -> BN -> ReLU -> MaxPool ->
    Flatten -> FC(256) -> ReLU -> Dropout -> FC(10)
    """
    def __init__(self, in_channels=1, num_classes=10):
        super(CNN, self).__init__()

        # Convolutional layers
        self.conv1 = nn.Conv2d(in_channels, 32, kernel_size=3, padding=1)
        self.bn1 = nn.BatchNorm2d(32)

        self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
        self.bn2 = nn.BatchNorm2d(64)

        self.conv3 = nn.Conv2d(64, 128, kernel_size=3, padding=1)
        self.bn3 = nn.BatchNorm2d(128)

        self.pool = nn.MaxPool2d(kernel_size=2, stride=2)

        # Fully connected layers
        # After 3 pooling operations: 28 -> 14 -> 7 -> 3
        self.fc1 = nn.Linear(128 * 3 * 3, 256)
        self.dropout = nn.Dropout(0.5)
        self.fc2 = nn.Linear(256, num_classes)

        self._initialize_weights()

    def _initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, mode='fan_in', nonlinearity='relu')

    def forward(self, x):
        # Convolutional blocks
        x = self.pool(F.relu(self.bn1(self.conv1(x))))  # 28->14
        x = self.pool(F.relu(self.bn2(self.conv2(x))))  # 14->7
        x = self.pool(F.relu(self.bn3(self.conv3(x))))  # 7->3

        # Flatten
        x = x.view(x.size(0), -1)

        # Fully connected
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)

        return x

# Create model
cnn_model = CNN().to(device)
print(cnn_model)
print(f"\nTotal parameters: {sum(p.numel() for p in cnn_model.parameters()):,}")

In [None]:
# Train the CNN
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(cnn_model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=3, factor=0.5)
early_stopping = EarlyStopping(patience=5)

print("Training CNN on MNIST...\n")
cnn_history = train_model(
    cnn_model, train_loader_mnist, val_loader_mnist, criterion, optimizer,
    num_epochs=20, device=device, scheduler=scheduler, early_stopping=early_stopping
)

In [None]:
# Plot training history and evaluate
plot_training_history(cnn_history, "CNN Training History")

test_loss, test_acc, predictions, targets = evaluate(cnn_model, test_loader_mnist, criterion, device)
print(f"\nTest Results: Loss: {test_loss:.4f}, Accuracy: {test_acc:.2f}%")

In [None]:
# Visualize learned filters from first conv layer
def visualize_filters(model):
    filters = model.conv1.weight.data.cpu()
    n_filters = filters.shape[0]

    fig, axes = plt.subplots(4, 8, figsize=(16, 8))
    axes = axes.flatten()

    for i in range(min(32, n_filters)):
        axes[i].imshow(filters[i, 0], cmap='gray')
        axes[i].axis('off')
        axes[i].set_title(f'Filter {i+1}')

    plt.suptitle('Learned Filters (First Conv Layer)', fontsize=14)
    plt.tight_layout()
    plt.show()

visualize_filters(cnn_model)

### üìù Exercise 3.1: CNN Feature Maps

Visualize the feature maps at different layers of the CNN for a sample image.

In [None]:
# YOUR CODE HERE: Visualize feature maps
def visualize_feature_maps(model, image):
    """
    Pass an image through the CNN and visualize intermediate feature maps.

    Hint: Use forward hooks or manually pass through each layer
    """
    # TODO: Implement feature map visualization
    pass

# Get a sample image
sample_image, label = test_dataset[0]
# visualize_feature_maps(cnn_model, sample_image.unsqueeze(0).to(device))

## 3.3 Data Augmentation for CNNs

In [None]:
# Demonstrate data augmentation
augmentation_transforms = transforms.Compose([
    transforms.RandomRotation(15),
    transforms.RandomAffine(degrees=0, translate=(0.1, 0.1)),
    transforms.RandomPerspective(distortion_scale=0.2, p=0.5),
])

# Get a sample image
original_image, label = train_dataset_full[0]

# Apply augmentations
fig, axes = plt.subplots(2, 5, figsize=(15, 6))
axes = axes.flatten()

axes[0].imshow(original_image.squeeze(), cmap='gray')
axes[0].set_title('Original')
axes[0].axis('off')

for i in range(1, 10):
    # Convert to PIL for augmentation
    pil_image = transforms.ToPILImage()(original_image)
    augmented = augmentation_transforms(pil_image)
    axes[i].imshow(augmented, cmap='gray')
    axes[i].set_title(f'Augmented {i}')
    axes[i].axis('off')

plt.suptitle('Data Augmentation Examples', fontsize=14)
plt.tight_layout()
plt.show()

---
<a name="part4"></a>
# Part 4: Recurrent Neural Networks (RNN)

## 4.1 Theory

**Recurrent Neural Networks** are designed for sequential data where the order matters.

### Key Characteristics

- **Memory**: Maintains hidden state that captures information from previous time steps
- **Weight Sharing**: Same weights used across all time steps
- **Variable-Length Input**: Can process sequences of any length

### The Hidden State
$$h_t = \tanh(W_{hh} \cdot h_{t-1} + W_{xh} \cdot x_t + b_h)$$
$$y_t = W_{hy} \cdot h_t + b_y$$

### When to Use RNNs
- Time series prediction
- Natural language processing
- Speech recognition
- Any sequential data

### Limitations
- **Vanishing Gradient Problem**: Gradients become very small during backpropagation through time
- **Short-term Memory**: Difficulty learning long-range dependencies

In [None]:
# Create a synthetic sequence dataset for RNN demonstration
def generate_sequence_data(n_samples=1000, seq_length=50, n_features=1):
    """
    Generate synthetic time series data.
    Task: Predict if the sum of the sequence is positive or negative.
    """
    X = torch.randn(n_samples, seq_length, n_features)
    y = (X.sum(dim=(1, 2)) > 0).long()  # Binary classification
    return X, y

# Generate data
X_seq, y_seq = generate_sequence_data(n_samples=5000)

# Split data
X_train, X_temp, y_train, y_temp = train_test_split(X_seq, y_seq, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Create datasets and loaders
train_dataset_seq = TensorDataset(X_train, y_train)
val_dataset_seq = TensorDataset(X_val, y_val)
test_dataset_seq = TensorDataset(X_test, y_test)

train_loader_seq = DataLoader(train_dataset_seq, batch_size=64, shuffle=True)
val_loader_seq = DataLoader(val_dataset_seq, batch_size=64, shuffle=False)
test_loader_seq = DataLoader(test_dataset_seq, batch_size=64, shuffle=False)

print(f"Training samples: {len(train_dataset_seq)}")
print(f"Sequence shape: {X_train.shape}")
print(f"Class distribution - 0: {(y_train == 0).sum().item()}, 1: {(y_train == 1).sum().item()}")

## 4.2 Building an RNN

In [None]:
class SimpleRNN(nn.Module):
    """
    Simple Recurrent Neural Network.

    Architecture:
    Input -> RNN layers -> FC -> Output
    """
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, num_classes=2, dropout=0.3):
        super(SimpleRNN, self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # RNN layer
        self.rnn = nn.RNN(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,  # Input shape: (batch, seq, features)
            dropout=dropout if num_layers > 1 else 0,
            nonlinearity='tanh'
        )

        # Output layer
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # Initialize hidden state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)

        # Forward through RNN
        # out: (batch, seq_len, hidden_size)
        # hidden: (num_layers, batch, hidden_size)
        out, hidden = self.rnn(x, h0)

        # Use the last time step's output
        out = out[:, -1, :]  # (batch, hidden_size)

        # Fully connected layer
        out = self.fc(out)

        return out

# Create model
rnn_model = SimpleRNN().to(device)
print(rnn_model)
print(f"\nTotal parameters: {sum(p.numel() for p in rnn_model.parameters()):,}")

In [None]:
# Train the RNN
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(rnn_model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
early_stopping = EarlyStopping(patience=10)

print("Training RNN on sequence data...\n")
rnn_history = train_model(
    rnn_model, train_loader_seq, val_loader_seq, criterion, optimizer,
    num_epochs=30, device=device, scheduler=scheduler, early_stopping=early_stopping
)

In [None]:
plot_training_history(rnn_history, "Simple RNN Training History")

test_loss, test_acc, _, _ = evaluate(rnn_model, test_loader_seq, criterion, device)
print(f"\nTest Results: Loss: {test_loss:.4f}, Accuracy: {test_acc:.2f}%")

## 4.3 Visualizing the Vanishing Gradient Problem

In [None]:
# Demonstrate vanishing gradients
def analyze_gradients(model, data_loader, device):
    """Analyze gradient magnitudes across time steps."""
    model.train()

    # Get a batch
    inputs, targets = next(iter(data_loader))
    inputs, targets = inputs.to(device), targets.to(device)

    # Forward pass
    outputs = model(inputs)
    loss = nn.CrossEntropyLoss()(outputs, targets)

    # Backward pass
    loss.backward()

    # Get gradients from RNN layer
    gradients = {}
    for name, param in model.named_parameters():
        if param.grad is not None:
            gradients[name] = param.grad.abs().mean().item()

    return gradients

# Analyze gradients
rnn_gradients = analyze_gradients(rnn_model, train_loader_seq, device)

print("Gradient magnitudes:")
for name, grad_mag in rnn_gradients.items():
    print(f"  {name}: {grad_mag:.6f}")

### üìù Exercise 4.1: RNN for Sequence Classification

Modify the RNN to:
1. Use bidirectional processing
2. Use attention over all time steps instead of just the last one

Compare the performance.

In [None]:
# YOUR CODE HERE: Create a Bidirectional RNN
class BidirectionalRNN(nn.Module):
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, num_classes=2):
        super(BidirectionalRNN, self).__init__()
        # TODO: Implement bidirectional RNN
        pass

    def forward(self, x):
        # TODO: Implement forward pass
        pass

---
<a name="part5"></a>
# Part 5: LSTM & GRU

## 5.1 LSTM Theory

**Long Short-Term Memory (LSTM)** networks solve the vanishing gradient problem through a gating mechanism.

### LSTM Gates

1. **Forget Gate**: Decides what information to discard from cell state
   $$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$$

2. **Input Gate**: Decides what new information to store
   $$i_t = \sigma(W_i \cdot [h_{t-1}, x_t] + b_i)$$
   $$\tilde{C}_t = \tanh(W_C \cdot [h_{t-1}, x_t] + b_C)$$

3. **Cell State Update**:
   $$C_t = f_t \odot C_{t-1} + i_t \odot \tilde{C}_t$$

4. **Output Gate**: Decides what to output
   $$o_t = \sigma(W_o \cdot [h_{t-1}, x_t] + b_o)$$
   $$h_t = o_t \odot \tanh(C_t)$$

In [None]:
class LSTM(nn.Module):
    """
    Long Short-Term Memory Network.
    """
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, num_classes=2, dropout=0.3):
        super(LSTM, self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # LSTM layer
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=False
        )

        # Layer normalization
        self.layer_norm = nn.LayerNorm(hidden_size)

        # Output layer
        self.fc = nn.Linear(hidden_size, num_classes)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # Initialize hidden and cell states
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)

        # Forward through LSTM
        out, (hidden, cell) = self.lstm(x, (h0, c0))

        # Use the last time step's output
        out = out[:, -1, :]
        out = self.layer_norm(out)
        out = self.dropout(out)
        out = self.fc(out)

        return out

# Create and train LSTM
lstm_model = LSTM().to(device)
print(lstm_model)
print(f"\nTotal parameters: {sum(p.numel() for p in lstm_model.parameters()):,}")

In [None]:
# Train LSTM
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(lstm_model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
early_stopping = EarlyStopping(patience=10)

print("Training LSTM...\n")
lstm_history = train_model(
    lstm_model, train_loader_seq, val_loader_seq, criterion, optimizer,
    num_epochs=30, device=device, scheduler=scheduler, early_stopping=early_stopping
)

## 5.2 GRU (Gated Recurrent Unit)

GRU is a simplified version of LSTM with fewer parameters.

### GRU Gates

1. **Update Gate**: Combines forget and input gates
   $$z_t = \sigma(W_z \cdot [h_{t-1}, x_t])$$

2. **Reset Gate**: Controls how much past information to forget
   $$r_t = \sigma(W_r \cdot [h_{t-1}, x_t])$$

3. **Hidden State**:
   $$\tilde{h}_t = \tanh(W \cdot [r_t \odot h_{t-1}, x_t])$$
   $$h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$$

In [None]:
class GRU(nn.Module):
    """
    Gated Recurrent Unit Network.
    """
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, num_classes=2, dropout=0.3):
        super(GRU, self).__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # GRU layer
        self.gru = nn.GRU(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )

        # Output layer
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        # Initialize hidden state
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_size).to(x.device)

        # Forward through GRU
        out, hidden = self.gru(x, h0)

        # Use the last time step's output
        out = out[:, -1, :]
        out = self.fc(out)

        return out

# Create and train GRU
gru_model = GRU().to(device)
print(gru_model)
print(f"\nTotal parameters: {sum(p.numel() for p in gru_model.parameters()):,}")

In [None]:
# Train GRU
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(gru_model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
early_stopping = EarlyStopping(patience=10)

print("Training GRU...\n")
gru_history = train_model(
    gru_model, train_loader_seq, val_loader_seq, criterion, optimizer,
    num_epochs=30, device=device, scheduler=scheduler, early_stopping=early_stopping
)

In [None]:
# Compare RNN, LSTM, and GRU
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss comparison
axes[0].plot(rnn_history['val_loss'], label='RNN', alpha=0.8)
axes[0].plot(lstm_history['val_loss'], label='LSTM', alpha=0.8)
axes[0].plot(gru_history['val_loss'], label='GRU', alpha=0.8)
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Validation Loss')
axes[0].set_title('Validation Loss Comparison')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Accuracy comparison
axes[1].plot(rnn_history['val_acc'], label='RNN', alpha=0.8)
axes[1].plot(lstm_history['val_acc'], label='LSTM', alpha=0.8)
axes[1].plot(gru_history['val_acc'], label='GRU', alpha=0.8)
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Validation Accuracy (%)')
axes[1].set_title('Validation Accuracy Comparison')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Final test results
print("\nFinal Test Results:")
print("-" * 40)
for name, model in [('RNN', rnn_model), ('LSTM', lstm_model), ('GRU', gru_model)]:
    test_loss, test_acc, _, _ = evaluate(model, test_loader_seq, criterion, device)
    params = sum(p.numel() for p in model.parameters())
    print(f"{name:6s} - Accuracy: {test_acc:.2f}%, Loss: {test_loss:.4f}, Params: {params:,}")

### üìù Exercise 5.1: LSTM vs GRU Analysis

1. Which model performed better and why?
2. How do the number of parameters compare?
3. Try increasing the sequence length to 100 and 200. How does each model perform?

**Your Analysis:**

[Write your analysis here]

---
<a name="part6"></a>
# Part 6: Attention Mechanisms

## 6.1 Theory

**Attention** allows the model to focus on relevant parts of the input when making predictions.

### Basic Attention Formula

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
- **Q (Query)**: What we're looking for
- **K (Key)**: What we're matching against
- **V (Value)**: What we retrieve
- **d_k**: Dimension of keys (for scaling)

### Types of Attention
1. **Self-Attention**: Q, K, V all come from the same sequence
2. **Cross-Attention**: Q comes from one sequence, K and V from another
3. **Multi-Head Attention**: Multiple attention operations in parallel

In [None]:
# Visualize attention mechanism
def visualize_attention_concept():
    # Simulate attention weights for a sentence
    words = ['The', 'cat', 'sat', 'on', 'the', 'mat']
    query_word = 'sat'

    # Simulated attention scores (higher = more relevant)
    attention_scores = torch.tensor([0.1, 0.4, 1.0, 0.2, 0.05, 0.15])
    attention_weights = F.softmax(attention_scores, dim=0)

    fig, axes = plt.subplots(1, 2, figsize=(14, 4))

    # Raw scores
    axes[0].bar(words, attention_scores.numpy(), color='steelblue')
    axes[0].set_title(f'Raw Attention Scores (Query: "{query_word}")')
    axes[0].set_ylabel('Score')

    # Softmax weights
    axes[1].bar(words, attention_weights.numpy(), color='coral')
    axes[1].set_title(f'Attention Weights after Softmax')
    axes[1].set_ylabel('Weight')

    plt.tight_layout()
    plt.show()

visualize_attention_concept()

In [None]:
class ScaledDotProductAttention(nn.Module):
    """
    Scaled Dot-Product Attention.

    Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
    """
    def __init__(self, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

    def forward(self, query, key, value, mask=None):
        """
        Args:
            query: (batch, seq_len, d_k)
            key: (batch, seq_len, d_k)
            value: (batch, seq_len, d_v)
            mask: Optional attention mask
        """
        d_k = query.size(-1)

        # Compute attention scores
        scores = torch.matmul(query, key.transpose(-2, -1)) / np.sqrt(d_k)

        # Apply mask (optional)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, float('-inf'))

        # Softmax to get attention weights
        attention_weights = F.softmax(scores, dim=-1)
        attention_weights = self.dropout(attention_weights)

        # Apply attention to values
        output = torch.matmul(attention_weights, value)

        return output, attention_weights

# Test the attention module
batch_size = 2
seq_len = 4
d_model = 8

attention = ScaledDotProductAttention()
q = torch.randn(batch_size, seq_len, d_model)
k = torch.randn(batch_size, seq_len, d_model)
v = torch.randn(batch_size, seq_len, d_model)

output, weights = attention(q, k, v)
print(f"Input shape: {q.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")
print(f"\nAttention weights (first batch):\n{weights[0].detach().numpy().round(3)}")

In [None]:
class MultiHeadAttention(nn.Module):
    """
    Multi-Head Attention.

    Runs multiple attention operations in parallel, then concatenates results.
    """
    def __init__(self, d_model, num_heads, dropout=0.1):
        super().__init__()
        assert d_model % num_heads == 0, "d_model must be divisible by num_heads"

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Linear projections
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)

        self.attention = ScaledDotProductAttention(dropout)

    def forward(self, query, key, value, mask=None):
        batch_size = query.size(0)

        # Linear projections and reshape for multi-head
        # (batch, seq_len, d_model) -> (batch, num_heads, seq_len, d_k)
        query = self.W_q(query).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        key = self.W_k(key).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)
        value = self.W_v(value).view(batch_size, -1, self.num_heads, self.d_k).transpose(1, 2)

        # Apply attention
        attn_output, attn_weights = self.attention(query, key, value, mask)

        # Concatenate heads
        # (batch, num_heads, seq_len, d_k) -> (batch, seq_len, d_model)
        attn_output = attn_output.transpose(1, 2).contiguous().view(batch_size, -1, self.d_model)

        # Final linear projection
        output = self.W_o(attn_output)

        return output, attn_weights

# Test multi-head attention
mha = MultiHeadAttention(d_model=64, num_heads=8)
x = torch.randn(2, 10, 64)  # (batch, seq_len, d_model)
output, weights = mha(x, x, x)  # Self-attention
print(f"Input shape: {x.shape}")
print(f"Output shape: {output.shape}")
print(f"Attention weights shape: {weights.shape}")

## 6.2 LSTM with Attention

In [None]:
class LSTMWithAttention(nn.Module):
    """
    LSTM with self-attention mechanism.
    """
    def __init__(self, input_size=1, hidden_size=64, num_layers=2, num_classes=2, dropout=0.3):
        super().__init__()

        self.hidden_size = hidden_size
        self.num_layers = num_layers

        # LSTM layer
        self.lstm = nn.LSTM(
            input_size=input_size,
            hidden_size=hidden_size,
            num_layers=num_layers,
            batch_first=True,
            dropout=dropout if num_layers > 1 else 0
        )

        # Attention layer
        self.attention = nn.MultiheadAttention(
            embed_dim=hidden_size,
            num_heads=4,
            dropout=dropout,
            batch_first=True
        )

        # Layer norm
        self.layer_norm = nn.LayerNorm(hidden_size)

        # Output layers
        self.fc = nn.Linear(hidden_size, num_classes)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        batch_size = x.size(0)

        # Initialize hidden states
        h0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)
        c0 = torch.zeros(self.num_layers, batch_size, self.hidden_size).to(x.device)

        # LSTM forward
        lstm_out, _ = self.lstm(x, (h0, c0))  # (batch, seq_len, hidden_size)

        # Self-attention over LSTM outputs
        attn_out, attn_weights = self.attention(lstm_out, lstm_out, lstm_out)

        # Residual connection and layer norm
        attn_out = self.layer_norm(lstm_out + attn_out)

        # Global average pooling over sequence
        pooled = attn_out.mean(dim=1)  # (batch, hidden_size)

        # Output
        out = self.dropout(pooled)
        out = self.fc(out)

        return out

# Create and train LSTM with Attention
lstm_attn_model = LSTMWithAttention().to(device)
print(lstm_attn_model)
print(f"\nTotal parameters: {sum(p.numel() for p in lstm_attn_model.parameters()):,}")

In [None]:
# Train LSTM with Attention
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(lstm_attn_model.parameters(), lr=0.001)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, factor=0.5)
early_stopping = EarlyStopping(patience=10)

print("Training LSTM with Attention...\n")
lstm_attn_history = train_model(
    lstm_attn_model, train_loader_seq, val_loader_seq, criterion, optimizer,
    num_epochs=30, device=device, scheduler=scheduler, early_stopping=early_stopping
)

In [None]:
plot_training_history(lstm_attn_history, "LSTM with Attention Training History")

test_loss, test_acc, _, _ = evaluate(lstm_attn_model, test_loader_seq, criterion, device)
print(f"\nTest Results: Loss: {test_loss:.4f}, Accuracy: {test_acc:.2f}%")

---
<a name="part7"></a>
# Part 7: Transformers

## 7.1 Theory

**Transformers** use self-attention as the primary mechanism, completely eliminating recurrence.

### Key Components

1. **Positional Encoding**: Injects sequence position information
2. **Multi-Head Attention**: Parallel attention operations
3. **Feed-Forward Network**: Position-wise fully connected layers
4. **Residual Connections**: Skip connections around each sub-layer
5. **Layer Normalization**: Normalizes inputs to each sub-layer

### Advantages over RNNs
- Parallelizable (no sequential dependency)
- Better at capturing long-range dependencies
- More efficient training

In [None]:
class PositionalEncoding(nn.Module):
    """
    Positional Encoding using sine and cosine functions.

    PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
    """
    def __init__(self, d_model, max_len=5000, dropout=0.1):
        super().__init__()
        self.dropout = nn.Dropout(dropout)

        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)

        pe = pe.unsqueeze(0)  # (1, max_len, d_model)
        self.register_buffer('pe', pe)

    def forward(self, x):
        """
        Args:
            x: (batch, seq_len, d_model)
        """
        x = x + self.pe[:, :x.size(1), :]
        return self.dropout(x)

# Visualize positional encoding
pe = PositionalEncoding(d_model=64, max_len=100, dropout=0)
dummy_input = torch.zeros(1, 100, 64)
pe_output = pe(dummy_input)

plt.figure(figsize=(12, 6))
plt.imshow(pe.pe[0, :50, :].numpy(), aspect='auto', cmap='viridis')
plt.colorbar()
plt.xlabel('Dimension')
plt.ylabel('Position')
plt.title('Positional Encoding Visualization')
plt.show()

In [None]:
class TransformerEncoderBlock(nn.Module):
    """
    Single Transformer Encoder block.
    """
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()

        # Multi-head attention
        self.self_attn = nn.MultiheadAttention(
            embed_dim=d_model,
            num_heads=num_heads,
            dropout=dropout,
            batch_first=True
        )

        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_ff, d_model),
            nn.Dropout(dropout)
        )

        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_out, attn_weights = self.self_attn(x, x, x, attn_mask=mask)
        x = self.norm1(x + self.dropout(attn_out))

        # Feed-forward with residual connection
        ffn_out = self.ffn(x)
        x = self.norm2(x + ffn_out)

        return x, attn_weights

In [None]:
class TransformerClassifier(nn.Module):
    """
    Transformer model for sequence classification.
    """
    def __init__(self, input_size=1, d_model=64, num_heads=4, num_layers=2,
                 d_ff=256, num_classes=2, max_len=100, dropout=0.1):
        super().__init__()

        # Input projection
        self.input_proj = nn.Linear(input_size, d_model)

        # Positional encoding
        self.pos_encoder = PositionalEncoding(d_model, max_len, dropout)

        # Transformer encoder layers
        self.encoder_layers = nn.ModuleList([
            TransformerEncoderBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        # Classification head
        self.fc = nn.Sequential(
            nn.Linear(d_model, d_model),
            nn.GELU(),
            nn.Dropout(dropout),
            nn.Linear(d_model, num_classes)
        )

        # CLS token (learnable)
        self.cls_token = nn.Parameter(torch.randn(1, 1, d_model))

    def forward(self, x):
        batch_size = x.size(0)

        # Project input
        x = self.input_proj(x)  # (batch, seq_len, d_model)

        # Add CLS token
        cls_tokens = self.cls_token.expand(batch_size, -1, -1)
        x = torch.cat([cls_tokens, x], dim=1)  # (batch, seq_len+1, d_model)

        # Add positional encoding
        x = self.pos_encoder(x)

        # Pass through encoder layers
        attn_weights_list = []
        for encoder_layer in self.encoder_layers:
            x, attn_weights = encoder_layer(x)
            attn_weights_list.append(attn_weights)

        # Use CLS token for classification
        cls_output = x[:, 0, :]  # (batch, d_model)

        # Classification
        output = self.fc(cls_output)

        return output

# Create transformer model
transformer_model = TransformerClassifier().to(device)
print(transformer_model)
print(f"\nTotal parameters: {sum(p.numel() for p in transformer_model.parameters()):,}")

In [None]:
# Train Transformer
criterion = nn.CrossEntropyLoss()
optimizer = optim.AdamW(transformer_model.parameters(), lr=0.001, weight_decay=0.01)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=30)
early_stopping = EarlyStopping(patience=10)

print("Training Transformer...\n")
transformer_history = train_model(
    transformer_model, train_loader_seq, val_loader_seq, criterion, optimizer,
    num_epochs=30, device=device, scheduler=scheduler, early_stopping=early_stopping
)

In [None]:
plot_training_history(transformer_history, "Transformer Training History")

test_loss, test_acc, _, _ = evaluate(transformer_model, test_loader_seq, criterion, device)
print(f"\nTest Results: Loss: {test_loss:.4f}, Accuracy: {test_acc:.2f}%")

---
<a name="part8"></a>
# Part 8: Comprehensive Comparison

## 8.1 Architecture Comparison

In [None]:
# Compare all sequence models
models_comparison = {
    'Simple RNN': (rnn_model, rnn_history),
    'LSTM': (lstm_model, lstm_history),
    'GRU': (gru_model, gru_history),
    'LSTM + Attention': (lstm_attn_model, lstm_attn_history),
    'Transformer': (transformer_model, transformer_history),
}

# Create comparison table
print("\n" + "="*80)
print("ARCHITECTURE COMPARISON - SEQUENCE CLASSIFICATION")
print("="*80)
print(f"{'Model':<20} {'Parameters':>12} {'Test Acc':>12} {'Best Val Acc':>14} {'Epochs':>8}")
print("-"*80)

results = []
for name, (model, history) in models_comparison.items():
    params = sum(p.numel() for p in model.parameters())
    test_loss, test_acc, _, _ = evaluate(model, test_loader_seq, criterion, device)
    best_val_acc = max(history['val_acc'])
    epochs = len(history['val_acc'])

    print(f"{name:<20} {params:>12,} {test_acc:>11.2f}% {best_val_acc:>13.2f}% {epochs:>8}")
    results.append({'Model': name, 'Params': params, 'Test Acc': test_acc, 'Best Val Acc': best_val_acc})

print("="*80)

In [None]:
# Visual comparison
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Training loss comparison
ax1 = axes[0, 0]
for name, (model, history) in models_comparison.items():
    ax1.plot(history['train_loss'], label=name, alpha=0.8)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Training Loss')
ax1.set_title('Training Loss Comparison')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Validation loss comparison
ax2 = axes[0, 1]
for name, (model, history) in models_comparison.items():
    ax2.plot(history['val_loss'], label=name, alpha=0.8)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Validation Loss')
ax2.set_title('Validation Loss Comparison')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Validation accuracy comparison
ax3 = axes[1, 0]
for name, (model, history) in models_comparison.items():
    ax3.plot(history['val_acc'], label=name, alpha=0.8)
ax3.set_xlabel('Epoch')
ax3.set_ylabel('Validation Accuracy (%)')
ax3.set_title('Validation Accuracy Comparison')
ax3.legend()
ax3.grid(True, alpha=0.3)

# Parameters vs Performance
ax4 = axes[1, 1]
for r in results:
    ax4.scatter(r['Params'], r['Test Acc'], s=100, label=r['Model'])
ax4.set_xlabel('Number of Parameters')
ax4.set_ylabel('Test Accuracy (%)')
ax4.set_title('Parameters vs Performance')
ax4.legend()
ax4.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

## 8.2 Summary Table: When to Use Each Architecture

| Architecture | Best For | Pros | Cons |
|-------------|----------|------|------|
| **ANN (MLP)** | Tabular data, simple classification | Simple, fast, easy to interpret | Can't handle spatial/temporal patterns |
| **CNN** | Images, spatial data | Translation invariant, parameter efficient | Requires fixed input size, no temporal modeling |
| **RNN** | Simple sequences | Handles variable-length inputs | Vanishing gradients, slow training |
| **LSTM** | Long sequences, time series | Captures long-range dependencies | More parameters, slower than GRU |
| **GRU** | Similar to LSTM | Fewer parameters, faster | Slightly less powerful than LSTM |
| **Transformer** | NLP, complex sequences | Parallelizable, captures long-range deps | Quadratic memory in sequence length |

---
<a name="part9"></a>
# Part 9: Final Project

## Project: Build Your Own Deep Learning Pipeline

### Requirements

Choose ONE of the following projects:

#### Option A: Image Classification with CNN
- Use the CIFAR-10 dataset
- Implement a CNN with at least 3 convolutional layers
- Include batch normalization and dropout
- Implement data augmentation
- Achieve at least 80% test accuracy

#### Option B: Sequence Prediction with LSTM/Transformer
- Generate or use a time series dataset
- Implement both LSTM and Transformer models
- Compare their performance
- Visualize attention weights

#### Option C: Text Classification
- Use a text classification dataset (e.g., IMDB reviews)
- Implement word embeddings
- Build both RNN-based and Transformer-based classifiers
- Compare and analyze results

### Deliverables

1. **Code**: Well-documented implementation
2. **Analysis**:
   - Architecture description and justification
   - Training curves and metrics
   - Comparison with baseline
3. **Discussion**:
   - What worked well?
   - What challenges did you face?
   - How would you improve the model?

In [None]:
# YOUR FINAL PROJECT CODE HERE

# Example structure:

# 1. Load and preprocess data
# ...

# 2. Define model architecture
# ...

# 3. Training loop
# ...

# 4. Evaluation and visualization
# ...

# 5. Analysis and conclusions
# ...

## 9.2 Saving and Loading Models

In [None]:
# Save a model
def save_model(model, optimizer, epoch, path='model_checkpoint.pth'):
    """Save model checkpoint."""
    torch.save({
        'epoch': epoch,
        'model_state_dict': model.state_dict(),
        'optimizer_state_dict': optimizer.state_dict(),
    }, path)
    print(f"Model saved to {path}")

# Load a model
def load_model(model, optimizer, path='model_checkpoint.pth'):
    """Load model checkpoint."""
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict'])
    optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
    epoch = checkpoint['epoch']
    print(f"Model loaded from {path} (epoch {epoch})")
    return model, optimizer, epoch

# Example usage
# save_model(transformer_model, optimizer, epoch=30, path='transformer_checkpoint.pth')

---

# üìã Assignment Checklist

Before submitting, ensure you have completed:

- [ ] **Part 1**: Answered Exercise 1.1 (Activation Functions)
- [ ] **Part 2**: Completed Exercise 2.1 (Modified ANN)
- [ ] **Part 3**: Completed Exercise 3.1 (CNN Feature Maps)
- [ ] **Part 4**: Completed Exercise 4.1 (Bidirectional RNN)
- [ ] **Part 5**: Completed Exercise 5.1 (LSTM vs GRU Analysis)
- [ ] **Part 9**: Final Project with all deliverables

### Grading Rubric

| Component | Points |
|-----------|--------|
| Exercises (1.1 - 5.1) | 4 |
| Final Project Implementation | 3 |
| Final Project Analysis | 2 |
| Code Quality & Documentation | 1 |
| **Total** | **10** |

---

## üìö Additional Resources

- [PyTorch Documentation](https://pytorch.org/docs/stable/index.html)
- [Deep Learning Book by Goodfellow et al.](https://www.deeplearningbook.org/)
- [Attention Is All You Need (Transformer Paper)](https://arxiv.org/abs/1706.03762)
- [Understanding LSTM Networks](https://colah.github.io/posts/2015-08-Understanding-LSTMs/)
- [The Illustrated Transformer](https://jalammar.github.io/illustrated-transformer/)

---

**Good luck with your assignment!** üöÄ