# Introduction to LLM Engineering

Welcome to the first exercise of our course on LLM Engineering. 

This notebook will introduce you to the fundamental concepts of PyTorch, neural networks, and good programming practices that will serve as a foundation for working with more complex language models. By the end of this exercise, you'll be comfortable with implementing neural networks in PyTorch, understanding key concepts like train/evaluation modes, and applying best practices for model development and evaluation.

Ensure you understand each section thoroughly, as they form the basis for complex topics covered later in the course. If you are not familiar with the concepts covered here, please catch up on them.


In [None]:
# Import PyTorch for tensor operations and building neural networks
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

# Dataclasses and typing help structure data neatly and clearly
from dataclasses import dataclass
from typing import List, Tuple, Dict, Optional, Union

# Importing libraries for data manipulation and visualization
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing libraries for data loading and preprocessing
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import json
import random
import os
from datetime import datetime

# Set a random seed for reproducibility
def set_seed(seed: int = 42) -> None:
    """Set seed for reproducibility."""
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False

set_seed()

## Part 1: Typing and Object-Oriented Programming (OOP) in Neural Networks

Clean code structure, typing, and  Object-Oriented Programming (OOP) principles enhance maintainability, readability, and reusability, especially critical for large-scale ML projects.
In this section, we'll create a basic neural network using object-oriented programming principles.

### Building a Simple Neural Network

A neural network consists of layers performing linear transformations and non-linear activations. Each layer applies transformations to the input data.

**Example:**  Here, we define a simple feedforward neural network with hidden layers and dropout.



In [None]:
class SimpleNN(nn.Module):
    """A simple neural network with configurable architecture."""
    
    def __init__(self, input_size: int, hidden_size: int, output_size: int, dropout_rate: float = 0.2):
        """
        Initialize the neural network.
        
        Args:
            input_size: Number of input features
            hidden_size: Number of neurons in hidden layer
            output_size: Number of output classes
            dropout_rate: Dropout probability for regularization
        """
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.dropout = nn.Dropout(dropout_rate)
        self.fc2 = nn.Linear(hidden_size, output_size)
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Forward pass through the network.
        
        Args:
            x: Input tensor of shape [batch_size, input_size]
            
        Returns:
            Output tensor of shape [batch_size, output_size]
        """
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = self.fc2(x)
        return x
    
    def predict(self, x: torch.Tensor) -> torch.Tensor:
        """
        Make predictions using the model.
        
        Args:
            x: Input tensor
            
        Returns:
            Predicted class indices
        """
        with torch.no_grad():
            outputs = self.forward(x)
            _, predicted = torch.max(outputs.data, 1)
        return predicted


The `SimpleNN` class demonstrates how to inherit from PyTorch's `nn.Module`. We then initialize layers in the `__init__` function and implement a forward pass to pass data through the network. We can also add further utility methods like `predict`to make predicitions based on the model outputs.


This pattern of encapsulating model logic in a class will be used throughout your work with neural networks and LLMs.

In [None]:
# Let's create an instance of our model
input_size = 10
hidden_size = 20
output_size = 2
model = SimpleNN(input_size, hidden_size, output_size)
print(f"Model architecture:\n{model}")

# Example forward pass
example_input = torch.randn(5, input_size) # Batch of 5 samples
output = model(example_input) # Forward pass
prediction = model.predict(example_input) # Prediction
print(f"\nExample input shape: {example_input.shape}")
print(f"Example output shape: {output.shape}")
print(f"Example output:\n{output}")
print(f"Predicted classes:\n{prediction}")

### Exercise: Implement a deeper neural network with multiple hidden layers

For this exercise, create a new class called `DeepNN` that extends nn.Module. Your implementation should include:
* 3 hidden layers with configurable sizes
* ReLU activations between layers
* Dropout for regularization
* `__init__` and `forward` methods


In [None]:
class DeepNN(nn.Module):
    """
    Your implementation here...
    """
    pass


## Part 2: Structured Outputs and Parsing

When working with LLMs and other complex models, it's important to structure your outputs in a way that makes them easy to process and interpret. In this section, we'll implement structured outputs using data classes.
Using Python's `dataclass` simplifies this process by clearly defining expected attributes and their types.

We defina a class `ModelOutput` to structure the outputs of our model. This class has methods for converting outputs to different formats (dict, JSON).

In [None]:
@dataclass
class ModelOutput:
    """Data class for structured model outputs."""
    logits: torch.Tensor
    probabilities: torch.Tensor
    predictions: torch.Tensor
    
    def to_dict(self) -> Dict:
        """Convert outputs to dictionary format."""
        return {
            "logits": self.logits.tolist(),
            "probabilities": self.probabilities.tolist(),
            "predictions": self.predictions.tolist()
        }
    
    def to_json(self) -> str:
        """Convert outputs to JSON string."""
        return json.dumps(self.to_dict())
    
    @classmethod
    def from_logits(cls, logits: torch.Tensor) -> 'ModelOutput':
        """Create a ModelOutput instance from logits."""
        probabilities = F.softmax(logits, dim=1)
        predictions = torch.argmax(probabilities, dim=1)
        return cls(logits=logits, probabilities=probabilities, predictions=predictions)

In [None]:
# Let's extend our SimpleNN to use structured outputs
class StructuredNN(SimpleNN):
    """Neural network with structured outputs."""
    
    def forward_structured(self, x: torch.Tensor) -> ModelOutput:
        """
        Forward pass with structured output.
        
        Args:
            x: Input tensor
            
        Returns:
            Structured model output
        """
        logits = super().forward(x)
        return ModelOutput.from_logits(logits)
    


In [None]:
# Create a model instance and test the structured outputs
structured_model = StructuredNN(input_size, hidden_size, output_size)
structured_output = structured_model.forward_structured(example_input)

print(f"Structured output example:\n")
print(f"Logits shape: {structured_output.logits.shape}")
print(f"Probabilities shape: {structured_output.probabilities.shape}")
print(f"Predictions shape: {structured_output.predictions.shape}")
print(f"\nJSON representation:\n{structured_output.to_json()[:200]}...")

### Exercise: Parsing structured output
Implement a function called `parse_output` that takes a `ModelOutput` object and returns a human-readable string. Your function should:
* Extract the most confident prediction
* Include its associated probability
* Format the information in a clear, readable way.

In [None]:
# TODO: Create a function called parse_output

## Part 3: Understanding torch.no_grad() and train/eval modes

Gradients represent how much each parameter affects the final prediction, guiding parameter updates during training.
Managing gradients efficiently is vital for training neural networks and optimizing inference.

For efficient and correct model development, it's crucial to understand PyTorch's execution modes. 
PyTorch models have two primary operating modes: training and evaluation. These modes control how certain layers behave and whether gradients are computed. 

- `.eval()` sets the model to evaluation mode.
- `torch.no_grad()` disables gradient calculation, saving memory during inference.


In [None]:
# Let's demonstrate different behaviors in train and eval modes
model = SimpleNN(input_size, hidden_size, output_size)
input_data = torch.randn(5, input_size)

# Training mode (default)
model.train()
print("Training mode (model.train()):")
print("-" * 30)

# Forward pass with gradients
output_train = model(input_data)
print(f"Output shape: {output_train.shape}")
print(f"Requires gradient: {output_train.requires_grad}")

# Make another pass, observe dropout
output_train2 = model(input_data)
diff_train = torch.sum(torch.abs(output_train - output_train2)).item()
print(f"Difference between two forward passes: {diff_train:.6f} (due to dropout)")

# Evaluation mode
model.eval()
print("\nEvaluation mode (model.eval()):")
print("-" * 30)

# Forward pass in eval mode
output_eval = model(input_data)
print(f"Output shape: {output_eval.shape}")
print(f"Requires gradient: {output_eval.requires_grad}")

# Make another pass, observe no dropout
output_eval2 = model(input_data)
diff_eval = torch.sum(torch.abs(output_eval - output_eval2)).item()
print(f"Difference between two forward passes: {diff_eval:.6f} (should be 0, no dropout)")

# Using torch.no_grad()
print("\nUsing torch.no_grad():")
print("-" * 30)
with torch.no_grad():
    output_no_grad = model(input_data)
    print(f"Output shape: {output_no_grad.shape}")
    print(f"Requires gradient: {output_no_grad.requires_grad}")

**Dropout Behavior in Different Modes**
Dropout is a regularization technique that randomly sets a portion of neurons to zero during training to prevent overfitting. However, this random behavior would be problematic during inference:

* In training mode (`model.train()`): Dropout randomly deactivates neurons based on the dropout probability.Each forward pass produces slightly different results due to this randomness. This intentional noise helps the model generalize better.
* In evaluation mode (`model.eval()`): Dropout is effectively disabled (all neurons are active). Output is deterministic (same input always produces same output). No randomness is introduced to prediction.

This difference is why `model.eval()` must be called before making predictions on new data to ensure consistent results

### Exercise: Comparing memory usage with and without gradients

Training neural networks requires computing gradients for parameter updates. This process:
* Stores intermediate activations in memory for backpropagation
* Creates a computational graph that tracks operations
* Significantly increases memory usage as model size grows

When performing inference (making predictions), we don't need these gradients. Using `torch.no_grad()`:
* Disables gradient calculation
* Reduces memory usage by not storing intermediate activations
* Speeds up computation by not building the computational graph
* Prevents accidental parameter updates

Task: Implement a function that measures and compares memory usage when processing a large batch with and without gradient calculation. When would you use which mode?

In [None]:
# TODO: Create a function that measures and compares the memory usage
# when processing a large batch with and without gradients

batch = torch.randn(10000, input_size)  # Large batch


# Hint: You can use the torch.cuda.memory_allocated() function if using GPU


## Part 4: Data Classes in Python
Using dataclasses streamlines hyperparameter management, making experiments reproducible and organized. Hyperparameters are settings chosen before training that influence model learning. Hyperparameters like learning rate, batch size, and epochs affect how the model learns and performs.

Data classes provide a clean way to organize configuration and results in your machine learning projects. This section introduces Python's dataclass decorator and demonstrates how to use it effectively.

In [None]:
@dataclass
class TrainingConfig:
    """Data class for neural network training configuration."""
    learning_rate: float = 0.001
    batch_size: int = 32
    epochs: int = 10
    weight_decay: float = 0.0001
    early_stopping_patience: int = 3
    device: str = "cuda" if torch.cuda.is_available() else "cpu"
    
    def __post_init__(self):
        """Validate configuration after initialization."""
        if self.learning_rate <= 0:
            raise ValueError("Learning rate must be positive")
        if self.batch_size <= 0:
            raise ValueError("Batch size must be positive")
        if self.early_stopping_patience < 0:
            raise ValueError("Patience must be non-negative")
            
    def to_dict(self) -> Dict:
        """Convert config to dictionary."""
        return {
            "learning_rate": self.learning_rate,
            "batch_size": self.batch_size,
            "epochs": self.epochs,
            "weight_decay": self.weight_decay,
            "early_stopping_patience": self.early_stopping_patience,
            "device": self.device
        }

The `TrainingConfig`class shows how to define default values for configuration parameters. We can validate configurations with `__post_init__`, and add a method for converting configurations to other formats.

In [None]:
# Create and use a configuration
config = TrainingConfig(learning_rate=0.01, epochs=20)
print(f"Training configuration:\n{config}")

# Modify and validate configurations
try:
    invalid_config = TrainingConfig(learning_rate=-0.1)
except ValueError as e:
    print(f"Validation works: {e}")

### Exercise: Create an ExperimentResults data class
Create a data class called ExperimentResults to store experimental results. Your implementation should:
* Store model name, training time, best epoch, and metrics
* Include methods for saving to and loading from JSON files
* Support optional test metrics (for when test data isn't available)

In [None]:

# TODO: Create a data class called ExperimentResults that stores:
# - model_name: str
# - training_time: float (in seconds)
# - best_epoch: int
# - train_metrics: Dict (containing accuracy, loss, etc.)
# - validation_metrics: Dict
# - test_metrics: Optional[Dict]
# Include methods for saving to and loading from JSON files

## Part 5: Advanced NN Concepts: Regularization and Activation Functions

As models grow in complexity, understanding regularization techniques and activation functions becomes crucial for achieving good performance. This section introduces advanced neural network concepts that will improve your models.

In [None]:
class AdvancedNN(nn.Module):
    """Neural network with various regularization techniques and activation functions."""
    
    def __init__(
        self, 
        input_size: int, 
        hidden_sizes: List[int], 
        output_size: int,
        dropout_rate: float = 0.2,
        activation: str = "relu",
        use_batch_norm: bool = True,
        weight_constraint: Optional[float] = None
    ):
        """
        Initialize the advanced neural network.
        
        Args:
            input_size: Number of input features
            hidden_sizes: List of hidden layer sizes
            output_size: Number of output classes
            dropout_rate: Dropout probability
            activation: Activation function name ('relu', 'leaky_relu', 'elu', 'gelu', etc.)
            use_batch_norm: Whether to use batch normalization
            weight_constraint: Maximum norm for weight constraint (L2 norm)
        """
        super(AdvancedNN, self).__init__()
        
        self.layers = nn.ModuleList()
        self.batch_norms = nn.ModuleList()
        self.dropouts = nn.ModuleList()
        self.weight_constraint = weight_constraint
        
        # Input layer
        self.layers.append(nn.Linear(input_size, hidden_sizes[0]))
        if use_batch_norm:
            self.batch_norms.append(nn.BatchNorm1d(hidden_sizes[0]))
        
        # Hidden layers
        for i in range(len(hidden_sizes) - 1):
            self.layers.append(nn.Linear(hidden_sizes[i], hidden_sizes[i+1]))
            if use_batch_norm:
                self.batch_norms.append(nn.BatchNorm1d(hidden_sizes[i+1]))
            self.dropouts.append(nn.Dropout(dropout_rate))
        
        # Output layer
        self.layers.append(nn.Linear(hidden_sizes[-1], output_size))
        self.dropouts.append(nn.Dropout(dropout_rate))
        
        # Activation function
        if activation == "relu":
            self.activation = F.relu
        elif activation == "leaky_relu":
            self.activation = F.leaky_relu
        elif activation == "elu":
            self.activation = F.elu
        elif activation == "gelu":
            self.activation = F.gelu
        else:
            raise ValueError(f"Unsupported activation function: {activation}")
            
        # Flag for batch normalization
        self.use_batch_norm = use_batch_norm
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass through the network."""
        # Apply each layer with activation, batch norm, and dropout
        for i in range(len(self.layers) - 1):
            x = self.layers[i](x)
            if self.use_batch_norm:
                if x.dim() == 2:  # Handle batch size of 1
                    x = self.batch_norms[i](x)
            x = self.activation(x)
            x = self.dropouts[i](x)
            
        # Apply weight constraints if specified
        if self.weight_constraint is not None and self.training:
            for layer in self.layers:
                w = layer.weight.data
                norm = torch.norm(w, 2, dim=1, keepdim=True)
                desired = torch.clamp(norm, 0, self.weight_constraint)
                w = w * (desired / (1e-7 + norm))
                layer.weight.data = w
        
        # Output layer (no activation or dropout)
        x = self.layers[-1](x)
        return x

# Create a simple test for our advanced model
hidden_sizes = [32, 16]
advanced_model = AdvancedNN(
    input_size=10, 
    hidden_sizes=hidden_sizes, 
    output_size=2, 
    activation="gelu",
    weight_constraint=3.0
)

print(f"Advanced model architecture:\n{advanced_model}")

# Test forward pass
test_input = torch.randn(8, input_size)
test_output = advanced_model(test_input)
print(f"\nTest input shape: {test_input.shape}")
print(f"Test output shape: {test_output.shape}")

With this class, we added different activation functions (ReLU, Leaky ReLU, ELU, GELU), and regularization techniques (dropout, batch normalization, weight constraints).

### Exercise: Compare activation functions and regularization
Implement a function that trains the same model architecture with different activation functions and regularization settings, then compares the results. Your function should:
* Test at least 3 different combinations of activations and regularization
* Report training time, convergence speed, and final accuracy
* Visualize the differences in performance

In [None]:

# TODO: Create a function that trains the same model architecture with different activation functions and regularization settings, then compares the results

# Hands-On: Text Classification with Neural Networks

In this part, we'll apply the concepts from Part 1 to a practical text classification task. We'll build a sentiment analyzer for movie reviews, taking you through the entire machine learning workflow from data preprocessing to model evaluation.
We will use a dataset from the Internet (from the model & dataset database [Huggingface](https://huggingface.co/datasets/stanfordnlp/imdb)).

In [None]:
import pandas as pd

print("\nLoading and preprocessing example text data...")

splits = {'train': 'plain_text/train-00000-of-00001.parquet', 'test': 'plain_text/test-00000-of-00001.parquet', 'unsupervised': 'plain_text/unsupervised-00000-of-00001.parquet'}
df = pd.read_parquet("hf://datasets/stanfordnlp/imdb/" + splits["train"]).sample(1000, random_state=42)

reviews = df['text'].tolist()
labels = df['label'].tolist()

print(f"Loaded {len(reviews)} training samples and {len(labels)} labels.")
print(f"Sample text: {reviews[0]}")
print(f"Sample label: {labels[0]}")

[StanfordNLP/imdb](https://huggingface.co/datasets/stanfordnlp/imdb) is a large movie review dataset. It is used for binary sentiment classification. It provides 50,000 highly polar movie reviews (of which we randomly sample 1000). A label "0" corresponds to a negative sentiment, "1" symbolizes positive sentiment.

## Part 6: Data Preprocessing and Feature Engineering

Before training any neural network, proper data preprocessing is essential. This section demonstrates how to prepare text data for a classification task.

In [None]:
@dataclass
class TextDataset:
    """Dataset class for text classification."""
    texts: List[str]
    labels: List[int]
    vectorizer: Optional[CountVectorizer] = None
    is_fitted: bool = False
    
    def preprocess(self, max_features: int = 1000) -> None:
        """
        Preprocess the text data by vectorizing it.
        
        Args:
            max_features: Maximum number of features (vocabulary size)
        """
        if not self.vectorizer:
            self.vectorizer = CountVectorizer(max_features=max_features)
            self.vectorizer.fit(self.texts)
            self.is_fitted = True
    
    def get_features(self) -> np.ndarray:
        """Convert texts to feature vectors."""
        if not self.is_fitted:
            raise ValueError("Dataset has not been preprocessed. Call preprocess() first.")
        return self.vectorizer.transform(self.texts).toarray()
    
    def get_tensor_dataset(self) -> Tuple[torch.Tensor, torch.Tensor]:
        """Get PyTorch tensors for features and labels."""
        features = self.get_features()
        return (
            torch.tensor(features, dtype=torch.float32),
            torch.tensor(self.labels, dtype=torch.long)
        )
    
    def get_vocabulary_size(self) -> int:
        """Get the size of the vocabulary."""
        if not self.is_fitted:
            raise ValueError("Dataset has not been preprocessed. Call preprocess() first.")
        return len(self.vectorizer.get_feature_names_out())
    
    def transform_new_texts(self, texts: List[str]) -> np.ndarray:
        """Transform new texts using the fitted vectorizer."""
        if not self.is_fitted:
            raise ValueError("Dataset has not been preprocessed. Call preprocess() first.")
        return self.vectorizer.transform(texts).toarray()

# Create our dataset and preprocess it
dataset = TextDataset(reviews, labels)
dataset.preprocess(max_features=100)  # Limit features for the example

print(f"Vocabulary size: {dataset.get_vocabulary_size()}")
features = dataset.get_features()
print(f"Feature matrix shape: {features.shape}")

# Show a sample of the vectorized data
sample_idx = 0
sample_text = reviews[sample_idx]
sample_vector = features[sample_idx]
print(f"\nSample text: '{sample_text}'")
print(f"Vectorized (first 10 features): {sample_vector[:10]}...")

The `TextDataset` class encapsulates dataset operations in a reusable class. As we cannot work with the reviews in their string format right away, we transform them to numeric vectors using `CountVectorizer`. We convert the raw data into PyTorch tensors.

## Part 7: Train/Validation/Test Splits

Properly splitting your data is crucial for reliable model evaluation. This section covers how to create train, validation, and test sets while maintaining consistent preprocessing.

In [None]:

def split_dataset(
    dataset: TextDataset, 
    val_size: float = 0.15, 
    test_size: float = 0.15,
    random_state: int = 42
) -> Tuple[TextDataset, TextDataset, TextDataset]:
    """
    Split a dataset into train, validation, and test sets.
    
    Args:
        dataset: The original dataset
        val_size: Proportion for validation
        test_size: Proportion for testing
        random_state: Random seed for reproducibility
        
    Returns:
        Tuple of (train_dataset, val_dataset, test_dataset)
    """
    # First split into train+val and test
    texts_train_val, texts_test, labels_train_val, labels_test = train_test_split(
        dataset.texts, dataset.labels, test_size=test_size, random_state=random_state
    )
    
    # Then split train+val into train and val
    adjusted_val_size = val_size / (1 - test_size)
    texts_train, texts_val, labels_train, labels_val = train_test_split(
        texts_train_val, labels_train_val, test_size=adjusted_val_size, random_state=random_state
    )
    
    # Create datasets with the same preprocessing
    train_dataset = TextDataset(texts_train, labels_train, dataset.vectorizer, dataset.is_fitted)
    val_dataset = TextDataset(texts_val, labels_val, dataset.vectorizer, dataset.is_fitted)
    test_dataset = TextDataset(texts_test, labels_test, dataset.vectorizer, dataset.is_fitted)
    
    return train_dataset, val_dataset, test_dataset

# Split our dataset
train_dataset, val_dataset, test_dataset = split_dataset(dataset)

# Create PyTorch tensors
X_train, y_train = train_dataset.get_tensor_dataset()
X_val, y_val = val_dataset.get_tensor_dataset()
X_test, y_test = test_dataset.get_tensor_dataset()

print(f"Training set: {len(train_dataset.texts)} examples")
print(f"Validation set: {len(val_dataset.texts)} examples")
print(f"Test set: {len(test_dataset.texts)} examples")
print(f"\nFeature tensor shapes:")
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_val: {X_val.shape}, y_val: {y_val.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")

We performed stratified splitting based on the labels of our data. We calculate appropriate split proportions first, and then create dataset objects for each split with shared preprocessing.

### Exercise: Implement a PyTorch DataLoader
Create a function that returns DataLoader objects for training, validation, and test sets. Your implementation should:
* Implement proper batching for efficient training
* Shuffle training data to improve convergence
* Include an option for balanced sampling to handle class imbalance

In [None]:
# TODO: Create a function that returns DataLoader objects for training, validation and test sets
# Make sure to implement proper batching and shuffling (for training only)
# Include an optional parameter for balanced sampling based on class distribution



## Part 8: Loss Functions and Optimization

Choosing appropriate loss functions and optimizers is critical for effective model training. This section provides a framework for training neural networks in PyTorch.
The `train_model` function demonstrates:
* Setting up loss functions and optimizers
* Implementing the training loop
* Tracking training and validation metrics
* Early stopping to prevent overfitting

In [None]:

@dataclass
class TrainingResult:
    """Data class for tracking training results."""
    train_losses: List[float]
    val_losses: List[float]
    train_accuracies: List[float]
    val_accuracies: List[float]
    best_epoch: int
    best_val_accuracy: float
    training_time: float

def train_model(
    model: nn.Module,
    X_train: torch.Tensor,
    y_train: torch.Tensor,
    X_val: torch.Tensor,
    y_val: torch.Tensor,
    config: TrainingConfig
) -> TrainingResult:
    """
    Train a PyTorch model.
    
    Args:
        model: The neural network model
        X_train: Training features
        y_train: Training labels
        X_val: Validation features
        y_val: Validation labels
        config: Training configuration
        
    Returns:
        Training results
    """
    device = torch.device(config.device)
    model = model.to(device)
    
    # Move data to device
    X_train = X_train.to(device)
    y_train = y_train.to(device)
    X_val = X_val.to(device)
    y_val = y_val.to(device)
    
    # Define loss function and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(
        model.parameters(), 
        lr=config.learning_rate, 
        weight_decay=config.weight_decay
    )
    
    # Training loop
    train_losses = []
    val_losses = []
    train_accuracies = []
    val_accuracies = []
    
    best_val_accuracy = 0.0
    best_epoch = 0
    patience_counter = 0
    
    start_time = datetime.now()
    
    for epoch in range(config.epochs):
        # Training phase
        model.train()
        outputs = model(X_train)
        loss = criterion(outputs, y_train)
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Calculate training accuracy
        _, train_preds = torch.max(outputs, 1)
        train_acc = (train_preds == y_train).float().mean().item()
        
        # Validation phase
        model.eval()
        with torch.no_grad():
            val_outputs = model(X_val)
            val_loss = criterion(val_outputs, y_val).item()
            
            _, val_preds = torch.max(val_outputs, 1)
            val_acc = (val_preds == y_val).float().mean().item()
        
        # Store metrics
        train_losses.append(loss.item())
        val_losses.append(val_loss)
        train_accuracies.append(train_acc)
        val_accuracies.append(val_acc)
        
        # Check for improvement
        if val_acc > best_val_accuracy:
            best_val_accuracy = val_acc
            best_epoch = epoch
            patience_counter = 0
        else:
            patience_counter += 1
            
        # Early stopping
        if patience_counter >= config.early_stopping_patience:
            print(f"Early stopping at epoch {epoch+1}")
            break
            
        # Print progress
        print(f"Epoch {epoch+1}/{config.epochs} | "
              f"Train Loss: {loss.item():.4f} | "
              f"Train Acc: {train_acc:.4f} | "
              f"Val Loss: {val_loss:.4f} | "
              f"Val Acc: {val_acc:.4f}")
    
    training_time = (datetime.now() - start_time).total_seconds()
    
    return TrainingResult(
        train_losses=train_losses,
        val_losses=val_losses,
        train_accuracies=train_accuracies,
        val_accuracies=val_accuracies,
        best_epoch=best_epoch,
        best_val_accuracy=best_val_accuracy,
        training_time=training_time
    )

# Create a text classification model
input_size = dataset.get_vocabulary_size()
hidden_size = 32
output_size = 2  # Binary classification
text_model = SimpleNN(input_size, hidden_size, output_size)

# Define training configuration
training_config = TrainingConfig(
    learning_rate=0.01,
    batch_size=4,  # Small batch size for our tiny dataset
    epochs=20,
    weight_decay=0.0001,
    early_stopping_patience=5
)

print(f"Training model on {input_size} features...")
training_result = train_model(text_model, X_train, y_train, X_val, y_val, training_config)


## Part 9: Visualizing Training Progress

Monitoring training progress helps identify issues early and tune hyperparameters effectively. This section shows how to visualize training metrics and implement early stopping.

In [None]:

def plot_training_progress(result: TrainingResult) -> None:
    """
    Plot training and validation metrics.
    
    Args:
        result: Training results
    """
    epochs = range(1, len(result.train_losses) + 1)
    
    plt.figure(figsize=(12, 5))
    
    # Plot losses
    plt.subplot(1, 2, 1)
    plt.plot(epochs, result.train_losses, 'b-', label='Training Loss')
    plt.plot(epochs, result.val_losses, 'r-', label='Validation Loss')
    plt.axvline(x=result.best_epoch + 1, color='g', linestyle='--', label='Best Epoch')
    plt.title('Training and Validation Loss')
    plt.xlabel('Epochs')
    plt.ylabel('Loss')
    plt.legend()
    
    # Plot accuracies
    plt.subplot(1, 2, 2)
    plt.plot(epochs, result.train_accuracies, 'b-', label='Training Accuracy')
    plt.plot(epochs, result.val_accuracies, 'r-', label='Validation Accuracy')
    plt.axvline(x=result.best_epoch + 1, color='g', linestyle='--', label='Best Epoch')
    plt.title('Training and Validation Accuracy')
    plt.xlabel('Epochs')
    plt.ylabel('Accuracy')
    plt.legend()
    
    plt.tight_layout()
    plt.show()

# Plot training progress
plot_training_progress(training_result)

print(f"Best validation accuracy: {training_result.best_val_accuracy:.4f} at epoch {training_result.best_epoch+1}")
print(f"Training time: {training_result.training_time:.2f} seconds")

Our function visualizes loss and accuracy over time. With that, we can identify the best model checkpoint, and also detect overfitting and other training issues.

### Exercise: Implement learning rate scheduling
Learning rate scheduling is an important technique for achieving better convergence, especially in deeper networks.
Modify the train_model function to include a learning rate scheduler. Your implementation should:
* Implement at least one scheduling strategy (StepLR, ReduceLROnPlateau, or CosineAnnealingLR)
* Track and plot learning rate changes over epochs
* Compare training with and without scheduling

In [None]:
# TODO: Modify the train_model function to include a learning rate scheduler


## Part 10: Reproducibility and Experiment Tracking


Maintaining reproducibility and tracking experiments is crucial for research and development. This section introduces a framework for experiment configuration and tracking.

In [None]:

@dataclass
class ExperimentConfig:
    """Data class for experiment configuration."""
    experiment_name: str
    model_type: str
    model_params: Dict
    training_params: TrainingConfig
    dataset_params: Dict
    seed: int = 42
    
    def save(self, filepath: str) -> None:
        """Save experiment configuration to a JSON file."""
        with open(filepath, 'w') as f:
            # Convert all dataclasses to dictionaries
            config_dict = {
                'experiment_name': self.experiment_name,
                'model_type': self.model_type,
                'model_params': self.model_params,
                'training_params': self.training_params.to_dict(),
                'dataset_params': self.dataset_params,
                'seed': self.seed
            }
            json.dump(config_dict, f, indent=2)
    
    @classmethod
    def load(cls, filepath: str) -> 'ExperimentConfig':
        """Load experiment configuration from a JSON file."""
        with open(filepath, 'r') as f:
            config_dict = json.load(f)
            # Convert dictionaries back to dataclasses
            training_params = TrainingConfig(**config_dict['training_params'])
            return cls(
                experiment_name=config_dict['experiment_name'],
                model_type=config_dict['model_type'],
                model_params=config_dict['model_params'],
                training_params=training_params,
                dataset_params=config_dict['dataset_params'],
                seed=config_dict['seed']
            )

# Example: Creating an experiment configuration
experiment_config = ExperimentConfig(
    experiment_name="text_classification_bow",
    model_type="SimpleNN",
    model_params={
        "input_size": input_size,
        "hidden_size": hidden_size,
        "output_size": output_size
    },
    training_params=training_config,
    dataset_params={
        "max_features": 100,
        "train_size": len(train_dataset.texts),
        "val_size": len(val_dataset.texts),
        "test_size": len(test_dataset.texts)
    }
)

# Save experiment configuration
os.makedirs("experiments", exist_ok=True)
config_path = os.path.join("experiments", "experiment_config.json")
experiment_config.save(config_path)
print(f"Saved experiment configuration to {config_path}")

def log_metrics(result: TrainingResult, log_path: str) -> None:
    """
    Log training metrics to a file.
    
    Args:
        result: Training results
        log_path: Path to log file
    """
    with open(log_path, 'w') as f:
        metrics = {
            "train_losses": result.train_losses,
            "val_losses": result.val_losses,
            "train_accuracies": result.train_accuracies,
            "val_accuracies": result.val_accuracies,
            "best_epoch": result.best_epoch,
            "best_val_accuracy": result.best_val_accuracy,
            "training_time": result.training_time,
            "timestamp": datetime.now().isoformat()
        }
        json.dump(metrics, f, indent=2)

# Log metrics from our experiment
log_path = os.path.join("experiments", "metrics.json")
log_metrics(training_result, log_path)

While our basic implementation provides fundamental tracking capabilities, for larger projects and team environments, external experiment tracking tools offer significant advantages, e.g., Weights & Biases (wandb), MLFlow, or TensorBoard.

## Part 11: Model Saving and Loading


This section shows how to save model weights and architecture, load models for inference and include metadata about the training conditions.
As your models become more complex and training runs take longer, proper experiment tracking becomes essential for reproducibility, comparison, collaboration, etc.

In [None]:

def save_model(model: nn.Module, filepath: str, metadata: Dict = None) -> None:
    """
    Save model weights and optional metadata.
    
    Args:
        model: PyTorch model
        filepath: Path to save the model
        metadata: Optional metadata dictionary
    """
    save_dict = {
        'model_state_dict': model.state_dict(),
        'metadata': metadata or {}
    }
    torch.save(save_dict, filepath)
    print(f"Model saved to {filepath}")

def load_model(model_class: nn.Module, filepath: str, **model_params) -> Tuple[nn.Module, Dict]:
    """
    Load model weights and metadata.
    
    Args:
        model_class: PyTorch model class
        filepath: Path to the saved model
        **model_params: Parameters to initialize the model
        
    Returns:
        Tuple of (model, metadata)
    """
    # Create a new model instance
    model = model_class(**model_params)
    
    # Load the saved state
    checkpoint = torch.load(filepath)
    model.load_state_dict(checkpoint['model_state_dict'])
    
    return model, checkpoint['metadata']

# Save our trained model
model_metadata = {
    "accuracy": training_result.best_val_accuracy,
    "epoch": training_result.best_epoch,
    "timestamp": datetime.now().isoformat(),
    "vocabulary_size": input_size
}

model_path = os.path.join("experiments", "text_classifier.pt")
save_model(text_model, model_path, model_metadata)

# Load the model for inference
loaded_model, metadata = load_model(
    SimpleNN, 
    model_path, 
    input_size=input_size, 
    hidden_size=hidden_size, 
    output_size=output_size
)

print(f"Loaded model trained to accuracy: {metadata['accuracy']:.4f}")

## Part 12: Running Inference

Finally, we'll demonstrate how to use a trained model for inference on new data. This section covers the inference pipeline.

In [None]:

def predict_sentiment(
    model: nn.Module, 
    vectorizer: CountVectorizer, 
    text: str,
    device: str = "cpu"
) -> Dict:
    """
    Predict sentiment of a text.
    
    Args:
        model: Trained model
        vectorizer: Fitted CountVectorizer
        text: Input text
        device: Device to run inference on
        
    Returns:
        Dictionary with prediction results
    """
    # Vectorize the input text
    features = vectorizer.transform([text]).toarray()
    
    # Convert to tensor
    X = torch.tensor(features, dtype=torch.float32).to(device)
    
    # Set model to eval mode
    model.eval()
    
    # Make prediction
    with torch.no_grad():
        outputs = model(X)
        probabilities = F.softmax(outputs, dim=1)
        predicted_class = torch.argmax(probabilities, dim=1).item()
    
    # Get class probabilities
    class_probs = probabilities[0].cpu().numpy()
    
    return {
        "text": text,
        "predicted_class": predicted_class,
        "positive_probability": float(class_probs[1]),
        "negative_probability": float(class_probs[0])
    }

# Test the model with some new reviews
test_reviews = [
    "The movie was fantastic and I would watch it again. Highly recommended!",
    "A complete waste of time with terrible acting, I hated it and would not recommend it",
    "Not great, not terrible, just an average film"
]

print("Running inference on new reviews:")
for review in test_reviews:
    result = predict_sentiment(loaded_model, dataset.vectorizer, review)
    sentiment = "Positive" if result["predicted_class"] == 1 else "Negative"
    print(f"\nText: '{result['text']}'")
    print(f"Prediction: {sentiment} (Confidence: {max(result['positive_probability'], result['negative_probability']):.4f})")
    print(f"Probabilities: Positive = {result['positive_probability']:.4f}, Negative = {result['negative_probability']:.4f}")


The above code demonstrates:
* Preprocessing new data consistently
* Running inference efficiently
* Interpreting model outputs
* Evaluating model performance on test data

# Resources and Best Practices




In this notebook, we've covered the fundamentals of neural networks with PyTorch, from basic architecture to training, evaluation, and deployment. These concepts form the foundation for working with more complex models like LLMs in future exercises.

### Recommended Resources

* [PyTorch Documentation](https://pytorch.org/docs/stable/index.html) - Official documentation for PyTorch
* [Deep Learning with PyTorch](https://isip.piconepress.com/courses/temple/ece_4822/resources/books/Deep-Learning-with-PyTorch.pdf) - Comprehensive book on PyTorch
* [Dive into Deep Learning](https://d2l.ai/) - Interactive deep learning book with code examples
* [Weights & Biases](https://wandb.ai/site) - Tool for experiment tracking and visualization

### Self-Assessment
Complete the following self-assessment to gauge your understanding:

1. Explain the difference between model.train() and model.eval() modes
2. Why is torch.no_grad() important during inference?
3. What are the benefits of using data classes in ML projects?
4. What steps are necessary to ensure reproducibility in ML experiments?
5. Why is proper train/validation/test splitting important?
6. How would you modify our text classifier to handle longer texts (hint: think about the limitations of Bag-of-Words)

Bonus: Try implementing a simple Transformer model using PyTorch's nn.TransformerEncoder