# Notebook 2: Your First Model - A Multilayer Perceptron (MLP)

## Data Loading and Visualization

### The MNIST Dataset

MNIST (Modified National Institute of Standards and Technology) is one of the most famous datasets in machine learning. It contains 70,000 grayscale images of handwritten digits (0-9), each 28x28 pixels. The dataset is split into 60,000 training images and 10,000 test images.

### torchvision.datasets

PyTorch's `torchvision.datasets` module provides easy access to many popular datasets, including MNIST. It handles downloading, extracting, and organizing the data for you.

### DataLoader

The `DataLoader` is a crucial component that groups your data into batches, shuffles it (for training), and makes it easy to loop over. Without it, you'd have to manually manage batches, shuffling, and iteration—tasks that `DataLoader` handles efficiently.

### transforms.ToTensor()

The `transforms.ToTensor()` step converts PIL images or NumPy arrays into PyTorch tensors. It also automatically scales pixel values from [0, 255] to [0.0, 1.0], which is the standard format neural networks expect.
Notebook 3: Learning Patterns - A Convolutional Neural Network (CNN) on MNIST

LLM, generate a Jupyter Notebook with the following cells and content:
ain function from Notebook 2 and use it to train the CNN.

In [None]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Define transforms
transform = transforms.ToTensor()

# Download and create datasets
train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

test_dataset = datasets.MNIST(
    root='./data',
    train=False,
    download=True,
    transform=transform
)

# Create DataLoaders
batch_size = 64
train_dataloader = DataLoader(
    train_dataset,
    batch_size=batch_size,
    shuffle=True
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=batch_size,
    shuffle=False
)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Batch size: {batch_size}")


## Inspecting the Data Shape

Understanding tensor shapes is critical for debugging neural networks. Shape mismatches are the #1 cause of errors in deep learning.

Let's think about what we expect:
- **Batch size**: We set `batch_size=64`, so each batch contains 64 images.
- **Image dimensions**: MNIST images are grayscale (1 channel) and are 28x28 pixels.
- **Shape convention**: PyTorch uses the format `(batch_size, channels, height, width)`.

Therefore, the shape of a single batch should be **(64, 1, 28, 28)**.

The labels will be a 1D tensor of shape **(64,)** containing the digit labels (0-9).


In [None]:
# Get one batch of data
X, y = next(iter(train_dataloader))

print("Image tensor shape:", X.shape)
print("Label tensor shape:", y.shape)
print(f"\nExpected image shape: (64, 1, 28, 28)")
print(f"Expected label shape: (64,)")
print(f"\nLabels in this batch: {y[:10].tolist()}...")  # Show first 10 labels


## Building the MLP Model

### nn.Module - The Base Class

All PyTorch models inherit from `nn.Module`, the base class for all neural network modules. When you create a model, you must define two essential methods:

1. **`__init__(self)`**: Where you define the layers of your network. This is where you instantiate all the layers (Linear, ReLU, etc.) that will be used in the forward pass.

2. **`forward(self, x)`**: Where you define how data flows through those layers. This method is called automatically when you do `model(X)` or `model.forward(X)`.

### The Layers We'll Use

- **`nn.Flatten()`**: Converts the 2D image tensor `(1, 28, 28)` into a 1D vector `(784)`. This is necessary because fully-connected layers expect 1D input.

- **`nn.Linear(in_features, out_features)`**: A standard fully-connected (dense) layer. It performs the operation `y = xW^T + b`, where W is a weight matrix and b is a bias vector.

- **`nn.ReLU()`**: The Rectified Linear Unit activation function. It applies `max(0, x)` element-wise, introducing non-linearity into the network. Non-linearity is essential for neural networks to learn complex patterns.


In [None]:
class SimpleMLP(nn.Module):
    def __init__(self):
        super(SimpleMLP, self).__init__()
        # Flatten 28x28 image to 784
        self.flatten = nn.Flatten()
        # First layer: 784 -> 128
        self.linear1 = nn.Linear(784, 128)
        self.relu1 = nn.ReLU()
        # Second layer: 128 -> 64
        self.linear2 = nn.Linear(128, 64)
        self.relu2 = nn.ReLU()
        # Output layer: 64 -> 10 (one for each digit 0-9)
        self.linear3 = nn.Linear(64, 10)
    
    def forward(self, x):
        x = self.flatten(x)
        x = self.linear1(x)
        x = self.relu1(x)
        x = self.linear2(x)
        x = self.relu2(x)
        x = self.linear3(x)
        return x

# Instantiate the model
model = SimpleMLP()
print(model)


## The Training Essentials

To train a neural network, you need three components:

1. **Loss Function**: Measures how wrong the model's predictions are. For classification tasks with multiple classes, we use `nn.CrossEntropyLoss`. It combines a softmax activation and negative log-likelihood loss in one efficient operation.

2. **Optimizer**: Adjusts the model's weights to reduce the loss. `torch.optim.Adam` is a popular choice—it's an adaptive learning rate algorithm that works well for most problems. The optimizer needs access to the model's parameters (`model.parameters()`) so it knows which weights to update.

3. **The Training Loop**: The iterative process of:
   - Feeding data to the model
   - Computing the loss
   - Updating the weights
   - Repeating until the model learns


In [None]:
# Instantiate loss function
loss_fn = nn.CrossEntropyLoss()

# Instantiate optimizer
# lr = learning rate (how big of steps to take)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

print("Loss function:", loss_fn)
print("Optimizer:", optimizer)


## The Training Loop Explained

Each training iteration consists of five critical steps:

1. **Forward Pass**: Pass the batch of images through the model to get predictions. The model outputs raw scores (logits) for each of the 10 digit classes.

2. **Calculate Loss**: Compare the model's predictions to the true labels using the loss function. This gives us a single number representing how wrong the model is.

3. **Backpropagation**: Calculate the gradients (the direction and magnitude of error). This is what `loss.backward()` does—it computes gradients for all parameters in the model.

4. **Update Weights**: The optimizer takes a "step" in the right direction to reduce the loss. This is `optimizer.step()`—it updates all the model's parameters using the computed gradients.

5. **Zero Gradients**: Reset the gradients to zero before the next batch. This is `optimizer.zero_grad()`—critical because PyTorch accumulates gradients by default, and we want fresh gradients for each batch.


In [None]:
def train(dataloader, model, loss_fn, optimizer, epochs=5):
    """
    Train the model for a specified number of epochs.
    
    Args:
        dataloader: DataLoader providing batches of training data
        model: The neural network model
        loss_fn: Loss function
        optimizer: Optimizer for updating weights
        epochs: Number of training epochs
    """
    model.train()  # Set model to training mode
    
    for epoch in range(epochs):
        total_loss = 0.0
        num_batches = 0
        
        for batch_idx, (X, y) in enumerate(dataloader):
            # Step 1: Forward pass
            pred = model(X)
            
            # Step 2: Calculate loss
            loss = loss_fn(pred, y)
            
            # Step 3: Backpropagation
            loss.backward()
            
            # Step 4: Update weights
            optimizer.step()
            
            # Step 5: Zero gradients
            optimizer.zero_grad()
            
            total_loss += loss.item()
            num_batches += 1
            
            # Print progress every 100 batches
            if (batch_idx + 1) % 100 == 0:
                avg_loss = total_loss / num_batches
                print(f'Epoch {epoch + 1}/{epochs}, Batch {batch_idx + 1}/{len(dataloader)}, Loss: {avg_loss:.4f}')
        
        # Print average loss for the epoch
        avg_loss = total_loss / num_batches
        print(f'Epoch {epoch + 1}/{epochs} completed. Average Loss: {avg_loss:.4f}\n')

# Train the model
print("Starting training...\n")
train(train_dataloader, model, loss_fn, optimizer, epochs=5)
print("Training completed!")
