# Module 2: Neural Networks & Deep Learning Basics

## Overview

Neural networks are computational models inspired by the structure of biological neurons in the brain. They form the backbone of modern deep learning and power everything from image recognition to natural language processing.

In this notebook, we will build up an understanding of neural networks from first principles:

1. **The Perceptron** - The simplest neural unit, and its limitations
2. **Activation Functions** - Non-linearities that give networks their power
3. **Forward Pass & Backpropagation** - How networks learn through gradient descent
4. **Loss Functions** - How we measure prediction quality
5. **PyTorch Introduction** - The modern framework for building neural networks
6. **MNIST Digit Classification** - A real-world example tying everything together

By the end of this module you will be able to:
- Implement a neural network from scratch using NumPy
- Understand forward propagation and backpropagation at the mathematical level
- Build and train neural networks using PyTorch
- Classify handwritten digits with >95% accuracy

---
## 1. Setup

First, let's install and import the libraries we need.

In [None]:
!pip install -q torch numpy matplotlib torchvision

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import matplotlib.pyplot as plt
from torchvision import datasets, transforms
from torch.utils.data import DataLoader

%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

# Reproducibility
np.random.seed(42)
torch.manual_seed(42)

print(f"PyTorch version: {torch.__version__}")
print(f"NumPy version: {np.__version__}")
print("Setup complete.")

---
## 2. The Perceptron

The **perceptron** is the simplest form of a neural network -- a single neuron. It takes a set of inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function.

Mathematically:

$$y = f\left(\sum_{i=1}^{n} w_i \cdot x_i + b\right) = f(\mathbf{w} \cdot \mathbf{x} + b)$$

where:
- $\mathbf{x}$ = input vector
- $\mathbf{w}$ = weight vector
- $b$ = bias term
- $f$ = activation function (e.g., step function)

### How the Perceptron Learns

The perceptron learning rule updates weights based on the error:

$$w_i \leftarrow w_i + \eta \cdot (y_{\text{true}} - y_{\text{pred}}) \cdot x_i$$

where $\eta$ is the learning rate.

In [None]:
class Perceptron:
    """A single perceptron (neuron) implemented from scratch with NumPy."""

    def __init__(self, n_inputs, learning_rate=0.1):
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0.0
        self.lr = learning_rate

    def step_function(self, x):
        """Binary step activation: returns 1 if x >= 0, else 0."""
        return (x >= 0).astype(int)

    def predict(self, X):
        """Forward pass: compute weighted sum + bias, apply activation."""
        linear_output = np.dot(X, self.weights) + self.bias
        return self.step_function(linear_output)

    def train(self, X, y, n_epochs=100):
        """Train the perceptron using the perceptron learning rule."""
        errors_per_epoch = []
        for epoch in range(n_epochs):
            total_errors = 0
            for xi, yi in zip(X, y):
                prediction = self.predict(xi.reshape(1, -1))[0]
                error = yi - prediction
                if error != 0:
                    total_errors += 1
                    self.weights += self.lr * error * xi
                    self.bias += self.lr * error
            errors_per_epoch.append(total_errors)
            if total_errors == 0:
                print(f"Converged at epoch {epoch + 1}")
                break
        return errors_per_epoch


print("Perceptron class defined.")

### Learning AND and OR Gates

The perceptron can learn linearly separable functions like AND and OR.

In [None]:
# Define the truth tables
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])

y_and = np.array([0, 0, 0, 1])
y_or  = np.array([0, 1, 1, 1])
y_xor = np.array([0, 1, 1, 0])

# --- AND Gate ---
print("=" * 40)
print("Training Perceptron on AND gate")
print("=" * 40)
p_and = Perceptron(n_inputs=2, learning_rate=0.1)
errors_and = p_and.train(X, y_and, n_epochs=100)
predictions_and = p_and.predict(X)
for xi, yi, pred in zip(X, y_and, predictions_and):
    status = "OK" if yi == pred else "WRONG"
    print(f"  Input: {xi} -> Expected: {yi}, Predicted: {pred}  [{status}]")
print(f"  Weights: {p_and.weights}, Bias: {p_and.bias:.4f}")

# --- OR Gate ---
print("\n" + "=" * 40)
print("Training Perceptron on OR gate")
print("=" * 40)
p_or = Perceptron(n_inputs=2, learning_rate=0.1)
errors_or = p_or.train(X, y_or, n_epochs=100)
predictions_or = p_or.predict(X)
for xi, yi, pred in zip(X, y_or, predictions_or):
    status = "OK" if yi == pred else "WRONG"
    print(f"  Input: {xi} -> Expected: {yi}, Predicted: {pred}  [{status}]")
print(f"  Weights: {p_or.weights}, Bias: {p_or.bias:.4f}")

# --- XOR Gate (will fail!) ---
print("\n" + "=" * 40)
print("Training Perceptron on XOR gate")
print("=" * 40)
p_xor = Perceptron(n_inputs=2, learning_rate=0.1)
errors_xor = p_xor.train(X, y_xor, n_epochs=100)
predictions_xor = p_xor.predict(X)
correct = 0
for xi, yi, pred in zip(X, y_xor, predictions_xor):
    status = "OK" if yi == pred else "WRONG"
    if yi == pred:
        correct += 1
    print(f"  Input: {xi} -> Expected: {yi}, Predicted: {pred}  [{status}]")
print(f"  Accuracy: {correct}/{len(y_xor)}")
print(f"  -> The perceptron CANNOT learn XOR because it is not linearly separable!")

In [None]:
# Visualize the decision boundaries
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

gate_names = ["AND", "OR", "XOR"]
gate_labels = [y_and, y_or, y_xor]
gate_models = [p_and, p_or, p_xor]

for ax, name, labels, model in zip(axes, gate_names, gate_labels, gate_models):
    # Plot data points
    for i, (xi, yi) in enumerate(zip(X, labels)):
        color = 'blue' if yi == 1 else 'red'
        marker = 'o' if yi == 1 else 'x'
        ax.scatter(xi[0], xi[1], c=color, marker=marker, s=200, zorder=5,
                   edgecolors='black', linewidth=1.5)

    # Plot decision boundary: w1*x1 + w2*x2 + b = 0 => x2 = -(w1*x1 + b) / w2
    w1, w2 = model.weights
    b = model.bias
    if abs(w2) > 1e-8:
        x1_range = np.linspace(-0.5, 1.5, 100)
        x2_boundary = -(w1 * x1_range + b) / w2
        ax.plot(x1_range, x2_boundary, 'g--', linewidth=2, label='Decision boundary')

    ax.set_xlim(-0.5, 1.5)
    ax.set_ylim(-0.5, 1.5)
    ax.set_xlabel('$x_1$', fontsize=12)
    ax.set_ylabel('$x_2$', fontsize=12)
    ax.set_title(f'{name} Gate', fontsize=14)
    ax.legend(fontsize=10)
    ax.set_aspect('equal')

plt.suptitle('Perceptron Decision Boundaries', fontsize=16, y=1.02)
plt.tight_layout()
plt.show()
print("Notice: AND and OR have clean linear boundaries. XOR cannot be separated by a single line.")

**Key Insight:** A single perceptron can only learn **linearly separable** functions. XOR requires a non-linear decision boundary, which means we need **multiple layers** of neurons -- a neural network.

---
## 3. Activation Functions

Activation functions introduce **non-linearity** into neural networks. Without them, stacking multiple layers would be equivalent to a single linear transformation (since a composition of linear functions is still linear).

### Common Activation Functions

| Function | Formula | Range | Use Case |
|----------|---------|-------|----------|
| Sigmoid | $\sigma(x) = \frac{1}{1+e^{-x}}$ | (0, 1) | Binary classification output |
| Tanh | $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ | (-1, 1) | Hidden layers (zero-centered) |
| ReLU | $\max(0, x)$ | [0, inf) | Default for hidden layers |
| Leaky ReLU | $\max(\alpha x, x)$ | (-inf, inf) | Avoids dying ReLU problem |

In [None]:
# Implement activation functions and their derivatives

def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x) ** 2

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

def leaky_relu_derivative(x, alpha=0.01):
    return np.where(x > 0, 1.0, alpha)

print("Activation functions defined.")
print(f"  sigmoid(0) = {sigmoid(0):.4f}")
print(f"  relu(-1) = {relu(-1):.4f}, relu(1) = {relu(1):.4f}")
print(f"  tanh(0) = {tanh(0):.4f}")
print(f"  leaky_relu(-1) = {leaky_relu(-1):.4f}")

In [None]:
# Plot all activation functions and their derivatives side by side
x = np.linspace(-5, 5, 500)

activations = [
    ('Sigmoid', sigmoid, sigmoid_derivative, 'tab:blue'),
    ('ReLU', relu, relu_derivative, 'tab:orange'),
    ('Tanh', tanh, tanh_derivative, 'tab:green'),
    ('Leaky ReLU', leaky_relu, leaky_relu_derivative, 'tab:red'),
]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot activations
for name, func, _, color in activations:
    axes[0].plot(x, func(x), label=name, color=color, linewidth=2)
axes[0].axhline(y=0, color='black', linewidth=0.5)
axes[0].axvline(x=0, color='black', linewidth=0.5)
axes[0].set_title('Activation Functions', fontsize=14)
axes[0].set_xlabel('x', fontsize=12)
axes[0].set_ylabel('f(x)', fontsize=12)
axes[0].legend(fontsize=11)
axes[0].set_ylim(-1.5, 5)

# Plot derivatives
for name, _, deriv, color in activations:
    axes[1].plot(x, deriv(x), label=f"{name}'", color=color, linewidth=2)
axes[1].axhline(y=0, color='black', linewidth=0.5)
axes[1].axvline(x=0, color='black', linewidth=0.5)
axes[1].set_title('Derivatives of Activation Functions', fontsize=14)
axes[1].set_xlabel('x', fontsize=12)
axes[1].set_ylabel("f'(x)", fontsize=12)
axes[1].legend(fontsize=11)
axes[1].set_ylim(-0.2, 1.5)

plt.tight_layout()
plt.show()

print("Why ReLU is popular:")
print("  1. Computationally efficient (just a max operation)")
print("  2. Does not saturate for positive values (no vanishing gradient)")
print("  3. Sigmoid/tanh derivatives approach 0 for large |x| -> vanishing gradients")
print("  4. Sparse activation: neurons output exactly 0 for negative inputs")
print("\nThe vanishing gradient problem:")
print("  When gradients are very small (close to 0), weight updates become tiny.")
print("  In deep networks, these small gradients multiply during backpropagation,")
print("  making it nearly impossible for early layers to learn. ReLU avoids this")
print("  because its gradient is 1 for all positive inputs.")

---
## Exercise 1: 2-Layer Neural Network from Scratch (NumPy) for XOR

Now that we know a single perceptron cannot solve XOR, let's build a 2-layer neural network from scratch using NumPy.

**Architecture:**
- Input layer: 2 neurons (for the two XOR inputs)
- Hidden layer: 4 neurons with sigmoid activation
- Output layer: 1 neuron with sigmoid activation

**Your task:** Implement the forward pass, backward pass (backpropagation), and training loop.

**Hint for backpropagation:**
- Output error: $\delta_{out} = (y - \hat{y}) \cdot \sigma'(z_{out})$
- Hidden error: $\delta_{hidden} = (\delta_{out} \cdot W_{out}^T) \cdot \sigma'(z_{hidden})$
- Weight update: $W \leftarrow W + \eta \cdot a^T \cdot \delta$

In [None]:
# Exercise 1: Implement a 2-layer neural network for XOR

class NeuralNetworkXOR:
    def __init__(self, input_size=2, hidden_size=4, output_size=1, lr=1.0):
        np.random.seed(42)
        self.lr = lr
        # Initialize weights and biases
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, x):
        return 1.0 / (1.0 + np.exp(-x))

    def sigmoid_derivative(self, x):
        s = self.sigmoid(x)
        return s * (1 - s)

    def forward(self, X):
        """Forward pass through the network.
        TODO: Implement the forward pass.
        1. Compute hidden layer pre-activation: z1 = X @ W1 + b1
        2. Apply sigmoid activation: a1 = sigmoid(z1)
        3. Compute output pre-activation: z2 = a1 @ W2 + b2
        4. Apply sigmoid activation: a2 = sigmoid(z2)
        Store z1, a1, z2, a2 as instance variables for backprop.
        Return a2 (the prediction).
        """
        self.z1 = None  # TODO
        self.a1 = None  # TODO
        self.z2 = None  # TODO
        self.a2 = None  # TODO
        return self.a2

    def backward(self, X, y):
        """Backward pass (backpropagation).
        TODO: Implement backpropagation.
        1. Compute output layer delta: d2 = (y - a2) * sigmoid_derivative(z2)
        2. Compute hidden layer delta: d1 = (d2 @ W2.T) * sigmoid_derivative(z1)
        3. Update W2, b2 using a1 and d2
        4. Update W1, b1 using X and d1
        """
        m = X.shape[0]
        # TODO: Implement gradient computation and weight updates
        pass

    def train(self, X, y, epochs=10000):
        """Training loop.
        TODO: Implement the training loop.
        For each epoch:
        1. Forward pass
        2. Compute loss (MSE)
        3. Backward pass
        4. Record loss every 1000 epochs
        Return list of losses.
        """
        losses = []
        # TODO: Implement training loop
        return losses


# Test data
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([[0], [1], [1], [0]])

print("Exercise 1: Implement the forward, backward, and train methods above.")
print("Then run the cell below to test your implementation.")

### Solution

In [None]:
# Solution: Complete 2-layer neural network for XOR

class NeuralNetworkXOR_Solution:
    def __init__(self, input_size=2, hidden_size=4, output_size=1, lr=1.0):
        np.random.seed(42)
        self.lr = lr
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, x):
        return 1.0 / (1.0 + np.exp(-x))

    def sigmoid_derivative(self, x):
        s = self.sigmoid(x)
        return s * (1 - s)

    def forward(self, X):
        """Forward pass through the network."""
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y):
        """Backward pass (backpropagation)."""
        m = X.shape[0]

        # Output layer gradients
        d2 = (y - self.a2) * self.sigmoid_derivative(self.z2)

        # Hidden layer gradients
        d1 = (d2 @ self.W2.T) * self.sigmoid_derivative(self.z1)

        # Update weights and biases
        self.W2 += self.lr * (self.a1.T @ d2) / m
        self.b2 += self.lr * np.sum(d2, axis=0, keepdims=True) / m
        self.W1 += self.lr * (X.T @ d1) / m
        self.b1 += self.lr * np.sum(d1, axis=0, keepdims=True) / m

    def train(self, X, y, epochs=10000):
        """Training loop."""
        losses = []
        for epoch in range(epochs):
            # Forward pass
            output = self.forward(X)

            # Compute MSE loss
            loss = np.mean((y - output) ** 2)

            # Backward pass
            self.backward(X, y)

            # Record loss
            if epoch % 1000 == 0:
                losses.append(loss)
                if epoch % 2000 == 0:
                    print(f"  Epoch {epoch:5d} | Loss: {loss:.6f}")

        return losses


# Train the network
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y_xor = np.array([[0], [1], [1], [0]])

print("Training 2-layer neural network on XOR...")
print("=" * 40)
nn_xor = NeuralNetworkXOR_Solution(input_size=2, hidden_size=4, output_size=1, lr=2.0)
losses = nn_xor.train(X_xor, y_xor, epochs=10000)

# Test predictions
print("\nFinal predictions:")
predictions = nn_xor.forward(X_xor)
for xi, yi, pred in zip(X_xor, y_xor, predictions):
    rounded = round(pred[0])
    status = "OK" if yi[0] == rounded else "WRONG"
    print(f"  Input: {xi} -> Expected: {yi[0]}, Predicted: {pred[0]:.4f} (rounded: {rounded}) [{status}]")

# Plot training loss
plt.figure(figsize=(8, 4))
plt.plot(range(0, 10000, 1000), losses, 'b-o', linewidth=2, markersize=4)
plt.xlabel('Epoch', fontsize=12)
plt.ylabel('MSE Loss', fontsize=12)
plt.title('XOR Training Loss (2-Layer Neural Network)', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nSuccess! The 2-layer network can learn XOR, unlike a single perceptron.")
print("The hidden layer creates a non-linear feature space where XOR becomes separable.")

---
## 4. Forward Pass & Backpropagation

### Step-by-Step Walkthrough

Let's trace through a concrete example to see exactly how forward propagation and backpropagation work.

**Network Architecture:** 2 inputs -> 2 hidden neurons -> 1 output

#### The Computation Graph

```
x1 --w1--> [h1] --w5-->
      \   /              \
       \ /                [o1] --> Loss
       / \                /
      /   \              /
x2 --w3--> [h2] --w6-->
```

Each arrow represents a weight. Each node computes a weighted sum plus bias, then applies an activation function.

### The Chain Rule

Backpropagation is simply the **chain rule** applied systematically. If we have a composition of functions:

$$L = f(g(h(x)))$$

Then:

$$\frac{\partial L}{\partial x} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial h} \cdot \frac{\partial h}{\partial x}$$

In a neural network, this means we compute gradients layer by layer, from the output back to the input -- hence "back" propagation.

In [None]:
# Manual step-by-step forward and backward pass

print("=" * 60)
print("STEP-BY-STEP FORWARD PASS AND BACKPROPAGATION")
print("=" * 60)

# Initialize with specific values for clarity
x = np.array([[0.5, 0.8]])  # Input
y_true = np.array([[1.0]])  # Target

# Weights (small, fixed for demonstration)
W1 = np.array([[0.1, 0.3],
               [0.2, 0.4]])
b1 = np.array([[0.01, 0.02]])

W2 = np.array([[0.5],
               [0.6]])
b2 = np.array([[0.03]])

print("\n--- FORWARD PASS ---")
print(f"Input x = {x}")
print(f"Target y = {y_true}")

# Layer 1: Hidden
z1 = x @ W1 + b1
print(f"\nHidden layer pre-activation:")
print(f"  z1 = x @ W1 + b1 = {x} @ {W1.tolist()} + {b1}")
print(f"  z1 = {z1}")

a1 = sigmoid(z1)
print(f"  a1 = sigmoid(z1) = {a1}")

# Layer 2: Output
z2 = a1 @ W2 + b2
print(f"\nOutput layer pre-activation:")
print(f"  z2 = a1 @ W2 + b2 = {z2}")

a2 = sigmoid(z2)
print(f"  a2 = sigmoid(z2) = {a2}  (this is our prediction)")

# Loss (MSE)
loss = np.mean((y_true - a2) ** 2)
print(f"\nMSE Loss = (y - a2)^2 = ({y_true[0,0]:.4f} - {a2[0,0]:.4f})^2 = {loss:.6f}")

print("\n--- BACKWARD PASS (Backpropagation) ---")
print("Using chain rule to compute gradients...")

# dL/da2
dL_da2 = -2 * (y_true - a2)
print(f"\ndL/da2 = -2(y - a2) = {dL_da2}")

# da2/dz2
da2_dz2 = sigmoid_derivative(z2)
print(f"da2/dz2 = sigmoid'(z2) = {da2_dz2}")

# dL/dz2 (delta at output)
delta2 = dL_da2 * da2_dz2
print(f"delta2 = dL/dz2 = dL/da2 * da2/dz2 = {delta2}")

# Gradients for W2 and b2
dL_dW2 = a1.T @ delta2
dL_db2 = delta2
print(f"\ndL/dW2 = a1.T @ delta2 = {dL_dW2.flatten()}")
print(f"dL/db2 = delta2 = {dL_db2.flatten()}")

# Backpropagate to hidden layer
dL_da1 = delta2 @ W2.T
print(f"\ndL/da1 = delta2 @ W2.T = {dL_da1}")

da1_dz1 = sigmoid_derivative(z1)
delta1 = dL_da1 * da1_dz1
print(f"delta1 = dL/da1 * sigmoid'(z1) = {delta1}")

# Gradients for W1 and b1
dL_dW1 = x.T @ delta1
dL_db1 = delta1
print(f"\ndL/dW1 = x.T @ delta1:")
print(f"  {dL_dW1}")
print(f"dL/db1 = delta1 = {dL_db1}")

# Update weights
lr = 0.5
print(f"\n--- WEIGHT UPDATE (lr={lr}) ---")
W2_new = W2 - lr * dL_dW2
b2_new = b2 - lr * dL_db2
W1_new = W1 - lr * dL_dW1
b1_new = b1 - lr * dL_db1

print(f"W2: {W2.flatten()} -> {W2_new.flatten()}")
print(f"b2: {b2.flatten()} -> {b2_new.flatten()}")
print(f"W1:\n  {W1} ->\n  {W1_new}")

# Verify: forward pass with new weights should give lower loss
z1_new = x @ W1_new + b1_new
a1_new = sigmoid(z1_new)
z2_new = a1_new @ W2_new + b2_new
a2_new = sigmoid(z2_new)
loss_new = np.mean((y_true - a2_new) ** 2)

print(f"\n--- VERIFICATION ---")
print(f"Old prediction: {a2[0,0]:.6f}, Loss: {loss:.6f}")
print(f"New prediction: {a2_new[0,0]:.6f}, Loss: {loss_new:.6f}")
print(f"Loss decreased by: {loss - loss_new:.6f}")
print(f"The gradient step moved us closer to the target ({y_true[0,0]}).")

---
## 5. Loss Functions

Loss functions measure how far our predictions are from the true values. The choice of loss function depends on the task:

- **Regression** (predicting continuous values): Mean Squared Error (MSE)
- **Classification** (predicting categories): Cross-Entropy Loss

### Mean Squared Error (MSE)

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

- Penalizes large errors heavily (quadratic)
- Always non-negative, equals 0 when predictions are perfect

### Binary Cross-Entropy

$$\text{BCE} = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i) \right]$$

- Natural pairing with sigmoid output
- Measures the divergence between predicted probability distribution and true distribution

In [None]:
# Implement and visualize loss functions

def mse_loss(y_true, y_pred):
    """Mean Squared Error loss."""
    return np.mean((y_true - y_pred) ** 2)

def binary_cross_entropy(y_true, y_pred, epsilon=1e-15):
    """Binary Cross-Entropy loss. Epsilon prevents log(0)."""
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))


# Demonstrate
y_true_val = 1.0
y_preds = np.linspace(0.01, 0.99, 200)

mse_values = [(y_true_val - yp) ** 2 for yp in y_preds]
bce_values = [-(y_true_val * np.log(yp) + (1 - y_true_val) * np.log(1 - yp)) for yp in y_preds]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# MSE plot
axes[0].plot(y_preds, mse_values, 'b-', linewidth=2)
axes[0].axvline(x=y_true_val, color='red', linestyle='--', label=f'True value = {y_true_val}')
axes[0].set_xlabel('Predicted value ($\hat{y}$)', fontsize=12)
axes[0].set_ylabel('Loss', fontsize=12)
axes[0].set_title('Mean Squared Error (Regression)', fontsize=14)
axes[0].legend(fontsize=11)

# BCE plot
axes[1].plot(y_preds, bce_values, 'r-', linewidth=2)
axes[1].axvline(x=y_true_val, color='blue', linestyle='--', label=f'True label = {int(y_true_val)}')
axes[1].set_xlabel('Predicted probability ($\hat{y}$)', fontsize=12)
axes[1].set_ylabel('Loss', fontsize=12)
axes[1].set_title('Binary Cross-Entropy (Classification)', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].set_ylim(0, 5)

plt.tight_layout()
plt.show()

print("When to use each loss function:")
print("  MSE: Regression tasks (predicting house prices, temperatures, etc.)")
print("  Cross-Entropy: Classification tasks (spam detection, image classification, etc.)")
print("\nKey difference:")
print("  MSE penalizes errors quadratically.")
print("  BCE penalizes confident wrong predictions very heavily (loss -> infinity as")
print("  predicted probability -> 0 when true label is 1).")

---
## 6. PyTorch Introduction

PyTorch is a deep learning framework that provides:
1. **Tensors** - Like NumPy arrays, but with GPU acceleration
2. **Autograd** - Automatic differentiation (no manual backprop!)
3. **nn.Module** - Building blocks for neural networks

### Tensors

In [None]:
# Device detection
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
if device == "cuda":
    print(f"  GPU: {torch.cuda.get_device_name(0)}")

print("\n--- Tensor Creation ---")

# From Python lists
t1 = torch.tensor([1, 2, 3, 4])
print(f"From list: {t1}, dtype: {t1.dtype}")

# From NumPy
np_arr = np.array([[1.0, 2.0], [3.0, 4.0]])
t2 = torch.from_numpy(np_arr)
print(f"From NumPy:\n{t2}")

# Special tensors
t_zeros = torch.zeros(2, 3)
t_ones = torch.ones(2, 3)
t_rand = torch.randn(2, 3)  # Normal distribution
print(f"\nZeros:\n{t_zeros}")
print(f"Random (normal):\n{t_rand}")

print("\n--- Tensor Operations ---")

a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

print(f"a + b =\n{a + b}")
print(f"a * b (element-wise) =\n{a * b}")
print(f"a @ b (matrix multiply) =\n{a @ b}")
print(f"a.mean() = {a.mean():.2f}")
print(f"a.shape = {a.shape}")

# Move to GPU if available
a_device = a.to(device)
print(f"\nTensor on device: {a_device.device}")

### Autograd: Automatic Differentiation

PyTorch can automatically compute gradients for us. No more manual backpropagation!

In [None]:
# Autograd demo
print("--- Autograd Demo ---")

# Create a tensor with gradient tracking
x = torch.tensor(3.0, requires_grad=True)
print(f"x = {x}")

# Define a function: y = x^2 + 2x + 1
y = x**2 + 2*x + 1
print(f"y = x^2 + 2x + 1 = {y.item():.2f}")

# Compute gradient: dy/dx = 2x + 2
y.backward()
print(f"dy/dx at x=3: {x.grad.item():.2f} (expected: 2*3 + 2 = 8)")

print("\n--- Autograd with Vectors ---")

# More complex example
w = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
x_vec = torch.tensor([4.0, 5.0, 6.0])

# y = w . x (dot product)
y = torch.dot(w, x_vec)
print(f"w = {w.data}")
print(f"x = {x_vec}")
print(f"y = w . x = {y.item():.2f}")

# dy/dw = x
y.backward()
print(f"dy/dw = {w.grad} (expected: {x_vec}, since dy/dw_i = x_i)")

print("\n--- Autograd with a Mini Neural Network ---")

# Simulate a tiny forward + backward pass
torch.manual_seed(42)
x_in = torch.randn(1, 3)       # 1 sample, 3 features
w_hidden = torch.randn(3, 2, requires_grad=True)
w_out = torch.randn(2, 1, requires_grad=True)
target = torch.tensor([[1.0]])

# Forward
hidden = torch.sigmoid(x_in @ w_hidden)
output = torch.sigmoid(hidden @ w_out)
loss = ((target - output) ** 2).mean()

print(f"Input:  {x_in.data.numpy().flatten()}")
print(f"Output: {output.item():.4f}")
print(f"Target: {target.item():.4f}")
print(f"Loss:   {loss.item():.6f}")

# Backward - PyTorch computes ALL gradients automatically!
loss.backward()
print(f"\nGradients computed automatically:")
print(f"  dL/dw_hidden shape: {w_hidden.grad.shape}")
print(f"  dL/dw_hidden:\n  {w_hidden.grad}")
print(f"  dL/dw_out shape: {w_out.grad.shape}")
print(f"  dL/dw_out:\n  {w_out.grad}")
print("\nNo manual gradient calculation needed!")

### nn.Module: Building Blocks

`nn.Module` is the base class for all neural network modules in PyTorch. It handles parameter management, GPU transfer, and more.

In [None]:
# Building a neural network with nn.Module

class SimpleNet(nn.Module):
    """A simple 2-layer neural network."""

    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.layer1 = nn.Linear(input_size, hidden_size)
        self.layer2 = nn.Linear(hidden_size, output_size)
        self.relu = nn.ReLU()
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.relu(self.layer1(x))
        x = self.sigmoid(self.layer2(x))
        return x


# Create and inspect the network
model = SimpleNet(input_size=2, hidden_size=4, output_size=1)
print("Network architecture:")
print(model)

print("\nModel parameters:")
total_params = 0
for name, param in model.named_parameters():
    print(f"  {name}: shape={param.shape}, params={param.numel()}")
    total_params += param.numel()
print(f"  Total parameters: {total_params}")

# Test forward pass
x_test = torch.tensor([[0.0, 1.0], [1.0, 0.0], [1.0, 1.0]])
with torch.no_grad():
    output = model(x_test)
print(f"\nTest forward pass (3 samples):")
print(f"  Input shape:  {x_test.shape}")
print(f"  Output shape: {output.shape}")
print(f"  Outputs: {output.squeeze().numpy()}")

---
## 7. MNIST Digit Classification

Now let's put it all together and build a real neural network that classifies handwritten digits from the MNIST dataset.

**MNIST:**
- 70,000 grayscale images of handwritten digits (0-9)
- Each image is 28x28 pixels = 784 input features
- 60,000 training images, 10,000 test images

**Our MLP Architecture:**
- Input: 784 (flattened 28x28 image)
- Hidden 1: 128 neurons, ReLU
- Hidden 2: 64 neurons, ReLU
- Output: 10 neurons (one per digit class)

In [None]:
# Load MNIST dataset

transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST mean and std
])

train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transform)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

print(f"Training set: {len(train_dataset)} images")
print(f"Test set:     {len(test_dataset)} images")
print(f"Image shape:  {train_dataset[0][0].shape} (channels x height x width)")
print(f"Classes:      {train_dataset.classes}")

# Visualize some samples
fig, axes = plt.subplots(2, 8, figsize=(14, 4))
for i, ax in enumerate(axes.flat):
    image, label = train_dataset[i]
    ax.imshow(image.squeeze(), cmap='gray')
    ax.set_title(f'Label: {label}', fontsize=10)
    ax.axis('off')
plt.suptitle('MNIST Sample Images', fontsize=14)
plt.tight_layout()
plt.show()

In [None]:
# Define the MLP model

class MNISTClassifier(nn.Module):
    """Multi-Layer Perceptron for MNIST digit classification."""

    def __init__(self, hidden1=128, hidden2=64, activation='relu'):
        super().__init__()
        self.flatten = nn.Flatten()
        self.fc1 = nn.Linear(784, hidden1)
        self.fc2 = nn.Linear(hidden1, hidden2)
        self.fc3 = nn.Linear(hidden2, 10)

        # Select activation function
        if activation == 'relu':
            self.activation = nn.ReLU()
        elif activation == 'sigmoid':
            self.activation = nn.Sigmoid()
        elif activation == 'tanh':
            self.activation = nn.Tanh()
        else:
            raise ValueError(f"Unknown activation: {activation}")

    def forward(self, x):
        x = self.flatten(x)
        x = self.activation(self.fc1(x))
        x = self.activation(self.fc2(x))
        x = self.fc3(x)  # No activation on output (raw logits for CrossEntropyLoss)
        return x


# Create model
model_mnist = MNISTClassifier(hidden1=128, hidden2=64, activation='relu').to(device)
print("MNIST Classifier Architecture:")
print(model_mnist)

total_params = sum(p.numel() for p in model_mnist.parameters())
print(f"\nTotal parameters: {total_params:,}")

In [None]:
# Training function

def train_model(model, train_loader, test_loader, epochs=5, lr=0.001):
    """Train the model and return training history."""
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=lr)

    train_losses = []
    test_accuracies = []

    for epoch in range(epochs):
        model.train()
        running_loss = 0.0
        num_batches = 0

        for batch_idx, (data, target) in enumerate(train_loader):
            data, target = data.to(device), target.to(device)

            optimizer.zero_grad()
            output = model(data)
            loss = criterion(output, target)
            loss.backward()
            optimizer.step()

            running_loss += loss.item()
            num_batches += 1

        avg_loss = running_loss / num_batches
        train_losses.append(avg_loss)

        # Evaluate on test set
        model.eval()
        correct = 0
        total = 0
        with torch.no_grad():
            for data, target in test_loader:
                data, target = data.to(device), target.to(device)
                output = model(data)
                _, predicted = torch.max(output, 1)
                total += target.size(0)
                correct += (predicted == target).sum().item()

        accuracy = 100.0 * correct / total
        test_accuracies.append(accuracy)

        print(f"  Epoch {epoch+1}/{epochs} | Train Loss: {avg_loss:.4f} | Test Accuracy: {accuracy:.2f}%")

    return train_losses, test_accuracies


# Train the model
print("Training MNIST Classifier (784 -> 128 -> 64 -> 10)...")
print("=" * 60)
train_losses, test_accuracies = train_model(model_mnist, train_loader, test_loader, epochs=5, lr=0.001)
print(f"\nFinal test accuracy: {test_accuracies[-1]:.2f}%")

In [None]:
# Visualize training results

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Training loss curve
axes[0].plot(range(1, len(train_losses) + 1), train_losses, 'b-o', linewidth=2, markersize=6)
axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Training Loss', fontsize=12)
axes[0].set_title('Training Loss Curve', fontsize=14)
axes[0].grid(True, alpha=0.3)

# Test accuracy curve
axes[1].plot(range(1, len(test_accuracies) + 1), test_accuracies, 'g-o', linewidth=2, markersize=6)
axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Test Accuracy (%)', fontsize=12)
axes[1].set_title('Test Accuracy Curve', fontsize=14)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Visualize sample predictions

model_mnist.eval()
test_images, test_labels = next(iter(test_loader))
test_images_dev = test_images.to(device)

with torch.no_grad():
    outputs = model_mnist(test_images_dev)
    probabilities = F.softmax(outputs, dim=1)
    _, predictions = torch.max(outputs, 1)

predictions = predictions.cpu()
probabilities = probabilities.cpu()

# Show 16 sample predictions
fig, axes = plt.subplots(2, 8, figsize=(16, 5))
for i, ax in enumerate(axes.flat):
    img = test_images[i].squeeze()
    pred = predictions[i].item()
    true = test_labels[i].item()
    conf = probabilities[i][pred].item() * 100

    ax.imshow(img, cmap='gray')
    color = 'green' if pred == true else 'red'
    ax.set_title(f'Pred: {pred} ({conf:.0f}%)\nTrue: {true}', fontsize=9, color=color)
    ax.axis('off')

plt.suptitle('MNIST Predictions (green=correct, red=incorrect)', fontsize=14)
plt.tight_layout()
plt.show()

# Count correct/incorrect in this batch
batch_correct = (predictions == test_labels).sum().item()
print(f"Batch accuracy: {batch_correct}/{len(test_labels)} = {100*batch_correct/len(test_labels):.1f}%")

---
## Exercise 2: Experiment with Hidden Layer Sizes

How does the size of the hidden layers affect the model's performance? In this exercise, you will train multiple MNIST classifiers with different hidden layer sizes and compare their accuracies.

**Your task:**
1. Train models with hidden sizes: 32, 64, 128, 256
2. Record the final test accuracy for each
3. Plot a comparison

In [None]:
# Exercise 2: Compare different hidden layer sizes

hidden_sizes = [32, 64, 128, 256]
results = {}  # Will store {size: (losses, accuracies)}

for size in hidden_sizes:
    # TODO: Create a MNISTClassifier with hidden1=size, hidden2=size//2
    model = None  # TODO

    # TODO: Train the model for 5 epochs
    losses, accuracies = None, None  # TODO

    # TODO: Store results
    # results[size] = (losses, accuracies)
    pass

# TODO: Plot comparison of test accuracies for each hidden size
# Create a plot with epochs on x-axis and accuracy on y-axis
# with one line per hidden size

print("Exercise 2: Implement the training loop above and plot the results.")

### Solution

In [None]:
# Solution: Compare different hidden layer sizes

hidden_sizes = [32, 64, 128, 256]
results = {}

for size in hidden_sizes:
    print(f"\nTraining with hidden_size={size} (hidden2={size//2})...")
    print("-" * 60)

    torch.manual_seed(42)
    model = MNISTClassifier(hidden1=size, hidden2=size//2, activation='relu').to(device)

    param_count = sum(p.numel() for p in model.parameters())
    print(f"  Parameters: {param_count:,}")

    losses, accuracies = train_model(model, train_loader, test_loader, epochs=5, lr=0.001)
    results[size] = (losses, accuracies)

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

colors = ['tab:blue', 'tab:orange', 'tab:green', 'tab:red']
epochs_range = range(1, 6)

for (size, (losses, accuracies)), color in zip(results.items(), colors):
    axes[0].plot(epochs_range, losses, '-o', color=color, linewidth=2, markersize=5,
                 label=f'hidden={size}')
    axes[1].plot(epochs_range, accuracies, '-o', color=color, linewidth=2, markersize=5,
                 label=f'hidden={size}')

axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Training Loss', fontsize=12)
axes[0].set_title('Training Loss by Hidden Layer Size', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Test Accuracy (%)', fontsize=12)
axes[1].set_title('Test Accuracy by Hidden Layer Size', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary table
print("\nSummary:")
print(f"{'Hidden Size':<15} {'Final Accuracy':>15} {'Parameters':>12}")
print("-" * 45)
for size in hidden_sizes:
    acc = results[size][1][-1]
    params = 784 * size + size + size * (size // 2) + (size // 2) + (size // 2) * 10 + 10
    print(f"{size:<15} {acc:>14.2f}% {params:>12,}")

print("\nObservations:")
print("  - Larger hidden layers generally achieve higher accuracy.")
print("  - But the improvement diminishes -- doubling size does not double accuracy.")
print("  - Larger models train slower and use more memory.")
print("  - The right size depends on the task complexity and available compute.")

---
## Exercise 3: Visualize Activation Function Effects on Training

Different activation functions can dramatically affect how quickly and how well a network trains. In this exercise, you will train the same MNIST architecture with three different activation functions and compare their convergence behavior.

**Your task:**
1. Train MNIST classifiers with sigmoid, ReLU, and tanh activations
2. Use the same architecture (784 -> 128 -> 64 -> 10) and learning rate
3. Plot the training loss curves for all three on the same figure

In [None]:
# Exercise 3: Compare activation functions on MNIST

activation_functions = ['sigmoid', 'relu', 'tanh']
activation_results = {}  # Will store {activation: (losses, accuracies)}

for act_fn in activation_functions:
    # TODO: Create a MNISTClassifier with this activation function
    model = None  # TODO

    # TODO: Train the model for 5 epochs
    losses, accuracies = None, None  # TODO

    # TODO: Store results
    # activation_results[act_fn] = (losses, accuracies)
    pass

# TODO: Plot training loss curves for all three activations on the same figure
# Also plot accuracy curves

print("Exercise 3: Implement the training loop above and plot the results.")

### Solution

In [None]:
# Solution: Compare activation functions on MNIST

activation_functions = ['sigmoid', 'relu', 'tanh']
activation_results = {}

for act_fn in activation_functions:
    print(f"\nTraining with activation={act_fn}...")
    print("-" * 60)

    torch.manual_seed(42)
    model = MNISTClassifier(hidden1=128, hidden2=64, activation=act_fn).to(device)
    losses, accuracies = train_model(model, train_loader, test_loader, epochs=5, lr=0.001)
    activation_results[act_fn] = (losses, accuracies)

# Plot comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

act_colors = {'sigmoid': 'tab:blue', 'relu': 'tab:orange', 'tanh': 'tab:green'}
epochs_range = range(1, 6)

for act_fn in activation_functions:
    losses, accuracies = activation_results[act_fn]
    color = act_colors[act_fn]
    axes[0].plot(epochs_range, losses, '-o', color=color, linewidth=2,
                 markersize=6, label=act_fn.capitalize())
    axes[1].plot(epochs_range, accuracies, '-o', color=color, linewidth=2,
                 markersize=6, label=act_fn.capitalize())

axes[0].set_xlabel('Epoch', fontsize=12)
axes[0].set_ylabel('Training Loss', fontsize=12)
axes[0].set_title('Training Loss by Activation Function', fontsize=14)
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

axes[1].set_xlabel('Epoch', fontsize=12)
axes[1].set_ylabel('Test Accuracy (%)', fontsize=12)
axes[1].set_title('Test Accuracy by Activation Function', fontsize=14)
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Summary
print("\nSummary:")
print(f"{'Activation':<15} {'Final Accuracy':>15} {'Final Loss':>12}")
print("-" * 45)
for act_fn in activation_functions:
    losses, accuracies = activation_results[act_fn]
    print(f"{act_fn.capitalize():<15} {accuracies[-1]:>14.2f}% {losses[-1]:>12.4f}")

print("\nObservations:")
print("  - ReLU typically converges fastest and achieves the highest accuracy.")
print("  - Sigmoid is slowest due to the vanishing gradient problem.")
print("  - Tanh is better than sigmoid (zero-centered) but still suffers from saturation.")
print("  - This is why ReLU is the default choice for hidden layers in modern networks.")

---
## 8. Summary & References

### Key Takeaways

1. **The Perceptron** is the simplest neural unit. It can only learn linearly separable functions (AND, OR) but not non-linear ones (XOR).

2. **Activation functions** introduce non-linearity. ReLU is the most popular choice for hidden layers because it avoids the vanishing gradient problem.

3. **Backpropagation** is the systematic application of the chain rule to compute gradients layer by layer, from the output back to the input.

4. **Loss functions** quantify how wrong our predictions are. MSE for regression, cross-entropy for classification.

5. **PyTorch** provides tensors (GPU-accelerated arrays), autograd (automatic differentiation), and nn.Module (building blocks for networks).

6. **Deeper and wider networks** can model more complex functions, but with diminishing returns and increased computational cost.

7. **The choice of activation function** significantly impacts training dynamics. ReLU consistently outperforms sigmoid and tanh for deep networks.

### What's Next

In the next module, we will explore:
- Convolutional Neural Networks (CNNs) for image data
- Recurrent Neural Networks (RNNs) for sequential data
- Regularization techniques (dropout, batch normalization)
- Transfer learning

### References

- **Book:** Goodfellow, I., Bengio, Y., & Courville, A. (2016). *Deep Learning*. MIT Press. Chapters 6-8 cover feedforward networks, regularization, and optimization. Available free at [deeplearningbook.org](https://www.deeplearningbook.org/)

- **Course:** fast.ai - *Practical Deep Learning for Coders*. A hands-on course that teaches deep learning from the top down. [course.fast.ai](https://course.fast.ai/)

- **Tutorial:** PyTorch - *Deep Learning with PyTorch: A 60 Minute Blitz*. Official PyTorch tutorial covering tensors, autograd, and neural networks. [pytorch.org/tutorials](https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html)

- **Paper:** Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). *Learning representations by back-propagating errors*. Nature, 323(6088), 533-536. The foundational paper on backpropagation.

- **Paper:** Glorot, X., & Bengio, Y. (2010). *Understanding the difficulty of training deep feedforward neural networks*. Explains the vanishing gradient problem and initialization strategies.