# MNIST Neural Network: Function Explanations

## Line-by-Line Breakdown of Core Functions

This notebook provides detailed explanations of the four key functions in the MNIST neural network:
1. **compute_loss()** - Measuring prediction errors
2. **backward()** - Backpropagation & weight updates
3. **train()** - The main training loop
4. **predict()** - Making predictions

---

## Function 1: `compute_loss(self, y_true, y_pred)`

**Purpose:** Calculate cross-entropy loss to measure how wrong the predictions are.

**Location in code:** Lines 318-323

### The Complete Function

```python
def compute_loss(self, y_true, y_pred):
    """Cross-entropy loss"""
    m = y_true.shape[0]                                    # Line 320
    log_likelihood = -np.log(y_pred[range(m), y_true.argmax(axis=1)])  # Line 321
    loss = np.sum(log_likelihood) / m                      # Line 322
    return loss                                             # Line 323
```

### Line-by-Line Breakdown

**Line 320: `m = y_true.shape[0]`**
- Gets the batch size (number of samples)
- If `y_true` has shape (128, 10), then `m = 128`
- This tells us how many images we're evaluating

**Line 321: `y_true.argmax(axis=1)`**
- Converts one-hot encoding to class indices
- Example: `[0, 0, 0, 1, 0, ...]` (class 3) → becomes `3`
- Does this for all 128 samples

**Line 321: `y_pred[range(m), y_true.argmax(axis=1)]`**
- **Advanced indexing** (fancy indexing)
- For each sample i, gets: `y_pred[i, true_class_i]`
- Extracts ONLY the predicted probability for the correct class
- Example:
  ```
  Sample 1: y_pred = [0.1, 0.2, 0.1, 0.6, 0.0, ...]  true_class = 3
  → Extract: 0.6 (the probability for class 3)
  
  Sample 2: y_pred = [0.3, 0.5, 0.1, 0.1, ...]  true_class = 1
  → Extract: 0.5 (the probability for class 1)
  ```

**Line 321: `-np.log(...)`**
- Apply negative logarithm to each extracted probability
- **Why logarithm?** It heavily penalizes wrong predictions
  - If probability = 0.9 (good): loss = -log(0.9) ≈ 0.105 (small)
  - If probability = 0.5 (uncertain): loss = -log(0.5) ≈ 0.693 (medium)
  - If probability = 0.1 (wrong): loss = -log(0.1) ≈ 2.303 (large)
- Result: array of 128 loss values (one per sample)

**Line 322: `np.sum(log_likelihood) / m`**
- Sum all 128 individual losses
- Divide by batch size (128) to get AVERAGE loss
- **Why average?** Makes loss independent of batch size (small batches don't artificially have small losses)
- Result: single scalar value

**Line 323: `return loss`**
- Return the average cross-entropy loss
- Used to monitor training progress

### Visual Example

```
Input:
  y_true (one-hot):  [[0,0,1,0], [1,0,0,0]]  # True classes: 2 and 0
  y_pred (softmax):  [[0.1, 0.2, 0.6, 0.1], [0.8, 0.1, 0.05, 0.05]]

Step 1: Get true classes
  y_true.argmax(axis=1) → [2, 0]

Step 2: Extract predictions for true classes
  y_pred[0, 2] = 0.6
  y_pred[1, 0] = 0.8
  Result: [0.6, 0.8]

Step 3: Apply negative log
  -log(0.6) ≈ 0.51
  -log(0.8) ≈ 0.22
  Result: [0.51, 0.22]

Step 4: Average
  (0.51 + 0.22) / 2 = 0.365

Output: 0.365 (average loss)
```

### What It Means
- **Lower loss = better predictions**
- Loss 0.1 = very confident and correct ✓✓✓
- Loss 0.5 = somewhat confident and correct ✓✓
- Loss 2.0 = very confident but WRONG ✗✗✗

## Function 2: `backward(self, X, y_true, learning_rate=0.01)`

**Purpose:** Compute gradients via backpropagation and update all network weights.

**Location in code:** Lines 325-343

### The Complete Function

```python
def backward(self, X, y_true, learning_rate=0.01):
    """Backpropagation"""
    m = X.shape[0]                                        # Line 327
    
    # Output layer gradients
    dz2 = self.a2 - y_true                               # Line 330
    dW2 = np.dot(self.a1.T, dz2) / m                     # Line 331
    db2 = np.sum(dz2, axis=0, keepdims=True) / m         # Line 332
    
    # Hidden layer gradients
    dz1 = np.dot(dz2, self.W2.T) * relu_derivative(self.z1)  # Line 335
    dW1 = np.dot(X.T, dz1) / m                           # Line 336
    db1 = np.sum(dz1, axis=0, keepdims=True) / m         # Line 337
    
    # Update weights
    self.W2 -= learning_rate * dW2                       # Line 340
    self.b2 -= learning_rate * db2                       # Line 341
    self.W1 -= learning_rate * dW1                       # Line 342
    self.b1 -= learning_rate * db1                       # Line 343
```

### Line-by-Line Breakdown

**Line 327: `m = X.shape[0]`**
- Batch size (e.g., 128)
- Used to normalize gradients

#### OUTPUT LAYER (Working Backwards)

**Line 330: `dz2 = self.a2 - y_true`**
- **Key line!** Compute error at output layer
- `self.a2` = predictions from softmax, shape (128, 10)
- `y_true` = one-hot labels, shape (128, 10)
- Result: (128, 10) matrix of errors
- This is the **gradient of loss w.r.t. output layer** (thanks to softmax + cross-entropy)
- Example: if prediction [0.1, 0.9] and truth [0, 1], error = [0.1, -0.1]

**Line 331: `dW2 = np.dot(self.a1.T, dz2) / m`**
- Compute gradient for hidden→output weights
- `self.a1.T` = hidden layer activations transposed, shape (128, 128) → (128, 128)
- Matrix multiplication: (128, 128) × (128, 10) = (128, 10) **wait** should be (128, 10) ✓
- **What it means:** How much each hidden activation contributed to the output error
- Divide by m: normalize per sample

**Line 332: `db2 = np.sum(dz2, axis=0, keepdims=True) / m`**
- Compute gradient for output bias
- Sum errors across all samples (axis=0): (128, 10) → (1, 10)
- Each output neuron gets one bias gradient value
- Divide by m: average per sample

#### HIDDEN LAYER (Backpropagate Further)

**Line 335: `np.dot(dz2, self.W2.T)`**
- **Propagate error backwards through W2**
- Take output layer errors and "spread them back" to hidden layer
- (128, 10) × (10, 128) = (128, 128)
- This tells us: "how much did each hidden neuron contribute to the output error?"

**Line 335: `* relu_derivative(self.z1)`**
- **Apply ReLU derivative**
- For each hidden neuron: if z1 < 0, gradient is 0 (dead neuron, no contribution)
- If z1 > 0, gradient passes through (coefficient = 1)
- This respects the ReLU activation function
- Result: (128, 128) hidden layer gradient

**Line 336: `dW1 = np.dot(X.T, dz1) / m`**
- Compute gradient for input→hidden weights
- `X.T` = input transposed, shape (784, 128)
- Matrix multiplication: (784, 128) × (128, 128) = (784, 128) ✓
- This tells us: "how much does each input pixel affect the hidden layer error?"

**Line 337: `db1 = np.sum(dz1, axis=0, keepdims=True) / m`**
- Compute gradient for hidden bias
- Sum across samples: (128, 128) → (1, 128)

#### WEIGHT UPDATES (Gradient Descent)

**Lines 340-343: Update all weights and biases**
```python
self.W2 -= learning_rate * dW2    # Move opposite to gradient
self.b2 -= learning_rate * db2
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
```
- **Gradient descent step**: Move in direction that reduces loss
- `learning_rate` controls step size (e.g., 0.1)
- Large gradient = large update
- Small gradient = small update
- All 4 parameter matrices are updated simultaneously

### Visual Flow

```
Forward Pass (already done):
  X (128, 784) → W1 → z1 → ReLU → a1 (128, 128) → W2 → z2 → Softmax → a2 (128, 10)

Backward Pass (what backward() does):
  Error at output: dz2 = a2 - y_true (128, 10)
       ↓
  Compute dW2, db2 (gradients for W2, b2)
       ↓
  Backprop to hidden: dz1 = dz2 @ W2.T * relu_derivative (128, 128)
       ↓
  Compute dW1, db1 (gradients for W1, b1)
       ↓
  Update all weights: W1 -= lr * dW1, etc.
```

## Function 3: `train(self, X_train, y_train, X_test, y_test, epochs=10, batch_size=128, learning_rate=0.1)`

**Purpose:** Main training loop that orchestrates everything.

**Location in code:** Lines 345-384

### The Complete Function (Simplified Structure)

```python
def train(self, X_train, y_train, X_test, y_test, epochs=10, batch_size=128, learning_rate=0.1):
    n_batches = len(X_train) // batch_size              # Line 347
    
    for epoch in range(epochs):                         # Line 349
        # Shuffle training data
        indices = np.random.permutation(len(X_train))   # Line 351
        X_shuffled = X_train[indices]                   # Line 352
        y_shuffled = y_train[indices]                   # Line 353
        
        # Mini-batch training
        for i in range(n_batches):                      # Line 356
            start_idx = i * batch_size                  # Line 357
            end_idx = start_idx + batch_size            # Line 358
            
            X_batch = X_shuffled[start_idx:end_idx]     # Line 360
            y_batch = y_shuffled[start_idx:end_idx]     # Line 361
            
            # Forward and backward pass
            self.forward(X_batch)                       # Line 364
            self.backward(X_batch, y_batch, learning_rate)  # Line 365
        
        # Evaluate on full sets
        train_pred = self.forward(X_train)              # Line 368
        test_pred = self.forward(X_test)                # Line 369
        
        train_loss = self.compute_loss(y_train, train_pred)   # Line 371
        test_loss = self.compute_loss(y_test, test_pred)      # Line 372
        
        # Compute accuracy
        train_acc = np.mean(...)  # Line 374
        test_acc = np.mean(...)   # Line 375
        
        # Store and print
        self.loss_history.append(test_loss)             # Line 377
        self.accuracy_history.append(test_acc)          # Line 378
        print(...)                                       # Line 380
```

### Line-by-Line Breakdown

**Line 347: `n_batches = len(X_train) // batch_size`**
- Calculate number of batches per epoch
- With 60,000 images and batch_size=128: 60000 // 128 = 468 batches
- Each epoch will have 468 update steps

**Line 349: `for epoch in range(epochs):`**
- Main training loop: repeat 10 times (for epochs=10)
- Each epoch = one complete pass through all training data
- In each epoch, weights get updated 468 times (once per batch)

**Line 351: `indices = np.random.permutation(len(X_train))`**
- Create random ordering of sample indices
- Example: [45, 120, 3, 99, 5, ...] (random order)
- **Why shuffle?** 
  - Prevents overfitting to data order
  - Helps gradient descent explore solution space better
  - Each epoch sees data in different order

**Lines 352-353: Reorder training data**
```python
X_shuffled = X_train[indices]  # Reorder images
y_shuffled = y_train[indices]  # Reorder labels (keep alignment!)
```
- Apply shuffled indices to both images and labels
- Maintains alignment: image[i] still matches label[i]

**Line 356: `for i in range(n_batches):`**
- Loop through all 468 batches
- Each iteration processes 128 samples

**Lines 357-358: Calculate batch boundaries**
```python
start_idx = i * batch_size      # Batch 0: 0, Batch 1: 128, Batch 2: 256, ...
end_idx = start_idx + batch_size  # Batch 0: 128, Batch 1: 256, ...
```

**Lines 360-361: Extract batch**
```python
X_batch = X_shuffled[start_idx:end_idx]  # 128 images
y_batch = y_shuffled[start_idx:end_idx]  # 128 labels
```

**Line 364: `self.forward(X_batch)`**
- Forward pass on 128 images
- Computes predictions (self.a2)

**Line 365: `self.backward(X_batch, y_batch, learning_rate)`**
- Backward pass: compute gradients for this batch
- Update weights based on these 128 samples

**Lines 368-369: Evaluate after epoch**
```python
train_pred = self.forward(X_train)  # Predict on ALL 60K training images
test_pred = self.forward(X_test)    # Predict on ALL 10K test images
```
- After processing all 468 batches, evaluate on complete datasets
- Test accuracy shows generalization

**Lines 371-372: Compute loss**
```python
train_loss = self.compute_loss(y_train, train_pred)  # Loss on training set
test_loss = self.compute_loss(y_test, test_pred)    # Loss on test set
```

**Lines 374-375: Compute accuracy**
```python
train_acc = np.mean(np.argmax(train_pred, axis=1) == np.argmax(y_train, axis=1))
test_acc = np.mean(np.argmax(test_pred, axis=1) == np.argmax(y_test, axis=1))
```
- Convert predictions to class indices
- Compare with true class indices
- Compute % correct

**Lines 377-378: Store history**
```python
self.loss_history.append(test_loss)      # Save for plotting
self.accuracy_history.append(test_acc)   # Save for plotting
```

### Training Flow Example

```
Epoch 1:
  Shuffle 60K images
  Batch 1: Process images 0-127, update weights
  Batch 2: Process images 128-255, update weights
  ...
  Batch 468: Process images 59776-59903, update weights
  → Evaluate: Train loss=0.8, Test loss=0.9, Train acc=78%, Test acc=76%

Epoch 2:
  Shuffle 60K images (different order!)
  Batch 1-468: Same process with new order
  → Evaluate: Train loss=0.5, Test loss=0.6, Train acc=85%, Test acc=84%

...(repeat until epoch 10)...

Epoch 10:
  Shuffle and process all batches
  → Evaluate: Train loss=0.2, Test loss=0.3, Train acc=96%, Test acc=95%
```

## Function 4: `predict(self, X)`

**Purpose:** Make predictions on new data.

**Location in code:** Lines 386-389

### The Complete Function

```python
def predict(self, X):
    """Make predictions"""
    probabilities = self.forward(X)              # Line 388
    return np.argmax(probabilities, axis=1)      # Line 389
```

### Line-by-Line Breakdown

**Line 388: `probabilities = self.forward(X)`**
- Run forward pass on input X
- Input X shape: (n_samples, 784) - e.g., (100, 784) for 100 test images
- Output shape: (n_samples, 10) - probabilities for each class
- Each row contains 10 probabilities that sum to 1
- Example:
  ```
  Sample 1: [0.01, 0.05, 0.02, 0.88, 0.01, 0.00, 0.01, 0.01, 0.01, 0.00]
  Sample 2: [0.02, 0.92, 0.01, 0.01, 0.01, 0.01, 0.01, 0.01, 0.00, 0.00]
  Sample 3: [0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10, 0.10]
  ```
  - Sample 1 is most confident it's a 3 (88% probability)
  - Sample 2 is most confident it's a 1 (92% probability)
  - Sample 3 is uncertain about everything (10% each)

**Line 389: `np.argmax(probabilities, axis=1)`**
- **argmax** = "find the index of the maximum value"
- For each row (axis=1), find which column has the highest probability
- Example:
  ```
  [0.01, 0.05, 0.02, 0.88, 0.01, ...] → 3 (highest at index 3)
  [0.02, 0.92, 0.01, 0.01, 0.01, ...] → 1 (highest at index 1)
  [0.10, 0.10, 0.10, 0.10, 0.10, ...] → 0 (first maximum at index 0)
  ```
- Output shape: (n_samples,) - single predicted class per sample
- Example output: [3, 1, 0, ...]

**Line 389: `return ...`**
- Return array of predicted class indices

### Why Two Steps?

**Why not just return probabilities?**
- `forward()` gives probabilities (soft output): [0.1, 0.2, 0.7]
- We need class labels (hard output): 2
- `argmax` converts from "how confident?" to "which class?"

### Visual Example

```
Input: 5 test images
  X shape: (5, 784)

After forward():
  probabilities = [
    [0.0, 0.0, 0.05, 0.90, 0.0, 0.0, 0.0, 0.0, 0.0, 0.05],  # Probably 3
    [0.0, 0.95, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.05],   # Probably 1
    [0.2, 0.2, 0.2, 0.2, 0.2, 0.0, 0.0, 0.0, 0.0, 0.0],      # Uncertain
    [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.98, 0.02],    # Probably 8
    [0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.05, 0.50]  # Probably 9
  ]

After argmax(axis=1):
  predictions = [3, 1, 0, 8, 9]
  (Pick the class with highest probability for each sample)
```

### Common Use Cases

```python
# Make predictions on test set
predictions = nn.predict(X_test)  # Returns [3, 1, 5, 2, ...]

# Compare with true labels
accuracy = accuracy_score(y_test_labels, predictions)

# Predict on single image
new_image = X_test[0:1]  # Shape: (1, 784)
pred = nn.predict(new_image)  # Returns [5]
```

## Summary Table

| Function | Input | Output | Purpose | When Called |
|----------|-------|--------|---------|-------------|
| `compute_loss()` | Predictions (n, 10)<br/>True labels (n, 10) | Scalar loss value | Measure prediction error | During training evaluation |
| `backward()` | Batch images (128, 784)<br/>Batch labels (128, 10) | None<br/>(updates weights) | Backprop + weight update | Every batch (468 times/epoch) |
| `train()` | 60K train imgs<br/>60K train labels<br/>10K test imgs<br/>10K test labels | None<br/>(stores history) | Run complete training | Once at start |
| `predict()` | Test images (n, 784) | Class indices (n,) | Convert probabilities to predictions | After training, for evaluation |

---

## The Complete Pipeline

```
1. nn = DigitRecognitionNetwork()  # Create network
   
2. nn.train(X_train, y_train, X_test, y_test, epochs=10)
   ├─ For each epoch:
   │  ├─ Shuffle data
   │  └─ For each batch:
   │     ├─ forward() → compute predictions
   │     └─ backward() → compute gradients, update weights
   ├─ compute_loss() → track training progress
   └─ Print accuracy and loss
   
3. predictions = nn.predict(X_test)  # Make predictions
   ├─ forward() → get probabilities
   └─ argmax() → convert to class indices
   
4. accuracy = accuracy_score(y_test_labels, predictions)  # Evaluate
```