# Neural Network Fundamentals

## Part 5: Training - Learning from Mistakes

### The Brain's Decision Committee - Chapter 5

---

## The Story So Far...

In Part 4, our committee member attempted their first classification task. They looked at images of vertical and horizontal lines and tried to identify them. The results were... not great. With random weights, they achieved about 50% accuracy - no better than flipping a coin.

**But here's the beautiful thing about neural networks: they can learn from their mistakes.**

In this notebook, we'll teach our Perceptron how to improve. We'll show it examples, tell it when it's wrong, and let it gradually adjust its weights until it becomes an expert line detector.

This is **training** - the heart of machine learning.

---

## What You'll Learn in Part 5

This is one of the most important notebooks in the series. By the end, you will understand:

1. **Loss Functions** - How to measure "how wrong" a prediction is
2. **Why We Square Errors** - The mathematical reason behind MSE
3. **Binary Cross-Entropy** - The preferred loss for classification (and why!)
4. **Gradient Descent** - The algorithm that finds better weights
5. **Learning Rate** - How fast to adjust (and what happens if it's wrong)
6. **The Gradient** - The direction of steepest improvement
7. **Backpropagation** - How errors flow backward through the network
8. **The Training Loop** - Putting it all together
9. **Watch It Learn** - See the Perceptron go from 50% to 95%+ accuracy!

---

## Prerequisites

Make sure you've completed:
- **Parts 0-1:** Matrices (`neural_network_fundamentals.ipynb`)
- **Part 2:** Single Neuron (`part_2_single_neuron.ipynb`)
- **Part 3:** Activation Functions (`part_3_activation_functions.ipynb`)
- **Part 4:** The Perceptron (`part_4_perceptron.ipynb`)


---

## Setup: Import Dependencies and Recreate Our Tools

Let's bring in everything we built in previous notebooks.


In [None]:
# =============================================================================
# PART 5: TRAINING - SETUP AND IMPORTS
# =============================================================================

import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, clear_output

# Try to import ipywidgets for interactive features
try:
    import ipywidgets as widgets
    WIDGETS_AVAILABLE = True
except ImportError:
    WIDGETS_AVAILABLE = False
    print("Note: ipywidgets not installed. Interactive features will be limited.")

# Set up matplotlib style
style_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']
for style in style_options:
    try:
        plt.style.use(style)
        break
    except OSError:
        continue

plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['font.size'] = 12
np.random.seed(42)

print("Setup complete!")
print("="*60)


In [None]:
# =============================================================================
# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS
# =============================================================================

# -----------------------------------------------------------------------------
# Our canonical line images (from Part 1)
# -----------------------------------------------------------------------------
vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])
horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])
vertical_flat = vertical_line.flatten()
horizontal_flat = horizontal_line.flatten()

# -----------------------------------------------------------------------------
# Dataset generator (from Part 4)
# -----------------------------------------------------------------------------
def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):
    """Generate vertical (label=1) and horizontal (label=0) line images."""
    if seed is not None:
        np.random.seed(seed)
    
    X, y = [], []
    
    for i in range(n_samples):
        image = np.zeros((3, 3))
        
        if i < n_samples // 2:  # Vertical lines
            col = np.random.randint(0, 3)
            image[:, col] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(1)
        else:  # Horizontal lines
            row = np.random.randint(0, 3)
            image[row, :] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(0)
    
    X, y = np.array(X), np.array(y)
    shuffle_idx = np.random.permutation(n_samples)
    return X[shuffle_idx], y[shuffle_idx]

# -----------------------------------------------------------------------------
# Sigmoid activation function (from Part 3)
# -----------------------------------------------------------------------------
def sigmoid(z):
    """Sigmoid activation: maps any value to range (0, 1)."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# -----------------------------------------------------------------------------
# Basic Perceptron class (from Part 4) - We'll add training later!
# -----------------------------------------------------------------------------
class Perceptron:
    """A single-layer Perceptron for binary classification."""
    
    def __init__(self, n_inputs):
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0
        self.n_inputs = n_inputs
    
    def forward(self, x):
        """Compute the forward pass."""
        x = np.array(x).flatten()
        z = np.dot(self.weights, x) + self.bias
        return sigmoid(z)
    
    def predict(self, x):
        """Make a binary prediction (0 or 1)."""
        return 1 if self.forward(x) >= 0.5 else 0

# Generate our training dataset
X_train, y_train = generate_line_dataset(n_samples=100, noise_level=0.0, seed=42)

print("Tools recreated from previous notebooks!")
print(f"  - Vertical/Horizontal line templates")
print(f"  - Dataset generator")
print(f"  - Sigmoid activation")
print(f"  - Basic Perceptron class")
print(f"\nTraining dataset: {len(X_train)} samples")
print(f"  - {sum(y_train)} vertical lines (label=1)")
print(f"  - {len(y_train) - sum(y_train)} horizontal lines (label=0)")


---

## 5.1 The Error: How Wrong Are We?

Before we can improve, we need to measure **how wrong** our predictions are. This is the foundation of learning.

### The Basic Idea

When our Perceptron makes a prediction, we compare it to the actual answer:

```
Error = Actual Value - Predicted Value
      = y - ŷ
```

### A Concrete Example

Let's say we show the Perceptron a **vertical line** (actual label y = 1):

| Scenario | Prediction (ŷ) | Error (y - ŷ) | Interpretation |
|----------|----------------|---------------|----------------|
| Perfect | 1.0 | 1.0 - 1.0 = 0.0 | No error! |
| Good | 0.9 | 1.0 - 0.9 = 0.1 | Small error |
| Bad | 0.3 | 1.0 - 0.3 = 0.7 | Big error! |
| Terrible | 0.0 | 1.0 - 0.0 = 1.0 | Maximum error |

### Committee Analogy

*"The committee member votes on a case. After the vote, the supervisor reveals the correct answer. The difference between their vote and the correct answer is their ERROR - and they need to learn from it."*

### Why Error Matters

The error tells us two things:
1. **How much** to adjust (larger error = bigger adjustment needed)
2. **Which direction** to adjust (positive error = increase output, negative = decrease)

Let's see this with real numbers:


In [None]:
# =============================================================================
# CALCULATING ERROR: Step by Step
# =============================================================================

# Create an untrained Perceptron
perceptron = Perceptron(n_inputs=9)

print("="*70)
print("CALCULATING ERROR: Step by Step")
print("="*70)

# Test on a vertical line (actual label = 1)
print("\n" + "-"*70)
print("Example 1: Testing on a VERTICAL line")
print("-"*70)

y_actual = 1  # The true label (it IS a vertical line)
y_predicted = perceptron.forward(vertical_flat)

print(f"\n  Step 1: Get the actual label")
print(f"          y (actual) = {y_actual}")
print(f"          This means: 'This IS a vertical line'")

print(f"\n  Step 2: Get the prediction from our Perceptron")
print(f"          ŷ (predicted) = {y_predicted:.4f}")
print(f"          This means: '{y_predicted*100:.1f}% confident it's vertical'")

print(f"\n  Step 3: Calculate the error")
print(f"          error = y - ŷ")
print(f"          error = {y_actual} - {y_predicted:.4f}")
error_vertical = y_actual - y_predicted
print(f"          error = {error_vertical:.4f}")

print(f"\n  Interpretation:")
if error_vertical > 0:
    print(f"          The error is POSITIVE ({error_vertical:.4f})")
    print(f"          This means: The Perceptron underestimated! It should output HIGHER.")
else:
    print(f"          The error is NEGATIVE ({error_vertical:.4f})")
    print(f"          This means: The Perceptron overestimated! It should output LOWER.")

# Test on a horizontal line (actual label = 0)
print("\n" + "-"*70)
print("Example 2: Testing on a HORIZONTAL line")
print("-"*70)

y_actual_h = 0  # The true label (it is NOT a vertical line)
y_predicted_h = perceptron.forward(horizontal_flat)

print(f"\n  Step 1: Get the actual label")
print(f"          y (actual) = {y_actual_h}")
print(f"          This means: 'This is NOT a vertical line'")

print(f"\n  Step 2: Get the prediction from our Perceptron")
print(f"          ŷ (predicted) = {y_predicted_h:.4f}")
print(f"          This means: '{y_predicted_h*100:.1f}% confident it's vertical'")

print(f"\n  Step 3: Calculate the error")
print(f"          error = y - ŷ")
print(f"          error = {y_actual_h} - {y_predicted_h:.4f}")
error_horizontal = y_actual_h - y_predicted_h
print(f"          error = {error_horizontal:.4f}")

print(f"\n  Interpretation:")
if error_horizontal > 0:
    print(f"          The error is POSITIVE ({error_horizontal:.4f})")
    print(f"          This means: The Perceptron underestimated!")
elif error_horizontal < 0:
    print(f"          The error is NEGATIVE ({error_horizontal:.4f})")
    print(f"          This means: The Perceptron overestimated! It should output LOWER.")
else:
    print(f"          The error is ZERO - perfect prediction!")


---

## 5.2 Loss Functions: The Teacher's Grading System

Before we look at specific formulas, let's understand **what a loss function is and why we need one**.

### What is a Loss Function?

A **loss function** (also called a "cost function" or "objective function") is a mathematical formula that:
- Takes in predictions and actual labels
- Outputs a **single number** representing "how wrong" the predictions are
- **Lower is better** - a loss of 0 means perfect predictions

### Why Do We Need Loss Functions?

Think about learning anything - you need **feedback** to improve. The loss function provides that feedback:

| Without Loss Function | With Loss Function |
|----------------------|-------------------|
| "Your predictions are wrong" | "Your predictions are 0.25 wrong" |
| Vague, not actionable | Precise, quantifiable |
| Can't compare methods | Can compare: 0.25 vs 0.15 |
| Can't track progress | Can see improvement over time |

### The Role of Loss in Training

Loss functions are the **heart of machine learning**. The entire training process is:
1. Make predictions
2. **Calculate loss** (how wrong?)
3. Adjust weights to **reduce loss**
4. Repeat

**The weights that minimize loss are the "best" weights** - that's the entire goal of training!

### Committee Analogy

*"The loss function is like a performance review score. Every time the committee member makes a decision, they get a score. A perfect decision scores 0. A terrible decision scores high. The member's goal is to adjust their behavior to minimize this score over time."*

---

## 5.2.1 Mean Squared Error (MSE): Our First Loss Function

Now let's look at a specific loss function: **Mean Squared Error** (MSE).

Simple error (y - ŷ) has a problem: positive and negative errors can cancel out!

**Example:** If we have two predictions:
- Prediction 1: error = +0.5 (underestimated)
- Prediction 2: error = -0.5 (overestimated)
- Average error = (+0.5 + -0.5) / 2 = 0 ← Looks perfect, but it's NOT!

### The Solution: Square the Errors

By squaring each error before averaging, we solve this problem:

$$\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2$$

Let's break this formula down piece by piece:

| Symbol | Meaning | Example |
|--------|---------|---------|
| $n$ | Number of samples | 100 images |
| $y_i$ | Actual label for sample $i$ | 1 (vertical) |
| $\hat{y}_i$ | Predicted value for sample $i$ | 0.7 |
| $(y_i - \hat{y}_i)$ | Error for sample $i$ | 1 - 0.7 = 0.3 |
| $(y_i - \hat{y}_i)^2$ | Squared error | 0.3² = 0.09 |
| $\frac{1}{n}\sum$ | Average of all squared errors | Mean |

### Why Square?

Squaring the errors has three important benefits:

1. **No Cancellation:** Positive and negative errors both become positive
2. **Penalize Big Errors:** A small error (0.1) becomes tiny (0.01), but a big error (0.9) becomes large (0.81)
3. **Smooth Landscape:** Creates a smooth "bowl" shape that's easy to optimize (more on this later)

### Let's Calculate MSE Step by Step:


In [None]:
# =============================================================================
# MEAN SQUARED ERROR: Step by Step Calculation
# =============================================================================

print("="*70)
print("MEAN SQUARED ERROR (MSE): Step by Step")
print("="*70)

# Let's use 5 samples to make this clear
sample_actuals = np.array([1, 1, 0, 0, 1])       # True labels
sample_predictions = np.array([0.9, 0.6, 0.3, 0.1, 0.5])  # Our predictions

print("\nOur data:")
print(f"  Actual labels (y):      {sample_actuals}")
print(f"  Predictions (ŷ):        {sample_predictions}")

# Step 1: Calculate each error
print("\n" + "-"*70)
print("STEP 1: Calculate each error (y - ŷ)")
print("-"*70)
errors = sample_actuals - sample_predictions
print(f"\n  Sample 1: {sample_actuals[0]} - {sample_predictions[0]} = {errors[0]:.2f}")
print(f"  Sample 2: {sample_actuals[1]} - {sample_predictions[1]} = {errors[1]:.2f}")
print(f"  Sample 3: {sample_actuals[2]} - {sample_predictions[2]} = {errors[2]:.2f}")
print(f"  Sample 4: {sample_actuals[3]} - {sample_predictions[3]} = {errors[3]:.2f}")
print(f"  Sample 5: {sample_actuals[4]} - {sample_predictions[4]} = {errors[4]:.2f}")
print(f"\n  All errors: {errors}")

# Step 2: Square each error
print("\n" + "-"*70)
print("STEP 2: Square each error (to make all positive)")
print("-"*70)
squared_errors = errors ** 2
print(f"\n  Sample 1: ({errors[0]:.2f})² = {squared_errors[0]:.4f}")
print(f"  Sample 2: ({errors[1]:.2f})² = {squared_errors[1]:.4f}")
print(f"  Sample 3: ({errors[2]:.2f})² = {squared_errors[2]:.4f}")
print(f"  Sample 4: ({errors[3]:.2f})² = {squared_errors[3]:.4f}")
print(f"  Sample 5: ({errors[4]:.2f})² = {squared_errors[4]:.4f}")
print(f"\n  Squared errors: {squared_errors}")

# Step 3: Take the mean
print("\n" + "-"*70)
print("STEP 3: Take the mean (average)")
print("-"*70)
mse = np.mean(squared_errors)
print(f"\n  Sum of squared errors: {np.sum(squared_errors):.4f}")
print(f"  Number of samples: {len(squared_errors)}")
print(f"  MSE = Sum / n = {np.sum(squared_errors):.4f} / {len(squared_errors)}")
print(f"  MSE = {mse:.4f}")

# The MSE function
print("\n" + "-"*70)
print("THE MSE FUNCTION (for reuse)")
print("-"*70)

def mse_loss(y_true, y_pred):
    """
    Mean Squared Error loss function.
    
    Formula: MSE = (1/n) * Σ(y - ŷ)²
    
    Parameters:
        y_true: Array of actual labels (0 or 1)
        y_pred: Array of predicted probabilities (0 to 1)
    
    Returns:
        Single value representing average squared error
    """
    return np.mean((y_true - y_pred) ** 2)

# Verify our calculation
print(f"\n  Using our function: mse_loss(y, ŷ) = {mse_loss(sample_actuals, sample_predictions):.4f}")
print(f"  Our manual calculation: {mse:.4f}")
print(f"  Match: {'Yes!' if abs(mse_loss(sample_actuals, sample_predictions) - mse) < 0.0001 else 'No'}")


---

## 5.3 Binary Cross-Entropy: The Better Loss for Classification

MSE works, but for **classification** problems (like our V/H detection), there's a better loss function: **Binary Cross-Entropy** (BCE).

### First, Let's Understand the Name

The name "Binary Cross-Entropy" has three parts:

| Term | Meaning | Our Context |
|------|---------|-------------|
| **Binary** | Two classes only | Vertical (1) or Horizontal (0) |
| **Cross** | Comparing two distributions | Comparing predictions vs reality |
| **Entropy** | Measure of uncertainty/surprise | How "surprised" we are by the outcome |

**Entropy** comes from information theory. It measures uncertainty:
- If something is certain (100% probability), entropy is 0 - no surprise!
- If something is uncertain (50/50), entropy is high - maximum surprise!

**Cross-entropy** compares what we PREDICTED against what ACTUALLY happened.

### Why Not Just Use MSE?

MSE works for regression (predicting continuous values like house prices), but classification has a special property: **we're predicting probabilities**.

**The Problem with MSE:** When the prediction is very wrong (e.g., predicting 0.01 for a true label of 1), MSE gives an error of 0.99² = 0.98. That's bad, but is it bad *enough*?

Consider: predicting 0.01 when you should predict 1.0 means you were **99% confident and COMPLETELY wrong**. That deserves a HUGE penalty!

**BCE's Solution:** BCE uses logarithms, which give **much harsher penalties** for confident wrong answers.

### The Logarithm: Why It's Perfect for This

The **logarithm** is a special mathematical function. Here's why it works for measuring surprise:

- `log(1) = 0` → If probability was 100%, no surprise at all
- `log(0.5) ≈ -0.69` → Uncertain, some surprise
- `log(0.1) ≈ -2.30` → Low probability, big surprise!
- `log(0.01) ≈ -4.61` → Very low probability, huge surprise!
- `log(0) = -∞` → Zero probability, infinite surprise (impossible event!)

The negative sign flips these to positive loss values: `-log(0.01) = 4.61`

### The Intuition: Measuring "Surprise"

Think of cross-entropy as measuring **how surprised** you are by the actual answer:

| Prediction | Actual | BCE Value | Interpretation |
|------------|--------|-----------|----------------|
| 0.99 | 1 | 0.01 | "Not surprised at all - I expected this!" |
| 0.5 | 1 | 0.69 | "Somewhat surprised - I was uncertain" |
| 0.01 | 1 | 4.61 | "VERY surprised! I was confident it was NOT 1!" |

### The Mathematics

$$\text{BCE} = -\frac{1}{n}\sum_{i=1}^{n} \left[ y_i \cdot \log(\hat{y}_i) + (1-y_i) \cdot \log(1-\hat{y}_i) \right]$$

This looks scary! Let's break it down:

**When the actual label y = 1 (it IS vertical):**
- The formula simplifies to: $-\log(\hat{y})$
- If we predicted high (ŷ = 0.9): $-\log(0.9) = 0.105$ (low loss - good!)
- If we predicted low (ŷ = 0.1): $-\log(0.1) = 2.303$ (high loss - bad!)

**When the actual label y = 0 (it is NOT vertical):**
- The formula simplifies to: $-\log(1 - \hat{y})$
- If we predicted low (ŷ = 0.1): $-\log(0.9) = 0.105$ (low loss - good!)
- If we predicted high (ŷ = 0.9): $-\log(0.1) = 2.303$ (high loss - bad!)

### Committee Analogy

*"BCE measures how embarrassed the committee member should be. If they confidently voted 'definitely vertical!' (0.99) and it turned out to be horizontal, they should be VERY embarrassed. The logarithm captures this severe penalty for confident wrong answers."*

### Let's Implement and Compare:


In [None]:
# =============================================================================
# BINARY CROSS-ENTROPY: Step by Step
# =============================================================================

print("="*70)
print("BINARY CROSS-ENTROPY (BCE): Step by Step")
print("="*70)

# First, let's understand the log function
print("\n" + "-"*70)
print("UNDERSTANDING THE LOGARITHM")
print("-"*70)
print("""
The natural log (ln or log) has a special property:
  - log(1) = 0        (no surprise when probability matches reality)
  - log(0.5) = -0.69  (some surprise)
  - log(0.1) = -2.30  (very surprised!)
  - log(0.01) = -4.61 (extremely surprised!)

As the probability gets closer to 0, log goes to -infinity.
That's why BCE severely punishes confident wrong predictions!
""")

# Show the log curve
print("  Let's calculate -log(ŷ) for different predictions:")
predictions = [0.99, 0.9, 0.7, 0.5, 0.3, 0.1, 0.01]
print(f"\n  {'Prediction (ŷ)':<18} {'-log(ŷ)':<15} {'Interpretation'}")
print("  " + "-"*60)
for p in predictions:
    neg_log = -np.log(p)
    if neg_log < 0.5:
        interp = "Low loss (good prediction)"
    elif neg_log < 1.5:
        interp = "Medium loss"
    else:
        interp = "High loss (bad prediction!)"
    print(f"  {p:<18} {neg_log:<15.4f} {interp}")

print("\n" + "-"*70)
print("BCE CALCULATION FOR A SINGLE SAMPLE")
print("-"*70)

# Example 1: Actual is 1, prediction is 0.9 (good prediction)
y_true_1 = 1
y_pred_1 = 0.9

print(f"\n  Example 1: Actual y = {y_true_1}, Predicted ŷ = {y_pred_1}")
print(f"  (This is a GOOD prediction for a vertical line)")
print(f"\n  BCE formula: -[y * log(ŷ) + (1-y) * log(1-ŷ)]")
print(f"\n  Since y = 1, the (1-y) term becomes 0, so:")
print(f"  BCE = -[{y_true_1} * log({y_pred_1}) + 0]")
print(f"  BCE = -log({y_pred_1})")
print(f"  BCE = -{np.log(y_pred_1):.4f}")
bce_1 = -np.log(y_pred_1)
print(f"  BCE = {bce_1:.4f}")

# Example 2: Actual is 1, prediction is 0.1 (bad prediction)
y_true_2 = 1
y_pred_2 = 0.1

print(f"\n  Example 2: Actual y = {y_true_2}, Predicted ŷ = {y_pred_2}")
print(f"  (This is a BAD prediction for a vertical line)")
print(f"\n  Since y = 1:")
print(f"  BCE = -log({y_pred_2})")
print(f"  BCE = -{np.log(y_pred_2):.4f}")
bce_2 = -np.log(y_pred_2)
print(f"  BCE = {bce_2:.4f}")

print(f"\n  Notice: The bad prediction has {bce_2/bce_1:.1f}x higher loss!")

# The BCE function
print("\n" + "-"*70)
print("THE BCE FUNCTION (for reuse)")
print("-"*70)

def binary_cross_entropy(y_true, y_pred):
    """
    Binary Cross-Entropy loss function.
    
    Formula: BCE = -(1/n) * Σ[y*log(ŷ) + (1-y)*log(1-ŷ)]
    
    Parameters:
        y_true: Array of actual labels (0 or 1)
        y_pred: Array of predicted probabilities (0 to 1)
    
    Returns:
        Single value representing average cross-entropy loss
    """
    # Clip predictions to avoid log(0) which is undefined
    epsilon = 1e-15  # A tiny number
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # Calculate BCE
    bce = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return bce

print("""
def binary_cross_entropy(y_true, y_pred):
    # Clip to avoid log(0) - would be undefined!
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    
    # BCE formula
    return -np.mean(y_true * np.log(y_pred) + 
                    (1 - y_true) * np.log(1 - y_pred))
""")


In [None]:
# =============================================================================
# VISUALIZE: MSE vs BCE
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Generate prediction values from 0.01 to 0.99
y_pred_range = np.linspace(0.01, 0.99, 100)

# When actual label is 1 (vertical line)
mse_when_y_is_1 = (1 - y_pred_range) ** 2
bce_when_y_is_1 = -np.log(y_pred_range)

# Plot for y = 1
ax1 = axes[0]
ax1.plot(y_pred_range, mse_when_y_is_1, 'b-', linewidth=2, label='MSE')
ax1.plot(y_pred_range, bce_when_y_is_1, 'r-', linewidth=2, label='BCE')
ax1.set_xlabel('Prediction (ŷ)', fontsize=12)
ax1.set_ylabel('Loss', fontsize=12)
ax1.set_title('When Actual y = 1 (Vertical Line)\nLower prediction = Higher loss', fontsize=12, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlim(0, 1)
ax1.set_ylim(0, 5)
ax1.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)
ax1.annotate('If we predict 0.1\nBCE = 2.3 (harsh!)\nMSE = 0.81', 
             xy=(0.1, 2.3), xytext=(0.3, 3.5),
             fontsize=9, arrowprops=dict(arrowstyle='->', color='red'))

# When actual label is 0 (horizontal line)
mse_when_y_is_0 = y_pred_range ** 2
bce_when_y_is_0 = -np.log(1 - y_pred_range)

# Plot for y = 0
ax2 = axes[1]
ax2.plot(y_pred_range, mse_when_y_is_0, 'b-', linewidth=2, label='MSE')
ax2.plot(y_pred_range, bce_when_y_is_0, 'r-', linewidth=2, label='BCE')
ax2.set_xlabel('Prediction (ŷ)', fontsize=12)
ax2.set_ylabel('Loss', fontsize=12)
ax2.set_title('When Actual y = 0 (Horizontal Line)\nHigher prediction = Higher loss', fontsize=12, fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xlim(0, 1)
ax2.set_ylim(0, 5)
ax2.axvline(x=0.5, color='gray', linestyle=':', alpha=0.5)
ax2.annotate('If we predict 0.9\nBCE = 2.3 (harsh!)\nMSE = 0.81', 
             xy=(0.9, 2.3), xytext=(0.5, 3.5),
             fontsize=9, arrowprops=dict(arrowstyle='->', color='red'))

plt.tight_layout()
plt.show()

print("\nKEY INSIGHT: BCE vs MSE")
print("="*60)
print("""
Notice how BCE (red line) rises much more steeply than MSE (blue line)
as predictions get worse?

This is why BCE is preferred for classification:
  - It SEVERELY punishes confident wrong predictions
  - A prediction of 0.1 when the answer is 1 has BCE loss of 2.3
  - The same prediction has MSE loss of only 0.81

BCE creates stronger learning signals when the model is very wrong,
which helps it learn faster and more reliably!
""")


---

## 5.4 Gradient Descent: Finding Better Weights

Now we know HOW WRONG we are (the loss). But how do we make our predictions BETTER?

This is where **optimization** comes in - the process of finding the best values for our weights.

### The Optimization Problem

Our Perceptron has 9 weights + 1 bias = **10 numbers** to choose. Each combination of these 10 numbers gives different predictions and a different loss.

**The Question:** Out of the infinite possible combinations, which gives the LOWEST loss?

**The Naive Approach:** Try all combinations! 
- But with continuous numbers, there are infinitely many combinations
- Even with just 100 values per parameter: 100^10 = 10^20 combinations
- That's more than the number of grains of sand on Earth!

**The Smart Approach:** Use mathematics to **guide our search** toward better values.

### What is a Derivative? (A Quick Refresher)

The **derivative** tells you how much one quantity changes when you change another.

**Simple Example:** You're driving a car.
- Position = where you are
- Derivative of position = **speed** (how fast position changes)
- Derivative of speed = **acceleration** (how fast speed changes)

**For Our Loss Function:**
- Loss = how wrong we are
- Derivative of loss w.r.t. weight = how much loss changes when we change the weight

If the derivative is:
- **Positive:** Increasing the weight increases loss → we should DECREASE the weight
- **Negative:** Increasing the weight decreases loss → we should INCREASE the weight
- **Zero:** We're at a minimum (or maximum)!

### The Key Idea: The Loss Landscape

Imagine the loss as a **landscape** where:
- The **height** at any point = how wrong we are (higher = worse)
- The **position** = our current weights
- Our **goal** = find the lowest point (minimum loss)

We want to "roll downhill" until we find the bottom!

### The Algorithm: Gradient Descent

**Gradient** means "slope" - it tells us which way is uphill.

**Gradient Descent** means:
1. Look at the slope where we are
2. Take a step in the **opposite direction** (downhill)
3. Repeat until we reach the bottom

### The Mathematics

$$w_{new} = w_{old} - \alpha \cdot \frac{\partial L}{\partial w}$$

Let's break this down:

| Symbol | Meaning | Intuition |
|--------|---------|-----------|
| $w_{new}$ | Updated weight | Where we're going |
| $w_{old}$ | Current weight | Where we are |
| $\alpha$ | Learning rate | How big a step to take |
| $\frac{\partial L}{\partial w}$ | Gradient (slope) | Which way is uphill |
| $-$ | Subtraction | We go OPPOSITE to uphill (= downhill) |

### Committee Analogy

*"The gradient is like a compass that always points uphill. We want to go DOWNHILL (less error), so we walk in the opposite direction. The learning rate decides whether we take small careful steps or big bold leaps."*

### Let's Visualize This:


In [None]:
# =============================================================================
# VISUALIZE: The Loss Landscape and Gradient Descent
# =============================================================================

# Create a simple 1D loss landscape (parabola)
# This represents how loss changes as we change ONE weight
weight_values = np.linspace(-3, 3, 100)
loss_landscape = weight_values ** 2 + 0.5  # Simple parabola with minimum at w=0

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: The loss landscape
ax1 = axes[0]
ax1.plot(weight_values, loss_landscape, 'b-', linewidth=3)
ax1.fill_between(weight_values, loss_landscape, alpha=0.2)
ax1.set_xlabel('Weight Value (w)', fontsize=12)
ax1.set_ylabel('Loss (L)', fontsize=12)
ax1.set_title('The Loss Landscape\n(For a Single Weight)', fontsize=12, fontweight='bold')
ax1.grid(True, alpha=0.3)

# Mark the minimum
ax1.scatter([0], [0.5], color='green', s=200, zorder=5, marker='*', label='Minimum (goal)')
ax1.annotate('Our goal: Find this minimum!', xy=(0, 0.5), xytext=(0.5, 2),
            fontsize=10, arrowprops=dict(arrowstyle='->', color='green'))

# Show current position
current_w = 2.0
current_loss = current_w ** 2 + 0.5
ax1.scatter([current_w], [current_loss], color='red', s=150, zorder=5, label='Current position')
ax1.axvline(x=current_w, color='red', linestyle='--', alpha=0.3)
ax1.legend()

# Plot 2: Gradient descent animation (multiple steps)
ax2 = axes[1]
ax2.plot(weight_values, loss_landscape, 'b-', linewidth=2, alpha=0.5)
ax2.fill_between(weight_values, loss_landscape, alpha=0.1)
ax2.set_xlabel('Weight Value (w)', fontsize=12)
ax2.set_ylabel('Loss (L)', fontsize=12)
ax2.set_title('Gradient Descent: Rolling Downhill\n(Learning Rate α = 0.3)', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3)

# Simulate gradient descent
learning_rate = 0.3
w = 2.5  # Starting position
path = [(w, w**2 + 0.5)]

for step in range(8):
    gradient = 2 * w  # Derivative of w² is 2w
    w = w - learning_rate * gradient  # Gradient descent update
    loss = w ** 2 + 0.5
    path.append((w, loss))

# Plot the path
path = np.array(path)
ax2.plot(path[:, 0], path[:, 1], 'ro-', markersize=8, linewidth=2, label='Gradient descent path')
ax2.scatter([path[0, 0]], [path[0, 1]], color='red', s=200, zorder=5, marker='o', label='Start')
ax2.scatter([path[-1, 0]], [path[-1, 1]], color='green', s=200, zorder=5, marker='*', label='End (near minimum)')

# Add step numbers
for i, (w_val, l_val) in enumerate(path):
    ax2.annotate(f'{i}', xy=(w_val, l_val), xytext=(w_val+0.1, l_val+0.3),
                fontsize=9, fontweight='bold')

ax2.legend(loc='upper right')

plt.tight_layout()
plt.show()

print("\nGRADIENT DESCENT STEPS:")
print("="*60)
print(f"{'Step':<6} {'Weight (w)':<15} {'Gradient (2w)':<15} {'Update':<20} {'Loss'}")
print("-"*60)
w = 2.5
for step in range(6):
    gradient = 2 * w
    update = -learning_rate * gradient
    loss = w ** 2 + 0.5
    print(f"{step:<6} {w:<15.4f} {gradient:<15.4f} {update:<20.4f} {loss:.4f}")
    w = w + update  # Same as w = w - learning_rate * gradient

print("-"*60)
print(f"\nStarted at w = 2.5 (loss = 6.75)")
print(f"After 5 steps: w = {w:.4f} (loss = {w**2 + 0.5:.4f})")
print(f"Getting closer to the minimum at w = 0 (loss = 0.5)!")


---

## 5.5 Learning Rate: How Fast to Adjust

The **learning rate** (α, alpha) controls how big each step is. It's one of the most important choices in training!

### Parameters vs Hyperparameters

Before we dive in, let's clarify an important distinction:

| Term | What It Is | Examples | Who Sets It? |
|------|------------|----------|--------------|
| **Parameters** | Values the model LEARNS | Weights, Bias | The training algorithm |
| **Hyperparameters** | Settings WE choose before training | Learning rate, number of epochs | The human (you!) |

The learning rate is a **hyperparameter** - we choose it before training, and it affects HOW the model learns (but is not learned itself).

### Why Learning Rate Matters So Much

The learning rate multiplies the gradient to determine the step size:

```
step = learning_rate × gradient
new_weight = old_weight - step
```

**The Problem:** Gradients can vary wildly:
- Sometimes the gradient is 10.0 (steep slope)
- Sometimes it's 0.001 (nearly flat)

**The Learning Rate's Job:** Scale these gradients to reasonable step sizes.

### The Goldilocks Problem

| Learning Rate | Effect | Problem |
|---------------|--------|---------|
| **Too Large** (α = 1.0) | Big steps | Overshoot! Miss the minimum, bounce around |
| **Too Small** (α = 0.001) | Tiny steps | Takes forever, might get stuck |
| **Just Right** (α = 0.1) | Medium steps | Steady progress toward minimum |

### The Mathematics

Remember our update formula:

$$w_{new} = w_{old} - \alpha \cdot \text{gradient}$$

- If gradient = 10 and α = 0.1: step size = 1.0 (reasonable)
- If gradient = 10 and α = 1.0: step size = 10.0 (too big!)
- If gradient = 10 and α = 0.001: step size = 0.01 (too small!)

### Committee Analogy

*"The learning rate is how much the committee member adjusts after each mistake. Too much adjustment, and they overcorrect wildly. Too little, and they never improve. The right amount leads to steady learning."*

### Let's See All Three Scenarios:


In [None]:
# =============================================================================
# VISUALIZE: Learning Rate Effects
# =============================================================================

def run_gradient_descent(start_w, learning_rate, steps=15):
    """Run gradient descent and return the path."""
    w = start_w
    path = [(w, w**2 + 0.5)]
    for _ in range(steps):
        gradient = 2 * w  # Derivative of w²
        w = w - learning_rate * gradient
        w = np.clip(w, -5, 5)  # Prevent explosion
        loss = w ** 2 + 0.5
        path.append((w, loss))
    return np.array(path)

fig, axes = plt.subplots(1, 3, figsize=(15, 5))
weight_values = np.linspace(-3, 3, 100)
loss_landscape = weight_values ** 2 + 0.5

scenarios = [
    (0.9, 'TOO LARGE (α=0.9)', 'red', 'Overshoots and bounces!'),
    (0.3, 'JUST RIGHT (α=0.3)', 'green', 'Steady progress!'),
    (0.05, 'TOO SMALL (α=0.05)', 'blue', 'Very slow progress...')
]

for ax, (lr, title, color, desc) in zip(axes, scenarios):
    ax.plot(weight_values, loss_landscape, 'k-', linewidth=1, alpha=0.3)
    ax.fill_between(weight_values, loss_landscape, alpha=0.1, color='gray')
    
    path = run_gradient_descent(start_w=2.5, learning_rate=lr)
    ax.plot(path[:, 0], path[:, 1], 'o-', color=color, markersize=6, linewidth=1.5)
    ax.scatter([path[0, 0]], [path[0, 1]], color=color, s=150, zorder=5, marker='s', label='Start')
    ax.scatter([path[-1, 0]], [path[-1, 1]], color='black', s=150, zorder=5, marker='*', label='End')
    
    ax.set_xlabel('Weight (w)', fontsize=11)
    ax.set_ylabel('Loss', fontsize=11)
    ax.set_title(f'{title}\n{desc}', fontsize=11, fontweight='bold')
    ax.set_xlim(-3, 3)
    ax.set_ylim(0, 10)
    ax.grid(True, alpha=0.3)
    ax.legend(loc='upper right', fontsize=9)
    
    # Show final loss
    ax.annotate(f'Final loss: {path[-1, 1]:.2f}', xy=(0, 8), fontsize=10, ha='center')

plt.tight_layout()
plt.show()

print("\nLEARNING RATE COMPARISON:")
print("="*60)
for lr, title, _, _ in scenarios:
    path = run_gradient_descent(start_w=2.5, learning_rate=lr)
    print(f"\n{title}")
    print(f"  Final weight: {path[-1, 0]:.4f}")
    print(f"  Final loss:   {path[-1, 1]:.4f}")
    print(f"  Optimal loss: 0.5000 (at w=0)")
    print(f"  Distance from optimal: {abs(path[-1, 1] - 0.5):.4f}")


---

## 5.6 The Gradient: Which Way is Down?

We've been using the word "gradient" - but what IS it exactly, and how do we calculate it?

### What is a Gradient?

The **gradient** is the **derivative** (slope) of the loss with respect to each weight. It tells us:
- **How much** the loss changes when we change a weight
- **Which direction** increases the loss (so we go the opposite way!)

### Regular Derivatives vs Partial Derivatives

**Regular derivative:** When you have ONE variable.
- Example: If `f(x) = x²`, then `df/dx = 2x`

**Partial derivative (∂):** When you have MULTIPLE variables and you want to see the effect of changing just ONE while keeping others fixed.
- Example: If `f(x, y) = x² + y²`, then:
  - `∂f/∂x = 2x` (how f changes when x changes, y held constant)
  - `∂f/∂y = 2y` (how f changes when y changes, x held constant)

**In our Perceptron:**
- Loss depends on 9 weights + 1 bias = 10 variables
- We need 10 partial derivatives (one for each parameter)
- The **gradient** is the collection of ALL these partial derivatives

### The Notation "w.r.t." (With Respect To)

You'll often see "gradient of L w.r.t. w" - this means "how does L change when we change w?"

`∂L/∂w` is read as "partial derivative of L with respect to w"

### The Chain Rule: Breaking Down Complex Functions

Our Perceptron has multiple operations chained together:

```
x → [weighted sum] → z → [sigmoid] → ŷ → [BCE loss] → L
    w · x + b                                
```

To find how changing **w** affects the final loss **L**, we use the **chain rule**:

$$\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}$$

This looks complicated, but each piece is simple!

### The Beautiful Simplification

For sigmoid activation with BCE loss, all the calculus simplifies to:

$$\frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x$$

And for the bias:

$$\frac{\partial L}{\partial b} = (\hat{y} - y)$$

That's it! The gradient is just:
- **(prediction - actual)** × **input**

### Why This Formula Makes Intuitive Sense

| Part | Meaning | Intuition |
|------|---------|-----------|
| $(\hat{y} - y)$ | Error | How wrong we are (and in which direction) |
| $x$ | Input | Which inputs contributed to the output |

If we predicted too high (ŷ > y), the error is positive, so we'll decrease the weights.
If the input was large, we'll decrease more (because it had more influence).

### Let's Calculate Gradients Step by Step:


In [None]:
# =============================================================================
# CALCULATING GRADIENTS: Step by Step
# =============================================================================

print("="*70)
print("CALCULATING GRADIENTS: Step by Step")
print("="*70)

# Use our Perceptron on a vertical line
x = vertical_flat.copy()
y_true = 1  # It IS vertical

# Get the prediction
y_pred = perceptron.forward(x)

print(f"\nInput (x): {x}")
print(f"Actual label (y): {y_true}")
print(f"Prediction (ŷ): {y_pred:.4f}")

# Calculate the error term
print("\n" + "-"*70)
print("STEP 1: Calculate the error term (ŷ - y)")
print("-"*70)
error = y_pred - y_true
print(f"\n  error = ŷ - y")
print(f"  error = {y_pred:.4f} - {y_true}")
print(f"  error = {error:.4f}")

if error > 0:
    print(f"\n  Interpretation: Error is POSITIVE ({error:.4f})")
    print(f"  This means we predicted TOO HIGH - need to decrease output")
else:
    print(f"\n  Interpretation: Error is NEGATIVE ({error:.4f})")
    print(f"  This means we predicted TOO LOW - need to increase output")

# Calculate gradient for weights
print("\n" + "-"*70)
print("STEP 2: Calculate gradient for each weight")
print("-"*70)
print(f"\n  Formula: ∂L/∂w = (ŷ - y) × x = error × x")
print(f"\n  For each weight w_i, the gradient is: error × x_i")

gradient_weights = error * x
print(f"\n  Gradients for all 9 weights:")
print(f"  error × x = {error:.4f} × {x}")
print(f"           = [{', '.join([f'{g:.4f}' for g in gradient_weights])}]")

# Show which weights should change
print(f"\n  Let's interpret this (as a 3x3 grid):")
grad_grid = gradient_weights.reshape(3, 3)
print(f"    {grad_grid[0]}")
print(f"    {grad_grid[1]}")
print(f"    {grad_grid[2]}")

print(f"\n  Notice: Only the middle column has non-zero gradients!")
print(f"  That's because only those pixels had value 1 in the input.")
print(f"  Weights for other pixels don't need to change (input was 0).")

# Calculate gradient for bias
print("\n" + "-"*70)
print("STEP 3: Calculate gradient for bias")
print("-"*70)
gradient_bias = error
print(f"\n  Formula: ∂L/∂b = (ŷ - y) = error")
print(f"  Bias gradient = {gradient_bias:.4f}")

# Show the update
print("\n" + "-"*70)
print("STEP 4: Apply the update (with learning rate α = 0.5)")
print("-"*70)
learning_rate = 0.5
print(f"\n  Update formula: w_new = w_old - α × gradient")
print(f"\n  For weight w₁ (position 1, middle column):")
old_w1 = perceptron.weights[1]
new_w1 = old_w1 - learning_rate * gradient_weights[1]
print(f"    w₁_new = {old_w1:.4f} - {learning_rate} × {gradient_weights[1]:.4f}")
print(f"    w₁_new = {old_w1:.4f} - {learning_rate * gradient_weights[1]:.4f}")
print(f"    w₁_new = {new_w1:.4f}")
print(f"\n  Since error was negative, w₁ INCREASED to make output higher next time!")


---

## 5.7 Backpropagation: Tracing the Blame

**Backpropagation** ("backprop") is the algorithm that calculates gradients by flowing errors BACKWARD through the network.

### Why Backpropagation is Revolutionary

Before backpropagation was popularized in 1986 (by Rumelhart, Hinton, and Williams), training neural networks was incredibly difficult. People didn't know how to efficiently calculate gradients for networks with many layers.

**The Problem:** In a network with multiple layers, changing one weight affects EVERYTHING that comes after it. How do you figure out exactly how much each weight contributed to the final error?

**The Solution:** Backpropagation! It uses the chain rule to efficiently calculate ALL gradients in ONE backward pass through the network.

### The Name Explained

- **Back:** We start from the OUTPUT (the error) and work BACKWARD
- **Propagation:** The error "propagates" (spreads) to earlier layers

Think of it like blame assignment:
1. The final output was wrong
2. What caused it to be wrong? Trace backward...
3. These specific weights were most responsible
4. Adjust them accordingly

### For Our Single Neuron

In our simple Perceptron, backpropagation is straightforward:

```
    FORWARD PASS (left to right):
    x → [w·x + b] → z → [sigmoid] → ŷ → [compare to y] → Loss
    
    BACKWARD PASS (right to left):
    Loss → ∂L/∂ŷ → ∂L/∂z → ∂L/∂w, ∂L/∂b
           ↑          ↑          ↑
        "How does   "How does   "How does
         loss       loss        loss
         change     change      change
         with ŷ?"   with z?"    with w,b?"
```

### Committee Analogy

*"Backpropagation is like a post-mortem after a mistake. The committee asks: 'What went wrong?' They trace the decision back: 'The final vote was wrong. Why? The weighted sum was off. Why? These specific weights gave too much importance to the wrong evidence.' Then they adjust those specific weights."*

### The Backprop Flow for Our Perceptron

| Step | Calculation | Formula |
|------|-------------|---------|
| 1 | Loss gradient w.r.t. output | $\frac{\partial L}{\partial \hat{y}}$ (from BCE) |
| 2 | Output gradient w.r.t. pre-activation | $\frac{\partial \hat{y}}{\partial z} = \hat{y}(1-\hat{y})$ (sigmoid derivative) |
| 3 | Pre-activation gradient w.r.t. weights | $\frac{\partial z}{\partial w} = x$ |
| 4 | Chain them together | $\frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x$ |

The beautiful thing: steps 1 and 2 combine to give us just $(ŷ - y)$!


---

## 5.8 The Training Loop: Putting It All Together

Now we have all the pieces! Let's build the complete training algorithm.

### Why Do We Need a Loop?

A single gradient descent step makes only a TINY improvement. To go from random weights to good weights, we need MANY small steps.

**Real Example:**
- Start: Loss = 0.7, Accuracy = 50%
- After 1 step: Loss = 0.69, Accuracy = 51%  (tiny improvement)
- After 10 steps: Loss = 0.5, Accuracy = 65%
- After 100 steps: Loss = 0.1, Accuracy = 95%

Each step nudges the weights slightly. Over many steps, these tiny nudges accumulate into major improvements!

### Why Multiple Epochs?

**Problem:** One pass through the data isn't enough.
- With 100 samples, we only make 100 weight updates
- The model might not have "seen" enough patterns
- Early samples were processed with very different weights than late samples

**Solution:** Go through the data MULTIPLE times (epochs).
- Epoch 1: First exposure to all samples
- Epoch 2: Second look, with better weights now
- Epoch 3: Refinement continues
- ...

Each epoch, the model gets better at the task!

### The Training Loop Algorithm

```
FOR each epoch (pass through the data):
    FOR each sample (x, y) in the training data:
        
        1. FORWARD PASS: Get prediction
           ŷ = sigmoid(w · x + b)
        
        2. COMPUTE LOSS: How wrong?
           L = BCE(y, ŷ)
        
        3. COMPUTE GRADIENTS: Which way to go?
           ∂L/∂w = (ŷ - y) × x
           ∂L/∂b = (ŷ - y)
        
        4. UPDATE WEIGHTS: Take a step downhill
           w = w - α × ∂L/∂w
           b = b - α × ∂L/∂b
    
    Record average loss for this epoch
```

### Key Terms

| Term | Meaning |
|------|---------|
| **Epoch** | One complete pass through all training data |
| **Sample** | One training example (input + label) |
| **Batch** | Group of samples processed together (we use batch size = 1 here) |
| **Iteration** | One weight update |

### Let's Build Our Trainable Perceptron!


In [None]:
# =============================================================================
# THE TRAINABLE PERCEPTRON: Complete Implementation
# =============================================================================

class TrainablePerceptron:
    """
    A Perceptron that can learn from examples!
    
    This class includes:
    - Forward pass (prediction)
    - Loss calculation (BCE)
    - Gradient calculation (backpropagation)
    - Weight update (gradient descent)
    - Full training loop
    """
    
    def __init__(self, n_inputs):
        """Initialize with random weights."""
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0
        self.n_inputs = n_inputs
        
        # For tracking training progress
        self.loss_history = []
        self.accuracy_history = []
    
    def forward(self, x):
        """Forward pass: compute prediction."""
        x = np.array(x).flatten()
        z = np.dot(self.weights, x) + self.bias
        return sigmoid(z)
    
    def predict(self, x):
        """Binary prediction (0 or 1)."""
        return 1 if self.forward(x) >= 0.5 else 0
    
    def compute_loss(self, y_true, y_pred):
        """Compute BCE loss for one sample."""
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True):
        """
        Train the Perceptron using gradient descent.
        
        Parameters:
            X: Training inputs, shape (n_samples, n_features)
            y: Training labels, shape (n_samples,)
            learning_rate: Step size for gradient descent
            epochs: Number of passes through the data
            verbose: Whether to print progress
        
        Returns:
            List of losses for each epoch
        """
        self.loss_history = []
        self.accuracy_history = []
        
        if verbose:
            print("="*70)
            print("TRAINING STARTED")
            print("="*70)
            print(f"  Samples: {len(X)}")
            print(f"  Epochs: {epochs}")
            print(f"  Learning rate: {learning_rate}")
            print()
        
        for epoch in range(epochs):
            total_loss = 0
            correct = 0
            
            # Go through each training sample
            for i in range(len(X)):
                xi = X[i]  # Input
                yi = y[i]  # True label
                
                # ===== STEP 1: FORWARD PASS =====
                y_pred = self.forward(xi)
                
                # ===== STEP 2: COMPUTE LOSS =====
                loss = self.compute_loss(yi, y_pred)
                total_loss += loss
                
                # Count correct predictions
                if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0):
                    correct += 1
                
                # ===== STEP 3: COMPUTE GRADIENTS =====
                # The beautiful simplification: gradient = (prediction - actual) * input
                error = y_pred - yi
                gradient_weights = error * xi
                gradient_bias = error
                
                # ===== STEP 4: UPDATE WEIGHTS =====
                self.weights = self.weights - learning_rate * gradient_weights
                self.bias = self.bias - learning_rate * gradient_bias
            
            # Record progress
            avg_loss = total_loss / len(X)
            accuracy = correct / len(X)
            self.loss_history.append(avg_loss)
            self.accuracy_history.append(accuracy)
            
            # Print progress every 10 epochs
            if verbose and (epoch + 1) % 10 == 0:
                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")
        
        if verbose:
            print()
            print("="*70)
            print("TRAINING COMPLETE!")
            print("="*70)
            print(f"  Final Loss: {self.loss_history[-1]:.4f}")
            print(f"  Final Accuracy: {self.accuracy_history[-1]*100:.1f}%")
        
        return self.loss_history

print("TrainablePerceptron class created!")
print("Now let's train it and watch it learn...")


---

## 5.9 Watching It Learn!

This is the moment we've been building toward. Let's train our Perceptron and watch it transform from a confused guesser into an expert line detector!

### What to Watch For

During training, you'll see:

1. **Loss decreasing** - The model is making fewer/smaller mistakes
2. **Accuracy increasing** - More predictions are correct
3. **Eventually plateauing** - When the model has learned all it can

### Convergence: When Has the Model Learned Enough?

**Convergence** means the model has stopped improving significantly. Signs of convergence:

| Sign | What It Looks Like | What It Means |
|------|-------------------|---------------|
| Loss plateaus | Loss curve flattens out | No more improvement possible |
| Loss oscillates | Jumps up and down slightly | Near the minimum |
| Accuracy stable | Stays at same level | Model has learned the pattern |

**When to Stop Training:**
- When loss stops decreasing for several epochs
- When accuracy reaches acceptable level (e.g., 95%+)
- When you've run out of patience!


In [None]:
# =============================================================================
# TRAINING THE PERCEPTRON: Watch It Learn!
# =============================================================================

# Create a fresh Perceptron
np.random.seed(42)  # For reproducibility
model = TrainablePerceptron(n_inputs=9)

# Check initial performance (before training)
print("BEFORE TRAINING:")
print("-"*40)
correct_before = sum(model.predict(X_train[i]) == y_train[i] for i in range(len(X_train)))
print(f"Accuracy: {correct_before}/{len(y_train)} = {correct_before/len(y_train)*100:.1f}%")
print(f"(This is basically random guessing)")
print()

# Train the model!
losses = model.train(X_train, y_train, learning_rate=0.5, epochs=50)

# Check final performance
print("\nAFTER TRAINING:")
print("-"*40)
correct_after = sum(model.predict(X_train[i]) == y_train[i] for i in range(len(X_train)))
print(f"Accuracy: {correct_after}/{len(y_train)} = {correct_after/len(y_train)*100:.1f}%")


In [None]:
# =============================================================================
# VISUALIZE THE LEARNING PROGRESS
# =============================================================================

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Plot 1: Loss over time
ax1 = axes[0]
ax1.plot(model.loss_history, 'b-', linewidth=2)
ax1.set_xlabel('Epoch', fontsize=12)
ax1.set_ylabel('Loss (BCE)', fontsize=12)
ax1.set_title('Loss Decreasing Over Time', fontsize=12, fontweight='bold')
ax1.grid(True, alpha=0.3)
ax1.annotate(f'Start: {model.loss_history[0]:.2f}', xy=(0, model.loss_history[0]), 
            xytext=(5, model.loss_history[0]+0.1), fontsize=10)
ax1.annotate(f'End: {model.loss_history[-1]:.2f}', xy=(len(model.loss_history)-1, model.loss_history[-1]), 
            xytext=(len(model.loss_history)-15, model.loss_history[-1]+0.1), fontsize=10)

# Plot 2: Accuracy over time
ax2 = axes[1]
ax2.plot([a*100 for a in model.accuracy_history], 'g-', linewidth=2)
ax2.set_xlabel('Epoch', fontsize=12)
ax2.set_ylabel('Accuracy (%)', fontsize=12)
ax2.set_title('Accuracy Increasing Over Time', fontsize=12, fontweight='bold')
ax2.grid(True, alpha=0.3)
ax2.set_ylim(0, 105)
ax2.axhline(y=50, color='red', linestyle='--', alpha=0.5, label='Random guessing')
ax2.legend()

# Plot 3: Learned weights (as 3x3 grid)
ax3 = axes[2]
weights_grid = model.weights.reshape(3, 3)
im = ax3.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)
ax3.set_title('Learned Weights\n(What the model looks for)', fontsize=12, fontweight='bold')
for i in range(3):
    for j in range(3):
        ax3.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=11, fontweight='bold')
plt.colorbar(im, ax=ax3)
ax3.set_xticks([])
ax3.set_yticks([])

plt.tight_layout()
plt.show()

print("\nKEY OBSERVATIONS:")
print("="*60)
print(f"1. Loss decreased from {model.loss_history[0]:.4f} to {model.loss_history[-1]:.4f}")
print(f"2. Accuracy improved from ~50% to {model.accuracy_history[-1]*100:.1f}%")
print(f"3. The learned weights show HIGH values in the middle column!")
print(f"   This is exactly what we'd expect for a vertical line detector!")


In [None]:
# =============================================================================
# TEST ON CANONICAL EXAMPLES: Before vs After
# =============================================================================

print("="*70)
print("TESTING ON CANONICAL EXAMPLES")
print("="*70)

# Test on vertical line
v_pred = model.forward(vertical_flat)
print(f"\nVertical Line:")
print(f"  Prediction: {v_pred:.4f} ({v_pred*100:.1f}% confident it's vertical)")
print(f"  Actual: 1 (IS vertical)")
print(f"  Result: {'CORRECT!' if v_pred >= 0.5 else 'Wrong'}")

# Test on horizontal line
h_pred = model.forward(horizontal_flat)
print(f"\nHorizontal Line:")
print(f"  Prediction: {h_pred:.4f} ({h_pred*100:.1f}% confident it's vertical)")
print(f"  Actual: 0 (NOT vertical)")
print(f"  Result: {'CORRECT!' if h_pred < 0.5 else 'Wrong'}")

print("\n" + "="*70)
print("THE PERCEPTRON HAS LEARNED!")
print("="*70)
print("""
From random weights giving ~50% accuracy,
our Perceptron now confidently classifies lines!

It learned that:
  - The MIDDLE COLUMN matters most for vertical lines
  - Other pixels should have low/negative weights
  
This happened automatically through gradient descent -
we never told it what a vertical line looks like!
""")


---

## Part 5 Summary: What We've Learned

This was the most important notebook in the series! You've learned the core of how neural networks learn.

### Key Concepts Mastered

| Concept | Formula | Why It Matters |
|---------|---------|----------------|
| **Error** | y - ŷ | Measures how wrong we are |
| **MSE Loss** | (1/n)Σ(y-ŷ)² | Penalizes errors, larger errors more |
| **BCE Loss** | -[y·log(ŷ) + (1-y)·log(1-ŷ)] | Better for classification, harsh on confident mistakes |
| **Gradient** | (ŷ - y) · x | Direction and magnitude of improvement |
| **Gradient Descent** | w = w - α·∇L | Algorithm to find better weights |
| **Learning Rate** | α | Controls step size (too big = overshoot, too small = slow) |
| **Backpropagation** | Chain rule backward | Calculates gradients for all weights |

### The Training Loop (Memorize This!)

```python
for epoch in range(epochs):
    for x, y in training_data:
        y_pred = forward(x)           # 1. Predict
        loss = bce(y, y_pred)         # 2. Measure error
        gradient = (y_pred - y) * x   # 3. Calculate gradient
        weights -= lr * gradient      # 4. Update weights
```

### Committee Analogy Progress

| Part | Committee Story |
|------|-----------------|
| Part 1-3 | Member learned procedures (math, weights, voting) |
| Part 4 | First case - confused, random guessing |
| **Part 5** | **Member receives feedback and LEARNS!** |
| Part 6 | (Next) Evaluating the trained expert |

### The Big Picture

**Before Training:** Random weights → Random predictions → ~50% accuracy

**After Training:** Learned weights → Meaningful predictions → ~95%+ accuracy

The Perceptron discovered ON ITS OWN that vertical lines have pixels in the middle column!

---

## Knowledge Check


In [1]:
# =============================================================================
# KNOWLEDGE CHECK - Part 5
# =============================================================================

print("KNOWLEDGE CHECK - Part 5: Training")
print("="*60)
print("\nAnswer these questions to test your understanding:\n")

questions = [
    {
        "q": "1. Why do we square errors in MSE?",
        "options": [
            "A) To make the math easier",
            "B) To prevent positive and negative errors from canceling out",
            "C) To make errors smaller",
            "D) Because computers prefer square numbers"
        ],
        "answer": "B",
        "explanation": "Squaring makes all errors positive, so they add up rather than cancel. It also penalizes larger errors more."
    },
    {
        "q": "2. Why is BCE preferred over MSE for classification?",
        "options": [
            "A) BCE is faster to compute",
            "B) BCE uses less memory",
            "C) BCE severely punishes confident wrong predictions",
            "D) BCE always gives lower values"
        ],
        "answer": "C",
        "explanation": "BCE uses logarithms which give very large penalties when the model is confident but wrong (e.g., predicting 0.01 when answer is 1)."
    },
    {
        "q": "3. What happens if the learning rate is too high?",
        "options": [
            "A) Training is faster and better",
            "B) The model overshoots the minimum and may never converge",
            "C) The model learns more features",
            "D) Nothing bad, higher is always better"
        ],
        "answer": "B",
        "explanation": "A high learning rate causes big jumps that overshoot the minimum, causing the loss to bounce around or even increase."
    },
    {
        "q": "4. The gradient formula for our Perceptron is (ŷ - y) × x. What does the 'x' part mean?",
        "options": [
            "A) Larger inputs get larger weight updates",
            "B) The input is added to the gradient",
            "C) X marks the spot",
            "D) Nothing, it's just mathematical convention"
        ],
        "answer": "A",
        "explanation": "The input 'x' determines which weights contributed to the output. Weights connected to larger inputs get larger updates because they had more influence."
    },
    {
        "q": "5. What is an 'epoch' in training?",
        "options": [
            "A) One weight update",
            "B) One forward pass",
            "C) One complete pass through all training data",
            "D) When the model reaches 100% accuracy"
        ],
        "answer": "C",
        "explanation": "An epoch is one complete pass through the entire training dataset. We typically train for many epochs until the model converges."
    }
]

for q in questions:
    print(q["q"])
    for opt in q["options"]:
        print(f"   {opt}")
    print()

print("\n" + "="*60)
print("Scroll down for answers...")
print("="*60)


KNOWLEDGE CHECK - Part 5: Training

Answer these questions to test your understanding:

1. Why do we square errors in MSE?
   A) To make the math easier
   B) To prevent positive and negative errors from canceling out
   C) To make errors smaller
   D) Because computers prefer square numbers

2. Why is BCE preferred over MSE for classification?
   A) BCE is faster to compute
   B) BCE uses less memory
   C) BCE severely punishes confident wrong predictions
   D) BCE always gives lower values

3. What happens if the learning rate is too high?
   A) Training is faster and better
   B) The model overshoots the minimum and may never converge
   C) The model learns more features
   D) Nothing bad, higher is always better

4. The gradient formula for our Perceptron is (ŷ - y) × x. What does the 'x' part mean?
   A) Larger inputs get larger weight updates
   B) The input is added to the gradient
   C) X marks the spot
   D) Nothing, it's just mathematical convention

5. What is an 'epoch' in 

In [None]:
# =============================================================================
# ANSWERS - Knowledge Check Part 5
# =============================================================================

print("ANSWERS - Part 5 Knowledge Check")
print("="*60)

for i, q in enumerate(questions, 1):
    print(f"\n{i}. Answer: {q['answer']}")
    print(f"   {q['explanation']}")

print("\n" + "="*60)
print("How did you do?")
print("  5/5: Training Master!")
print("  4/5: Solid understanding!")
print("  3/5: Review the sections you missed")
print("  <3:  Re-read Part 5 - these concepts are crucial!")
print("="*60)


---

## What's Next?

**Congratulations!** You've completed the most important notebook in this series!

You now understand how neural networks **learn** - loss functions, gradient descent, and backpropagation are the foundation of ALL deep learning.

### Coming Up in Part 6: Evaluation - The Trained Expert

- **Training vs Inference** - Learning mode vs using mode
- **Accuracy Metrics** - Precision, recall, F1 score
- **Confusion Matrix** - Detailed prediction breakdown
- **Interpretability** - What did the model actually learn?

---

**Continue to Part 6:** `part_6_evaluation.ipynb`

---

*"The Perceptron has learned. Now it's time to see what it REALLY knows."*

**The Brain's Decision Committee** - From Confusion to Competence
