# Neural Network Fundamentals

## Part 6: Evaluation - The Trained Expert

### The Brain's Decision Committee - Chapter 6

---

## The Story So Far...

In Part 5, something remarkable happened: our committee member **learned**. Starting with random weights and ~50% accuracy, they adjusted their priorities through gradient descent until they became an expert vertical line detector with 95%+ accuracy.

**But how do we know they're actually good?** Getting 95% on training data is one thing, but:
- What kinds of mistakes do they still make?
- Are some errors worse than others?
- Can we understand *why* they make the decisions they do?

This is **evaluation** - properly assessing our trained model and understanding what it has learned.

---

## What You'll Learn in Part 6

By the end of this notebook, you will understand:

1. **Training vs Inference** - The difference between learning mode and using mode
2. **Accuracy** - The simplest metric (and its limitations)
3. **Confusion Matrix** - A detailed breakdown of all prediction types
4. **Precision & Recall** - Measuring different kinds of correctness
5. **F1 Score** - Balancing precision and recall
6. **Saliency/Interpretability** - What did the model actually learn?
7. **Test Sets** - Why we need data the model has never seen

---

## Prerequisites

Make sure you've completed:
- **Parts 0-1:** Matrices (`neural_network_fundamentals.ipynb`)
- **Part 2:** Single Neuron (`part_2_single_neuron.ipynb`)
- **Part 3:** Activation Functions (`part_3_activation_functions.ipynb`)
- **Part 4:** The Perceptron (`part_4_perceptron.ipynb`)
- **Part 5:** Training (`part_5_training.ipynb`)


---

## Setup: Import Dependencies and Recreate Our Trained Model

Let's bring in everything we need and train a model to evaluate.


In [None]:
# =============================================================================
# PART 6: EVALUATION - SETUP AND IMPORTS
# =============================================================================

import numpy as np
import matplotlib.pyplot as plt
from IPython.display import display, clear_output

# Try to import ipywidgets for interactive features
try:
    import ipywidgets as widgets
    WIDGETS_AVAILABLE = True
except ImportError:
    WIDGETS_AVAILABLE = False
    print("Note: ipywidgets not installed. Interactive features will be limited.")

# Set up matplotlib style
style_options = ['seaborn-v0_8-whitegrid', 'seaborn-whitegrid', 'ggplot', 'default']
for style in style_options:
    try:
        plt.style.use(style)
        break
    except OSError:
        continue

plt.rcParams['figure.figsize'] = [10, 6]
plt.rcParams['font.size'] = 12
np.random.seed(42)

print("Setup complete!")
print("="*60)


In [None]:
# =============================================================================
# RECREATE OUR TOOLS FROM PREVIOUS NOTEBOOKS
# =============================================================================

# -----------------------------------------------------------------------------
# Our canonical line images (from Part 1)
# -----------------------------------------------------------------------------
vertical_line = np.array([[0, 1, 0], [0, 1, 0], [0, 1, 0]])
horizontal_line = np.array([[0, 0, 0], [1, 1, 1], [0, 0, 0]])
vertical_flat = vertical_line.flatten()
horizontal_flat = horizontal_line.flatten()

# -----------------------------------------------------------------------------
# Dataset generator (from Part 4)
# -----------------------------------------------------------------------------
def generate_line_dataset(n_samples=100, noise_level=0.0, seed=None):
    """Generate vertical (label=1) and horizontal (label=0) line images."""
    if seed is not None:
        np.random.seed(seed)
    
    X, y = [], []
    
    for i in range(n_samples):
        image = np.zeros((3, 3))
        
        if i < n_samples // 2:  # Vertical lines
            col = np.random.randint(0, 3)
            image[:, col] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(1)
        else:  # Horizontal lines
            row = np.random.randint(0, 3)
            image[row, :] = 1
            if noise_level > 0:
                image = np.clip(image + np.random.randn(3, 3) * noise_level, 0, 1)
            X.append(image.flatten())
            y.append(0)
    
    X, y = np.array(X), np.array(y)
    shuffle_idx = np.random.permutation(n_samples)
    return X[shuffle_idx], y[shuffle_idx]

# -----------------------------------------------------------------------------
# Sigmoid activation function (from Part 3)
# -----------------------------------------------------------------------------
def sigmoid(z):
    """Sigmoid activation: maps any value to range (0, 1)."""
    return 1 / (1 + np.exp(-np.clip(z, -500, 500)))

# -----------------------------------------------------------------------------
# TrainablePerceptron class (from Part 5)
# -----------------------------------------------------------------------------
class TrainablePerceptron:
    """A Perceptron that can learn from examples."""
    
    def __init__(self, n_inputs):
        self.weights = np.random.randn(n_inputs) * 0.1
        self.bias = 0.0
        self.n_inputs = n_inputs
        self.loss_history = []
        self.accuracy_history = []
        self.is_trained = False  # Track if model has been trained
    
    def forward(self, x):
        x = np.array(x).flatten()
        z = np.dot(self.weights, x) + self.bias
        return sigmoid(z)
    
    def predict(self, x):
        return 1 if self.forward(x) >= 0.5 else 0
    
    def compute_loss(self, y_true, y_pred):
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def train(self, X, y, learning_rate=0.1, epochs=100, verbose=True):
        self.loss_history = []
        self.accuracy_history = []
        
        for epoch in range(epochs):
            total_loss = 0
            correct = 0
            
            for i in range(len(X)):
                xi, yi = X[i], y[i]
                y_pred = self.forward(xi)
                loss = self.compute_loss(yi, y_pred)
                total_loss += loss
                
                if (y_pred >= 0.5 and yi == 1) or (y_pred < 0.5 and yi == 0):
                    correct += 1
                
                error = y_pred - yi
                self.weights = self.weights - learning_rate * error * xi
                self.bias = self.bias - learning_rate * error
            
            avg_loss = total_loss / len(X)
            accuracy = correct / len(X)
            self.loss_history.append(avg_loss)
            self.accuracy_history.append(accuracy)
            
            if verbose and (epoch + 1) % 10 == 0:
                print(f"  Epoch {epoch+1:3d}/{epochs}: Loss = {avg_loss:.4f}, Accuracy = {accuracy*100:.1f}%")
        
        self.is_trained = True
        
        if verbose:
            print(f"\nTraining complete! Final accuracy: {self.accuracy_history[-1]*100:.1f}%")
        
        return self.loss_history

print("Tools recreated from previous notebooks!")
print("  - Line image templates")
print("  - Dataset generator")
print("  - Sigmoid activation")
print("  - TrainablePerceptron class")


In [None]:
# =============================================================================
# TRAIN OUR MODEL (Quick recap from Part 5)
# =============================================================================

print("="*70)
print("TRAINING OUR MODEL (to have something to evaluate)")
print("="*70)

# Generate training data
np.random.seed(42)
X_train, y_train = generate_line_dataset(n_samples=100, noise_level=0.0, seed=42)

# Generate TEST data (NEW! - data the model has never seen)
X_test, y_test = generate_line_dataset(n_samples=50, noise_level=0.0, seed=999)

print(f"\nTraining set: {len(X_train)} samples")
print(f"Test set: {len(X_test)} samples (model has NEVER seen these!)")

# Create and train model
model = TrainablePerceptron(n_inputs=9)
print("\nTraining...")
model.train(X_train, y_train, learning_rate=0.5, epochs=50, verbose=True)

print("\n" + "="*70)
print("Model is trained and ready for evaluation!")
print("="*70)


---

## 6.1 Training vs Inference: The Committee's Memory

Before we evaluate, let's understand an important distinction: **training mode** vs **inference mode**.

### What IS Inference?

The word **"inference"** comes from Latin *inferre* meaning "to bring in" or "to conclude." In machine learning:

**Inference = Using a trained model to make predictions on new data**

Think of it like this:
- **Training** = Teaching someone how to do a job
- **Inference** = That person actually doing the job

### Why Two Different Modes?

| Aspect | Training Mode | Inference Mode |
|--------|--------------|----------------|
| **Purpose** | Learn from examples | Make predictions |
| **Weights** | Being updated constantly | Frozen (fixed) |
| **Data** | Training set (with labels) | New, unseen data |
| **Speed** | Slower (computing gradients) | Fast (forward pass only) |
| **Goal** | Minimize loss | Predict accurately |

### Committee Analogy

*"During training, the committee is in a meeting room, debating cases, learning from mistakes, and updating their rulebook. Once trained, they compile their final rulebook and hand it to the front desk. The front desk uses this rulebook to make quick decisions without calling the committee for every case."*

- **Training:** The committee meeting (slow, learning, updating)
- **Inference:** The front desk using the final rulebook (fast, fixed, no learning)

### Why Does This Distinction Matter?

| Scenario | Why It Matters |
|----------|----------------|
| **Deployment** | In production, you use inference mode for speed |
| **Evaluation** | We evaluate in inference mode (weights must be fixed!) |
| **Consistency** | Same weights give same predictions every time |
| **Resources** | Inference uses less memory (no gradients stored) |

### The Key Insight

During inference, the model does NOT learn anything new. The weights are "frozen" - they don't change. This is essential because:

1. **Reproducibility**: Same input always gives same output
2. **Speed**: No gradient computation needed
3. **Fairness**: Test data doesn't influence the model

### Why "Frozen" Weights Matter Mathematically

During training, after each prediction, we do:
```
weights = weights - learning_rate × gradient
```

During inference, we SKIP this step entirely. The weights stay exactly as they were after training finished.

**Why does this matter?**

| If we kept updating during inference... | Consequence |
|----------------------------------------|-------------|
| Weights would change with each new input | Same input could give different outputs! |
| Model would "drift" over time | Yesterday's predictions wouldn't match today's |
| Hard to reproduce results | "But it worked yesterday!" |
| Unfair for test evaluation | Test data would influence the model |

**The mathematical guarantee:** With frozen weights, $f(x) = \sigma(w \cdot x + b)$ is a **deterministic function** - same input ALWAYS gives same output.

### In Code


In [None]:
# =============================================================================
# TRAINING VS INFERENCE: Demonstration
# =============================================================================

print("="*70)
print("TRAINING vs INFERENCE MODE")
print("="*70)

# Show the model's state
print(f"\nModel state: {'TRAINED' if model.is_trained else 'UNTRAINED'}")

# In training mode, weights change after each sample
print("\n" + "-"*70)
print("DURING TRAINING (weights change):")
print("-"*70)
print("  For each sample:")
print("    1. Forward pass → get prediction")
print("    2. Compute loss → how wrong?")
print("    3. Compute gradients → which direction?")
print("    4. Update weights → improve! (weights CHANGE)")

# In inference mode, weights are frozen
print("\n" + "-"*70)
print("DURING INFERENCE (weights frozen):")
print("-"*70)
print("  For each sample:")
print("    1. Forward pass → get prediction")
print("    2. Done! (NO weight updates)")

# Demonstrate inference
print("\n" + "-"*70)
print("INFERENCE EXAMPLE:")
print("-"*70)

# Save weights before
weights_before = model.weights.copy()

# Make predictions (inference)
pred_v = model.forward(vertical_flat)
pred_h = model.forward(horizontal_flat)

# Check weights after
weights_after = model.weights.copy()

print(f"\n  Vertical line:   {pred_v:.4f} ({pred_v*100:.1f}% confident it's vertical)")
print(f"  Horizontal line: {pred_h:.4f} ({pred_h*100:.1f}% confident it's vertical)")
print(f"\n  Weights changed? {not np.allclose(weights_before, weights_after)}")
print(f"  (In inference mode, weights stay fixed!)")


---

## 6.2 Accuracy: The Simplest Metric

We've been using accuracy throughout our notebooks, but let's formally define it and understand its limitations.

### What IS Accuracy?

**Accuracy** answers the question: "Of all the predictions I made, what fraction was correct?"

$$\text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$$

### Breaking Down the Formula

Let's understand each part:

| Component | What It Means | Our Example |
|-----------|---------------|-------------|
| **Correct Predictions** | Cases where prediction matches truth | Said "vertical" for vertical, "horizontal" for horizontal |
| **Total Predictions** | All cases we predicted on | All 50 test images |
| **Accuracy** | The ratio (0 to 1, or 0% to 100%) | 48/50 = 0.96 = 96% |

### Computing Accuracy Step by Step

```
Step 1: Make predictions on all samples
Step 2: Compare each prediction to the true label
Step 3: Count how many match (correct)
Step 4: Divide by total number of predictions
```

### Why Accuracy Can Be Misleading

Accuracy has a hidden flaw: **it treats all mistakes equally** and **ignores class imbalance**.

**Example - Fraud Detection:**

Suppose 99% of transactions are legitimate, 1% are fraud.

| Model Strategy | Accuracy | Is It Good? |
|----------------|----------|-------------|
| Say "legitimate" for EVERYTHING | 99% | NO! Catches 0% of fraud! |
| Actually detect fraud | 97% | YES! Even though lower accuracy |

The "dumb" model gets 99% accuracy by ignoring the problem entirely!

**Example - Medical Diagnosis:**

| Scenario | Type of Error | Consequence |
|----------|---------------|-------------|
| Say "healthy" when patient is sick | Miss a disease | Patient doesn't get treatment! (VERY bad) |
| Say "sick" when patient is healthy | False alarm | Unnecessary tests (annoying but not dangerous) |

Both are "wrong" but one is much worse! Accuracy treats them the same.

### When Accuracy Works Well

Accuracy is a good metric when:
1. **Classes are balanced** (roughly 50/50 split)
2. **All mistakes have equal cost**
3. **You want a quick overall view**

Our V/H classifier is a good case for accuracy: balanced classes, equal mistake costs.

### Understanding Why Class Imbalance Breaks Accuracy

Let's do the math to see WHY accuracy is misleading with imbalanced data:

**Scenario: Fraud Detection (1% fraud, 99% legitimate)**

| Strategy | Fraud Caught | Accuracy Calculation |
|----------|-------------|---------------------|
| **"Always say legitimate"** | 0 of 100 frauds | (0 + 9900) / 10000 = **99%** |
| **Good detector** | 90 of 100 frauds | (90 + 9800) / 10000 = **98.9%** |

The "dumb" strategy has HIGHER accuracy but catches ZERO fraud!

**Why this happens mathematically:**

$$\text{Accuracy} = \frac{TP + TN}{\text{Total}}$$

When 99% of data is class 0, you can get 99% accuracy by predicting 0 for everything (TN = 9900, everything else = 0).

**The lesson:** When classes are imbalanced, accuracy is dominated by the majority class. We need metrics that focus on the minority class (precision, recall).

### Let's Calculate Accuracy Properly


In [None]:
# =============================================================================
# ACCURACY: Step-by-Step Calculation
# =============================================================================

print("="*70)
print("CALCULATING ACCURACY: Step by Step")
print("="*70)

def calculate_accuracy(model, X, y, verbose=True):
    """
    Calculate accuracy of model on given data.
    
    Parameters:
        model: Trained model with predict() method
        X: Input data (n_samples, n_features)
        y: True labels (n_samples,)
        verbose: Whether to print details
    
    Returns:
        accuracy: Float between 0 and 1
        predictions: Array of predicted labels
    """
    predictions = []
    correct = 0
    
    for i in range(len(X)):
        pred = model.predict(X[i])
        predictions.append(pred)
        if pred == y[i]:
            correct += 1
    
    accuracy = correct / len(y)
    
    if verbose:
        print(f"\n  Total samples: {len(y)}")
        print(f"  Correct: {correct}")
        print(f"  Wrong: {len(y) - correct}")
        print(f"  Accuracy: {correct}/{len(y)} = {accuracy:.4f} = {accuracy*100:.1f}%")
    
    return accuracy, np.array(predictions)

# Calculate accuracy on TRAINING data
print("\n" + "-"*70)
print("TRAINING SET ACCURACY:")
print("-"*70)
train_accuracy, train_preds = calculate_accuracy(model, X_train, y_train)

# Calculate accuracy on TEST data (NEW!)
print("\n" + "-"*70)
print("TEST SET ACCURACY:")
print("-"*70)
test_accuracy, test_preds = calculate_accuracy(model, X_test, y_test)

print("\n" + "="*70)
print("KEY INSIGHT: Training vs Test Accuracy")
print("="*70)
print(f"""
Training accuracy: {train_accuracy*100:.1f}%
Test accuracy:     {test_accuracy*100:.1f}%

The TEST accuracy is what really matters!
Training accuracy can be misleadingly high if the model "memorizes" the data.
Test accuracy shows how well the model generalizes to NEW data.
""")


---

## 6.3 The Confusion Matrix: A Detailed Report Card

Accuracy gives us one number. But what if we want to understand **WHICH** mistakes the model makes?

### What IS a Confusion Matrix?

A **confusion matrix** is a table that breaks down all predictions into four categories based on two questions:
1. What did we **predict**?
2. What was the **actual** truth?

```
                      PREDICTED
                    0        1
              ┌─────────┬─────────┐
        0     │   TN    │   FP    │
   ACTUAL     ├─────────┼─────────┤
        1     │   FN    │   TP    │
              └─────────┴─────────┘
```

### Why "Confusion" Matrix?

The name comes from the fact that it shows how the model gets "confused" - where it mixes up one class for another.

### Understanding the Four Categories

| Abbrev | Full Name | Meaning | Our Example |
|--------|-----------|---------|-------------|
| **TP** | True Positive | Predicted 1, was actually 1 | Said "vertical", WAS vertical ✓ |
| **TN** | True Negative | Predicted 0, was actually 0 | Said "horizontal", WAS horizontal ✓ |
| **FP** | False Positive | Predicted 1, was actually 0 | Said "vertical", was horizontal ✗ |
| **FN** | False Negative | Predicted 0, was actually 1 | Said "horizontal", was vertical ✗ |

### Memory Trick for TP/TN/FP/FN

Think of it as TWO questions:

1. **True/False:** Was the prediction **correct**?
   - True = correct
   - False = wrong

2. **Positive/Negative:** What did we **predict**?
   - Positive = predicted class 1 (vertical)
   - Negative = predicted class 0 (horizontal)

So:
- **True Positive** = We were True (correct) when we predicted Positive (vertical)
- **False Positive** = We were False (wrong) when we predicted Positive (vertical)
- **True Negative** = We were True (correct) when we predicted Negative (horizontal)  
- **False Negative** = We were False (wrong) when we predicted Negative (horizontal)

### Committee Analogy

*"The confusion matrix is like a detailed performance review for our committee member:*
- *TP: Cases they correctly identified as vertical*
- *TN: Cases they correctly identified as NOT vertical*
- *FP: Cases they wrongly called vertical (a false alarm!)*
- *FN: Cases they missed (should have said vertical but didn't)"*

### Alternative Names You'll See

| Our Term | Also Called | When Used |
|----------|-------------|-----------|
| False Positive | Type I Error | Statistics |
| False Negative | Type II Error | Statistics |
| True Positive Rate | Sensitivity, Recall | Medical |
| True Negative Rate | Specificity | Medical |

### Real-World Examples of Each Error Type

Understanding these errors is easier with concrete examples:

| Error Type | Medical Example | Email Example | Self-Driving Car |
|------------|-----------------|---------------|------------------|
| **TP** | Correctly diagnose sick patient | Correctly mark spam | Correctly detect pedestrian |
| **TN** | Correctly clear healthy patient | Correctly allow good email | Correctly ignore false alarm |
| **FP** | Diagnose healthy as sick | Mark good email as spam | Brake for nothing (annoying) |
| **FN** | Miss a sick patient | Allow spam through | Miss a pedestrian (FATAL!) |

**Notice:** The consequences of FP vs FN are very different depending on the application!
- **Medical:** FN is worse (missed diagnosis can be fatal)
- **Spam filter:** FP is worse (losing important emails)
- **Self-driving:** FN is MUCH worse (hitting someone)

This is why we have precision and recall - to measure these separately.

### Let's Build a Confusion Matrix


In [None]:
# =============================================================================
# CONFUSION MATRIX: Implementation and Explanation
# =============================================================================

def confusion_matrix(y_true, y_pred):
    """
    Compute the confusion matrix.
    
    The logic behind each calculation:
    - TP: prediction=1 AND truth=1 (both conditions true)
    - TN: prediction=0 AND truth=0 (both conditions true)
    - FP: prediction=1 AND truth=0 (predicted positive, was negative)
    - FN: prediction=0 AND truth=1 (predicted negative, was positive)
    
    Parameters:
        y_true: Array of true labels (0 or 1)
        y_pred: Array of predicted labels (0 or 1)
    
    Returns:
        dict with TP, TN, FP, FN counts
    """
    # True Positive: We said 1, it was 1
    TP = np.sum((y_pred == 1) & (y_true == 1))
    
    # True Negative: We said 0, it was 0
    TN = np.sum((y_pred == 0) & (y_true == 0))
    
    # False Positive: We said 1, but it was 0 (false alarm!)
    FP = np.sum((y_pred == 1) & (y_true == 0))
    
    # False Negative: We said 0, but it was 1 (missed it!)
    FN = np.sum((y_pred == 0) & (y_true == 1))
    
    return {'TP': TP, 'TN': TN, 'FP': FP, 'FN': FN}

print("="*70)
print("CONFUSION MATRIX: Step by Step")
print("="*70)

# Calculate confusion matrix for test set
cm = confusion_matrix(y_test, test_preds)

print("\nFor our TEST set:")
print(f"  Total samples: {len(y_test)}")
print(f"  Vertical lines (label=1): {np.sum(y_test == 1)}")
print(f"  Horizontal lines (label=0): {np.sum(y_test == 0)}")

print("\n" + "-"*70)
print("CONFUSION MATRIX BREAKDOWN:")
print("-"*70)

print(f"""
                       PREDICTED
                   Horizontal(0)  Vertical(1)
              ┌─────────────────┬─────────────────┐
   Horiz.(0)  │  TN = {cm['TN']:3d}       │  FP = {cm['FP']:3d}       │
   ACTUAL     ├─────────────────┼─────────────────┤
   Vert.(1)   │  FN = {cm['FN']:3d}       │  TP = {cm['TP']:3d}       │
              └─────────────────┴─────────────────┘
""")

print("Interpretation (reading the matrix):")
print(f"  ✓ True Positives (TP = {cm['TP']}): Correctly identified as VERTICAL")
print(f"  ✓ True Negatives (TN = {cm['TN']}): Correctly identified as HORIZONTAL")
print(f"  ✗ False Positives (FP = {cm['FP']}): Wrongly called VERTICAL (was horizontal)")
print(f"  ✗ False Negatives (FN = {cm['FN']}): Wrongly called HORIZONTAL (was vertical)")

# Verify: TP + TN + FP + FN should equal total samples
total = cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']
print(f"\n  Verification: TP + TN + FP + FN = {total} (should equal {len(y_test)}) ✓")

# Show how accuracy relates to confusion matrix
print("\n" + "-"*70)
print("ACCURACY FROM CONFUSION MATRIX:")
print("-"*70)
print(f"""
  Accuracy = (TP + TN) / (TP + TN + FP + FN)
           = ({cm['TP']} + {cm['TN']}) / ({cm['TP']} + {cm['TN']} + {cm['FP']} + {cm['FN']})
           = {cm['TP'] + cm['TN']} / {total}
           = {(cm['TP'] + cm['TN']) / total:.4f}
           = {(cm['TP'] + cm['TN']) / total * 100:.1f}%
""")


In [None]:
# =============================================================================
# VISUALIZE THE CONFUSION MATRIX
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Confusion Matrix as heatmap
ax1 = axes[0]
cm_matrix = np.array([[cm['TN'], cm['FP']], 
                       [cm['FN'], cm['TP']]])

im = ax1.imshow(cm_matrix, cmap='Blues')
ax1.set_xticks([0, 1])
ax1.set_yticks([0, 1])
ax1.set_xticklabels(['Horizontal (0)', 'Vertical (1)'])
ax1.set_yticklabels(['Horizontal (0)', 'Vertical (1)'])
ax1.set_xlabel('Predicted Label', fontsize=12)
ax1.set_ylabel('Actual Label', fontsize=12)
ax1.set_title('Confusion Matrix', fontsize=14, fontweight='bold')

# Add text annotations
labels = [['TN', 'FP'], ['FN', 'TP']]
for i in range(2):
    for j in range(2):
        text_color = 'white' if cm_matrix[i, j] > cm_matrix.max()/2 else 'black'
        ax1.text(j, i, f'{labels[i][j]}\n{cm_matrix[i, j]}', 
                ha='center', va='center', fontsize=14, fontweight='bold', color=text_color)

plt.colorbar(im, ax=ax1)

# Plot 2: Visual explanation
ax2 = axes[1]
ax2.axis('off')

explanation_text = f"""
READING THE CONFUSION MATRIX
{'='*45}

The DIAGONAL (top-left to bottom-right) shows 
CORRECT predictions:
  • TN ({cm['TN']}): Horizontal predicted as Horizontal ✓
  • TP ({cm['TP']}): Vertical predicted as Vertical ✓

The OFF-DIAGONAL shows ERRORS:
  • FP ({cm['FP']}): Horizontal wrongly called Vertical ✗
  • FN ({cm['FN']}): Vertical wrongly called Horizontal ✗

A PERFECT model has:
  • All values on the diagonal
  • Zeros everywhere else
"""

ax2.text(0.1, 0.5, explanation_text, fontsize=11, family='monospace',
        verticalalignment='center', transform=ax2.transAxes,
        bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()


---

## 6.4 Precision, Recall, and F1 Score

The confusion matrix gives us four numbers. From these, we can calculate more specific metrics that answer different questions.

### Precision: "When I Say Positive, Am I Right?"

**Precision** answers: "Of all the times I predicted 'positive' (vertical), how many were actually positive?"

$$\text{Precision} = \frac{TP}{TP + FP}$$

**Breaking it down:**
- **Numerator (TP):** Cases we correctly called positive
- **Denominator (TP + FP):** ALL cases we called positive (right or wrong)

**High precision means:** When we say "vertical", we're usually right. Few false alarms.

**When to prioritize precision:**
- Spam filters (don't delete legitimate emails!)
- Recommender systems (don't recommend things users hate!)
- Any case where false alarms are costly

### Recall: "Did I Catch All the Positives?"

**Recall** (also called **Sensitivity**) answers: "Of all the actual positives, how many did I catch?"

$$\text{Recall} = \frac{TP}{TP + FN}$$

**Breaking it down:**
- **Numerator (TP):** Cases we correctly caught
- **Denominator (TP + FN):** ALL actual positives (caught or missed)

**High recall means:** We catch most of the actual vertical lines. Few misses.

**When to prioritize recall:**
- Disease detection (don't miss sick patients!)
- Fraud detection (don't miss fraudulent transactions!)
- Any case where missing positives is costly

### The Precision-Recall Trade-off

Here's the fundamental tension:

| Strategy | Precision | Recall | Problem |
|----------|-----------|--------|---------|
| "Only say vertical when 100% sure" | HIGH (few false alarms) | LOW (miss many) | Miss too many positives |
| "Say vertical for anything remotely vertical" | LOW (many false alarms) | HIGH (catch most) | Too many false alarms |

**You often can't maximize both!** This is called the **precision-recall trade-off**.

### Concrete Example: Airport Security

Imagine a security scanner detecting threats:

| Setting | Precision | Recall | Outcome |
|---------|-----------|--------|---------|
| **Super sensitive** | 10% | 99% | Catches ALL threats but 90% of "threats" are false alarms. Massive delays! |
| **Super strict** | 95% | 20% | Few false alarms but misses 80% of real threats. Dangerous! |
| **Balanced** | 70% | 70% | Some false alarms, catches most threats. Practical! |

**Why the trade-off exists:**

When we lower the threshold for saying "positive":
- We catch MORE true positives (recall goes UP ↑)
- But we also catch MORE false positives (precision goes DOWN ↓)

When we raise the threshold:
- We have FEWER false positives (precision goes UP ↑)
- But we miss MORE true positives (recall goes DOWN ↓)

**There's no free lunch!** The art is finding the right balance for your specific application.

### F1 Score: Finding the Balance

The **F1 Score** is the **harmonic mean** of precision and recall - a single number that balances both:

$$F1 = 2 \cdot \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$$

### What IS a Harmonic Mean and Why Use It?

You might wonder: "Why not just use a regular average (arithmetic mean)?"

**Three Types of Means:**

| Mean Type | Formula | Example: (99%, 10%) |
|-----------|---------|---------------------|
| **Arithmetic** | (a + b) / 2 | (99 + 10) / 2 = **54.5%** |
| **Geometric** | √(a × b) | √(99 × 10) = **31.5%** |
| **Harmonic** | 2ab / (a + b) | 2×99×10 / (99+10) = **18.2%** |

**Why harmonic mean is better for F1:**

The harmonic mean is **punishing when values are imbalanced**. If you have 99% precision but only 10% recall:
- Arithmetic mean says "54.5% - not bad!"
- Harmonic mean says "18.2% - this is terrible!"

**The harmonic mean forces BOTH values to be reasonably high to get a good score.**

**Intuition:** Think about speed. If you drive 60 mph for half a trip and 20 mph for the other half, your average speed isn't 40 mph - it's closer to 30 mph (harmonic mean). The slow part dominates.

**Why this matters for ML:**
A model that predicts "positive" for everything gets 100% recall but ~0% precision. The harmonic mean correctly identifies this as a terrible model.

| Precision | Recall | F1 Score | Verdict |
|-----------|--------|----------|---------|
| 90% | 90% | 90% | Great! Both balanced |
| 99% | 10% | 18% | Terrible! Very unbalanced |
| 50% | 50% | 50% | Mediocre |

**F1 is high only when BOTH precision AND recall are reasonably high.**


In [None]:
# =============================================================================
# PRECISION, RECALL, F1: Calculation
# =============================================================================

def calculate_metrics(cm):
    """
    Calculate precision, recall, F1 from confusion matrix.
    
    Parameters:
        cm: dict with TP, TN, FP, FN
    
    Returns:
        dict with precision, recall, f1, accuracy
    """
    TP, TN, FP, FN = cm['TP'], cm['TN'], cm['FP'], cm['FN']
    
    # Precision: When we say positive, are we right?
    # Note: We add a check to avoid division by zero
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    
    # Recall: Did we catch all the positives?
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    
    # F1: Harmonic mean of precision and recall
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    # Accuracy (for comparison)
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    
    return {'precision': precision, 'recall': recall, 'f1': f1, 'accuracy': accuracy}

print("="*70)
print("PRECISION, RECALL, AND F1 SCORE")
print("="*70)

metrics = calculate_metrics(cm)

print("\n" + "-"*70)
print("STEP-BY-STEP CALCULATION:")
print("-"*70)

print(f"""
From our confusion matrix:
  TP = {cm['TP']} (correctly identified vertical lines)
  TN = {cm['TN']} (correctly identified horizontal lines)
  FP = {cm['FP']} (horizontal lines wrongly called vertical)
  FN = {cm['FN']} (vertical lines wrongly called horizontal)

PRECISION: "When I say vertical, am I right?"
  Formula: Precision = TP / (TP + FP)
  
  Precision = {cm['TP']} / ({cm['TP']} + {cm['FP']})
            = {cm['TP']} / {cm['TP'] + cm['FP']}
            = {metrics['precision']:.4f}
            = {metrics['precision']*100:.1f}%

RECALL: "Did I catch all the vertical lines?"
  Formula: Recall = TP / (TP + FN)
  
  Recall = {cm['TP']} / ({cm['TP']} + {cm['FN']})
         = {cm['TP']} / {cm['TP'] + cm['FN']}
         = {metrics['recall']:.4f}
         = {metrics['recall']*100:.1f}%

F1 SCORE: "Balance of precision and recall"
  Formula: F1 = 2 × (Precision × Recall) / (Precision + Recall)
  
  F1 = 2 × ({metrics['precision']:.4f} × {metrics['recall']:.4f}) / ({metrics['precision']:.4f} + {metrics['recall']:.4f})
     = 2 × {metrics['precision'] * metrics['recall']:.4f} / {metrics['precision'] + metrics['recall']:.4f}
     = {metrics['f1']:.4f}
     = {metrics['f1']*100:.1f}%

ACCURACY (for comparison):
  Formula: Accuracy = (TP + TN) / Total
  
  Accuracy = ({cm['TP']} + {cm['TN']}) / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']}
           = {cm['TP'] + cm['TN']} / {cm['TP'] + cm['TN'] + cm['FP'] + cm['FN']}
           = {metrics['accuracy']:.4f}
           = {metrics['accuracy']*100:.1f}%
""")


In [None]:
# =============================================================================
# DEMONSTRATING THE PRECISION-RECALL TRADE-OFF
# =============================================================================

print("="*70)
print("THE PRECISION-RECALL TRADE-OFF: A Visual Demonstration")
print("="*70)

print("""
To understand the trade-off, let's see what happens when we change
our THRESHOLD for saying "vertical" (positive).

Currently we use: threshold = 0.5
  - If output >= 0.5 → predict "vertical"
  - If output < 0.5 → predict "horizontal"

But what if we change this threshold?
""")

# Try different thresholds
thresholds = [0.1, 0.3, 0.5, 0.7, 0.9]
results = []

for threshold in thresholds:
    # Make predictions at this threshold
    preds = np.array([1 if model.forward(x) >= threshold else 0 for x in X_test])
    
    # Calculate confusion matrix
    TP = np.sum((preds == 1) & (y_test == 1))
    TN = np.sum((preds == 0) & (y_test == 0))
    FP = np.sum((preds == 1) & (y_test == 0))
    FN = np.sum((preds == 0) & (y_test == 1))
    
    # Calculate metrics
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
    
    results.append({
        'threshold': threshold,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'TP': TP, 'FP': FP, 'FN': FN
    })
    
    print(f"Threshold = {threshold}:")
    print(f"  TP={TP:2d}, FP={FP:2d}, FN={FN:2d}")
    print(f"  Precision={precision:.1%}, Recall={recall:.1%}, F1={f1:.1%}")
    print()

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Precision vs Recall at different thresholds
ax = axes[0]
precisions = [r['precision'] for r in results]
recalls = [r['recall'] for r in results]

ax.plot(recalls, precisions, 'b-o', linewidth=2, markersize=10)
for r in results:
    ax.annotate(f"  t={r['threshold']}", 
               (r['recall'], r['precision']), fontsize=9)

ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Trade-off\n(Each point is a different threshold)', 
            fontsize=12, fontweight='bold')
ax.set_xlim(-0.05, 1.05)
ax.set_ylim(-0.05, 1.05)
ax.grid(True, alpha=0.3)

# Add ideal point
ax.scatter([1], [1], color='gold', s=200, marker='*', zorder=5, label='Ideal (1,1)')
ax.legend()

# Plot 2: Bar chart showing trade-off
ax = axes[1]
x = np.arange(len(thresholds))
width = 0.25

bars1 = ax.bar(x - width, precisions, width, label='Precision', color='#e74c3c')
bars2 = ax.bar(x, recalls, width, label='Recall', color='#27ae60')
bars3 = ax.bar(x + width, [r['f1'] for r in results], width, label='F1', color='#9b59b6')

ax.set_xlabel('Threshold', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Metrics at Different Thresholds', fontsize=12, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([f'{t}' for t in thresholds])
ax.legend()
ax.set_ylim(0, 1.1)

plt.tight_layout()
plt.show()

print("""
KEY INSIGHT:
════════════════════════════════════════════════════════════════════════

• LOW threshold (0.1): "Say vertical for almost everything!"
  → High recall (catch most verticals) but low precision (many false alarms)
  
• HIGH threshold (0.9): "Only say vertical when VERY confident!"
  → High precision (rarely wrong when we say vertical) but low recall (miss many)
  
• MIDDLE threshold (0.5): Balanced trade-off

Notice how the precision-recall curve shows the trade-off: as one goes up, 
the other tends to go down. F1 score helps us find a good balance!
""")


In [None]:
# =============================================================================
# VISUALIZE ALL METRICS
# =============================================================================

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Bar chart of all metrics
ax1 = axes[0]
metric_names = ['Accuracy', 'Precision', 'Recall', 'F1 Score']
metric_values = [metrics['accuracy'], metrics['precision'], metrics['recall'], metrics['f1']]
colors = ['#3498db', '#e74c3c', '#27ae60', '#9b59b6']

bars = ax1.bar(metric_names, metric_values, color=colors, edgecolor='white', linewidth=2)
ax1.set_ylim(0, 1.1)
ax1.set_ylabel('Score', fontsize=12)
ax1.set_title('Model Performance Metrics', fontsize=14, fontweight='bold')
ax1.axhline(y=1.0, color='gray', linestyle='--', alpha=0.5, label='Perfect score')

# Add value labels
for bar, val in zip(bars, metric_values):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.02, 
            f'{val:.1%}', ha='center', va='bottom', fontsize=12, fontweight='bold')

# Plot 2: Which metric to use guide
ax2 = axes[1]
ax2.axis('off')

metrics_explanation = """
WHICH METRIC SHOULD YOU USE?
═══════════════════════════════════════════════════

ACCURACY
  • Best when: Classes are balanced (50/50)
  • Misleading when: Rare events (e.g., 1% fraud)
  
PRECISION
  • Best when: False alarms are COSTLY
  • Examples: 
    - Spam filter (don't delete real email!)
    - Criminal conviction (don't jail innocent!)
  
RECALL
  • Best when: Missing positives is COSTLY
  • Examples: 
    - Disease detection (don't miss sick patients!)
    - Fraud detection (don't miss fraud!)
  
F1 SCORE
  • Best when: You need balance between P & R
  • Most real-world applications use F1

═══════════════════════════════════════════════════
For our V/H classifier, all metrics are similar 
because our dataset is balanced and model works well!
"""

ax2.text(0.05, 0.5, metrics_explanation, fontsize=10, family='monospace',
        verticalalignment='center', transform=ax2.transAxes,
        bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8))

plt.tight_layout()
plt.show()


---

## 6.5 The Committee Report: Saliency and Interpretability

We know our model works well, but **WHY** does it work? What has it actually learned?

### What IS Interpretability?

**Interpretability** (also called **Explainability**) means understanding:
1. What patterns did the model learn?
2. Why does it make specific predictions?
3. Is it using the "right" features?

| Question | How to Answer |
|----------|---------------|
| What patterns did it learn? | Look at the weights |
| Why did it predict "vertical"? | Look at which inputs contributed most |
| Is it using the right features? | Visualize the saliency map |

### What IS Saliency?

The word **"saliency"** comes from Latin *salire* meaning "to leap." In machine learning:

**Saliency = Which parts of the input "leap out" as important to the model**

For our Perceptron, saliency is beautifully simple:

$$\text{Saliency}_i = |w_i \times x_i|$$

Where:
- $w_i$ = weight for input $i$
- $x_i$ = value of input $i$
- $|...|$ = absolute value (we care about magnitude, not sign)

### Why Absolute Value?

| Weight × Input | Meaning | Contribution |
|----------------|---------|--------------|
| +2.0 × 1.0 = +2.0 | Strongly SUPPORTS vertical | HIGH |
| -2.0 × 1.0 = -2.0 | Strongly OPPOSES vertical | HIGH |
| +0.1 × 1.0 = +0.1 | Weakly supports vertical | LOW |

Both +2.0 and -2.0 are **strong contributions** - just in opposite directions. The absolute value captures the **strength of influence**.

### Committee Analogy

*"We ask the committee: 'Show us your reasoning. Highlight the evidence that most influenced your decision.' They produce a report where the most influential pieces of evidence glow brightly. This is the saliency map - a visual explanation of the committee's thought process."*

### Why Interpretability Matters

| Reason | Example |
|--------|---------|
| **Trust** | Can we trust this medical diagnosis? |
| **Debugging** | Why is the model getting this wrong? |
| **Discovery** | What features actually matter? |
| **Fairness** | Is it unfairly using race or gender? |
| **Legal** | GDPR requires "right to explanation" |

### The Math Behind Saliency

For our Perceptron, let's trace WHY $|w_i \times x_i|$ measures importance:

**Step 1: The Neuron's Decision**
$$z = w_1 x_1 + w_2 x_2 + ... + w_9 x_9 + b$$

Each term $w_i x_i$ is that pixel's **contribution** to the final sum $z$.

**Step 2: How Much Did Each Pixel Contribute?**

| Pixel | Weight ($w_i$) | Input ($x_i$) | Contribution ($w_i \times x_i$) |
|-------|----------------|---------------|--------------------------------|
| 0 | 0.5 | 0 | 0.5 × 0 = 0 (no contribution) |
| 1 | 1.2 | 1 | 1.2 × 1 = 1.2 (strong positive) |
| 4 | -0.8 | 1 | -0.8 × 1 = -0.8 (strong negative) |

**Step 3: Why Absolute Value?**

Both +1.2 and -0.8 are **strong influences** on the decision - they just push in opposite directions. The absolute value captures **strength of influence regardless of direction**.

$$\text{Saliency}_i = |w_i \times x_i|$$

**Interpretation:**
- High saliency = This pixel strongly influenced the decision (positively OR negatively)
- Low saliency = This pixel didn't matter much for this prediction

### Looking at What Our Model Learned


In [None]:
# =============================================================================
# SALIENCY: What Did the Model Learn?
# =============================================================================

print("="*70)
print("THE COMMITTEE REPORT: What Did the Model Learn?")
print("="*70)

# First, let's look at the learned weights
print("\n" + "-"*70)
print("STEP 1: Examine the Learned Weights")
print("-"*70)

weights_grid = model.weights.reshape(3, 3)
print("""
Remember our pixel positions:

    Position Index:     Image Layout:
    [0] [1] [2]         [row 0]
    [3] [4] [5]   →     [row 1]
    [6] [7] [8]         [row 2]

Our model's learned weights (as 3x3 grid):
""")
for i, row in enumerate(weights_grid):
    print(f"  Row {i}: [{row[0]:6.3f}, {row[1]:6.3f}, {row[2]:6.3f}]")

print(f"\n  Bias: {model.bias:.4f}")

# Interpret the weights
print("\n" + "-"*70)
print("STEP 2: Interpret What the Weights Mean")
print("-"*70)

print("""
HOW TO READ WEIGHTS:
  • Positive weight → This pixel being bright INCREASES "vertical" confidence
  • Negative weight → This pixel being bright DECREASES "vertical" confidence
  • Near-zero weight → This pixel doesn't matter much
""")

# Find which positions have highest/lowest weights
flat_weights = model.weights
max_idx = np.argmax(flat_weights)
min_idx = np.argmin(flat_weights)

print(f"""
KEY OBSERVATIONS:

  Maximum weight: position {max_idx} (row {max_idx//3}, col {max_idx%3}) = {flat_weights[max_idx]:.3f}
    → If this pixel is bright, model is MORE confident it's vertical
    
  Minimum weight: position {min_idx} (row {min_idx//3}, col {min_idx%3}) = {flat_weights[min_idx]:.3f}
    → If this pixel is bright, model is LESS confident it's vertical
    
  Positions with HIGH positive weights: {np.where(flat_weights > 0.3)[0].tolist()}
    → These pixels SUPPORT "vertical" classification
    
  Positions with HIGH negative weights: {np.where(flat_weights < -0.3)[0].tolist()}
    → These pixels OPPOSE "vertical" classification
""")


In [None]:
# =============================================================================
# VISUALIZE: Weights and Saliency Maps - THE "AHA!" MOMENT
# =============================================================================

def compute_saliency(model, x):
    """
    Compute saliency map for an input.
    
    Saliency = |weight × input|
    
    This tells us: "How much did each input pixel 
    contribute to the final decision?"
    
    Parameters:
        model: Trained model with weights
        x: Input image (flattened)
    
    Returns:
        saliency: Array of contribution magnitudes
    """
    x = np.array(x).flatten()
    # Multiply each input by its weight, take absolute value
    return np.abs(model.weights * x)

fig, axes = plt.subplots(2, 4, figsize=(16, 8))

# =================
# Top row: Vertical line analysis
# =================

# 1. Input image
ax = axes[0, 0]
ax.imshow(vertical_line, cmap='Blues', vmin=0, vmax=1)
ax.set_title('INPUT:\nVertical Line', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        ax.text(j, i, f'{vertical_line[i,j]:.0f}', ha='center', va='center', fontsize=12)
ax.axis('off')

# 2. Model weights
ax = axes[0, 1]
im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)
ax.set_title('WEIGHTS:\nLearned by Model', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        color = 'white' if abs(weights_grid[i,j]) > 1 else 'black'
        ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)
ax.axis('off')

# 3. Saliency map
ax = axes[0, 2]
saliency_v = compute_saliency(model, vertical_flat).reshape(3, 3)
im = ax.imshow(saliency_v, cmap='hot', vmin=0)
ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        color = 'white' if saliency_v[i,j] > saliency_v.max()/2 else 'black'
        ax.text(j, i, f'{saliency_v[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)
ax.axis('off')

# 4. Prediction result
ax = axes[0, 3]
ax.axis('off')
pred_v = model.forward(vertical_flat)
result_text = f"""PREDICTION

Raw output: {pred_v:.4f}
Confidence: {pred_v*100:.1f}%

Decision: {"VERTICAL" if pred_v >= 0.5 else "HORIZONTAL"}

Correct! ✓"""
ax.text(0.5, 0.5, result_text, fontsize=11, fontweight='bold',
       ha='center', va='center', transform=ax.transAxes,
       bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8))

# =================
# Bottom row: Horizontal line analysis
# =================

# 1. Input image
ax = axes[1, 0]
ax.imshow(horizontal_line, cmap='Blues', vmin=0, vmax=1)
ax.set_title('INPUT:\nHorizontal Line', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        ax.text(j, i, f'{horizontal_line[i,j]:.0f}', ha='center', va='center', fontsize=12)
ax.axis('off')

# 2. Model weights (same)
ax = axes[1, 1]
im = ax.imshow(weights_grid, cmap='RdBu', vmin=-2, vmax=2)
ax.set_title('WEIGHTS:\n(Same model)', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        color = 'white' if abs(weights_grid[i,j]) > 1 else 'black'
        ax.text(j, i, f'{weights_grid[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)
ax.axis('off')

# 3. Saliency map
ax = axes[1, 2]
saliency_h = compute_saliency(model, horizontal_flat).reshape(3, 3)
im = ax.imshow(saliency_h, cmap='hot', vmin=0)
ax.set_title('SALIENCY MAP:\n|Weight × Input|', fontsize=11, fontweight='bold')
for i in range(3):
    for j in range(3):
        color = 'white' if saliency_h[i,j] > saliency_h.max()/2 else 'black'
        ax.text(j, i, f'{saliency_h[i,j]:.2f}', ha='center', va='center', fontsize=9, color=color)
ax.axis('off')

# 4. Prediction result
ax = axes[1, 3]
ax.axis('off')
pred_h = model.forward(horizontal_flat)
result_text = f"""PREDICTION

Raw output: {pred_h:.4f}
Confidence: {(1-pred_h)*100:.1f}% horizontal

Decision: {"VERTICAL" if pred_h >= 0.5 else "HORIZONTAL"}

Correct! ✓"""
ax.text(0.5, 0.5, result_text, fontsize=11, fontweight='bold',
       ha='center', va='center', transform=ax.transAxes,
       bbox=dict(boxstyle='round', facecolor='lightgreen', alpha=0.8))

plt.suptitle('THE COMMITTEE REPORT: How the Model Makes Decisions', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()


In [None]:
# =============================================================================
# THE "AHA!" MOMENT: Understanding What the Model Learned
# =============================================================================

print("="*70)
print("THE KEY INSIGHT: What Did the Model ACTUALLY Learn?")
print("="*70)

print("""
Looking at the visualizations above, we can see something beautiful:

FOR VERTICAL LINES:
  • The middle column (positions 1, 4, 7) has POSITIVE weights
  • When bright pixels appear in the middle column, the model says "VERTICAL!"
  • The saliency map lights up exactly where the vertical line is
  
FOR HORIZONTAL LINES:
  • The middle row (positions 3, 4, 5) has NEGATIVE or low weights for the sides
  • When bright pixels appear across a row, they don't activate the "vertical" detector
  • The output is LOW, meaning "not vertical" = "horizontal"

THE MODEL LEARNED THE RIGHT PATTERN!
═══════════════════════════════════════════════════════════════════════

Our model didn't just memorize examples. It learned a GENERAL RULE:

  "Vertical lines have bright pixels stacked in a column.
   Horizontal lines have bright pixels spread across a row."

This is exactly what we hoped it would learn!

═══════════════════════════════════════════════════════════════════════
""")

# Show the pattern it learned
print("\nVisualized Pattern Recognition:")
print("-"*50)
print("""
  VERTICAL LINE:          MODEL LOOKS AT:
  [ ] [●] [ ]             [ ] [HIGH] [ ]
  [ ] [●] [ ]     →       [ ] [HIGH] [ ]
  [ ] [●] [ ]             [ ] [HIGH] [ ]
                          (Middle column weights are positive)
  
  HORIZONTAL LINE:        MODEL LOOKS AT:
  [ ] [ ] [ ]             [ ] [ ] [ ]
  [●] [●] [●]     →       [LOW] [LOW] [LOW]
  [ ] [ ] [ ]             [ ] [ ] [ ]
                          (Row weights don't support "vertical")
""")


---

## 6.6 Train/Test Split: Why We Need Separate Data

Throughout this notebook, we've used separate **training** and **test** data. This is crucial for honest evaluation.

### The Problem: Memorization vs Learning

A model could achieve 100% accuracy on training data by simply **memorizing** every example - like a student who memorizes test answers instead of understanding concepts.

But memorization isn't useful - we need the model to **generalize** to NEW data it has never seen.

| Approach | Training Accuracy | Test Accuracy | What Happened? |
|----------|------------------|---------------|----------------|
| True learning | 95% | 93% | Learned the general pattern |
| Memorization | 100% | 50% | Memorized training, fails on new |

### What IS a Train/Test Split?

We divide our data into two groups:

```
ALL DATA (150 samples)
    │
    ├── TRAINING SET (100 samples) ──→ Used to TRAIN the model
    │                                  Model sees these during learning
    │
    └── TEST SET (50 samples) ───────→ Used to EVALUATE the model
                                       Model NEVER sees these during training
```

### Why This Works

| Data Set | Model Sees During Training? | Purpose |
|----------|---------------------------|---------|
| **Training** | YES | Learn patterns |
| **Test** | NO | Evaluate generalization |

The test set acts as a "final exam" - questions the model has never seen.

### Committee Analogy

*"It's like preparing for an exam:*
- *Training data = study materials (examples you practice with)*
- *Test data = the actual exam (new questions you've never seen)*

*If you just memorize your notes without understanding, you'll ace the practice problems but fail the exam. If you truly learned the concepts, you'll do well on both."*

### The Golden Rule

**NEVER use test data for training!**

If the model sees test data during training, it can memorize those examples too, and our evaluation becomes meaningless.

### Common Split Ratios

| Split | Training | Test | When to Use |
|-------|----------|------|-------------|
| 80/20 | 80% | 20% | Large datasets (>10,000 samples) |
| 70/30 | 70% | 30% | Medium datasets (1,000-10,000) |
| 60/40 | 60% | 40% | Small datasets (<1,000) |

More test data = more reliable evaluation, but less training data.

### Understanding Overfitting Mathematically

**What IS Overfitting?**

Overfitting is when a model learns the **noise** in the training data, not just the **signal**.

**Analogy:** Imagine studying for an exam by memorizing the exact wording of practice questions instead of understanding the concepts. You'd ace those exact questions but fail on new ones.

**How Train/Test Split Reveals Overfitting:**

| Scenario | Training Accuracy | Test Accuracy | What's Happening |
|----------|------------------|---------------|------------------|
| **Good learning** | 95% | 93% | Learned the pattern! |
| **Mild overfitting** | 99% | 85% | Some memorization |
| **Severe overfitting** | 100% | 50% | Memorized everything, learned nothing |

**The Math:**
- If a model memorizes all 100 training examples, it can get 100% training accuracy
- But those memorized patterns don't apply to new data
- Test accuracy reveals true generalization

**The Gap:**
$$\text{Overfitting Gap} = \text{Training Accuracy} - \text{Test Accuracy}$$

- Gap < 5%: Great! Model generalizes well
- Gap 5-15%: Some overfitting, might need more data or simpler model
- Gap > 15%: Serious overfitting, model is memorizing

### Why These Specific Ratios?

| More Training Data | More Test Data |
|-------------------|----------------|
| Model can learn more | More reliable evaluation |
| Better final accuracy | Smaller margin of error |
| Less reliable evaluation | Model might underfit |

**The sweet spot:** Enough training data to learn well, enough test data to evaluate reliably. With 100 samples, 80/20 gives 80 for training (decent) and 20 for testing (acceptable). With 10,000 samples, even 90/10 gives 1,000 test samples (very reliable).


In [None]:
# =============================================================================
# TRAIN/TEST SPLIT: Our Results
# =============================================================================

print("="*70)
print("TRAIN/TEST SPLIT: Checking for Generalization")
print("="*70)

print(f"""
OUR DATA SPLIT:
  • Training set: {len(X_train)} samples (used for learning)
  • Test set: {len(X_test)} samples (used for evaluation only)
  • Split ratio: {len(X_train)}/{len(X_train)+len(X_test)} = {len(X_train)/(len(X_train)+len(X_test))*100:.0f}% training
  
RESULTS:
  • Training accuracy: {train_accuracy:.1%}
  • Test accuracy: {test_accuracy:.1%}
  • Difference: {abs(train_accuracy - test_accuracy):.1%}
""")

# Interpret the gap
diff = train_accuracy - test_accuracy

print("-"*70)
print("INTERPRETATION:")
print("-"*70)

if diff < 0.05:
    print("""
  ✓ EXCELLENT! Training and test accuracy are very similar.
  
  This suggests the model has LEARNED the general pattern,
  not just memorized the training data.
  
  Our model generalizes well to new data!
""")
elif diff < 0.15:
    print(f"""
  ⚠ CAUTION: Training accuracy is {diff:.1%} higher than test accuracy.
  
  Some memorization may have occurred.
  The model might be slightly "overfitting" to training data.
""")
else:
    print(f"""
  ⚠ WARNING: Training accuracy is {diff:.1%} higher than test accuracy!
  
  This suggests OVERFITTING - the model memorized training data
  but doesn't generalize well to new data.
  
  Possible solutions:
    - Get more training data
    - Use regularization
    - Simplify the model
""")


---

## Part 6 Summary: What We've Learned

### Key Concepts Mastered

| Concept | Definition/Formula | Why It Matters |
|---------|-------------------|----------------|
| **Training vs Inference** | Learning mode vs using mode | Different behaviors, same weights |
| **Accuracy** | (TP + TN) / Total | Simple overall view (but can mislead) |
| **Confusion Matrix** | TP, TN, FP, FN breakdown | Shows WHAT mistakes are made |
| **Precision** | TP / (TP + FP) | "When I say yes, am I right?" |
| **Recall** | TP / (TP + FN) | "Did I catch all the positives?" |
| **F1 Score** | 2 × (P × R) / (P + R) | Balance precision and recall |
| **Saliency** | \|weight × input\| | What did the model look at? |
| **Train/Test Split** | Separate data for evaluation | Detect memorization vs learning |

### The Four Categories Explained

| Category | Model Said | Truth Was | Meaning |
|----------|-----------|-----------|---------|
| **TP** (True Positive) | Vertical | Vertical | Correct detection |
| **TN** (True Negative) | Horizontal | Horizontal | Correct rejection |
| **FP** (False Positive) | Vertical | Horizontal | False alarm |
| **FN** (False Negative) | Horizontal | Vertical | Missed detection |

### Committee Analogy Progress

| Part | What Happened |
|------|--------------|\n| Parts 1-3 | Committee member learned procedures |
| Part 4 | First case - confused, random guessing |
| Part 5 | Learned from feedback, became expert |
| **Part 6** | **Performance review: verified expertise and understood reasoning** |
| Part 7 | (Next) One expert isn't enough - building the full committee |

### The Big Picture

We now have a **complete, evaluated model** that:
- Achieves high accuracy on both training and test data
- Makes few mistakes (low FP and FN)
- Has interpretable learned weights
- Uses the RIGHT features (column patterns for vertical detection)
- Generalizes well to new data

---

## Knowledge Check


In [None]:
# =============================================================================
# KNOWLEDGE CHECK - Part 6
# =============================================================================

print("KNOWLEDGE CHECK - Part 6: Evaluation")
print("="*60)
print("\nAnswer these questions to test your understanding:\n")

questions = [
    {
        "q": "1. What's the difference between training and inference mode?",
        "options": [
            "A) Training is faster than inference",
            "B) In training, weights update; in inference, weights are frozen",
            "C) Inference uses more data than training",
            "D) They're the same thing with different names"
        ],
        "answer": "B",
        "explanation": "During training, the model learns and weights change after each example. During inference, weights are frozen and we just make predictions - no learning happens."
    },
    {
        "q": "2. A model predicts 'sick' for a healthy patient. What type of error is this?",
        "options": [
            "A) True Positive (TP)",
            "B) True Negative (TN)",
            "C) False Positive (FP)",
            "D) False Negative (FN)"
        ],
        "answer": "C",
        "explanation": "False Positive: We predicted Positive (sick), but we were False (wrong) - the patient was actually healthy. This is a 'false alarm'."
    },
    {
        "q": "3. You're building a disease detection system. Missing a sick patient is VERY bad.\n   Which metric should you prioritize?",
        "options": [
            "A) Accuracy",
            "B) Precision",
            "C) Recall",
            "D) F1 Score"
        ],
        "answer": "C",
        "explanation": "Recall measures 'did we catch all the positives?' High recall means we catch most sick patients, even if we have some false alarms. When missing positives is costly, prioritize recall."
    },
    {
        "q": "4. Why do we use a separate test set?",
        "options": [
            "A) To have more data for training",
            "B) To make training faster",
            "C) To check if the model memorized vs truly learned",
            "D) It's optional and not really needed"
        ],
        "answer": "C",
        "explanation": "A model could memorize training data and fail on new data. The test set (unseen data) reveals if it truly learned the general pattern or just memorized examples."
    },
    {
        "q": "5. What does a saliency map show?",
        "options": [
            "A) The accuracy of the model over time",
            "B) Which inputs the model focused on for its decision",
            "C) The training loss curve",
            "D) How fast the model runs"
        ],
        "answer": "B",
        "explanation": "Saliency maps highlight which parts of the input were most important for the model's decision. It's a form of interpretability - understanding WHY the model made its prediction."
    }
]

for q in questions:
    print(q["q"])
    for opt in q["options"]:
        print(f"   {opt}")
    print()

print("\n" + "="*60)
print("Scroll down for answers...")
print("="*60)


In [None]:
# =============================================================================
# ANSWERS - Knowledge Check Part 6
# =============================================================================

print("ANSWERS - Part 6 Knowledge Check")
print("="*60)

for i, q in enumerate(questions, 1):
    print(f"\n{i}. Answer: {q['answer']}")
    print(f"   {q['explanation']}")

print("\n" + "="*60)
print("How did you do?")
print("  5/5: Evaluation Master! Ready for Part 7!")
print("  4/5: Solid understanding - great job!")
print("  3/5: Review the sections you missed")
print("  <3:  Re-read Part 6 before continuing")
print("="*60)


---

## What's Next?

**Congratulations!** You've completed Part 6!

Our single neuron is now a **verified expert** - we've evaluated its performance, understood its decision-making process, and confirmed it learned the RIGHT patterns.

### But Here's the Thing...

A single neuron (Perceptron) can only learn **linear patterns** - patterns that can be separated by a straight line. For more complex problems, one expert isn't enough.

### The Limitation of Single Neurons

Some problems are **not linearly separable**. The classic example is the **XOR problem**:

| Input A | Input B | Output (XOR) |
|---------|---------|--------------|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |

No single neuron can learn this pattern! We need **multiple neurons working together**.

### Coming Up in Part 7: Hidden Layers - The Full Committee

In the next notebook, we'll explore:

- **Why one neuron isn't enough** - The XOR problem demonstration
- **Hidden layers** - Adding more neurons between input and output
- **The full committee** - Multiple experts with different perspectives
- **Universal approximation** - Why deep networks can learn (almost) anything

---

**Continue to Part 7:** `part_7_hidden_layers.ipynb`

---

*"One expert is good. A committee of experts is powerful."*

**The Brain's Decision Committee** - From Expert to Team
