# Lesson 3: Baseline Model Training

## 🎓 LEARNING OBJECTIVES

In this notebook, you will:
1. Learn how to split data for time-series prediction
2. Understand feature scaling and why it matters
3. Establish baseline models to beat
4. Train your first Random Forest classifier
5. **Deeply understand evaluation metrics** (confusion matrix, precision, recall, F1)
6. Test whether tier features improve predictions
7. Discover which features matter most

## 📊 DATASET
- **Source**: `ml_project/data/match_features_historical.csv` (from Lesson 2A)
- **Size**: 1,900 matches from 2020-2025
- **Features**: 111 performance metrics + 3 tier features
- **Target**: Match outcome (Home Win / Draw / Away Win)

## 🎯 HYPOTHESIS TO TEST
**From Lesson 1D**: Different position tiers (Top 4, Mid-Table, Relegation) perform differently.

**Question**: Do tier features improve predictions?

We'll train TWO models:
1. **Model A**: WITHOUT tier features
2. **Model B**: WITH tier features

Then compare: Does Model B perform better? If yes, hypothesis validated!

---

Let's begin!

## Section 1: Setup & Data Loading

In [None]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import (
    accuracy_score, 
    classification_report, 
    confusion_matrix,
    ConfusionMatrixDisplay,
    precision_score,
    recall_score,
    f1_score
)
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import joblib
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set up paths
OUTPUT_DIR = Path('../../outputs/07_model_training')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
MODEL_DIR = Path('../../models')
MODEL_DIR.mkdir(parents=True, exist_ok=True)

# Set visualization style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("="*80)
print("LESSON 3: BASELINE MODEL TRAINING")
print("="*80)
print("\n📚 You're about to build your FIRST machine learning model!")
print("This notebook will teach you HOW and WHY at every step.\n")

In [None]:
# Load data
data_path = '../../data/match_features_historical.csv'
data = pd.read_csv(data_path)

print(f"✓ Dataset loaded: {len(data):,} matches")
print(f"✓ Features: {data.shape[1]} columns")
print(f"✓ Date range: {data['date'].min()} to {data['date'].max()}")
print(f"\nFirst few rows:")
data.head(3)

## Section 2: Understanding the Target Variable

Before building any model, we need to understand **what we're predicting**.

Our target variable is `match_outcome` with 3 possible values:
- **Home Win**: Home team won
- **Draw**: Match ended in a tie
- **Away Win**: Away team won

This is a **multiclass classification** problem (3 classes).

In [None]:
print("="*80)
print("UNDERSTANDING OUR TARGET: What are we predicting?")
print("="*80)

# Show outcome distribution
outcome_counts = data['match_outcome'].value_counts()
outcome_pcts = data['match_outcome'].value_counts(normalize=True)

print("\nMatch Outcomes Distribution:")
for outcome in ['Home Win', 'Draw', 'Away Win']:
    count = outcome_counts.get(outcome, 0)
    pct = outcome_pcts.get(outcome, 0)
    print(f"  {outcome:<12}: {count:4} matches ({pct:.1%})")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#2ecc71', '#f39c12', '#e74c3c']
bars = ax.bar(outcome_pcts.index, outcome_pcts.values, color=colors, alpha=0.7)
ax.set_ylabel('Proportion', fontsize=12, fontweight='bold')
ax.set_title('Match Outcome Distribution (1,900 matches)', fontsize=14, fontweight='bold')
ax.axhline(0.333, color='black', linestyle='--', alpha=0.3, label='Random guess (33.3%)')

# Add percentage labels
for bar, pct in zip(bars, outcome_pcts.values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{pct:.1%}', ha='center', va='bottom', fontweight='bold')

plt.legend()
plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'outcome_distribution.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n💡 TEACHING POINT:")
print("This is an IMBALANCED dataset:")
print(f"  - Home Win: {outcome_pcts.get('Home Win', 0):.1%} (most common)")
print(f"  - Draw: {outcome_pcts.get('Draw', 0):.1%} (least common)")
print(f"  - Away Win: {outcome_pcts.get('Away Win', 0):.1%}")
print("\nThis is NATURAL - home teams really do win more often!")
print("Our model needs to learn this pattern, not just guess 'Home Win' always.")

## Section 3: Prepare Features & Target

In supervised learning, we separate our data into:
- **X (features)**: The INPUT data the model sees (statistics, metrics, etc.)
- **y (target)**: The OUTPUT we want to predict (match outcome)

The model learns patterns: "When X looks like this, y is usually that."

In [None]:
print("="*80)
print("PREPARING DATA: Separating Features (X) from Target (y)")
print("="*80)

# Identify feature columns (exclude identifiers and targets)
exclude_cols = [
    'match_id', 'season', 'date', 'gameweek',
    'home_team', 'away_team',
    'match_outcome', 'home_goals', 'away_goals'
]

feature_cols = [col for col in data.columns if col not in exclude_cols]

X = data[feature_cols].copy()
y = data['match_outcome'].copy()

print(f"\n✓ Feature matrix X: {X.shape}")
print(f"  → {X.shape[0]:,} matches × {X.shape[1]} features")
print(f"\n✓ Target vector y: {y.shape}")
print(f"  → {y.shape[0]:,} outcomes to predict")

print("\n💡 TEACHING POINT:")
print("In supervised learning:")
print("  X = INPUT (what the model sees)")
print("  y = OUTPUT (what the model predicts)")
print(f"\nOur X has {len(feature_cols)} features like:")
for feat in feature_cols[:5]:
    print(f"  - {feat}")
print("  ...")
print(f"\nOur y has 3 classes: {sorted(y.unique().tolist())}")

print(f"\n📊 Feature types:")
tier_cols = [col for col in feature_cols if 'tier' in col.lower()]
stat_cols = [col for col in feature_cols if 'tier' not in col.lower()]
print(f"  - Performance statistics: {len(stat_cols)} features")
print(f"  - Tier features: {len(tier_cols)} features")
print(f"    {tier_cols}")

## Section 4: Train/Validation Split (Time-Based)

### 🚨 CRITICAL CONCEPT: Time-Based Splitting

**Wrong way**: Randomly shuffle and split  
❌ Problem: Mixes 2020 data with 2024 data. Model sees "future" during training!

**Right way**: Chronological split  
✅ Train on past (2020-2024), validate on recent (2024-2025)

This simulates real-world usage: predict future matches using historical patterns.

In [None]:
print("="*80)
print("SPLITTING DATA: Training vs Validation (CHRONOLOGICAL)")
print("="*80)

# Time-based split (not random!)
train_mask = data['season'].isin(['2020-2021', '2021-2022', '2022-2023', '2023-2024'])
val_mask = data['season'] == '2024-2025'

X_train = X[train_mask].copy()
y_train = y[train_mask].copy()
X_val = X[val_mask].copy()
y_val = y[val_mask].copy()

print(f"\n✓ Training set: {len(X_train):,} matches")
print(f"  Seasons: 2020-2021, 2021-2022, 2022-2023, 2023-2024")
print(f"\n✓ Validation set: {len(X_val):,} matches")
print(f"  Season: 2024-2025")

print(f"\nSplit ratio: {len(X_train)/len(X):.0%} train / {len(X_val)/len(X):.0%} validation")

# Check distribution in both sets
print("\n📊 Outcome distribution in each set:")
print("\nTraining:")
print(y_train.value_counts(normalize=True).sort_index())
print("\nValidation:")
print(y_val.value_counts(normalize=True).sort_index())

print("\n💡 TEACHING POINT: Why CHRONOLOGICAL split?")
print("❌ BAD: Random shuffle")
print("   → Mixes 2020 data with 2024 data")
print("   → Model sees 'future' during training")
print("   → Unrealistic!")
print("\n✓ GOOD: Time-based split")
print("   → Train on 2020-2024")
print("   → Test on 2024-2025")
print("   → Realistic: predict future from past")
print("   → This is how it works in production!")

## Section 5: Feature Scaling (Standardization)

### ⚡ CRITICAL: Why Scale Features?

Different features have VERY different scales:
- `total_pass_distance`: 40,000 (huge!)
- `shot_accuracy`: 0.40 (tiny!)

Problem: Model thinks distance is 100,000× more important just because the numbers are bigger!

Solution: **StandardScaler** transforms all features to:  
- Mean ≈ 0  
- Standard Deviation ≈ 1

Now all features are comparable!

In [None]:
print("="*80)
print("FEATURE SCALING: Fixing the variance problem")
print("="*80)

# Show the problem BEFORE scaling
sample_features = ['home_total_pass_distance', 'home_shot_accuracy', 'home_shots_on_target_per_90']
print("\n📊 BEFORE SCALING - Features have very different scales:")
print(X_train[sample_features].describe().loc[['mean', 'std', 'min', 'max']].round(1))

print("\n❌ Problem:")
print("  total_pass_distance: mean~40,000 (HUGE!)")
print("  shot_accuracy: mean~0.4 (tiny!)")
print("  → Model thinks distance is 100,000x more important!")

# Initialize and fit scaler
scaler = StandardScaler()
scaler.fit(X_train)  # ONLY fit on training data!

# Transform both sets
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)

# Convert back to DataFrame
X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_cols, index=X_train.index)
X_val_scaled = pd.DataFrame(X_val_scaled, columns=feature_cols, index=X_val.index)

print("\n📊 AFTER SCALING - All features on same scale:")
print(X_train_scaled[sample_features].describe().loc[['mean', 'std', 'min', 'max']].round(3))

print("\n✓ Solution: StandardScaler transforms every feature to:")
print("  → Mean ≈ 0")
print("  → Standard Deviation ≈ 1")
print("  → Now all features are comparable!")

print("\n💡 TEACHING POINT: Why fit ONLY on training?")
print("❌ BAD: scaler.fit(X_train + X_val)")
print("   → Uses validation data!")
print("   → Information leaks from validation into training")
print("\n✓ GOOD: scaler.fit(X_train)")
print("   → Only uses training data")
print("   → Validation data stays unseen")
print("   → Then apply SAME scaling to validation")


## Section 6: Baseline Models (Benchmarks)

### 🎯 KEY CONCEPT: Establish Baselines

Before training complex models, establish "dumb" baselines:

**Baseline 1**: Random guessing (33/33/33)  
**Baseline 2**: Always predict "Home Win" (most common)  
**Baseline 3**: Proportional guessing (43% Home, 23% Draw, 34% Away)

### Why baselines matter:
- If your ML model gets 40% accuracy → ❌ WORSE than always guessing "Home Win" (43%)!
- If your ML model gets 58% accuracy → ✅ Beats baseline by 15 points. Model learned something!

**Your model MUST beat these baselines to be valuable.**

In [None]:
print("="*80)
print("BASELINE MODELS: What do we need to beat?")
print("="*80)

# Baseline 1: Random guessing
baseline_random = DummyClassifier(strategy='uniform', random_state=42)
baseline_random.fit(X_train_scaled, y_train)
y_pred_random = baseline_random.predict(X_val_scaled)
acc_random = accuracy_score(y_val, y_pred_random)

# Baseline 2: Always predict most frequent class
baseline_frequent = DummyClassifier(strategy='most_frequent')
baseline_frequent.fit(X_train_scaled, y_train)
y_pred_frequent = baseline_frequent.predict(X_val_scaled)
acc_frequent = accuracy_score(y_val, y_pred_frequent)

# Baseline 3: Proportional guessing
baseline_stratified = DummyClassifier(strategy='stratified', random_state=42)
baseline_stratified.fit(X_train_scaled, y_train)
y_pred_stratified = baseline_stratified.predict(X_val_scaled)
acc_stratified = accuracy_score(y_val, y_pred_stratified)

print("\n📊 BASELINE RESULTS:")
print(f"  Random guess (33/33/33): {acc_random:.1%}")
print(f"  Always 'Home Win': {acc_frequent:.1%}")
print(f"  Proportional guessing: {acc_stratified:.1%}")

baseline_threshold = max(acc_random, acc_frequent, acc_stratified)
print(f"\n🎯 TARGET TO BEAT: {baseline_threshold:.1%}")

print("\n💡 TEACHING POINT: Why establish baselines?")
print("Baselines = 'Dumb' strategies that require no ML")
print("\nIf your ML model gets 40% accuracy:")
print("  ❌ WORSE than always guessing 'Home Win' (43%)")
print("  ❌ Model is USELESS!")
print("\nIf your ML model gets 58% accuracy:")
print("  ✓ Beats baseline by 15 percentage points")
print("  ✓ Model learned something useful!")
print("\nYour model MUST beat these baselines to be valuable.")

## Section 7: Train Random Forest (WITHOUT Tiers)

### 🌳 What is Random Forest?

Random Forest = **Ensemble** of 100 decision trees voting

Each tree makes a prediction:  
- Tree 1: "Home Win"  
- Tree 2: "Away Win"  
- Tree 3: "Home Win"  
- ...  
- Tree 100: "Home Win"

**Final prediction**: Majority vote  
→ 65 vote "Home Win" → Predict "Home Win" (65% confidence)

### Why Random Forest?
- Handles many features well
- Resistant to overfitting
- Provides feature importance
- Works out-of-the-box with minimal tuning

In [None]:
print("="*80)
print("TRAINING MODEL #1: Random Forest WITHOUT Tier Features")
print("="*80)

# Remove tier features
tier_cols = [col for col in feature_cols if 'tier' in col.lower()]
feature_cols_no_tiers = [col for col in feature_cols if col not in tier_cols]

X_train_no_tiers = X_train_scaled[feature_cols_no_tiers].copy()
X_val_no_tiers = X_val_scaled[feature_cols_no_tiers].copy()

print(f"\nFeatures used: {len(feature_cols_no_tiers)} (excluded {len(tier_cols)} tier features)")
print(f"Tier features excluded: {tier_cols}")

# Initialize model
rf_no_tiers = RandomForestClassifier(
    n_estimators=100,        # 100 decision trees
    max_depth=15,            # Max tree depth
    min_samples_split=20,    # Min samples to split
    min_samples_leaf=10,     # Min samples in leaf
    random_state=42,         # Reproducibility
    n_jobs=-1,               # Use all CPU cores
    class_weight='balanced', # Handle class imbalance
    verbose=0                # No progress output
)

print("\n🌳 Training Random Forest...")
print("(This may take 30-60 seconds)")
import time
start = time.time()
rf_no_tiers.fit(X_train_no_tiers, y_train)
elapsed = time.time() - start

# Predict
y_val_pred_no_tiers = rf_no_tiers.predict(X_val_no_tiers)
acc_no_tiers = accuracy_score(y_val, y_val_pred_no_tiers)

print(f"\n✓ Model trained in {elapsed:.1f} seconds!")
print(f"📊 Validation Accuracy: {acc_no_tiers:.1%}")
print(f"📈 Improvement vs baseline: {(acc_no_tiers - baseline_threshold)*100:+.1f} percentage points")

if acc_no_tiers > baseline_threshold:
    print("\n✅ SUCCESS! Model beats baseline!")
else:
    print("\n❌ WARNING: Model doesn't beat baseline. Something's wrong.")

print("\n💡 TEACHING POINT: What is Random Forest?")
print("Random Forest = 100 decision trees voting")
print("\nEach tree makes a prediction:")
print("  Tree 1: 'Home Win'")
print("  Tree 2: 'Away Win'")
print("  Tree 3: 'Home Win'")
print("  ...")
print("  Tree 100: 'Home Win'")
print("\nFinal prediction: Majority vote")
print("  65 vote 'Home Win' → Predict 'Home Win' (65% confidence)")

## Section 8: Train Random Forest (WITH Tiers) ⭐

### 🧪 HYPOTHESIS TEST

Now we'll train a SECOND model that INCLUDES tier features.

**Hypothesis**: Different tiers (Top 4, Mid-Table, Relegation) perform differently.  
**Test**: Does including tier features improve predictions?

If Model WITH tiers > Model WITHOUT tiers → ✅ Hypothesis validated!

In [None]:
print("="*80)
print("TRAINING MODEL #2: Random Forest WITH Tier Features")
print("="*80)

print(f"\nFeatures used: {len(feature_cols)} (includes {len(tier_cols)} tier features)")
print(f"Tier features: {tier_cols}")

# Initialize model (same hyperparameters)
rf_with_tiers = RandomForestClassifier(
    n_estimators=100,
    max_depth=15,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42,
    n_jobs=-1,
    class_weight='balanced',
    verbose=0
)

print("\n🌳 Training Random Forest...")
start = time.time()
rf_with_tiers.fit(X_train_scaled, y_train)
elapsed = time.time() - start

# Predict
y_val_pred_with_tiers = rf_with_tiers.predict(X_val_scaled)
acc_with_tiers = accuracy_score(y_val, y_val_pred_with_tiers)

print(f"\n✓ Model trained in {elapsed:.1f} seconds!")
print(f"📊 Validation Accuracy: {acc_with_tiers:.1%}")
print(f"📈 Improvement vs baseline: {(acc_with_tiers - baseline_threshold)*100:+.1f} percentage points")

print("\n🔬 HYPOTHESIS TEST: Do tier features help?")
print(f"Model WITHOUT tiers: {acc_no_tiers:.1%}")
print(f"Model WITH tiers: {acc_with_tiers:.1%}")
print(f"Difference: {(acc_with_tiers - acc_no_tiers)*100:+.1f} percentage points")

if acc_with_tiers > acc_no_tiers + 0.02:
    print("\n✅ VALIDATED: Tier features improve predictions!")
    print("Your Part 1D insight (different tiers succeed differently) is CORRECT!")
elif acc_with_tiers > acc_no_tiers:
    print("\n⚠️ WEAK SUPPORT: Tier features help slightly")
    print(f"Improvement is small ({(acc_with_tiers - acc_no_tiers)*100:.1f} pp)")
else:
    print("\n❌ NOT VALIDATED: Tier features don't help")
    print("Model performs same or worse with tier features.")

## Section 9: Understanding the Confusion Matrix

### 🎓 DEEP DIVE: What is a Confusion Matrix?

A confusion matrix shows **WHERE your model makes mistakes**.

```
              Predicted
              Away  Draw  Home
Actual  Away   XX    YY    ZZ   ← When actual=Away Win, model predicted:
        Draw   AA    BB    CC   ← When actual=Draw, model predicted:
        Home   DD    EE    FF   ← When actual=Home Win, model predicted:
```

**Diagonal** (XX, BB, FF) = CORRECT predictions  
**Off-diagonal** = ERRORS (confusions)

### Example:
If `ZZ = 20` (row 1, column 3):  
→ 20 times, actual outcome was "Away Win" but model predicted "Home Win"  
→ Model confused away wins for home wins!

Let's look at YOUR confusion matrix:

In [None]:
print("="*80)
print("EVALUATION: Understanding the Confusion Matrix")
print("="*80)

# Generate confusion matrix
cm = confusion_matrix(y_val, y_val_pred_with_tiers, labels=['Away Win', 'Draw', 'Home Win'])

# Calculate totals for each class
totals_actual = cm.sum(axis=1)
totals_predicted = cm.sum(axis=0)

print("\n📊 CONFUSION MATRIX (Model WITH tiers):")
print("\n              Predicted")
print("              Away   Draw   Home   | Total Actual")
print(f"Actual  Away  {cm[0,0]:4}   {cm[0,1]:4}   {cm[0,2]:4}  |  {totals_actual[0]:4}")
print(f"        Draw  {cm[1,0]:4}   {cm[1,1]:4}   {cm[1,2]:4}  |  {totals_actual[1]:4}")
print(f"        Home  {cm[2,0]:4}   {cm[2,1]:4}   {cm[2,2]:4}  |  {totals_actual[2]:4}")
print("        " + "-"*40)
print(f"Total Pred    {totals_predicted[0]:4}   {totals_predicted[1]:4}   {totals_predicted[2]:4}")

# Calculate per-class accuracies
print("\n📈 Per-Class Accuracy (diagonal / row total):")
class_labels = ['Away Win', 'Draw', 'Home Win']
for i, label in enumerate(class_labels):
    class_acc = cm[i,i] / totals_actual[i] if totals_actual[i] > 0 else 0
    print(f"  {label:<12}: {cm[i,i]:3}/{totals_actual[i]:3} = {class_acc:.1%}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 8))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=['Away Win', 'Draw', 'Home Win']
)
disp.plot(ax=ax, cmap='Blues', values_format='d', colorbar=False)
plt.title('Confusion Matrix - Random Forest WITH Tiers\n(Validation Set)', 
          fontsize=14, fontweight='bold')

# Add accuracy annotations
for i in range(3):
    class_acc = cm[i,i] / totals_actual[i]
    ax.text(i, i-0.3, f'{class_acc:.0%}', ha='center', va='center', 
            color='white', fontweight='bold', fontsize=10)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'confusion_matrix_with_tiers.png', dpi=300, bbox_inches='tight')
plt.show()

print("\n" + "="*80)
print("💡 TEACHING POINT: How to Read This Matrix")
print("="*80)
print("\n1. DIAGONAL = Correct predictions")
print(f"   - Top-left ({cm[0,0]}): Correctly predicted Away Win")
print(f"   - Middle ({cm[1,1]}): Correctly predicted Draw")
print(f"   - Bottom-right ({cm[2,2]}): Correctly predicted Home Win")
print(f"   - Total correct: {cm[0,0] + cm[1,1] + cm[2,2]} / {totals_actual.sum()} = {acc_with_tiers:.1%}")

print("\n2. OFF-DIAGONAL = Errors (confusions)")
print(f"   - Row 1, Col 3 ({cm[0,2]}): Away wins predicted as Home wins")
print(f"   - Row 3, Col 1 ({cm[2,0]}): Home wins predicted as Away wins")
print(f"   - Draws misclassified: {cm[1,0] + cm[1,2]} times")

print("\n3. INSIGHTS FROM YOUR MATRIX:")
# Find most confused class
worst_class_idx = np.argmin([cm[i,i]/totals_actual[i] for i in range(3)])
worst_class = class_labels[worst_class_idx]
worst_acc = cm[worst_class_idx, worst_class_idx] / totals_actual[worst_class_idx]
print(f"   - Hardest to predict: {worst_class} ({worst_acc:.1%} accuracy)")

best_class_idx = np.argmax([cm[i,i]/totals_actual[i] for i in range(3)])
best_class = class_labels[best_class_idx]
best_acc = cm[best_class_idx, best_class_idx] / totals_actual[best_class_idx]
print(f"   - Easiest to predict: {best_class} ({best_acc:.1%} accuracy)")

## Section 10: Precision, Recall, and F1-Score Explained

### 🎓 DEEP DIVE: Beyond Accuracy

**Accuracy** = Overall correctness  
But sometimes we need to know MORE:

---

### 1. PRECISION: "When I predict X, how often am I right?"

**Formula**: Precision = True Positives / (True Positives + False Positives)

**Example**: Predicting "Home Win"  
- Model predicted "Home Win" 100 times  
- 80 were actually Home Wins (✓ correct)  
- 20 were actually Draws or Away Wins (✗ wrong)

**Precision** = 80 / 100 = **80%**

**Meaning**: When model says "Home Win", it's right 80% of the time.

---

### 2. RECALL: "Of all actual X, how many did I catch?"

**Formula**: Recall = True Positives / (True Positives + False Negatives)

**Example**: Predicting "Home Win"  
- There were 120 actual Home Wins  
- Model correctly predicted 80 of them (✓ found)  
- Missed 40 (✗ predicted as Draw or Away Win)

**Recall** = 80 / 120 = **67%**

**Meaning**: Model catches 67% of all Home Wins.

---

### 3. F1-SCORE: "Balanced average of precision and recall"

**Formula**: F1 = 2 × (Precision × Recall) / (Precision + Recall)

**Why?** Sometimes precision and recall trade off:
- High precision, low recall → Model is careful but misses many  
- Low precision, high recall → Model catches many but makes many mistakes

**F1-score** balances both.

---

### When to use which metric?

**Accuracy**: Overall performance (good for balanced datasets)  
**Precision**: When false positives are costly (e.g., spam detection)  
**Recall**: When false negatives are costly (e.g., disease diagnosis)  
**F1-Score**: When you want balance (most ML tasks)

In [None]:
print("="*80)
print("EVALUATION: Precision, Recall, F1-Score")
print("="*80)

# Use sklearn's classification report
report = classification_report(
    y_val, 
    y_val_pred_with_tiers,
    labels=['Away Win', 'Draw', 'Home Win'],
    target_names=['Away Win', 'Draw', 'Home Win'],
    output_dict=True
)

print("\n📊 CLASSIFICATION REPORT:")
print(classification_report(
    y_val, 
    y_val_pred_with_tiers,
    labels=['Away Win', 'Draw', 'Home Win'],
    target_names=['Away Win', 'Draw', 'Home Win']
))

print("\n" + "="*80)
print("💡 TEACHING: Let's Calculate These BY HAND Using YOUR Data")
print("="*80)

# Let's manually calculate for "Home Win" as an example
print("\n🏠 Example: 'Home Win' Metrics")
print("-" * 40)

# True Positives: Actual=Home Win, Predicted=Home Win
tp_home = cm[2, 2]
# False Positives: Actual≠Home Win, Predicted=Home Win
fp_home = cm[0, 2] + cm[1, 2]
# False Negatives: Actual=Home Win, Predicted≠Home Win
fn_home = cm[2, 0] + cm[2, 1]

precision_home = tp_home / (tp_home + fp_home) if (tp_home + fp_home) > 0 else 0
recall_home = tp_home / (tp_home + fn_home) if (tp_home + fn_home) > 0 else 0
f1_home = 2 * (precision_home * recall_home) / (precision_home + recall_home) if (precision_home + recall_home) > 0 else 0

print(f"\nTrue Positives (TP):  {tp_home}  ← Correctly predicted Home Win")
print(f"False Positives (FP): {fp_home}  ← Predicted Home Win, but wasn't")
print(f"False Negatives (FN): {fn_home}  ← Was Home Win, but missed it")

print(f"\nPRECISION = TP / (TP + FP)")
print(f"          = {tp_home} / ({tp_home} + {fp_home})")
print(f"          = {tp_home} / {tp_home + fp_home}")
print(f"          = {precision_home:.2f} ({precision_home:.1%})")
print(f"\n→ When model predicts 'Home Win', it's right {precision_home:.1%} of the time.")

print(f"\nRECALL = TP / (TP + FN)")
print(f"       = {tp_home} / ({tp_home} + {fn_home})")
print(f"       = {tp_home} / {tp_home + fn_home}")
print(f"       = {recall_home:.2f} ({recall_home:.1%})")
print(f"\n→ Model catches {recall_home:.1%} of all actual Home Wins.")

print(f"\nF1-SCORE = 2 × (Precision × Recall) / (Precision + Recall)")
print(f"         = 2 × ({precision_home:.2f} × {recall_home:.2f}) / ({precision_home:.2f} + {recall_home:.2f})")
print(f"         = {f1_home:.2f} ({f1_home:.1%})")
print(f"\n→ Balanced score combining precision and recall.")

print("\n" + "="*80)
print("💡 WHEN TO USE WHICH METRIC?")
print("="*80)
print("\n1. ACCURACY: Overall performance")
print(f"   Your model: {acc_with_tiers:.1%}")
print("   Use when: All classes matter equally")

print("\n2. PRECISION: Avoiding false alarms")
print(f"   Best class: {class_labels[np.argmax([report[c]['precision'] for c in class_labels])]}")
print("   Use when: False positives are costly")
print("   Example: Spam detection (don't want good emails in spam)")

print("\n3. RECALL: Catching everything important")
print(f"   Best class: {class_labels[np.argmax([report[c]['recall'] for c in class_labels])]}")
print("   Use when: False negatives are costly")
print("   Example: Disease screening (don't want to miss sick patients)")

print("\n4. F1-SCORE: Balanced performance")
print(f"   Overall: {report['weighted avg']['f1-score']:.1%}")
print("   Use when: You want balance between precision and recall")
print("   Example: Most ML classification tasks (including this one!)")

## Section 11: Feature Importance Analysis

### 🔍 What Matters Most?

Random Forest tells us **how much each feature contributed** to predictions.

**Feature Importance** = How much the model relies on each feature

Example:  
- `shots_on_target_per_90 = 0.15` (15%) → 15% of decisions use this feature  
- `some_random_stat = 0.001` (0.1%) → Model barely uses it

This helps us:
1. Understand what drives predictions
2. Remove useless features
3. Validate domain knowledge
4. Test if tier features matter (our hypothesis!)

In [None]:
print("="*80)
print("FEATURE IMPORTANCE: What matters most?")
print("="*80)

# Get feature importances
feature_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_with_tiers.feature_importances_
}).sort_values('importance', ascending=False)

# Save full list
feature_importance.to_csv(OUTPUT_DIR / 'feature_importance_full.csv', index=False)

print("\n📊 TOP 20 MOST IMPORTANT FEATURES:")
for i, (idx, row) in enumerate(feature_importance.head(20).iterrows(), 1):
    bar = '█' * int(row['importance'] * 200)  # Scale for visibility
    tier_marker = " 🎯" if 'tier' in row['feature'].lower() else ""
    print(f"{i:2}. {row['feature']:<50} {row['importance']:.4f} {bar}{tier_marker}")

# Visualize
fig, ax = plt.subplots(figsize=(12, 10))
top_20 = feature_importance.head(20)
colors = ['darkgreen' if 'tier' in feat.lower() else 'steelblue' for feat in top_20['feature']]
ax.barh(range(len(top_20)), top_20['importance'], color=colors, alpha=0.8)
ax.set_yticks(range(len(top_20)))
ax.set_yticklabels(top_20['feature'])
ax.set_xlabel('Importance', fontsize=12, fontweight='bold')
ax.set_title('Top 20 Most Important Features\n(Green = Tier features)', 
             fontsize=14, fontweight='bold')
ax.invert_yaxis()
plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'feature_importance_top20.png', dpi=300, bbox_inches='tight')
plt.show()

# Check tier feature importance
tier_importance = feature_importance[feature_importance['feature'].str.contains('tier', case=False)]
print("\n🎯 TIER FEATURE IMPORTANCE:")
print(tier_importance.to_string(index=False))
print(f"\nTotal tier importance: {tier_importance['importance'].sum():.1%}")
if len(tier_importance) > 0:
    first_tier_idx = feature_importance.index[feature_importance['feature'].isin(tier_cols)][0]
    first_tier_rank = list(feature_importance.index).index(first_tier_idx) + 1
    print(f"Highest ranking tier feature: #{first_tier_rank} ({feature_importance.iloc[first_tier_rank-1]['feature']})")

print("\n💡 TEACHING POINT:")
print("Feature importance = How much the model relies on each feature")
print("\nIf shots_on_target_per_90 = 0.15 (15%):")
print("  → 15% of the model's decisions use this feature")
print("  → One of the most important features")
print("\nIf some_random_stat = 0.001 (0.1%):")
print("  → Model barely uses it")
print("  → Could remove without hurting performance")

# Check if gold standard features are important
gold_features = ['home_shots_on_target_per_90', 'away_shots_on_target_per_90',
                'home_touches_att_penalty', 'away_touches_att_penalty']
gold_in_top20 = [f for f in gold_features if f in top_20['feature'].values]
if gold_in_top20:
    print(f"\n⭐ VALIDATION: Gold standard features in top 20: {gold_in_top20}")
    print("Your Part 1D correlations were RIGHT!")

## Section 12: Model Comparison

Let's compare ALL our models side-by-side:

In [None]:
print("="*80)
print("MODEL COMPARISON: Which performs best?")
print("="*80)

# Compile results
results = {
    'Model': [
        'Random Guess',
        'Always "Home Win"',
        'Proportional Guess',
        'Random Forest (No Tiers)',
        'Random Forest (With Tiers)'
    ],
    'Accuracy': [
        acc_random,
        acc_frequent,
        acc_stratified,
        acc_no_tiers,
        acc_with_tiers
    ],
    'Type': [
        'Baseline',
        'Baseline',
        'Baseline',
        'ML Model',
        'ML Model'
    ]
}

results_df = pd.DataFrame(results).sort_values('Accuracy', ascending=False)

print("\n📊 RESULTS SUMMARY:")
print(results_df.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
colors = ['gray' if t=='Baseline' else 'darkgreen' if 'With Tiers' in m else 'steelblue' 
          for m, t in zip(results_df['Model'], results_df['Type'])]
bars = ax.barh(results_df['Model'], results_df['Accuracy'], color=colors, alpha=0.8)
ax.set_xlabel('Accuracy', fontsize=12, fontweight='bold')
ax.set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
ax.axvline(baseline_threshold, color='red', linestyle='--', linewidth=2, label='Baseline Threshold')

# Add labels
for bar, acc in zip(bars, results_df['Accuracy']):
    ax.text(acc + 0.005, bar.get_y() + bar.get_height()/2, 
            f'{acc:.1%}', va='center', fontweight='bold')

plt.legend()
plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()

print(f"\n🏆 BEST MODEL: {results_df.iloc[0]['Model']}")
print(f"📈 Accuracy: {results_df.iloc[0]['Accuracy']:.1%}")
print(f"🎯 Beat baseline by: {(results_df.iloc[0]['Accuracy'] - baseline_threshold)*100:.1f} percentage points")

## Section 13: Example Predictions

Let's see the model in action! We'll look at 10 random validation matches and see:
- What the model predicted
- What actually happened
- How confident the model was

In [None]:
print("="*80)
print("EXAMPLE PREDICTIONS: See the model in action")
print("="*80)

# Get sample matches
val_data = data[val_mask].copy()
sample_indices = val_data.sample(10, random_state=42).index

print("\n🎯 10 Random Validation Matches:")
print("="*80)

for i, idx in enumerate(sample_indices, 1):
    match = data.loc[idx]
    
    # Get features
    match_features = X_val_scaled.loc[idx].values.reshape(1, -1)
    
    # Predict
    prediction = rf_with_tiers.predict(match_features)[0]
    probabilities = rf_with_tiers.predict_proba(match_features)[0]
    
    # Map to classes
    classes = rf_with_tiers.classes_
    prob_dict = {cls: prob for cls, prob in zip(classes, probabilities)}
    
    actual = match['match_outcome']
    correct = "✅" if prediction == actual else "❌"
    
    print(f"\n{i}. {match['home_team']} vs {match['away_team']}")
    print(f"   Date: {match['date']} | Gameweek: {int(match['gameweek'])}")
    print(f"   Actual: {actual} | Predicted: {prediction} {correct}")
    print(f"   Confidence: ", end="")
    for cls in ['Home Win', 'Draw', 'Away Win']:
        if cls in prob_dict:
            marker = "←" if cls == prediction else ""
            print(f"{cls}: {prob_dict[cls]:.0%}{marker}  ", end="")
    print()

print("\n💡 TEACHING POINT:")
print("predict_proba() gives CONFIDENCE scores")
print("\nExample: Arsenal vs Chelsea")
print("  Home Win: 62%  ← Model is 62% confident")
print("  Draw: 23%")
print("  Away Win: 15%")
print("\nHigher confidence = Model is more sure")
print("Lower confidence = Uncertain match")
print("\nEven when wrong, looking at confidence helps:")
print("  - High confidence + wrong = Model has wrong assumptions")
print("  - Low confidence + wrong = Genuinely unpredictable match")

## Section 14: Save Models

Let's save our trained models so we can use them later without retraining!

In [None]:
print("="*80)
print("SAVING MODELS")
print("="*80)

# Save both models
joblib.dump(rf_no_tiers, MODEL_DIR / 'rf_no_tiers.pkl')
joblib.dump(rf_with_tiers, MODEL_DIR / 'rf_with_tiers.pkl')
joblib.dump(scaler, MODEL_DIR / 'feature_scaler.pkl')

print("\n✓ Models saved:")
print(f"  - {MODEL_DIR / 'rf_no_tiers.pkl'}")
print(f"  - {MODEL_DIR / 'rf_with_tiers.pkl'}")
print(f"  - {MODEL_DIR / 'feature_scaler.pkl'}")

# Save metadata
model_metadata = {
    'training_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'training_samples': len(X_train),
    'validation_samples': len(X_val),
    'features_total': len(feature_cols),
    'features_no_tiers': len(feature_cols_no_tiers),
    'tier_features': tier_cols,
    'hyperparameters': {
        'n_estimators': 100,
        'max_depth': 15,
        'min_samples_split': 20,
        'min_samples_leaf': 10,
        'class_weight': 'balanced'
    },
    'results': {
        'baseline_random': float(acc_random),
        'baseline_frequent': float(acc_frequent),
        'rf_no_tiers': float(acc_no_tiers),
        'rf_with_tiers': float(acc_with_tiers),
        'tier_improvement': float(acc_with_tiers - acc_no_tiers),
        'baseline_improvement': float(acc_with_tiers - baseline_threshold)
    },
    'top_5_features': feature_importance.head(5)['feature'].tolist(),
    'tier_feature_total_importance': float(tier_importance['importance'].sum())
}

with open(OUTPUT_DIR / 'model_metadata.json', 'w') as f:
    json.dump(model_metadata, f, indent=2)

print(f"  - {OUTPUT_DIR / 'model_metadata.json'}")

print("\n💡 To load these models later:")
print("```python")
print("import joblib")
print("model = joblib.load('path/to/rf_with_tiers.pkl')")
print("scaler = joblib.load('path/to/feature_scaler.pkl')")
print("predictions = model.predict(scaler.transform(new_data))")
print("```")

## Section 15: Generate Comprehensive Report

In [None]:
print("="*80)
print("GENERATING FINAL REPORT")
print("="*80)

report_path = OUTPUT_DIR / 'training_report.txt'

with open(report_path, 'w') as f:
    f.write("="*80 + "\n")
    f.write("LESSON 3: BASELINE MODEL TRAINING REPORT\n")
    f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write("="*80 + "\n\n")
    
    f.write("1. DATASET OVERVIEW\n")
    f.write("-"*80 + "\n")
    f.write(f"Training samples: {len(X_train):,} matches (2020-2024)\n")
    f.write(f"Validation samples: {len(X_val):,} matches (2024-2025)\n")
    f.write(f"Total features: {len(feature_cols)}\n")
    f.write(f"Target classes: Away Win / Draw / Home Win\n\n")
    
    f.write("2. BASELINE PERFORMANCE\n")
    f.write("-"*80 + "\n")
    f.write(f"Random guess: {acc_random:.1%}\n")
    f.write(f"Always 'Home Win': {acc_frequent:.1%}\n")
    f.write(f"Proportional guessing: {acc_stratified:.1%}\n")
    f.write(f"Threshold to beat: {baseline_threshold:.1%}\n\n")
    
    f.write("3. MODEL PERFORMANCE\n")
    f.write("-"*80 + "\n")
    f.write(f"Random Forest (no tiers): {acc_no_tiers:.1%}\n")
    f.write(f"Random Forest (with tiers): {acc_with_tiers:.1%}\n\n")
    f.write(f"Improvement from tiers: {(acc_with_tiers - acc_no_tiers)*100:+.1f} pp\n")
    f.write(f"Improvement vs baseline: {(acc_with_tiers - baseline_threshold)*100:+.1f} pp\n\n")
    
    f.write("4. HYPOTHESIS TEST: Do Tier Features Help?\n")
    f.write("-"*80 + "\n")
    if acc_with_tiers > acc_no_tiers + 0.02:
        f.write("✅ VALIDATED: Tier features significantly improve predictions!\n")
        f.write(f"   Model WITH tiers beats model WITHOUT tiers by {(acc_with_tiers - acc_no_tiers)*100:.1f} pp\n")
        f.write("   Your Part 1D insight is CORRECT!\n")
        f.write("   Different tiers DO succeed differently, and the model learned this.\n")
    elif acc_with_tiers > acc_no_tiers:
        f.write("⚠️ WEAK SUPPORT: Tier features help slightly\n")
        f.write(f"   Improvement is small ({(acc_with_tiers - acc_no_tiers)*100:.1f} pp)\n")
        f.write("   Hypothesis has some validity but effect is weak.\n")
    else:
        f.write("❌ NOT VALIDATED: Tier features don't help\n")
        f.write("   Model performs same or worse with tier features.\n")
        f.write("   Hypothesis needs revision.\n")
    f.write("\n")
    
    f.write("5. CONFUSION MATRIX ANALYSIS\n")
    f.write("-"*80 + "\n")
    f.write("Confusion Matrix (Model WITH tiers):\n\n")
    f.write("              Predicted\n")
    f.write("              Away   Draw   Home   | Total Actual\n")
    f.write(f"Actual  Away   {cm[0,0]:4}   {cm[0,1]:4}   {cm[0,2]:4}  |  {totals_actual[0]:4}\n")
    f.write(f"        Draw   {cm[1,0]:4}   {cm[1,1]:4}   {cm[1,2]:4}  |  {totals_actual[1]:4}\n")
    f.write(f"        Home   {cm[2,0]:4}   {cm[2,1]:4}   {cm[2,2]:4}  |  {totals_actual[2]:4}\n")
    f.write("        " + "-"*40 + "\n")
    f.write(f"Total Pred     {totals_predicted[0]:4}   {totals_predicted[1]:4}   {totals_predicted[2]:4}\n\n")
    
    # Calculate per-class accuracies
    f.write("Per-Class Accuracy:\n")
    for i, label in enumerate(class_labels):
        class_acc = cm[i,i] / totals_actual[i] if totals_actual[i] > 0 else 0
        f.write(f"  {label:<12}: {cm[i,i]:3}/{totals_actual[i]:3} = {class_acc:.1%}\n")
    f.write("\n")
    
    f.write("6. CLASSIFICATION METRICS\n")
    f.write("-"*80 + "\n")
    for label in class_labels:
        f.write(f"\n{label}:\n")
        f.write(f"  Precision: {report[label]['precision']:.2f}\n")
        f.write(f"  Recall:    {report[label]['recall']:.2f}\n")
        f.write(f"  F1-Score:  {report[label]['f1-score']:.2f}\n")
    f.write("\n")
    
    f.write("7. TOP 10 MOST IMPORTANT FEATURES\n")
    f.write("-"*80 + "\n")
    for i, (idx, row) in enumerate(feature_importance.head(10).iterrows(), 1):
        tier_marker = " 🎯" if 'tier' in row['feature'].lower() else ""
        f.write(f"{i:2}. {row['feature']:<50} {row['importance']:.4f}{tier_marker}\n")
    f.write("\n")
    
    f.write("8. TIER FEATURE ANALYSIS\n")
    f.write("-"*80 + "\n")
    f.write(f"Total tier importance: {tier_importance['importance'].sum():.1%}\n")
    f.write(f"Number of tier features: {len(tier_cols)}\n")
    f.write("Individual tier feature importance:\n")
    for idx, row in tier_importance.iterrows():
        rank = list(feature_importance.index).index(idx) + 1
        f.write(f"  {row['feature']:<35} Rank #{rank:3}  Importance: {row['importance']:.4f}\n")
    f.write("\n")
    
    f.write("9. KEY LEARNINGS\n")
    f.write("-"*80 + "\n")
    f.write(f"✓ Model significantly beats baseline ({(acc_with_tiers - baseline_threshold)*100:+.1f} pp)\n")
    f.write("✓ ML is working - the model learned useful patterns!\n")
    f.write("✓ Feature scaling successfully fixed variance issues\n")
    f.write(f"✓ Random Forest handles {len(feature_cols)} features effectively\n")
    
    # Check if gold standard features are important
    if gold_in_top20:
        f.write("✓ Gold standard features (Part 1D) rank highly - correlation validated!\n")
    
    tier_in_top20 = any('tier' in feat.lower() for feat in feature_importance.head(20)['feature'].values)
    if tier_in_top20:
        f.write("✓ Tier features in top 20 - domain knowledge adds value!\n")
    f.write("\n")
    
    f.write("10. NEXT STEPS\n")
    f.write("-"*80 + "\n")
    f.write("→ Lesson 3B: Hyperparameter tuning (optimize performance)\n")
    f.write("→ Lesson 4: Add current season features (fine-tuning on 2025-2026)\n")
    f.write("→ Phase 2: Deploy to Streamlit dashboard (real predictions)\n")
    f.write("\n")
    
    f.write("="*80 + "\n")
    f.write("END OF REPORT\n")
    f.write("="*80 + "\n")

print(f"\n✓ Report saved: {report_path}")
print("\nYou can read this report anytime to remember your results!")

## Section 16: Learning Summary

# 🎉 LESSON 3 COMPLETE!

## 📚 WHAT YOU LEARNED:

### 1. Data Preparation
- ✅ **Train/test splitting** (chronological, not random)
- ✅ **Feature scaling** (StandardScaler - why and how)
- ✅ **Separating X and y** (features vs target)

### 2. Model Training
- ✅ **Baseline models** (establishing benchmarks)
- ✅ **Random Forest classifier** (how ensemble learning works)
- ✅ **Hyperparameters** (n_estimators, max_depth, etc.)

### 3. Model Evaluation (DEEP UNDERSTANDING)
- ✅ **Confusion Matrix** - where the model makes mistakes
- ✅ **Accuracy** - overall correctness
- ✅ **Precision** - "When I predict X, how often am I right?"
- ✅ **Recall** - "Of all actual X, how many did I catch?"
- ✅ **F1-Score** - balanced precision and recall

### 4. Feature Analysis
- ✅ **Feature importance** (what drives predictions)
- ✅ **Domain validation** (tier features, gold standard metrics)

### 5. Hypothesis Testing
- ✅ **Scientific approach** (WITH vs WITHOUT tier features)
- ✅ **Quantitative validation** (measure improvement)

---

## 🏆 YOUR ACHIEVEMENTS:

You built your FIRST machine learning model and learned:
- WHY each step matters
- HOW to interpret results
- WHEN to use which metric
- WHAT makes a good model

You didn't just run code - you UNDERSTOOD the concepts!

---

## 🎯 NEXT STEPS:

**Lesson 3B**: Hyperparameter tuning  
→ Optimize model performance through systematic parameter search

**Lesson 4**: Current season features  
→ Add 2025-2026 data and fine-tune for latest patterns

**Phase 2**: Production deployment  
→ Build Streamlit dashboard with real predictions

---

## 📝 HOMEWORK (Optional):

1. Look at YOUR confusion matrix. Which class is hardest to predict? Why?
2. Look at YOUR feature importance. Do the top features make sense?
3. Look at YOUR example predictions. When is the model confident? When uncertain?

**The more you engage with YOUR specific results, the deeper your understanding!**

In [None]:
print("\n" + "="*80)
print("🎓 LESSON 3 COMPLETE!")
print("="*80)

print("\n📚 WHAT YOU LEARNED:")
print("✓ Train/test splitting (chronological, not random)")
print("✓ Feature scaling (StandardScaler)")
print("✓ Baseline models (benchmarks)")
print("✓ Random Forest classifier (how it works)")
print("✓ Model evaluation (accuracy, confusion matrix)")
print("✓ Precision, Recall, F1 (DEEP understanding)")
print("✓ Feature importance (what matters most)")
print("✓ Hypothesis testing (do tier features help?)")

print("\n📊 YOUR RESULTS:")
print(f"Baseline: {baseline_threshold:.1%}")
print(f"Your Model: {acc_with_tiers:.1%}")
print(f"Improvement: {(acc_with_tiers - baseline_threshold)*100:+.1f} percentage points")

print("\n🎯 HYPOTHESIS TEST:")
if acc_with_tiers > acc_no_tiers + 0.02:
    print("✅ Tier features significantly help!")
    print(f"   +{(acc_with_tiers - acc_no_tiers)*100:.1f} pp improvement")
elif acc_with_tiers > acc_no_tiers:
    print("⚠️ Tier features help slightly")
    print(f"   +{(acc_with_tiers - acc_no_tiers)*100:.1f} pp improvement")
else:
    print("❌ Tier features don't help")

print("\n📁 FILES CREATED:")
print(f"  Models:")
print(f"    - {MODEL_DIR / 'rf_no_tiers.pkl'}")
print(f"    - {MODEL_DIR / 'rf_with_tiers.pkl'}")
print(f"    - {MODEL_DIR / 'feature_scaler.pkl'}")
print(f"  Outputs:")
print(f"    - {OUTPUT_DIR / 'confusion_matrix_with_tiers.png'}")
print(f"    - {OUTPUT_DIR / 'feature_importance_top20.png'}")
print(f"    - {OUTPUT_DIR / 'model_comparison.png'}")
print(f"    - {OUTPUT_DIR / 'training_report.txt'}")
print(f"    - {OUTPUT_DIR / 'model_metadata.json'}")

print("\n" + "="*80)
print("🎉 YOU BUILT YOUR FIRST ML MODEL!")
print("="*80)
print("\nYou didn't just run code - you UNDERSTOOD:")
print("  - WHY each step matters")
print("  - HOW to interpret results")
print("  - WHEN to use which metric")
print("  - WHAT makes a good model")
print("\n🚀 Ready for the next lesson!")