# **LESSON 3B: GOAL PREDICTION MODEL**

---

## **What We're Building**

In Lesson 3, we built a **classifier** to predict match outcomes (Win/Draw/Loss) with **51.3% accuracy**.

Now we're taking a different approach:
- Train **2 regressors** to predict `home_goals` and `away_goals`
- **Derive** match outcomes from goal predictions (2.1 - 1.3 ‚Üí Home Win)
- Compare accuracy to direct classification
- Bonus: Get **score predictions** for betting markets!

---

## **Key Concepts**

### **Regression vs Classification**
- **Classification**: Predict categories (Win/Draw/Loss)
- **Regression**: Predict numbers (0, 1, 2, 3... goals)

### **Evaluation Metrics**
- **MAE** (Mean Absolute Error): Average goal difference from actual
- **RMSE** (Root Mean Squared Error): Penalizes large errors
- **R¬≤** (R-squared): % of variance explained (0 = baseline, 1 = perfect)

---

## **Dataset**
- 1,900 matches from 2020-2025
- 105 engineered features
- Targets: `home_goals`, `away_goals`, `match_outcome`

---

Let's begin!

---
## **SECTION 1: SETUP & DATA LOADING**

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import (
    mean_absolute_error,
    mean_squared_error,
    r2_score,
    accuracy_score,
    confusion_matrix,
    ConfusionMatrixDisplay
)
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import json
import joblib
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

# Set up paths
OUTPUT_DIR = Path('ml_project/outputs/08_goal_prediction')
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)
MODEL_DIR = Path('ml_project/models')

# Load data
data = pd.read_csv('ml_project/data/match_features_historical.csv')
scaler = joblib.load('ml_project/models/feature_scaler.pkl')

print("="*80)
print("LESSON 3B: GOAL PREDICTION MODEL")
print("="*80)
print(f"Dataset loaded: {len(data)} matches")
print(f"\nTarget variables:")
print(f"  - home_goals: mean={data['home_goals'].mean():.2f}, std={data['home_goals'].std():.2f}")
print(f"  - away_goals: mean={data['away_goals'].mean():.2f}, std={data['away_goals'].std():.2f}")

---
## **SECTION 2: UNDERSTAND GOAL DISTRIBUTIONS**

Before building models, let's understand what we're predicting.

In [None]:
print("\n" + "="*80)
print("UNDERSTANDING GOAL DISTRIBUTIONS")
print("="*80)

# Visualize goal distributions
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Home goals
axes[0].hist(data['home_goals'], bins=range(0, 8), alpha=0.7, color='green', edgecolor='black')
axes[0].set_xlabel('Home Goals', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[0].set_title('Distribution of Home Goals', fontsize=14, fontweight='bold')
axes[0].axvline(data['home_goals'].mean(), color='red', linestyle='--', linewidth=2, 
                label=f'Mean: {data["home_goals"].mean():.2f}')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Away goals
axes[1].hist(data['away_goals'], bins=range(0, 8), alpha=0.7, color='orange', edgecolor='black')
axes[1].set_xlabel('Away Goals', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[1].set_title('Distribution of Away Goals', fontsize=14, fontweight='bold')
axes[1].axvline(data['away_goals'].mean(), color='red', linestyle='--', linewidth=2, 
                label=f'Mean: {data["away_goals"].mean():.2f}')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'goal_distributions.png', dpi=300)
plt.show()

print(f"\nHome Goals Stats:")
print(f"  Mean: {data['home_goals'].mean():.2f}")
print(f"  Median: {data['home_goals'].median():.1f}")
print(f"  Std: {data['home_goals'].std():.2f}")
print(f"  Range: {data['home_goals'].min():.0f} - {data['home_goals'].max():.0f}")

print(f"\nAway Goals Stats:")
print(f"  Mean: {data['away_goals'].mean():.2f}")
print(f"  Median: {data['away_goals'].median():.1f}")
print(f"  Std: {data['away_goals'].std():.2f}")
print(f"  Range: {data['away_goals'].min():.0f} - {data['away_goals'].max():.0f}")

print("\nüí° TEACHING POINT:")
print("Home teams score MORE on average (home advantage)")
print(f"Home avg: {data['home_goals'].mean():.2f} vs Away avg: {data['away_goals'].mean():.2f}")
print(f"Difference: {(data['home_goals'].mean() - data['away_goals'].mean()):.2f} goals per match")

---
## **SECTION 3: PREPARE DATA**

Same train/validation split as Lesson 3 for fair comparison.

In [None]:
print("\n" + "="*80)
print("PREPARING DATA")
print("="*80)

# Separate features from targets
exclude_cols = [
    'match_id', 'season', 'date', 'gameweek',
    'home_team', 'away_team',
    'match_outcome', 'home_goals', 'away_goals'
]

feature_cols = [col for col in data.columns if col not in exclude_cols]
X = data[feature_cols]
y_home_goals = data['home_goals']
y_away_goals = data['away_goals']
y_outcome = data['match_outcome']

print(f"‚úì Features: {X.shape[1]}")
print(f"‚úì Home goals target: {len(y_home_goals)}")
print(f"‚úì Away goals target: {len(y_away_goals)}")

# Train/validation split (same as Lesson 3)
train_mask = data['season'].isin(['2020-2021', '2021-2022', '2022-2023', '2023-2024'])
val_mask = data['season'] == '2024-2025'

X_train = X[train_mask]
X_val = X[val_mask]
y_train_home = y_home_goals[train_mask]
y_val_home = y_home_goals[val_mask]
y_train_away = y_away_goals[train_mask]
y_val_away = y_away_goals[val_mask]
y_val_outcome = y_outcome[val_mask]

print(f"\n‚úì Training set: {len(X_train)} matches")
print(f"‚úì Validation set: {len(X_val)} matches")

# Scale features (use SAME scaler from Lesson 3)
X_train_scaled = scaler.transform(X_train)
X_val_scaled = scaler.transform(X_val)

X_train_scaled = pd.DataFrame(X_train_scaled, columns=feature_cols, index=X_train.index)
X_val_scaled = pd.DataFrame(X_val_scaled, columns=feature_cols, index=X_val.index)

print(f"‚úì Features scaled")

---
## **SECTION 4: ESTABLISH BASELINE**

**Baseline Strategy**: Always predict the training mean.

This tells us how much better our model needs to be.

In [None]:
print("\n" + "="*80)
print("BASELINE: Always Predict Mean Goals")
print("="*80)

# Baseline: Always predict training mean
baseline_home = y_train_home.mean()
baseline_away = y_train_away.mean()

# Calculate baseline errors
baseline_home_predictions = np.full(len(y_val_home), baseline_home)
baseline_away_predictions = np.full(len(y_val_away), baseline_away)

baseline_mae_home = mean_absolute_error(y_val_home, baseline_home_predictions)
baseline_mae_away = mean_absolute_error(y_val_away, baseline_away_predictions)

print(f"\nBaseline Strategy: Always predict mean")
print(f"  Home goals: {baseline_home:.2f}")
print(f"  Away goals: {baseline_away:.2f}")

print(f"\nBaseline MAE:")
print(f"  Home: {baseline_mae_home:.2f} goals")
print(f"  Away: {baseline_mae_away:.2f} goals")

print("\nüí° TEACHING POINT:")
print("MAE = Mean Absolute Error")
print("  Average difference between prediction and actual goals")
print("  Lower is better (0 = perfect predictions)")
print(f"  Baseline MAE ‚âà 1.0 means we're off by ~1 goal per match")

---
## **SECTION 5: TRAIN HOME GOALS REGRESSOR**

First model: Predict how many goals the **home team** will score.

In [None]:
print("\n" + "="*80)
print("TRAINING MODEL #1: Home Goals Regressor")
print("="*80)

# Initialize regressor
rf_home_goals = RandomForestRegressor(
    n_estimators=100,
    max_depth=15,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42,
    n_jobs=-1,
    verbose=1
)

print("\nüå≥ Training Random Forest Regressor for Home Goals...")
print("(This may take 30-60 seconds)")
rf_home_goals.fit(X_train_scaled, y_train_home)

# Predict
y_val_pred_home = rf_home_goals.predict(X_val_scaled)

# Evaluate
mae_home = mean_absolute_error(y_val_home, y_val_pred_home)
rmse_home = np.sqrt(mean_squared_error(y_val_home, y_val_pred_home))
r2_home = r2_score(y_val_home, y_val_pred_home)

print(f"\n‚úì Model trained!")
print(f"\nüìä Home Goals Performance:")
print(f"  MAE:  {mae_home:.3f} goals (avg error)")
print(f"  RMSE: {rmse_home:.3f} goals (penalizes large errors)")
print(f"  R¬≤:   {r2_home:.3f} (variance explained: {r2_home*100:.1f}%)")
print(f"\nüìà Improvement vs Baseline:")
print(f"  Baseline MAE: {baseline_mae_home:.3f}")
print(f"  Model MAE:    {mae_home:.3f}")
print(f"  Improvement:  {(baseline_mae_home - mae_home):.3f} goals ({((baseline_mae_home - mae_home)/baseline_mae_home)*100:.1f}%)")

print("\nüí° TEACHING POINT: What is R¬≤?")
print("R¬≤ = How much variance in goals does the model explain?")
print(f"  R¬≤ = {r2_home:.3f} means model explains {r2_home*100:.1f}% of goal variation")
print("  R¬≤ = 1.0 ‚Üí Perfect predictions")
print("  R¬≤ = 0.0 ‚Üí No better than predicting mean")
print(f"  R¬≤ = negative ‚Üí Worse than baseline (rare)")

---
## **SECTION 6: TRAIN AWAY GOALS REGRESSOR**

Second model: Predict how many goals the **away team** will score.

In [None]:
print("\n" + "="*80)
print("TRAINING MODEL #2: Away Goals Regressor")
print("="*80)

# Initialize regressor
rf_away_goals = RandomForestRegressor(
    n_estimators=100,
    max_depth=15,
    min_samples_split=20,
    min_samples_leaf=10,
    random_state=42,
    n_jobs=-1,
    verbose=1
)

print("\nüå≥ Training Random Forest Regressor for Away Goals...")
rf_away_goals.fit(X_train_scaled, y_train_away)

# Predict
y_val_pred_away = rf_away_goals.predict(X_val_scaled)

# Evaluate
mae_away = mean_absolute_error(y_val_away, y_val_pred_away)
rmse_away = np.sqrt(mean_squared_error(y_val_away, y_val_pred_away))
r2_away = r2_score(y_val_away, y_val_pred_away)

print(f"\n‚úì Model trained!")
print(f"\nüìä Away Goals Performance:")
print(f"  MAE:  {mae_away:.3f} goals")
print(f"  RMSE: {rmse_away:.3f} goals")
print(f"  R¬≤:   {r2_away:.3f} ({r2_away*100:.1f}% variance explained)")
print(f"\nüìà Improvement vs Baseline:")
print(f"  Baseline MAE: {baseline_mae_away:.3f}")
print(f"  Model MAE:    {mae_away:.3f}")
print(f"  Improvement:  {(baseline_mae_away - mae_away):.3f} goals ({((baseline_mae_away - mae_away)/baseline_mae_away)*100:.1f}%)")

---
## **SECTION 7: VISUALIZE PREDICTIONS VS ACTUAL**

How well do our predictions match reality?

In [None]:
print("\n" + "="*80)
print("VISUALIZING PREDICTIONS")
print("="*80)

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Home goals: Predicted vs Actual
axes[0].scatter(y_val_home, y_val_pred_home, alpha=0.5, s=50)
axes[0].plot([0, 6], [0, 6], 'r--', linewidth=2, label='Perfect prediction')
axes[0].set_xlabel('Actual Home Goals', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Predicted Home Goals', fontsize=12, fontweight='bold')
axes[0].set_title(f'Home Goals: Predicted vs Actual\nMAE: {mae_home:.3f}, R¬≤: {r2_home:.3f}', 
                  fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Away goals: Predicted vs Actual
axes[1].scatter(y_val_away, y_val_pred_away, alpha=0.5, s=50, color='orange')
axes[1].plot([0, 6], [0, 6], 'r--', linewidth=2, label='Perfect prediction')
axes[1].set_xlabel('Actual Away Goals', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Predicted Away Goals', fontsize=12, fontweight='bold')
axes[1].set_title(f'Away Goals: Predicted vs Actual\nMAE: {mae_away:.3f}, R¬≤: {r2_away:.3f}', 
                  fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'predictions_vs_actual.png', dpi=300)
plt.show()

print("\nüí° READING THE PLOT:")
print("Points on red line = Perfect predictions")
print("Points above line = Model over-predicted goals")
print("Points below line = Model under-predicted goals")
print("Spread = How much error there is")

---
## **SECTION 8: PREDICTION ERROR ANALYSIS**

Understanding where the model makes mistakes.

In [None]:
print("\n" + "="*80)
print("PREDICTION ERROR ANALYSIS")
print("="*80)

# Calculate errors
home_errors = y_val_pred_home - y_val_home
away_errors = y_val_pred_away - y_val_away

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Home goals error distribution
axes[0].hist(home_errors, bins=30, alpha=0.7, color='green', edgecolor='black')
axes[0].axvline(0, color='red', linestyle='--', linewidth=2, label='Zero error')
axes[0].axvline(home_errors.mean(), color='blue', linestyle='--', linewidth=2, 
                label=f'Mean error: {home_errors.mean():.3f}')
axes[0].set_xlabel('Prediction Error (Predicted - Actual)', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[0].set_title('Home Goals: Prediction Error Distribution', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Away goals error distribution
axes[1].hist(away_errors, bins=30, alpha=0.7, color='orange', edgecolor='black')
axes[1].axvline(0, color='red', linestyle='--', linewidth=2, label='Zero error')
axes[1].axvline(away_errors.mean(), color='blue', linestyle='--', linewidth=2, 
                label=f'Mean error: {away_errors.mean():.3f}')
axes[1].set_xlabel('Prediction Error (Predicted - Actual)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Frequency', fontsize=12, fontweight='bold')
axes[1].set_title('Away Goals: Prediction Error Distribution', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'error_distributions.png', dpi=300)
plt.show()

print(f"\nError Statistics:")
print(f"\nHome Goals:")
print(f"  Mean error: {home_errors.mean():.3f} (bias)")
print(f"  Std error:  {home_errors.std():.3f} (consistency)")
print(f"  Max over-prediction: {home_errors.max():.2f} goals")
print(f"  Max under-prediction: {home_errors.min():.2f} goals")

print(f"\nAway Goals:")
print(f"  Mean error: {away_errors.mean():.3f} (bias)")
print(f"  Std error:  {away_errors.std():.3f} (consistency)")
print(f"  Max over-prediction: {away_errors.max():.2f} goals")
print(f"  Max under-prediction: {away_errors.min():.2f} goals")

print("\nüí° TEACHING POINT: Bias vs Variance")
print("Mean error close to 0 = UNBIASED (not systematically over/under-predicting)")
print("Small std error = LOW VARIANCE (consistent predictions)")
print("Good model has both low bias AND low variance")

---
## **SECTION 9: DERIVE MATCH OUTCOMES FROM GOALS**

Now the key question: Can we derive Win/Draw/Loss from goal predictions?

In [None]:
print("\n" + "="*80)
print("DERIVING MATCH OUTCOMES FROM GOAL PREDICTIONS")
print("="*80)

def derive_outcome(home_goals, away_goals, threshold=0.5):
    """
    Derive match outcome from goal predictions
    
    Args:
        home_goals: Predicted home goals
        away_goals: Predicted away goals
        threshold: Goal difference threshold for declaring winner (default 0.5)
    
    Returns:
        'Home Win', 'Away Win', or 'Draw'
    """
    goal_diff = home_goals - away_goals
    
    if goal_diff > threshold:
        return 'Home Win'
    elif goal_diff < -threshold:
        return 'Away Win'
    else:
        return 'Draw'

# Derive outcomes
derived_outcomes = []
for home, away in zip(y_val_pred_home, y_val_pred_away):
    derived_outcomes.append(derive_outcome(home, away))

# Calculate accuracy
outcome_accuracy = accuracy_score(y_val_outcome, derived_outcomes)

print(f"\nüìä OUTCOME ACCURACY (derived from goals):")
print(f"  Accuracy: {outcome_accuracy:.1%}")

# Compare to Lesson 3 classifier
lesson3_metadata = json.load(open('ml_project/outputs/07_model_training/model_metadata.json'))
classifier_accuracy = lesson3_metadata['results']['rf_with_tiers']

print(f"\nüî¨ COMPARISON:")
print(f"  Lesson 3 Classifier (direct):  {classifier_accuracy:.1%}")
print(f"  Lesson 3B Goal-based (derived): {outcome_accuracy:.1%}")
print(f"  Difference: {(outcome_accuracy - classifier_accuracy)*100:+.1f} percentage points")

if outcome_accuracy > classifier_accuracy:
    print("\n‚úÖ Goal-based approach is BETTER!")
elif outcome_accuracy > classifier_accuracy - 0.02:
    print("\n‚öñÔ∏è  Goal-based approach is COMPARABLE (within 2%)")
else:
    print("\n‚ö†Ô∏è  Goal-based approach is WORSE (but gives more info!)")

print("\nüí° TEACHING POINT:")
print("Even if accuracy is slightly lower, goal-based gives you:")
print("  ‚úì Score predictions (not just outcome)")
print("  ‚úì Goal difference (margin of victory)")
print("  ‚úì Over/Under betting opportunities")
print("  ‚úì More interpretable (humans think in goals)")

---
## **SECTION 10: CONFUSION MATRIX FOR DERIVED OUTCOMES**

In [None]:
print("\n" + "="*80)
print("CONFUSION MATRIX: Derived Outcomes")
print("="*80)

cm = confusion_matrix(y_val_outcome, derived_outcomes, labels=['Away Win', 'Draw', 'Home Win'])

fig, ax = plt.subplots(figsize=(10, 8))
disp = ConfusionMatrixDisplay(
    confusion_matrix=cm,
    display_labels=['Away Win', 'Draw', 'Home Win']
)
disp.plot(ax=ax, cmap='Blues', values_format='d', colorbar=False)
plt.title('Confusion Matrix - Goal-Based Predictions\n(Validation Set)', 
          fontsize=14, fontweight='bold')

# Add per-class accuracy
totals_actual = cm.sum(axis=1)
for i in range(3):
    class_acc = cm[i,i] / totals_actual[i]
    ax.text(i, i-0.3, f'{class_acc:.0%}', ha='center', va='center', 
            color='white', fontweight='bold', fontsize=10)

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'confusion_matrix_goal_based.png', dpi=300)
plt.show()

# Print confusion matrix
totals_predicted = cm.sum(axis=0)
print("\n              Predicted")
print("              Away   Draw   Home   | Total Actual")
print(f"Actual  Away   {cm[0,0]:4}   {cm[0,1]:4}   {cm[0,2]:4}  |  {totals_actual[0]:4}")
print(f"        Draw   {cm[1,0]:4}   {cm[1,1]:4}   {cm[1,2]:4}  |  {totals_actual[1]:4}")
print(f"        Home   {cm[2,0]:4}   {cm[2,1]:4}   {cm[2,2]:4}  |  {totals_actual[2]:4}")
print("        " + "-"*40)
print(f"Total Pred     {totals_predicted[0]:4}   {totals_predicted[1]:4}   {totals_predicted[2]:4}")

print("\nPer-Class Accuracy:")
for i, label in enumerate(['Away Win', 'Draw', 'Home Win']):
    class_acc = cm[i,i] / totals_actual[i] if totals_actual[i] > 0 else 0
    print(f"  {label:<12}: {cm[i,i]:3}/{totals_actual[i]:3} = {class_acc:.1%}")

---
## **SECTION 11: THRESHOLD TUNING**

Can we improve accuracy by adjusting the draw threshold?

In [None]:
print("\n" + "="*80)
print("THRESHOLD TUNING: Finding Optimal Draw Threshold")
print("="*80)

# Try different thresholds
thresholds = np.arange(0.1, 1.0, 0.1)
accuracies = []

for threshold in thresholds:
    outcomes = [derive_outcome(h, a, threshold) for h, a in zip(y_val_pred_home, y_val_pred_away)]
    acc = accuracy_score(y_val_outcome, outcomes)
    accuracies.append(acc)
    print(f"Threshold {threshold:.1f}: Accuracy = {acc:.1%}")

# Find best threshold
best_idx = np.argmax(accuracies)
best_threshold = thresholds[best_idx]
best_accuracy = accuracies[best_idx]

print(f"\nüéØ BEST THRESHOLD: {best_threshold:.1f} goals")
print(f"   Accuracy: {best_accuracy:.1%}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(thresholds, accuracies, marker='o', linewidth=2, markersize=8)
ax.axvline(best_threshold, color='red', linestyle='--', linewidth=2, 
           label=f'Best: {best_threshold:.1f} ({best_accuracy:.1%})')
ax.set_xlabel('Goal Difference Threshold', fontsize=12, fontweight='bold')
ax.set_ylabel('Accuracy', fontsize=12, fontweight='bold')
ax.set_title('Threshold Tuning: Impact on Accuracy', fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)
ax.legend()
plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'threshold_tuning.png', dpi=300)
plt.show()

print("\nüí° TEACHING POINT: What is threshold?")
print("Threshold = Goal difference needed to declare a winner")
print("  Threshold 0.5: 2.0 vs 1.4 ‚Üí Home Win (diff = 0.6 > 0.5)")
print("  Threshold 0.5: 1.8 vs 1.4 ‚Üí Draw (diff = 0.4 < 0.5)")
print("  Threshold 0.8: 1.8 vs 1.4 ‚Üí Draw (diff = 0.4 < 0.8)")
print("\nHigher threshold ‚Üí More draws predicted")
print("Lower threshold ‚Üí More wins predicted")

---
## **SECTION 12: FEATURE IMPORTANCE COMPARISON**

What features matter for predicting goals?

In [None]:
print("\n" + "="*80)
print("FEATURE IMPORTANCE: What Matters for Goals?")
print("="*80)

# Get feature importances
home_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_home_goals.feature_importances_
}).sort_values('importance', ascending=False)

away_importance = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_away_goals.feature_importances_
}).sort_values('importance', ascending=False)

# Visualize top 15 for each
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Home goals importance
top_home = home_importance.head(15)
axes[0].barh(range(len(top_home)), top_home['importance'], color='green', alpha=0.8)
axes[0].set_yticks(range(len(top_home)))
axes[0].set_yticklabels(top_home['feature'])
axes[0].set_xlabel('Importance', fontsize=12, fontweight='bold')
axes[0].set_title('Top 15 Features: Home Goals', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()

# Away goals importance
top_away = away_importance.head(15)
axes[1].barh(range(len(top_away)), top_away['importance'], color='orange', alpha=0.8)
axes[1].set_yticks(range(len(top_away)))
axes[1].set_yticklabels(top_away['feature'])
axes[1].set_xlabel('Importance', fontsize=12, fontweight='bold')
axes[1].set_title('Top 15 Features: Away Goals', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()

plt.tight_layout()
plt.savefig(OUTPUT_DIR / 'feature_importance_goals.png', dpi=300)
plt.show()

print("\nüìä TOP 10 FEATURES FOR HOME GOALS:")
for i, (idx, row) in enumerate(home_importance.head(10).iterrows(), 1):
    print(f"{i:2}. {row['feature']:<50} {row['importance']:.4f}")

print("\nüìä TOP 10 FEATURES FOR AWAY GOALS:")
for i, (idx, row) in enumerate(away_importance.head(10).iterrows(), 1):
    print(f"{i:2}. {row['feature']:<50} {row['importance']:.4f}")

# Compare to Lesson 3 outcome classifier
lesson3_importance = pd.read_csv('ml_project/outputs/07_model_training/feature_importance_full.csv')
print("\nüî¨ COMPARISON WITH LESSON 3 (OUTCOME CLASSIFIER):")
print("\nTop 5 for outcome prediction:")
for i, (idx, row) in enumerate(lesson3_importance.head(5).iterrows(), 1):
    print(f"  {i}. {row['feature']}")

print("\nTop 5 for home goals:")
for i, (idx, row) in enumerate(home_importance.head(5).iterrows(), 1):
    print(f"  {i}. {row['feature']}")

print("\nüí° INSIGHT:")
print("Different features matter for outcomes vs goals!")
print("Outcome model cares about differentials (attack_advantage)")
print("Goal models care about absolute attacking strength")

---
## **SECTION 13: EXAMPLE PREDICTIONS WITH SCORES**

Let's see actual score predictions!

In [None]:
print("\n" + "="*80)
print("EXAMPLE PREDICTIONS: See Scores!")
print("="*80)

# Get sample matches
sample_indices = data[val_mask].sample(10, random_state=42).index

print("\nüéØ 10 Random Validation Matches:")
print("="*80)

correct_outcomes = 0
total_goal_error = 0

for i, idx in enumerate(sample_indices, 1):
    match = data.loc[idx]
    
    # Get features
    match_features = X_val_scaled.loc[idx].values.reshape(1, -1)
    
    # Predict goals
    pred_home = rf_home_goals.predict(match_features)[0]
    pred_away = rf_away_goals.predict(match_features)[0]
    
    # Derive outcome
    pred_outcome = derive_outcome(pred_home, pred_away, best_threshold)
    
    # Actual values
    actual_home = match['home_goals']
    actual_away = match['away_goals']
    actual_outcome = match['match_outcome']
    
    # Check correctness
    outcome_correct = "‚úÖ" if pred_outcome == actual_outcome else "‚ùå"
    if pred_outcome == actual_outcome:
        correct_outcomes += 1
    
    # Goal errors
    home_error = abs(pred_home - actual_home)
    away_error = abs(pred_away - actual_away)
    total_goal_error += (home_error + away_error)
    
    print(f"\n{i}. {match['home_team']} vs {match['away_team']}")
    print(f"   Predicted Score: {pred_home:.1f} - {pred_away:.1f} ({pred_outcome})")
    print(f"   Actual Score:    {actual_home:.0f} - {actual_away:.0f} ({actual_outcome}) {outcome_correct}")
    print(f"   Goal Error:      {home_error:.1f} (home), {away_error:.1f} (away)")

print(f"\nüìä SAMPLE STATISTICS:")
print(f"  Outcome accuracy: {correct_outcomes}/10 = {correct_outcomes/10:.0%}")
print(f"  Avg goal error: {total_goal_error/20:.2f} goals per team")

print("\nüí° WHAT YOU GET WITH GOAL PREDICTIONS:")
print("  ‚úì Exact score prediction (2.1 - 1.3)")
print("  ‚úì Goal difference (margin of victory)")
print("  ‚úì Outcome derived from scores")
print("  ‚úì More realistic (how football actually works)")

---
## **SECTION 14: SAVE MODELS**

In [None]:
print("\n" + "="*80)
print("SAVING MODELS")
print("="*80)

# Save both goal regressors
joblib.dump(rf_home_goals, MODEL_DIR / 'rf_home_goals.pkl')
joblib.dump(rf_away_goals, MODEL_DIR / 'rf_away_goals.pkl')

print(f"‚úì Models saved:")
print(f"  - {MODEL_DIR / 'rf_home_goals.pkl'}")
print(f"  - {MODEL_DIR / 'rf_away_goals.pkl'}")

# Save metadata
model_metadata = {
    'training_date': datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
    'training_samples': len(X_train),
    'validation_samples': len(X_val),
    'features': len(feature_cols),
    'home_goals': {
        'mae': float(mae_home),
        'rmse': float(rmse_home),
        'r2': float(r2_home),
        'baseline_mae': float(baseline_mae_home),
        'improvement': float(baseline_mae_home - mae_home)
    },
    'away_goals': {
        'mae': float(mae_away),
        'rmse': float(rmse_away),
        'r2': float(r2_away),
        'baseline_mae': float(baseline_mae_away),
        'improvement': float(baseline_mae_away - mae_away)
    },
    'derived_outcome_accuracy': float(outcome_accuracy),
    'best_threshold': float(best_threshold),
    'comparison': {
        'lesson3_classifier': float(classifier_accuracy),
        'lesson3b_goal_based': float(outcome_accuracy),
        'difference': float(outcome_accuracy - classifier_accuracy)
    },
    'top_5_features_home': home_importance.head(5)['feature'].tolist(),
    'top_5_features_away': away_importance.head(5)['feature'].tolist()
}

with open(OUTPUT_DIR / 'goal_model_metadata.json', 'w') as f:
    json.dump(model_metadata, f, indent=2)

print(f"  - {OUTPUT_DIR / 'goal_model_metadata.json'}")

---
## **SECTION 15: GENERATE COMPREHENSIVE REPORT**

In [None]:
print("\n" + "="*80)
print("GENERATING FINAL REPORT")
print("="*80)

report_path = OUTPUT_DIR / 'goal_prediction_report.txt'

with open(report_path, 'w') as f:
    f.write("="*80 + "\n")
    f.write("LESSON 3B: GOAL PREDICTION MODEL REPORT\n")
    f.write(f"Generated: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}\n")
    f.write("="*80 + "\n\n")
    
    f.write("1. DATASET OVERVIEW\n")
    f.write("-"*80 + "\n")
    f.write(f"Training samples: {len(X_train)} matches (2020-2024)\n")
    f.write(f"Validation samples: {len(X_val)} matches (2024-2025)\n")
    f.write(f"Features: {len(feature_cols)}\n")
    f.write(f"Targets: home_goals, away_goals\n\n")
    
    f.write("2. GOAL STATISTICS\n")
    f.write("-"*80 + "\n")
    f.write(f"Home goals - Mean: {data['home_goals'].mean():.2f}, Std: {data['home_goals'].std():.2f}\n")
    f.write(f"Away goals - Mean: {data['away_goals'].mean():.2f}, Std: {data['away_goals'].std():.2f}\n")
    f.write(f"Home advantage: +{(data['home_goals'].mean() - data['away_goals'].mean()):.2f} goals per match\n\n")
    
    f.write("3. MODEL PERFORMANCE\n")
    f.write("-"*80 + "\n")
    f.write(f"Home Goals Regressor:\n")
    f.write(f"  MAE:  {mae_home:.3f} goals\n")
    f.write(f"  RMSE: {rmse_home:.3f} goals\n")
    f.write(f"  R¬≤:   {r2_home:.3f} ({r2_home*100:.1f}% variance explained)\n")
    f.write(f"  Improvement vs baseline: {(baseline_mae_home - mae_home):.3f} goals ({((baseline_mae_home - mae_home)/baseline_mae_home)*100:.1f}%)\n\n")
    
    f.write(f"Away Goals Regressor:\n")
    f.write(f"  MAE:  {mae_away:.3f} goals\n")
    f.write(f"  RMSE: {rmse_away:.3f} goals\n")
    f.write(f"  R¬≤:   {r2_away:.3f} ({r2_away*100:.1f}% variance explained)\n")
    f.write(f"  Improvement vs baseline: {(baseline_mae_away - mae_away):.3f} goals ({((baseline_mae_away - mae_away)/baseline_mae_away)*100:.1f}%)\n\n")
    
    f.write("4. DERIVED OUTCOME ACCURACY\n")
    f.write("-"*80 + "\n")
    f.write(f"Outcome accuracy (derived from goals): {outcome_accuracy:.1%}\n")
    f.write(f"Best threshold: {best_threshold:.1f} goals\n\n")
    
    f.write("5. COMPARISON WITH LESSON 3 (DIRECT CLASSIFIER)\n")
    f.write("-"*80 + "\n")
    f.write(f"Lesson 3 (Direct classifier):  {classifier_accuracy:.1%}\n")
    f.write(f"Lesson 3B (Goal-based):         {outcome_accuracy:.1%}\n")
    f.write(f"Difference: {(outcome_accuracy - classifier_accuracy)*100:+.1f} percentage points\n\n")
    
    if outcome_accuracy > classifier_accuracy:
        f.write("‚úÖ VERDICT: Goal-based approach is BETTER!\n")
        f.write("   Plus: You get score predictions as a bonus!\n")
    elif outcome_accuracy > classifier_accuracy - 0.02:
        f.write("‚öñÔ∏è  VERDICT: Goal-based approach is COMPARABLE\n")
        f.write("   Even if slightly lower, you gain score prediction capability!\n")
    else:
        f.write("‚ö†Ô∏è  VERDICT: Goal-based approach has lower outcome accuracy\n")
        f.write("   But: You still get valuable score predictions for betting markets\n")
    f.write("\n")
    
    f.write("6. TOP 10 FEATURES FOR HOME GOALS\n")
    f.write("-"*80 + "\n")
    for i, (idx, row) in enumerate(home_importance.head(10).iterrows(), 1):
        f.write(f"{i:2}. {row['feature']:<50} {row['importance']:.4f}\n")
    f.write("\n")
    
    f.write("7. TOP 10 FEATURES FOR AWAY GOALS\n")
    f.write("-"*80 + "\n")
    for i, (idx, row) in enumerate(away_importance.head(10).iterrows(), 1):
        f.write(f"{i:2}. {row['feature']:<50} {row['importance']:.4f}\n")
    f.write("\n")
    
    f.write("8. KEY LEARNINGS\n")
    f.write("-"*80 + "\n")
    f.write("‚úì Goal prediction provides richer information than outcome classification\n")
    f.write("‚úì MAE < 1.0 goal means predictions are reasonable\n")
    f.write(f"‚úì R¬≤ = {r2_home:.2f} (home) and {r2_away:.2f} (away) shows good explanatory power\n")
    f.write("‚úì Threshold tuning can optimize derived outcome accuracy\n")
    f.write("‚úì Different features matter for goals vs outcomes\n\n")
    
    f.write("9. ADVANTAGES OF GOAL PREDICTION APPROACH\n")
    f.write("-"*80 + "\n")
    f.write("‚úì Score predictions (e.g., Arsenal 2.1 - 1.3 Chelsea)\n")
    f.write("‚úì Goal difference (margin of victory)\n")
    f.write("‚úì Over/Under betting opportunities (total goals)\n")
    f.write("‚úì Both Teams To Score predictions\n")
    f.write("‚úì More interpretable (humans think in goals)\n")
    f.write("‚úì Can derive outcomes with tunable threshold\n\n")
    
    f.write("10. NEXT STEPS\n")
    f.write("-"*80 + "\n")
    f.write("‚Üí Update prediction script to show score predictions\n")
    f.write("‚Üí Add Poisson distribution modeling for more realistic scores\n")
    f.write("‚Üí Combine both approaches: ensemble classifier + regressor\n")
    f.write("‚Üí Deploy goal prediction to Streamlit dashboard\n")
    f.write("\n")
    
    f.write("="*80 + "\n")
    f.write("END OF REPORT\n")
    f.write("="*80 + "\n")

print(f"‚úì Report saved: {report_path}")

---
## **SECTION 16: LEARNING SUMMARY**

# üéì **LESSON 3B COMPLETE!**

In [None]:
print("\n" + "="*80)
print("üéì LESSON 3B COMPLETE!")
print("="*80)

print("\nüìö WHAT YOU LEARNED:")
print("‚úì Regression vs Classification (predicting numbers vs categories)")
print("‚úì Training goal prediction models (RandomForestRegressor)")
print("‚úì Evaluation metrics for regression (MAE, RMSE, R¬≤)")
print("‚úì Deriving outcomes from goal predictions")
print("‚úì Threshold tuning for optimal performance")
print("‚úì Feature importance for goal prediction")

print("\nüìä YOUR RESULTS:")
print(f"Home Goals MAE: {mae_home:.3f} (avg error)")
print(f"Away Goals MAE: {mae_away:.3f} (avg error)")
print(f"Derived Outcome Accuracy: {outcome_accuracy:.1%}")
print(f"vs Lesson 3 Classifier: {classifier_accuracy:.1%} ({(outcome_accuracy - classifier_accuracy)*100:+.1f} pp)")

print("\nüéØ WHICH IS BETTER?")
if outcome_accuracy > classifier_accuracy:
    print("‚úÖ Goal-based approach WINS!")
    print(f"   Better accuracy AND you get score predictions!")
elif outcome_accuracy > classifier_accuracy - 0.02:
    print("‚öñÔ∏è  Both approaches are COMPARABLE")
    print(f"   Goal-based gives you score predictions as bonus!")
else:
    print("‚ö†Ô∏è  Classifier has better outcome accuracy")
    print(f"   But goal-based gives you MORE information!")

print("\nüí° RECOMMENDATION:")
print("Use BOTH models for different purposes:")
print("  1. Goal prediction ‚Üí Score betting, Over/Under markets")
print("  2. Outcome classifier ‚Üí Win/Draw/Loss betting, simpler predictions")
print("Or create an ENSEMBLE that combines both!")

print("\nüìÅ FILES CREATED:")
print(f"  Models:")
print(f"    - {MODEL_DIR / 'rf_home_goals.pkl'}")
print(f"    - {MODEL_DIR / 'rf_away_goals.pkl'}")
print(f"  Outputs:")
print(f"    - {OUTPUT_DIR / 'goal_distributions.png'}")
print(f"    - {OUTPUT_DIR / 'predictions_vs_actual.png'}")
print(f"    - {OUTPUT_DIR / 'error_distributions.png'}")
print(f"    - {OUTPUT_DIR / 'confusion_matrix_goal_based.png'}")
print(f"    - {OUTPUT_DIR / 'threshold_tuning.png'}")
print(f"    - {OUTPUT_DIR / 'feature_importance_goals.png'}")
print(f"    - {OUTPUT_DIR / 'goal_prediction_report.txt'}")
print(f"    - {OUTPUT_DIR / 'goal_model_metadata.json'}")

print("\n" + "="*80)
print("üéâ YOU BUILT A GOAL PREDICTION MODEL!")
print("="*80)
print("\nNext: Update prediction script to show score predictions!")
print("      Run: python ml_project/predict_match.py 'West Ham' 'Brentford'")
print("      Output: West Ham 1.8 - 1.2 Brentford (Home Win 62%)")