# 03 - Model Training and Evaluation

**Goal**: Train models to predict NFL game outcomes

**My Approach**:
- Start simple with XGBoost (usually works well out of the box)
- Focus on proper train/test split (no data leakage!)
- Evaluate on accuracy first, then dive into what the model learned


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

from src.models.base import XGBoostModel
from src.utils.metrics import evaluate_classification, print_evaluation_report

# Load our featured data from previous notebook
print(" TODO: Load data with engineered features")

## Data Preparation

**Critical Decision**: How to split train/test?

**My Strategy**: Time-based split
- Train on 2020-2022 seasons
- Test on 2023 season
- This mimics real-world usage (predicting future games)
- Avoids data leakage from random splits

In [None]:
# TODO: Temporal train/test split
# train_data = df[df['season'].isin([2020, 2021, 2022])]
# test_data = df[df['season'] == 2023]
# 
# print(f"Training games: {len(train_data)}")
# print(f"Test games: {len(test_data)}")
# print(f"Training seasons: {sorted(train_data['season'].unique())}")
# print(f"Test seasons: {sorted(test_data['season'].unique())}")

print(" TODO: Create temporal train/test split")

## Feature Selection

**What Goes Into the Model?**
- All the rolling averages we created / other features
- Matchup differentials (offense vs defense)
- Game context (week, home/away)
- NOT: team names (want model to generalize), actual scores (would just be overfitting center)

In [None]:
# TODO: Select features for modeling
# exclude_cols = ['game_id', 'season', 'week', 'home_team', 'away_team', 
#                'home_score', 'away_score', 'point_spread', 'home_win']
# 
# feature_cols = [col for col in train_data.columns if col not in exclude_cols]
# print(f"Using {len(feature_cols)} features:")
# print(feature_cols)

print("TODO: Select model features")

In [None]:
# TODO: Prepare X and y
# X_train = train_data[feature_cols].fillna(0)  # Handle missing values simply
# y_train = train_data['home_win'].astype(int)
# 
# X_test = test_data[feature_cols].fillna(0)
# y_test = test_data['home_win'].astype(int)
# 
# print(f"Training features shape: {X_train.shape}")
# print(f"Test features shape: {X_test.shape}")

print(" TODO: Prepare X and y arrays")

## Model Training

**Starting Simple**: XGBoost with default parameters
- Usually works well out of the box
- Handles missing values automatically
- Gives feature importance for free
- Can tune hyperparameters later if needed

In [None]:
# TODO: Train XGBoost model
# model = XGBoostModel()
# model.fit(X_train, y_train)
# 
# print("Model training complete!")

print(" TODO: Implement XGBoost training")

## Model Evaluation

**Key Questions**:
1. What's the overall accuracy? (Baseline is ~52-53% for home teams)
2. Does the model perform consistently across different types of games?
3. What features is the model using most?
4. Are there any obvious failure modes?
5. Did we run any sort of correlation matrixes on spreads / classificiations to see if featurs actually matter or are we just guessing

In [None]:
# TODO: Basic evaluation
# y_pred = model.predict(X_test)
# y_prob = model.predict_proba(X_test)[:, 1]  # Probability of home win
# 
# accuracy = accuracy_score(y_test, y_pred)
# print(f"Test Accuracy: {accuracy:.3f}")
# print(f"Baseline (always predict home): {y_test.mean():.3f}")
# 
# print("\nClassification Report:")
# print(classification_report(y_test, y_pred))

print(" TODO: Evaluate model performance")

## Feature Importance Analysis

**What I Want to Learn**:
- Which features actually matter for prediction?
- Do the important features make intuitive sense?
- Are we missing any obvious features?

In [None]:
# TODO: Feature importance
# feature_importance = model.get_feature_importance()
# importance_df = pd.DataFrame({
#     'feature': feature_cols,
#     'importance': feature_importance
# }).sort_values('importance', ascending=False)
# 
# plt.figure(figsize=(10, 8))
# sns.barplot(data=importance_df.head(15), x='importance', y='feature')
# plt.title('Top 15 Most Important Features')
# plt.tight_layout()

print(" TODO: Analyze feature importance")

## Model Interpretation

**Sanity Checks**:
- Do strong teams beat weak teams more often? (Should see recent win rate as important)
- Does home field advantage show up? (Should see some home-specific features)
- Are offensive/defensive matchups captured? (Should see differential features)

**Red Flags to Watch For**:
- Model relying too heavily on one feature (overfitting?)
- Important features that don't make football sense
- Performance much worse on certain types of games

In [None]:
# TODO: Prediction analysis
# test_results = test_data.copy()
# test_results['predicted_home_win'] = y_pred
# test_results['home_win_prob'] = y_prob
# 
# # Look at most confident predictions
# confident_predictions = test_results[
#     (test_results['home_win_prob'] > 0.8) | (test_results['home_win_prob'] < 0.2)
# ]
# print(f"High confidence predictions: {len(confident_predictions)}")
# print(f"Accuracy on high confidence: {(confident_predictions['home_win'] == confident_predictions['predicted_home_win']).mean():.3f}")

print(" TODO: Analyze prediction confidence")

## Error Analysis

**Learning from Mistakes**: Look at games where the model was most wrong
- Big upsets the model didn't see coming
- Games where model was overconfident
- Patterns in the mistakes (certain teams, certain situations)

In [None]:
# TODO: Error analysis
# test_results['prediction_error'] = np.abs(test_results['home_win_prob'] - test_results['home_win'])
# worst_predictions = test_results.nlargest(10, 'prediction_error')
# 
# print("Worst predictions (biggest errors):")
# print(worst_predictions[['home_team', 'away_team', 'home_win', 'home_win_prob', 'prediction_error']])

print(" TODO: Analyze worst predictions")

## Model Training Insights

**What I Expect to Learn**:
1. **Accuracy**: Probably 55-65% if features are good
2. **Key Features**: Recent win rate, point differentials, home field
3. **Model Behavior**: Should be more confident on mismatches, less on even games
4. **Limitations**: Probably struggles with injuries, weather, motivation

**Success Criteria**:
-  Beat baseline (random guessing ~50%, always home ~53%)
-  Features make football sense
-  Performance consistent across different game types
-  High-confidence predictions are more accurate

**Next Steps**:
- If model works: Try ensemble methods, hyperparameter tuning
- If model struggles: Better features, more data, different approach
- Either way: Deploy for 2024 season predictions!