# 05. Random Forest Ensemble Model

This notebook implements a Random Forest classifier - an ensemble of decision trees.

## Objectives
- Build a Random Forest classifier
- Tune number of trees and other hyperparameters
- Analyze feature importance across the ensemble
- Compare to single decision tree
- Measure out-of-bag (OOB) error

## Why Random Forests?
- **Reduces Overfitting**: Averages many trees
- **More Robust**: Less sensitive to noise
- **Better Generalization**: Usually outperforms single trees
- **Feature Importance**: More reliable estimates
- **Out-of-Bag**: Built-in validation

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data_processing.cleaning import DataCleaner
from src.data_processing.game_features import GameFeatureEngineer
from src.data_processing.dataset_builder import DatasetBuilder
from src.models.random_forest_model import GameRandomForest
from src.models.decision_tree_model import GameDecisionTree
from src.evaluation.model_comparison import ModelComparison
from src.utils.data_loader import load_games_as_dataframe

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

print("âœ“ Imports successful")

## 1. Prepare Data

In [None]:
# Load and prepare
try:
    games_df = load_games_as_dataframe(season=2023)
except:
    from scripts.generate_sample_data import generate_sample_games
    games_df = pd.DataFrame(generate_sample_games(200))

cleaner = DataCleaner()
games_df = cleaner.clean_game_data(games_df)

engineer = GameFeatureEngineer()
features_df = engineer.create_game_features(games_df)

builder = DatasetBuilder()
dataset = builder.create_dataset(
    df=features_df,
    target_column='home_win',
    date_column='date',
    split_method='time',
    scale_features=False,
    exclude_columns=['game_id', 'home_team_id', 'away_team_id', 'home_score', 'away_score']
)

print(f"Dataset ready: {len(dataset['X_train'])} training samples")

## 2. Train Random Forest

In [None]:
# Train Random Forest
rf_model = GameRandomForest()

print("Training Random Forest...")
print("This trains an ensemble of decision trees with bootstrap sampling")

train_metrics = rf_model.train(
    dataset['X_train'],
    dataset['y_train'],
    dataset['X_val'],
    dataset['y_val'],
    tune_hyperparameters=True
)

print("\n" + "="*60)
print("BEST HYPERPARAMETERS")
print("="*60)
print(f"Number of Trees: {rf_model.model.n_estimators}")
print(f"Max Depth: {rf_model.model.max_depth}")
print(f"Min Samples Split: {rf_model.model.min_samples_split}")
print(f"Max Features: {rf_model.model.max_features}")

print("\n" + "="*60)
print("VALIDATION PERFORMANCE")
print("="*60)
for metric, value in train_metrics.items():
    print(f"{metric:20s}: {value:.4f}")

## 3. Out-of-Bag (OOB) Score

Random Forests have a built-in validation mechanism using OOB samples.

In [None]:
# Get OOB score if available
if hasattr(rf_model.model, 'oob_score_'):
    print(f"Out-of-Bag Score: {rf_model.model.oob_score_:.4f}")
    print("\nOOB provides an unbiased estimate without needing a separate validation set!")
else:
    print("OOB scoring not enabled (set oob_score=True when training)")

## 4. Test Set Evaluation

In [None]:
test_metrics = rf_model.evaluate(dataset['X_test'], dataset['y_test'])

print("="*60)
print("TEST SET PERFORMANCE")
print("="*60)
for metric, value in test_metrics.items():
    print(f"{metric:20s}: {value:.4f}")

## 5. Feature Importance (Ensemble Average)

In [None]:
# Get aggregated feature importance
importance_df = rf_model.get_feature_importance(dataset['feature_names'])

# Plot
top_features = importance_df.head(15)
plt.figure(figsize=(10, 8))
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Mean Decrease in Impurity')
plt.title('Top 15 Features - Random Forest (Averaged over all trees)')
plt.tight_layout()
plt.show()

print("Top 10 Features:")
print(top_features.head(10))

## 6. Compare to Single Decision Tree

In [None]:
# Train single decision tree
dt_model = GameDecisionTree()
dt_model.train(
    dataset['X_train'],
    dataset['y_train'],
    dataset['X_val'],
    dataset['y_val'],
    tune_hyperparameters=True
)

# Compare
comparison = ModelComparison(task_type='classification')
comparison.add_model('Decision Tree', dt_model, dataset['X_test'], dataset['y_test'])
comparison.add_model('Random Forest', rf_model, dataset['X_test'], dataset['y_test'])

results = comparison.compare_all()
print("\n" + "="*60)
print("RANDOM FOREST vs DECISION TREE")
print("="*60)
print(results)

best_model, _ = comparison.get_best_model()
print(f"\nâœ“ Best Model: {best_model}")

## 7. Understanding Ensemble Predictions

In [None]:
# Analyze prediction confidence
test_proba = rf_model.predict_proba(dataset['X_test'])
test_pred = rf_model.predict(dataset['X_test'])

# Confidence distribution
confidence = np.max(test_proba, axis=1)

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(confidence, bins=20, edgecolor='black')
plt.xlabel('Prediction Confidence')
plt.ylabel('Count')
plt.title('Distribution of Prediction Confidence')

plt.subplot(1, 2, 2)
correct = test_pred == dataset['y_test']
plt.hist([confidence[correct], confidence[~correct]], 
         bins=20, label=['Correct', 'Incorrect'], alpha=0.7, edgecolor='black')
plt.xlabel('Confidence')
plt.ylabel('Count')
plt.title('Confidence: Correct vs Incorrect Predictions')
plt.legend()

plt.tight_layout()
plt.show()

print(f"Mean confidence: {confidence.mean():.3f}")
print(f"High confidence (>0.7): {(confidence > 0.7).sum()} predictions")

## 8. Conclusion

### Random Forest Advantages
- âœ“ More accurate than single decision tree
- âœ“ Reduces overfitting through averaging
- âœ“ More stable and robust predictions
- âœ“ Reliable feature importance
- âœ“ Out-of-bag validation

### Performance Improvement
Random Forest typically improves accuracy by 2-5% over single decision trees.

### Key Insights
- Ensemble methods combine multiple weak learners into a strong learner
- Bootstrap sampling and feature randomness reduce correlation between trees
- Majority voting produces more reliable predictions

### Next Steps
â†’ Notebook 06: Compare ALL models (Logistic Regression, Decision Tree, Random Forest)

ðŸŒ²ðŸŒ²ðŸŒ² A forest is better than a single tree!