# 04. Decision Tree Model

This notebook implements a decision tree classifier for NBA game prediction.

## Objectives
- Build a decision tree classifier
- Tune hyperparameters (depth, min_samples_split, etc.)
- Visualize the decision tree
- Compare performance to logistic regression baseline

## Why Decision Trees?
- Interpretable: Easy to understand decision rules
- Non-linear: Can capture complex patterns
- No feature scaling required
- Handles interactions naturally

In [None]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from src.data_processing.cleaning import DataCleaner
from src.data_processing.game_features import GameFeatureEngineer
from src.data_processing.dataset_builder import DatasetBuilder
from src.models.decision_tree_model import GameDecisionTree
from src.models.logistic_regression_model import GameLogisticRegression
from src.evaluation.model_comparison import ModelComparison
from src.utils.data_loader import load_games_as_dataframe

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)

print("âœ“ Imports successful")

## 1. Prepare Data (Same as Notebook 03)

In [None]:
# Load and prepare data
try:
    games_df = load_games_as_dataframe(season=2023)
except:
    from scripts.generate_sample_data import generate_sample_games
    games_df = pd.DataFrame(generate_sample_games(200))

cleaner = DataCleaner()
games_df = cleaner.clean_game_data(games_df)

engineer = GameFeatureEngineer()
features_df = engineer.create_game_features(games_df)

builder = DatasetBuilder()
dataset = builder.create_dataset(
    df=features_df,
    target_column='home_win',
    date_column='date',
    split_method='time',
    scale_features=False,  # Decision trees don't need scaling!
    exclude_columns=['game_id', 'home_team_id', 'away_team_id', 'home_score', 'away_score']
)

print(f"Training samples: {len(dataset['X_train'])}")
print(f"Features: {dataset['X_train'].shape[1]}")

## 2. Train Decision Tree with Hyperparameter Tuning

In [None]:
# Train decision tree
dt_model = GameDecisionTree()

print("Training Decision Tree with GridSearchCV...")
print("Testing combinations of: max_depth, min_samples_split, min_samples_leaf")

train_metrics = dt_model.train(
    dataset['X_train'],
    dataset['y_train'],
    dataset['X_val'],
    dataset['y_val'],
    tune_hyperparameters=True
)

print("\n" + "="*60)
print("BEST HYPERPARAMETERS")
print("="*60)
print(f"Max Depth: {dt_model.model.max_depth}")
print(f"Min Samples Split: {dt_model.model.min_samples_split}")
print(f"Min Samples Leaf: {dt_model.model.min_samples_leaf}")

print("\n" + "="*60)
print("VALIDATION PERFORMANCE")
print("="*60)
for metric, value in train_metrics.items():
    print(f"{metric:20s}: {value:.4f}")

## 3. Evaluate on Test Set

In [None]:
test_metrics = dt_model.evaluate(dataset['X_test'], dataset['y_test'])

print("="*60)
print("TEST SET PERFORMANCE")
print("="*60)
for metric, value in test_metrics.items():
    print(f"{metric:20s}: {value:.4f}")

## 4. Visualize Decision Tree

In [None]:
# Visualize tree (top levels only)
dt_model.visualize_tree(max_depth=3, figsize=(20, 10))
plt.show()

In [None]:
# Print decision rules
print("Decision Rules (Top 5 levels):")
print("="*60)
rules = dt_model.get_tree_rules(max_depth=5)
print(rules)

## 5. Feature Importance

In [None]:
# Get feature importance
importance_df = dt_model.get_feature_importance(dataset['feature_names'])

# Plot
top_features = importance_df.head(15)
plt.figure(figsize=(10, 8))
plt.barh(range(len(top_features)), top_features['importance'])
plt.yticks(range(len(top_features)), top_features['feature'])
plt.xlabel('Feature Importance (Gini)')
plt.title('Top 15 Features - Decision Tree')
plt.tight_layout()
plt.show()

print("Top 10 Features:")
print(top_features.head(10))

## 6. Compare to Baseline (Logistic Regression)

In [None]:
# Train baseline for comparison
lr_model = GameLogisticRegression()
lr_model.train(
    dataset['X_train'],
    dataset['y_train'],
    dataset['X_val'],
    dataset['y_val'],
    tune_hyperparameters=False
)

# Compare
comparison = ModelComparison(task_type='classification')
comparison.add_model('Logistic Regression', lr_model, dataset['X_test'], dataset['y_test'])
comparison.add_model('Decision Tree', dt_model, dataset['X_test'], dataset['y_test'])

results = comparison.compare_all()
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)
print(results)

best_model, _ = comparison.get_best_model()
print(f"\nâœ“ Best Model: {best_model}")

## 7. Conclusion

### Decision Tree Advantages
- âœ“ Interpretable decision rules
- âœ“ Captures non-linear patterns
- âœ“ No feature scaling needed
- âœ“ Shows which features drive decisions

### Decision Tree Disadvantages
- âœ— Can overfit on training data
- âœ— Unstable (small data changes = different tree)
- âœ— May not generalize as well

### Performance vs Baseline
Compare accuracy, precision, recall with logistic regression.

### Next Steps
â†’ Notebook 05: Try Random Forest (ensemble of decision trees) to reduce overfitting!

ðŸŒ² Decision trees provide interpretable insights into game prediction!