# Modeling: MLB Pitch Strike Prediction

This notebook builds and evaluates models to predict whether a pitch will be a strike.

Steps:
1. Load processed data and build features
2. Split into train/test sets
3. Train models (or load pre-trained models)
4. Evaluate performance
5. Visualize results (ROC curves, feature importances)



In [None]:
import sys
from pathlib import Path

# Add project root to path
project_root = Path().resolve().parent
sys.path.insert(0, str(project_root))

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import joblib

from src import config
from src.features import build_features_and_target
from src.models import (
    split_data,
    train_all_models,
    evaluate_all_models,
    get_feature_importances
)
from src.plots import plot_roc_curves_multiple, plot_feature_importances

# Set style
plt.style.use('default')
%matplotlib inline

print(f"Project root: {project_root}")



## 1. Load Data and Build Features


In [None]:
# Load processed data
df = pd.read_csv(config.PROCESSED_DATA_FILE)
print(f"Loaded {len(df)} pitches")

# Build features and target
print("\nBuilding features...")
X, y, feature_names = build_features_and_target(df)

print(f"Feature matrix shape: {X.shape}")
print(f"Number of features: {len(feature_names)}")
print(f"\nTarget distribution:")
print(y.value_counts())
print(f"Strike rate: {y.mean():.2%}")



## 2. Split Data


In [None]:
# Split into train and test sets
X_train, X_test, y_train, y_test = split_data(X, y)

print(f"Training set: {len(X_train)} pitches ({y_train.mean():.2%} strikes)")
print(f"Test set: {len(X_test)} pitches ({y_test.mean():.2%} strikes)")



## 3. Train Models (or Load Pre-trained)

We can either train models here or load pre-trained models from the `models/` directory.


In [None]:
# Option 1: Train models (uncomment to train)
# models = train_all_models(X_train, y_train)

# Option 2: Load pre-trained models
print("Loading pre-trained models...")
models = {}
model_files = list(config.MODELS_DIR.glob("*.joblib"))

if not model_files:
    print("No pre-trained models found. Training models now...")
    models = train_all_models(X_train, y_train)
else:
    for model_path in model_files:
        model_name = model_path.stem
        models[model_name] = joblib.load(model_path)
        print(f"  Loaded: {model_name}")

print(f"\nModels available: {list(models.keys())}")



## 4. Evaluate Models


In [None]:
# Evaluate on test set
print("Evaluating models on test set...")
test_metrics = evaluate_all_models(models, X_test, y_test)

print("\n" + "="*60)
print("Test Set Performance Metrics")
print("="*60)
print(test_metrics.round(4))



## 5. ROC Curves

ROC curves show the tradeoff between true positive rate and false positive rate at different classification thresholds. A model with a higher AUC (Area Under Curve) is better.


In [None]:
# Plot ROC curves for all models
fig = plot_roc_curves_multiple(models, X_test, y_test)
plt.show()



## 6. Feature Importances

For tree-based models (Random Forest, XGBoost), we can see which features are most important for predictions.


In [None]:
# Plot feature importances for Random Forest
if 'random_forest' in models:
    rf_model = models['random_forest']
    importances = get_feature_importances(rf_model, feature_names)
    
    print("Top 20 Most Important Features (Random Forest):")
    print(importances.head(20))
    
    # Plot
    fig = plot_feature_importances(importances, top_n=20)
    plt.show()

# Also for XGBoost if available
if 'xgboost' in models:
    xgb_model = models['xgboost']
    xgb_importances = get_feature_importances(xgb_model, feature_names)
    
    print("\nTop 20 Most Important Features (XGBoost):")
    print(xgb_importances.head(20))
    
    # Plot
    fig = plot_feature_importances(xgb_importances, top_n=20)
    plt.show()



## 7. Model Interpretation

Let's look at the coefficients for Logistic Regression to understand feature effects.


In [None]:
# Logistic Regression coefficients
if 'logistic_regression' in models:
    lr_model = models['logistic_regression']
    
    # Get coefficients (use absolute value for importance)
    coef_importances = get_feature_importances(lr_model, feature_names)
    
    print("Top 20 Features by Absolute Coefficient (Logistic Regression):")
    print(coef_importances.head(20))
    
    # Plot
    fig = plot_feature_importances(coef_importances, top_n=20)
    plt.show()



## Summary

- **Best Model**: Based on ROC-AUC, the {best_model} model performs best
- **Key Features**: The most important features are {top_features}
- **Model Performance**: All models significantly outperform the baseline

Key takeaways:
1. Plate location (plate_x, plate_z) is highly predictive
2. Count (balls, strikes) is important
3. Pitch characteristics (release_speed, pitch type) matter
4. Tree-based models (Random Forest, XGBoost) perform well on this task

