# Predicting Hall of Fame Inductees (Data Science Portfolio)

## Objective
Build an XGBoost classifier to predict Baseball Hall of Fame induction using locally stored Lahman Database CSV files, including Exploratory Data Analysis (EDA) and model interpretation with SHAP.

## Data Source
We use three CSV files located in `../../data/baseball/`:
1.  **Batting.csv**: Career batting statistics.
2.  **HallOfFame.csv**: Ground truth labels for induction.
3.  **People.csv**: Player names and biographical details.


In [None]:
# Install dependencies (if needed)
!pip install pandas numpy xgboost scikit-learn matplotlib seaborn shap

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
import shap
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import os

plt.style.use('fivethirtyeight')
# Initialize JS for SHAP plots
shap.initjs()

## 1. Data Loading
Load the CSVs into Pandas DataFrames. We assume the notebook is running in `assets/code/` and data is in `data/baseball/` (2 levels up).

In [None]:
# Define paths (Update this to './' if running in Colab with direct upload)
DATA_DIR = '../../data/baseball/'

try:
    df_batting = pd.read_csv(os.path.join(DATA_DIR, 'Batting.csv'))
    df_hof = pd.read_csv(os.path.join(DATA_DIR, 'HallOfFame.csv'))
    df_people = pd.read_csv(os.path.join(DATA_DIR, 'People.csv'))
    
    print("Data Loaded Successfully!")
    print(f"Batting: {df_batting.shape}")
    print(f"HOF: {df_hof.shape}")
    print(f"People: {df_people.shape}")
    
except FileNotFoundError:
    print("Error: CSV files not found. Please check the DATA_DIR path.")

## 2. Preprocessing & Feature Engineering
We transform season-level stats into career totals, and then derive advanced sabermetric indicators.

In [None]:
# 1. Aggregate Batting Stats by PlayerID
# Added HBP (Hit by Pitch) and SF (Sacrifice Fly) for accurate OBP calculation
cols_to_sum = ['G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'BB', 'SO', 'HBP', 'SF']

# Ensure columns are numeric
for col in cols_to_sum:
    if col in df_batting.columns:
        df_batting[col] = pd.to_numeric(df_batting[col], errors='coerce').fillna(0)

career_stats = df_batting.groupby('playerID')[cols_to_sum].sum().reset_index()

# 2. Calculate Basic Rates
career_stats['AVG'] = np.where(career_stats['AB'] > 0, career_stats['H'] / career_stats['AB'], 0)

# 3. ADVANCED FEATURE ENGINEERING
# Plate Appearances (PA)
career_stats['PA'] = career_stats['AB'] + career_stats['BB'] + career_stats['HBP'] + career_stats['SF']

# On-Base Percentage (OBP)
career_stats['OBP'] = np.where(career_stats['PA'] > 0, 
                               (career_stats['H'] + career_stats['BB'] + career_stats['HBP']) / career_stats['PA'], 0)

# Slugging Percentage (SLG)
# TB = H + 2B + 2*3B + 3*HR
total_bases = career_stats['H'] + career_stats['2B'] + (2 * career_stats['3B']) + (3 * career_stats['HR'])
career_stats['SLG'] = np.where(career_stats['AB'] > 0, total_bases / career_stats['AB'], 0)

# On-Base Plus Slugging (OPS)
career_stats['OPS'] = career_stats['OBP'] + career_stats['SLG']

# Isolated Power (ISO)
career_stats['ISO'] = career_stats['SLG'] - career_stats['AVG']

# Walk Rate and Strikeout Rate
career_stats['BB_Rate'] = np.where(career_stats['PA'] > 0, career_stats['BB'] / career_stats['PA'], 0)
career_stats['K_Rate']  = np.where(career_stats['PA'] > 0, career_stats['SO'] / career_stats['PA'], 0)

# 4. Merge with People to get Names
career_stats = career_stats.merge(df_people[['playerID', 'nameFirst', 'nameLast']], on='playerID', how='left')
career_stats['Name'] = career_stats['nameFirst'] + ' ' + career_stats['nameLast']

# Filter for significant careers (> 2000 ABs)
career_stats = career_stats[career_stats['AB'] > 2000]

print(f"Filtered Career Players: {len(career_stats)}")
career_stats[['Name', 'OPS', 'ISO', 'OBP']].head()

## 3. Labeling (HOF Status)
Identify players who were inducted into the Hall of Fame using the `HallOfFame` table.

In [None]:
# Get list of inducted IDs (inducted = 'Y')
hof_inductees = df_hof[df_hof['inducted'] == 'Y']['playerID'].unique()

# Create Target Column
career_stats['is_hof'] = career_stats['playerID'].isin(hof_inductees).astype(int)

print(f"Total Inductees in Dataset: {career_stats['is_hof'].sum()}")

## 4. Exploratory Data Analysis (EDA)
Let's visualize the data to understand the distinct statistical profile of Hall of Famers.

In [None]:
# 1. Correlation Heatmap
# Now including OPS to see its correlation with HOF status
plt.figure(figsize=(12, 10))
features_to_plot = ['H', 'HR', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS', 'is_hof']
sns.heatmap(career_stats[features_to_plot].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Including Advanced Stats')
plt.show()

In [None]:
# 2. Scatter Plot: OPS vs. Hits
# Does high OPS compensate for fewer hits?
plt.figure(figsize=(10, 6))
sns.scatterplot(data=career_stats, x='H', y='OPS', hue='is_hof', alpha=0.6, palette={0: 'gray', 1: 'gold'})
plt.title('Career Hits vs. OPS (Gold = HOF)')
plt.axvline(x=3000, color='red', linestyle='--', linewidth=1, label='3000 Hits')
plt.axhline(y=0.900, color='blue', linestyle='--', linewidth=1, label='0.900 OPS')
plt.legend()
plt.show()

In [None]:
# 3. Box Plots: Statistical Distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.boxplot(x='is_hof', y='OPS', data=career_stats, ax=axes[0], palette='Set2')
axes[0].set_title('Distribution of Career OPS')

sns.boxplot(x='is_hof', y='ISO', data=career_stats, ax=axes[1], palette='Set2')
axes[1].set_title('Distribution of Isolated Power')

sns.boxplot(x='is_hof', y='BB_Rate', data=career_stats, ax=axes[2], palette='Set2')
axes[2].set_title('Distribution of Walk Rate')

plt.show()

## 5. Model Training (XGBoost)
Train using the **New Feature Set** including metrics like OPS and ISO.

In [None]:
features = [
    'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'BB', 'SO', 
    'AVG', 'OBP', 'SLG', 'OPS', 'ISO', 'BB_Rate', 'K_Rate'
]
X = career_stats[features]
y = career_stats['is_hof']

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train XGBoost
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', n_estimators=200, random_state=42)
model.fit(X_train, y_train)

print("Model Trained Successfully.")

## 6. Evaluation

In [None]:
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

## 7. Misses & Predictions

In [None]:
test_df = X_test.copy()
test_df['Name'] = career_stats.loc[test_df.index, 'Name']
test_df['Actual'] = y_test
test_df['Predicted'] = y_pred

# Snubs (False Negatives)
snubs = test_df[(test_df['Actual'] == 1) & (test_df['Predicted'] == 0)]
print("Model 'Snubs' (Actual HOFers predicted as NO):")
display(snubs[['Name', 'H', 'HR', 'OPS']].head(5))

# Controversial (False Positives)
candidates = test_df[(test_df['Actual'] == 0) & (test_df['Predicted'] == 1)]
print("\nModel Candidates (Not HOF but predicted YES):")
display(candidates[['Name', 'H', 'HR', 'OPS']].head(5))

## 8. Model Interpretation with SHAP
Do the new stats like OPS dominate feature importance?

In [None]:
# Create the explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 1. Summary Plot
plt.title("Feature Impact on HOF Probability")
shap.summary_plot(shap_values, X_test, plot_type="dot")

### Explaining Single Predictions
Let's see why the model likes a specific candidate.

In [None]:
if not candidates.empty:
    # Pick the first 'False Positive' from our list
    player_idx = candidates.index[0]
    player_name = candidates.loc[player_idx, 'Name']
    
    print(f"Explaining prediction for: {player_name}")
    
    pos_idx = X_test.index.get_loc(player_idx)
    
    shap.plots.waterfall(shap.Explanation(values=shap_values[pos_idx], 
                                          base_values=explainer.expected_value, 
                                          data=X_test.iloc[pos_idx], 
                                          feature_names=X_test.columns))