# Predicting Hall of Fame Inductees (Data Science Portfolio)

## Objective
Build an XGBoost classifier to predict Baseball Hall of Fame induction using locally stored Lahman Database CSV files, followed by a deeper interpretation using **SHAP** values.

## Data Source
We use three CSV files located in `../../data/baseball/`:
1.  **Batting.csv**: Career batting statistics.
2.  **HallOfFame.csv**: Ground truth labels for induction.
3.  **People.csv**: Player names and biographical details.


In [None]:
# Install dependencies (if needed)
!pip install pandas numpy xgboost scikit-learn matplotlib seaborn shap

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
import shap
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import os

plt.style.use('fivethirtyeight')
# Initialize JS for SHAP plots
shap.initjs()

## 1. Data Loading
Load the CSVs into Pandas DataFrames. We assume the notebook is running in `assets/code/` and data is in `data/baseball/` (2 levels up).

In [None]:
# Define paths (Update this to './' if running in Colab with direct upload)
DATA_DIR = '../../data/baseball/'

try:
    df_batting = pd.read_csv(os.path.join(DATA_DIR, 'Batting.csv'))
    df_hof = pd.read_csv(os.path.join(DATA_DIR, 'HallOfFame.csv'))
    df_people = pd.read_csv(os.path.join(DATA_DIR, 'People.csv'))
    
    print("Data Loaded Successfully!")
    print(f"Batting: {df_batting.shape}")
    print(f"HOF: {df_hof.shape}")
    print(f"People: {df_people.shape}")
    
except FileNotFoundError:
    print("Error: CSV files not found. Please check the DATA_DIR path.")

## 2. Preprocessing & Aggregation
We need to transform season-level batting stats into **Player Career Totals**.

In [None]:
# 1. Aggregate Batting Stats by PlayerID
cols_to_sum = ['G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'BB', 'SO']

# Ensure columns are numeric
for col in cols_to_sum:
    df_batting[col] = pd.to_numeric(df_batting[col], errors='coerce').fillna(0)

career_stats = df_batting.groupby('playerID')[cols_to_sum].sum().reset_index()

# 2. Calculate Career Batting Average
# Avoid division by zero
career_stats['AVG'] = np.where(career_stats['AB'] > 0, career_stats['H'] / career_stats['AB'], 0)

# 3. Merge with People to get Names
career_stats = career_stats.merge(df_people[['playerID', 'nameFirst', 'nameLast']], on='playerID', how='left')
career_stats['Name'] = career_stats['nameFirst'] + ' ' + career_stats['nameLast']

# Filter for significant careers (> 2000 ABs)
career_stats = career_stats[career_stats['AB'] > 2000]

print(f"Filtered Career Players: {len(career_stats)}")

## 3. Labeling (HOF Status)
Identify players who were inducted into the Hall of Fame using the `HallOfFame` table.

In [None]:
# Get list of inducted IDs (inducted = 'Y')
hof_inductees = df_hof[df_hof['inducted'] == 'Y']['playerID'].unique()

# Create Target Column
career_stats['is_hof'] = career_stats['playerID'].isin(hof_inductees).astype(int)

print(f"Total Inductees in Dataset: {career_stats['is_hof'].sum()}")

## 4. Model Training (XGBoost)
Train the model using the aggregated stats.

In [None]:
features = ['G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'BB', 'SO', 'AVG']
X = career_stats[features]
y = career_stats['is_hof']

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train XGBoost
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', n_estimators=200, random_state=42)
model.fit(X_train, y_train)

print("Model Trained Successfully.")

## 5. Evaluation

In [None]:
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Confusion Matrix
plt.figure(figsize=(6, 4))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt='d', cmap='Blues')
plt.title('Confusion Matrix')
plt.show()

## 6. Misses & Predictions

In [None]:
test_df = X_test.copy()
test_df['Name'] = career_stats.loc[test_df.index, 'Name']
test_df['Actual'] = y_test
test_df['Predicted'] = y_pred

# Snubs (False Negatives)
snubs = test_df[(test_df['Actual'] == 1) & (test_df['Predicted'] == 0)]
print("Model 'Snubs' (Actual HOFers predicted as NO):")
display(snubs[['Name', 'H', 'HR', 'AVG']].head(5))

# Controversial (False Positives)
candidates = test_df[(test_df['Actual'] == 0) & (test_df['Predicted'] == 1)]
print("\nModel Candidates (Not HOF but predicted YES):")
display(candidates[['Name', 'H', 'HR', 'AVG']].head(5))

## 7. Model Interpretation with SHAP
While XGBoost gives us "Feature Importance", **SHAP (SHapley Additive exPlanations)** tells us *how* each feature affects the prediction. 

*   **Red dots** = High value of the feature (e.g., lots of Hits).
*   **Right side** = Higher chance of being HOF.

In [None]:
# Create the explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 1. Summary Plot (The single most useful view)
plt.title("Feature Impact on HOF Probability")
shap.summary_plot(shap_values, X_test, plot_type="dot")

### Explaining Single Predictions
Let's zoom in on one of our "False Positives" (a player predicted to be in the HOF who isn't). Why did the model think they should be inducted?

In [None]:
if not candidates.empty:
    # Pick the first 'False Positive' from our list
    player_idx = candidates.index[0]
    player_name = candidates.loc[player_idx, 'Name']
    
    print(f"Explaining prediction for: {player_name}")
    
    # Get the location in X_test (integer index)
    # Note: candidates.index contains original DF indices. We need the positional index in X_test.
    pos_idx = X_test.index.get_loc(player_idx)
    
    # Waterfall plot shows how each stat pushed the probability up or down from the baseline
    shap.plots.waterfall(shap.Explanation(values=shap_values[pos_idx], 
                                          base_values=explainer.expected_value, 
                                          data=X_test.iloc[pos_idx], 
                                          feature_names=X_test.columns))