# Predicting Hall of Fame Inductees (Data Science Portfolio)

## Objective
Build an XGBoost classifier to predict Baseball Hall of Fame induction using locally stored Lahman Database CSV files. 

**Key enhancements:**
*   **Feature Engineering**: OPS, ISO, Era/Decade adjustments.
*   **Bias Correction**: Filtering out active/recent players from training to avoid recency bias.
*   **Advanced Evaluation**: Using ROC-AUC and F1-Score for imbalanced class assessment.
*   **Interpretability**: SHAP analysis.

## Data Source
We use three CSV files located in `../../data/baseball/`:
1.  **Batting.csv**: Career batting statistics.
2.  **HallOfFame.csv**: Ground truth labels for induction.
3.  **People.csv**: Player names and biographical details.


In [None]:
# Install dependencies (if needed)
!pip install pandas numpy xgboost scikit-learn matplotlib seaborn shap

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb
import shap
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve, f1_score, recall_score, precision_score
import matplotlib.pyplot as plt
import seaborn as sns
import os

plt.style.use('fivethirtyeight')
# Initialize JS for SHAP plots
shap.initjs()

## 1. Data Loading
Load the CSVs into Pandas DataFrames. We assume the notebook is running in `assets/code/` and data is in `data/baseball/` (2 levels up).

In [None]:
# Define paths (Update this to './' if running in Colab with direct upload)
DATA_DIR = '../../data/baseball/'

try:
    df_batting = pd.read_csv(os.path.join(DATA_DIR, 'Batting.csv'))
    df_hof = pd.read_csv(os.path.join(DATA_DIR, 'HallOfFame.csv'))
    df_people = pd.read_csv(os.path.join(DATA_DIR, 'People.csv'))
    
    print("Data Loaded Successfully!")
    print(f"Batting: {df_batting.shape}")
    print(f"HOF: {df_hof.shape}")
    print(f"People: {df_people.shape}")
    
except FileNotFoundError:
    print("Error: CSV files not found. Please check the DATA_DIR path.")

## 2. Preprocessing & Feature Engineering
We transform season-level stats into career totals, and then derive advanced sabermetric indicators.

In [None]:
# 1. Aggregate Batting Stats by PlayerID
# Added HBP (Hit by Pitch) and SF (Sacrifice Fly) for accurate OBP calculation
cols_to_sum = ['G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'BB', 'SO', 'HBP', 'SF']

# Ensure columns are numeric
for col in cols_to_sum:
    if col in df_batting.columns:
        df_batting[col] = pd.to_numeric(df_batting[col], errors='coerce').fillna(0)

# We also need yearID to calculate Career Span and Decade
career_stats = df_batting.groupby('playerID').agg(
    {**{col: 'sum' for col in cols_to_sum}, 
     'yearID': ['min', 'max']}  # Get First and Last Year
).reset_index()

# Fix Column Names after Aggregation (Flatten MultiIndex)
new_cols = []
for col in career_stats.columns.values:
    if col[1] == 'sum':
        new_cols.append(col[0]) # Keep original name for sums (e.g. 'AB' not 'AB_sum')
    elif col[1]:
        new_cols.append(f"{col[0]}_{col[1]}") # Keep suffix for others (e.g. 'yearID_min')
    else:
        new_cols.append(col[0]) # Keep index name 'playerID'
        
career_stats.columns = new_cols

# Rename year columns for clarity
career_stats.rename(columns={'yearID_min': 'FirstYear', 'yearID_max': 'LastYear'}, inplace=True)

# 2. Calculate Basic Rates
career_stats['AVG'] = np.where(career_stats['AB'] > 0, career_stats['H'] / career_stats['AB'], 0)

# 3. ADVANCED FEATURE ENGINEERING

# -- ERA / DECADE FEATURES --
career_stats['YearsPlayed'] = career_stats['LastYear'] - career_stats['FirstYear'] + 1
# Calculate 'Primary Decade' (midpoint of career rounded to nearest 10)
career_stats['Decade'] = (((career_stats['FirstYear'] + career_stats['LastYear']) / 2) // 10 * 10).astype(int)

# -- SABERMETRICS --
# Plate Appearances (PA)
career_stats['PA'] = career_stats['AB'] + career_stats['BB'] + career_stats['HBP'] + career_stats['SF']

# On-Base Percentage (OBP)
career_stats['OBP'] = np.where(career_stats['PA'] > 0, 
                               (career_stats['H'] + career_stats['BB'] + career_stats['HBP']) / career_stats['PA'], 0)

# Slugging Percentage (SLG)
# TB = H + 2B + 2*3B + 3*HR
total_bases = career_stats['H'] + career_stats['2B'] + (2 * career_stats['3B']) + (3 * career_stats['HR'])
career_stats['SLG'] = np.where(career_stats['AB'] > 0, total_bases / career_stats['AB'], 0)

# On-Base Plus Slugging (OPS)
career_stats['OPS'] = career_stats['OBP'] + career_stats['SLG']

# Isolated Power (ISO)
career_stats['ISO'] = career_stats['SLG'] - career_stats['AVG']

# Walk Rate and Strikeout Rate
career_stats['BB_Rate'] = np.where(career_stats['PA'] > 0, career_stats['BB'] / career_stats['PA'], 0)
career_stats['K_Rate']  = np.where(career_stats['PA'] > 0, career_stats['SO'] / career_stats['PA'], 0)

# 4. Merge with People to get Names
career_stats = career_stats.merge(df_people[['playerID', 'nameFirst', 'nameLast']], on='playerID', how='left')
career_stats['Name'] = career_stats['nameFirst'] + ' ' + career_stats['nameLast']

# Filter for significant careers (> 2000 ABs)
career_stats = career_stats[career_stats['AB'] > 2000]

print(f"All Players (Significant Careers): {len(career_stats)}")
career_stats[['Name', 'Decade', 'YearsPlayed', 'OPS']].head()

## 3. Labeling & Splitting
**CRITICAL STEP**: We must separate players who have completed their careers from those who are still active or recently retired.
*   **Training Data**: Players who retired before 2019 (Eligible for HOF).
*   **Prediction Data**: Players who played in 2019 or later (Future candidates).

In [None]:
# Get list of inducted IDs (inducted = 'Y')
hof_inductees = df_hof[df_hof['inducted'] == 'Y']['playerID'].unique()

# Create Target Column
career_stats['is_hof'] = career_stats['playerID'].isin(hof_inductees).astype(int)

# --- FILTERING STEP ---
ELIGIBILITY_CUTOFF = 2019

df_eligible = career_stats[career_stats['LastYear'] < ELIGIBILITY_CUTOFF].copy()
df_future = career_stats[career_stats['LastYear'] >= ELIGIBILITY_CUTOFF].copy()

print(f"Training Set (retired before {ELIGIBILITY_CUTOFF}): {len(df_eligible)} players")
print(f"Future Prediction Set (active/recent): {len(df_future)} players")

## 4. Exploratory Data Analysis (EDA)
Visualizing the statistical profile of Hall of Famers (using only the eligible training data).

In [None]:
# 1. Correlation Heatmap
plt.figure(figsize=(12, 10))
features_to_plot = ['H', 'HR', 'RBI', 'AVG', 'OBP', 'SLG', 'OPS', 'Decade', 'is_hof']
sns.heatmap(df_eligible[features_to_plot].corr(), annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation (Eligible Players Only)')
plt.show()

In [None]:
# 2. Scatter Plot: OPS vs. Hits
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_eligible, x='H', y='OPS', hue='is_hof', alpha=0.6, palette={0: 'gray', 1: 'gold'})
plt.title('Career Hits vs. OPS (Gold = HOF)')
plt.axvline(x=3000, color='red', linestyle='--', linewidth=1, label='3000 Hits')
plt.axhline(y=0.900, color='blue', linestyle='--', linewidth=1, label='0.900 OPS')
plt.legend()
plt.show()

In [None]:
# 3. Box Plots: Statistical Distribution
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.boxplot(x='is_hof', y='OPS', data=df_eligible, ax=axes[0], palette='Set2')
axes[0].set_title('Distribution of Career OPS')

sns.boxplot(x='is_hof', y='YearsPlayed', data=df_eligible, ax=axes[1], palette='Set2')
axes[1].set_title('Distribution of Years Played')

sns.boxplot(x='is_hof', y='Decade', data=df_eligible, ax=axes[2], palette='Set2')
axes[2].set_title('Distribution of Decade Played')

plt.show()

## 5. Model Training (XGBoost)
Train using only the `df_eligible` dataset.

In [None]:
features = [
    'G', 'AB', 'R', 'H', '2B', '3B', 'HR', 'RBI', 'SB', 'BB', 'SO', 
    'AVG', 'OBP', 'SLG', 'OPS', 'ISO', 'BB_Rate', 'K_Rate',
    'YearsPlayed', 'Decade'
]
X = df_eligible[features]
y = df_eligible['is_hof']

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Train XGBoost
model = xgb.XGBClassifier(use_label_encoder=False, eval_metric='logloss', n_estimators=200, random_state=42)
model.fit(X_train, y_train)

print("Model Trained Successfully on Eligible Players.")

## 6. Evaluation
Given the imbalanced nature of HOF induction (very few players get in), **Accuracy** is not enough. 
We use **F1-Score**, **ROC-AUC**, and **Recall** (Sensitivity) to measure how well we identify the rare HOF talent.

In [None]:
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

# Calculate Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_prob)

print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f} (Correctly identified HOFers)")
print(f"F1 Score:  {f1:.4f}")
print(f"ROC-AUC:   {roc_auc:.4f}")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# 1. Confusion Matrix (Normalized)
cm = confusion_matrix(y_test, y_pred, normalize='true')
sns.heatmap(cm, annot=True, fmt='.2%', cmap='Blues', ax=axes[0])
axes[0].set_title('Normalized Confusion Matrix')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_xticklabels(['Not HOF', 'HOF'])
axes[0].set_yticklabels(['Not HOF', 'HOF'])

# 2. ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
axes[1].plot(fpr, tpr, label=f'AUC = {roc_auc:.2f}')
axes[1].plot([0, 1], [0, 1], 'r--')
axes[1].set_title('ROC Curve')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].legend(loc='lower right')

plt.show()

## 7. Model Interpretation with SHAP
Do we still see a Decade bias now that we've removed recent players?

In [None]:
# Create the explainer
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 1. Summary Plot
plt.title("Feature Impact on HOF Probability")
shap.summary_plot(shap_values, X_test, plot_type="dot")

## 8. PREDICTING THE FUTURE
Now for the fun part! We apply our trained model to the `df_future` dataset (Active/Recent players) to see who has the best shot.

In [None]:
X_future = df_future[features]

# Predict Probabilities
future_probs = model.predict_proba(X_future)[:, 1]

df_future['HOF_Probability'] = future_probs

# Display Top 15 Candidates
top_candidates = df_future.sort_values(by='HOF_Probability', ascending=False).head(20)

display(top_candidates[['Name', 'HOF_Probability', 'H', 'HR', 'OPS', 'YearsPlayed', 'Decade']])

### Case Study: Defensive Specialists (Hedges & Mathis)
The user requested an analysis of **Austin Hedges** and **Jeff Mathis**. Both are famous for being elite defensive catchers with historically low batting metrics.

Let's see if the model (which only sees batting stats) gives them *any* chance, and use SHAP to explain why.

In [None]:
target_players = ['Austin Hedges', 'Jeff Mathis']

for player_name in target_players:
    # Find the player in the FUTURE dataset (since they are recent/active)
    player_row = df_future[df_future['Name'] == player_name]
    
    if not player_row.empty:
        # Get the index relative to X_future
        # Loop in case there are duplicates (unlikely for these 2 but good practice)
        for idx in player_row.index:
            # We need to compute SHAP values for the FUTURE set first
            # (We only computed X_test earlier)
            shap_values_future = explainer.shap_values(X_future)
            
            # Find positional index in X_future
            pos_idx = X_future.index.get_loc(idx)

            print(f"\nSHAP Analysis for {player_name}:")
            print(f"HOF Probability: {player_row.loc[idx, 'HOF_Probability']:.4f}")
            
            shap.plots.waterfall(shap.Explanation(values=shap_values_future[pos_idx], 
                                                  base_values=explainer.expected_value, 
                                                  data=X_future.iloc[pos_idx], 
                                                  feature_names=features))
    else:
        print(f"Player {player_name} not found in the filtered dataset (Check spelling or AB threshold).")