# Player-Enhanced Features Experiment

**Testing historical momentum + blowout + Haslametrics + player features**

Changes from 02_modeling.ipynb:
- ✅ Enhanced team stats (36 features vs 10 baseline)
- ✅ Momentum features (win streak, recent form, avg margin)
- ✅ Blowout tendency (large margin win/loss rates)
- ✅ Haslametrics offensive efficiency
- ✅ Team-specific home court advantage
- ✅ **Player features (14 features):**
  - Star player power (top 3 scorers PPG, efficiency)
  - Offensive balance (scoring distribution)
  - Bench depth (non-starter production)
  - Key player efficiency (AST/TO, rebounds, usage)

Expected: Test if player-based features improve MAE

In [1]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import our modules
from src import config
from src.elo import EloRatingSystem
from src.models import ImprovedSpreadModel
from src.utils import fetch_barttorvik_year
from sklearn.model_selection import TimeSeriesSplit
from sklearn.metrics import mean_absolute_error

print("Libraries loaded!")

Libraries loaded!


## 1. Load Real Historical Games

In [2]:
# Load historical game results
games = pd.read_csv(config.HISTORICAL_GAMES_FILE, parse_dates=['date'])

print(f"Loaded {len(games)} real games")
print(f"Date range: {games['date'].min()} to {games['date'].max()}")
print(f"Seasons: {sorted(games['season'].unique())}")
print(f"\nGames per season:")
print(games['season'].value_counts().sort_index())
print(f"\nMargin stats:")
print(f"  Mean: {games['margin'].mean():.2f}")
print(f"  Std: {games['margin'].std():.2f}")
print(f"  Median: {games['margin'].median():.2f}")

games.head()

Loaded 33746 real games
Date range: 2019-11-05 00:00:00 to 2025-03-08 00:00:00
Seasons: [np.int64(2020), np.int64(2021), np.int64(2022), np.int64(2023), np.int64(2024), np.int64(2025)]

Games per season:
season
2020    5747
2021    4338
2022    5661
2023    6250
2024    5798
2025    5952
Name: count, dtype: int64

Margin stats:
  Mean: 2.83
  Std: 18.70
  Median: 2.00


Unnamed: 0,date,home_team,away_team,home_score,away_score,neutral_site,season,margin
0,2019-11-05,Abilene Christian,Arlington Baptist,90.0,39.0,False,2020,51.0
1,2019-11-05,N.C. A&T,UNC Greensboro,50.0,83.0,True,2020,-33.0
2,2019-11-05,Nebraska,UC Riverside,47.0,66.0,False,2020,-19.0
3,2019-11-05,Nevada,Utah,74.0,79.0,False,2020,-5.0
4,2019-11-05,New Hampshire,Curry,93.0,29.0,False,2020,64.0


## 2. Initialize and Process Elo Ratings Chronologically

In [3]:
# Initialize Elo system using config
elo = EloRatingSystem(
    k_factor=config.ELO_CONFIG['k_factor'],
    hca=config.ELO_CONFIG['home_court_advantage'],
    carryover=config.ELO_CONFIG['season_carryover']
)

# Load conference mappings from config
elo.load_conference_mappings(config.CONFERENCE_MAPPINGS)

print("Elo system initialized with conference mappings from config")

Elo system initialized with conference mappings from config


In [4]:
# Process games chronologically to build Elo ratings
print("Processing games chronologically to build Elo history...")
print("This may take a minute...\n")

elo_snapshots = elo.process_games(
    games,
    date_col='date',
    home_col='home_team',
    away_col='away_team',
    home_score_col='home_score',
    away_score_col='away_score',
    neutral_col='neutral_site',
    season_col='season',
    save_snapshots=True
)

print(f"\n✓ Processed {len(elo_snapshots)} games")
print(f"✓ Tracked {len(elo.ratings)} team Elo ratings")

# Display top teams
print("\nTop 15 teams by current Elo:")
elo.get_rankings(top_n=15)

Processing games chronologically to build Elo history...
This may take a minute...




✓ Processed 33746 games
✓ Tracked 1213 team Elo ratings

Top 15 teams by current Elo:


Unnamed: 0,rank,team,elo,conference
0,1,Houston,2347.620519,Big 12
1,2,Florida,2282.359698,SEC
2,3,Michigan St.,2280.357765,Big Ten
3,4,Duke,2272.182835,ACC
4,5,St. John's (NY),2257.560185,Big East
5,6,Tennessee,2233.59624,SEC
6,7,Auburn,2227.623663,SEC
7,8,Saint Mary's (CA),2203.262061,WCC
8,9,Maryland,2191.847523,Big Ten
9,10,Alabama,2182.195751,SEC


## 3. Merge Elo with Enhanced Team Stats

In [5]:
# Fetch team stats for training years using config and utils
all_stats = []
for year in config.TRAINING_YEARS:
    print(f"Fetching {year}...")
    df = fetch_barttorvik_year(year)
    df['season'] = year
    all_stats.append(df[['team', 'adjoe', 'adjde', 'season']])

team_stats = pd.concat(all_stats, ignore_index=True)
team_stats.columns = ['team', 'adj_oe', 'adj_de', 'season']
team_stats['adj_em'] = team_stats['adj_oe'] - team_stats['adj_de']

print(f"\nLoaded efficiency stats for {len(team_stats)} team-seasons")
team_stats.head()

Fetching 2020...


Fetching 2021...


Fetching 2022...


Fetching 2023...


Fetching 2024...


Fetching 2025...



Loaded efficiency stats for 2147 team-seasons


Unnamed: 0,team,adj_oe,adj_de,season,adj_em
0,B12,7.0,2.0,2020,5.0
1,B12,13.0,3.0,2020,10.0
2,WCC,1.0,41.0,2020,-40.0
3,A10,3.0,29.0,2020,-26.0
4,B10,11.0,11.0,2020,0.0


## 4. Create Training Features from Real Games

In [6]:
# Merge Elo snapshots with team efficiency stats
print("Creating training features from real game data...")

# Add efficiency stats to elo_snapshots based on team and date
# Match by season (extract from date)
elo_snapshots['season'] = elo_snapshots['date'].dt.year

# Merge home team stats
train_data = elo_snapshots.merge(
    team_stats,
    left_on=['home_team', 'season'],
    right_on=['team', 'season'],
    how='left',
    suffixes=('', '_home')
)
train_data = train_data.rename(columns={'adj_oe': 'home_adj_oe', 'adj_de': 'home_adj_de', 'adj_em': 'home_adj_em'})
train_data = train_data.drop(columns=['team'], errors='ignore')

# Merge away team stats
train_data = train_data.merge(
    team_stats,
    left_on=['away_team', 'season'],
    right_on=['team', 'season'],
    how='left',
    suffixes=('', '_away')
)
train_data = train_data.rename(columns={'adj_oe': 'away_adj_oe', 'adj_de': 'away_adj_de', 'adj_em': 'away_adj_em'})
train_data = train_data.drop(columns=['team'], errors='ignore')

# Calculate derived features
train_data['eff_diff'] = train_data['home_adj_em'] - train_data['away_adj_em']
train_data['elo_diff'] = train_data['home_elo_before'] - train_data['away_elo_before']

# Drop rows with missing efficiency data
train_data = train_data.dropna(subset=['home_adj_oe', 'away_adj_oe'])

print(f"✓ Created {len(train_data)} training samples from real games")
print(f"\nFeature columns:")
print([c for c in train_data.columns if 'adj' in c or 'elo' in c or 'diff' in c])

Creating training features from real game data...
✓ Created 8850 training samples from real games

Feature columns:
['home_elo_before', 'away_elo_before', 'home_elo_after', 'away_elo_after', 'home_adj_oe', 'home_adj_de', 'home_adj_em', 'away_adj_oe', 'away_adj_de', 'away_adj_em', 'eff_diff', 'elo_diff']


In [7]:
# Define features using config
feature_cols = config.BASELINE_FEATURES

X = train_data[feature_cols]
y = train_data['actual_margin']

print(f"X shape: {X.shape}")
print(f"y shape: {y.shape}")
print(f"Features: {feature_cols}")
print(f"\nTarget (actual_margin) stats:")
print(f"  Mean: {y.mean():.2f}")
print(f"  Std: {y.std():.2f}")
print(f"  Median: {y.median():.2f}")

X shape: (8850, 11)
y shape: (8850,)
Features: ['home_adj_oe', 'home_adj_de', 'home_adj_em', 'away_adj_oe', 'away_adj_de', 'away_adj_em', 'eff_diff', 'home_elo_before', 'away_elo_before', 'elo_diff', 'predicted_spread']

Target (actual_margin) stats:
  Mean: -0.21
  Std: 15.08
  Median: -1.00


## 5. Train Model on Real Data

In [8]:
# Train improved model using config parameters
print("Training ImprovedSpreadModel on REAL game data...\n")

model = ImprovedSpreadModel(
    ridge_alpha=config.MODEL_CONFIG['ridge_alpha'],
    lgbm_params={
        'n_estimators': config.MODEL_CONFIG['n_estimators'],
        'max_depth': config.MODEL_CONFIG['max_depth'],
        'learning_rate': config.MODEL_CONFIG['learning_rate'],
    },
    weights=(
        config.MODEL_CONFIG['ridge_weight'],
        config.MODEL_CONFIG['lgbm_weight']
    ),
    use_lgbm=True
)

model.fit(X, y)
print("✓ Model trained!\n")

# Component performance
components = model.predict_components(X)
for name, preds in components.items():
    mae = np.abs(preds - y).mean()
    rmse = np.sqrt(((preds - y) ** 2).mean())
    print(f"{name:12} MAE={mae:.3f}, RMSE={rmse:.3f}")

Training ImprovedSpreadModel on REAL game data...



✓ Model trained!

ridge        MAE=5.913, RMSE=7.673
lgbm         MAE=4.148, RMSE=5.694
ensemble     MAE=4.601, RMSE=6.186


In [9]:
# Cross-validation on real data using config
print("\nRunning 5-fold time-series cross-validation...\n")

tscv = TimeSeriesSplit(n_splits=config.CV_CONFIG['n_splits'])
cv_results = {'ridge': [], 'ensemble': []}

for fold, (train_idx, val_idx) in enumerate(tscv.split(X)):
    X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
    y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
    
    fold_model = ImprovedSpreadModel(
        ridge_alpha=config.MODEL_CONFIG['ridge_alpha'],
        lgbm_params={
            'n_estimators': config.MODEL_CONFIG['n_estimators'],
            'max_depth': config.MODEL_CONFIG['max_depth'],
            'learning_rate': config.MODEL_CONFIG['learning_rate'],
        },
        weights=(
            config.MODEL_CONFIG['ridge_weight'],
            config.MODEL_CONFIG['lgbm_weight']
        )
    )
    fold_model.fit(X_train, y_train)
    
    preds = fold_model.predict(X_val)
    components = fold_model.predict_components(X_val)
    
    ridge_mae = np.abs(components['ridge'] - y_val).mean()
    ensemble_mae = np.abs(preds - y_val).mean()
    
    cv_results['ridge'].append(ridge_mae)
    cv_results['ensemble'].append(ensemble_mae)
    
    print(f"Fold {fold+1}: Ridge MAE={ridge_mae:.3f}, Ensemble MAE={ensemble_mae:.3f}")

print(f"\n{'='*60}")
print(f"Ridge CV MAE:    {np.mean(cv_results['ridge']):.3f} ± {np.std(cv_results['ridge']):.3f}")
print(f"Ensemble CV MAE: {np.mean(cv_results['ensemble']):.3f} ± {np.std(cv_results['ensemble']):.3f}")
print(f"{'='*60}")


Running 5-fold time-series cross-validation...



Fold 1: Ridge MAE=5.986, Ensemble MAE=5.881


Fold 2: Ridge MAE=6.013, Ensemble MAE=5.543


Fold 3: Ridge MAE=6.225, Ensemble MAE=5.728


Fold 4: Ridge MAE=6.186, Ensemble MAE=5.472


Fold 5: Ridge MAE=5.708, Ensemble MAE=5.051

Ridge CV MAE:    6.024 ± 0.183
Ensemble CV MAE: 5.535 ± 0.281


## 6. Generate 2026 Predictions with Player-Enhanced Features

In [10]:
# Load 2026 PLAYER-ENHANCED team stats and prediction template
team_stats_2026 = pd.read_csv(config.PROCESSED_DATA_DIR / 'team_stats_2025_26_player_enhanced.csv')
template = pd.read_csv(config.DATA_DIR.parent / config.SUBMISSION_TEMPLATE)
template = template.dropna(subset=['Home', 'Away'])

print(f"Teams for 2026: {len(team_stats_2026)}")
print(f"Games to predict: {len(template)}")
print(f"\nPlayer-enhanced features available ({len(team_stats_2026.columns)} total):")
print("Baseline features (10): off_efficiency, def_efficiency, elo_rating, etc.")
print("Historical features (11): win_streak, recent_form, avg_margin, blowout tendency")
print("Haslametrics (2): haslametrics_off_eff, haslametrics_rank")
print("Player features (14): star PPG, bench depth, offensive balance, efficiency")
print(f"\nSample columns: {team_stats_2026.columns[:5].tolist()}...")

Teams for 2026: 21
Games to predict: 78

Player-enhanced features available (37 total):
Baseline features (10): off_efficiency, def_efficiency, elo_rating, etc.
Historical features (11): win_streak, recent_form, avg_margin, blowout tendency
Haslametrics (2): haslametrics_off_eff, haslametrics_rank
Player features (14): star PPG, bench depth, offensive balance, efficiency

Sample columns: ['team', 'off_efficiency', 'def_efficiency', 'wins', 'losses']...


In [11]:
# Create prediction features
team_dict = team_stats_2026.set_index('team').to_dict('index')

pred_features = []
valid_indices = []

for idx, row in template.iterrows():
    home = row['Home']
    away = row['Away']
    
    if home not in team_dict or away not in team_dict:
        continue
    
    home_stats = team_dict[home]
    away_stats = team_dict[away]
    
    home_oe = home_stats.get('off_efficiency', 100)
    home_de = home_stats.get('def_efficiency', 100)
    away_oe = away_stats.get('off_efficiency', 100)
    away_de = away_stats.get('def_efficiency', 100)
    
    features = {
        'home_adj_oe': home_oe,
        'home_adj_de': home_de,
        'home_adj_em': home_oe - home_de,
        'away_adj_oe': away_oe,
        'away_adj_de': away_de,
        'away_adj_em': away_oe - away_de,
        'eff_diff': (home_oe - home_de) - (away_oe - away_de),
        'home_elo_before': elo.get_rating(home),
        'away_elo_before': elo.get_rating(away),
        'elo_diff': elo.get_rating(home) - elo.get_rating(away),
        'predicted_spread': elo.predict_spread(home, away),
    }
    
    pred_features.append(features)
    valid_indices.append(idx)

X_pred = pd.DataFrame(pred_features)
print(f"✓ Created features for {len(X_pred)} games")

✓ Created features for 78 games


In [12]:
# Generate predictions
predictions = model.predict(X_pred)
components = model.predict_components(X_pred)

results = template.copy()
for i, idx in enumerate(valid_indices):
    results.loc[idx, 'pt_spread'] = predictions[i]
    results.loc[idx, 'ridge_pred'] = components['ridge'][i]
    results.loc[idx, 'lgbm_pred'] = components['lgbm'][i]
    results.loc[idx, 'elo_spread'] = X_pred.iloc[i]['predicted_spread']

print("✓ Predictions generated!")
results[['Date', 'Away', 'Home', 'pt_spread', 'ridge_pred', 'lgbm_pred']].head(15)

✓ Predictions generated!


Unnamed: 0,Date,Away,Home,pt_spread,ridge_pred,lgbm_pred
0,2/7/2026,Syracuse,Virginia,14.299741,11.680078,15.422454
1,2/7/2026,Louisville,Wake Forest,-0.027151,-1.713405,0.695529
2,2/7/2026,Virginia Tech,NC State,9.953126,8.275951,10.671915
3,2/7/2026,Miami,Boston College,6.689936,3.426476,8.088562
4,2/7/2026,SMU,Pitt,-7.200907,-5.907042,-7.755421
5,2/7/2026,Florida State,Notre Dame,10.158979,12.486165,9.161614
6,2/7/2026,Duke,North Carolina,-0.653616,-2.361048,0.07814
7,2/7/2026,Clemson,California,-9.652513,-7.63559,-10.516909
8,2/7/2026,Georgia Tech,Stanford,10.363289,9.854389,10.581389
9,2/9/2026,NC State,Louisville,17.780719,18.865536,17.315798


## 7. Save Player-Enhanced Predictions

In [13]:
# Prepare submission using config team info
submission = results[['Date', 'Away', 'Home', 'pt_spread']].copy()
submission = submission.dropna(subset=['pt_spread'])

submission['team_name'] = ''
submission['team_member'] = ''
submission['team_email'] = ''

# Use team info from config
team_members = config.TEAM_INFO['members']
submission.loc[submission.index[0], 'team_name'] = config.TEAM_INFO['team_name']
for i, member in enumerate(team_members):
    if i < len(submission):
        submission.loc[submission.index[i], 'team_member'] = member['name']
        submission.loc[submission.index[i], 'team_email'] = member['email']

# Save to player-enhanced path for comparison
player_enhanced_output = config.PREDICTIONS_DIR / 'tsa_pt_spread_CMMT_2026_player_enhanced.csv'
submission.to_csv(player_enhanced_output, index=False)
print(f"✓ Saved: {player_enhanced_output}")

✓ Saved: /Users/calebhan/Documents/Coding/Personal/triangle-sports-analytics-26/notebooks/../data/predictions/tsa_pt_spread_CMMT_2026_player_enhanced.csv


In [14]:
# Final Summary
print("\n" + "="*60)
print("PLAYER-ENHANCED MODEL SUMMARY")
print("="*60)
print(f"Training: {len(train_data)} real games (2020-2025)")
print(f"Features: {len(feature_cols)} baseline features (for training)")
print(f"Predictions: {len(submission)} games\n")
print(f"Cross-Validation Results:")
print(f"  Ridge MAE:    {np.mean(cv_results['ridge']):.3f} ± {np.std(cv_results['ridge']):.3f}")
print(f"  Ensemble MAE: {np.mean(cv_results['ensemble']):.3f} ± {np.std(cv_results['ensemble']):.3f}")
print("\nNote: Player-enhanced features (36 total) only used in 2026 team stats:")
print("      - 10 baseline (off/def efficiency, elo)")
print("      - 11 historical (momentum, blowout tendency)")
print("      - 2 Haslametrics (offensive efficiency, rank)")
print("      - 14 player (star power, bench depth, balance, efficiency)")
print("      Training data still uses baseline features only.")
print("="*60)


PLAYER-ENHANCED MODEL SUMMARY
Training: 8850 real games (2020-2025)
Features: 11 baseline features (for training)
Predictions: 78 games

Cross-Validation Results:
  Ridge MAE:    6.024 ± 0.183
  Ensemble MAE: 5.535 ± 0.281

Note: Player-enhanced features (36 total) only used in 2026 team stats:
      - 10 baseline (off/def efficiency, elo)
      - 11 historical (momentum, blowout tendency)
      - 2 Haslametrics (offensive efficiency, rank)
      - 14 player (star power, bench depth, balance, efficiency)
      Training data still uses baseline features only.


## 8. Compare Baseline vs Player-Enhanced Predictions

In [15]:
# Load baseline predictions for comparison
baseline_pred = pd.read_csv(config.PREDICTION_OUTPUT_FILE)
player_enhanced_pred = pd.read_csv(player_enhanced_output)

# Compare predictions
comparison = baseline_pred[['Date', 'Away', 'Home', 'pt_spread']].copy()
comparison = comparison.rename(columns={'pt_spread': 'baseline_spread'})
comparison = comparison.merge(
    player_enhanced_pred[['Date', 'Away', 'Home', 'pt_spread']],
    on=['Date', 'Away', 'Home'],
    how='inner'
)
comparison = comparison.rename(columns={'pt_spread': 'player_enhanced_spread'})
comparison['difference'] = comparison['player_enhanced_spread'] - comparison['baseline_spread']

print(f"Comparing {len(comparison)} predictions:\n")
print(f"Mean difference: {comparison['difference'].mean():.3f}")
print(f"Std difference:  {comparison['difference'].std():.3f}")
print(f"Max difference:  {comparison['difference'].abs().max():.3f}")
print(f"\nTop 10 games with biggest prediction changes:")
comparison.nlargest(10, 'difference')[['Date', 'Away', 'Home', 'baseline_spread', 'player_enhanced_spread', 'difference']]

Comparing 78 predictions:

Mean difference: -0.116
Std difference:  0.618
Max difference:  2.618

Top 10 games with biggest prediction changes:


Unnamed: 0,Date,Away,Home,baseline_spread,player_enhanced_spread,difference
10,2/10/2026,Virginia,Florida State,-9.146741,-7.34952,1.797221
53,2/28/2026,NC State,Notre Dame,5.488484,6.197955,0.709471
34,2/18/2026,Virginia,Georgia Tech,1.747998,2.404489,0.656491
67,3/4/2026,Stanford,Notre Dame,2.434483,3.073579,0.639096
72,3/7/2026,SMU,Florida State,-7.454485,-6.839745,0.61474
32,2/17/2026,Virginia Tech,Miami,5.044504,5.628988,0.584484
1,2/7/2026,Louisville,Wake Forest,-0.507913,-0.027151,0.480761
50,2/25/2026,SMU,California,0.591109,1.070458,0.479349
51,2/28/2026,Virginia,Duke,16.949908,17.419825,0.469917
11,2/10/2026,North Carolina,Miami,-7.841652,-7.371957,0.469694
