# Rush Yards Regression Model — Train / Test

This notebook builds and evaluates an **XGBoost regression model** that predicts NFL player rushing yards per game.

### Pipeline Overview
1. **Data Loading** – read engineered feature CSVs (base stats, defense, play-by-play, YBC/YAC, spread/point-diff)
2. **Data Cleaning** – normalize player names to ensure consistent join keys
3. **Data Merging** – join all feature tables on player and team keys
4. **Feature Selection** – define training columns; examine correlations with the target
5. **Walk-Forward Training** – train on historical seasons, test on each future season (2019–2023)
6. **Permutation Importance** – rank features by predictive contribution and prune weak ones
7. **Refined Model** – retrain with filtered features and evaluate final performance
8. **Model Export** – serialize the trained model to JSON

In [1]:
# Standard data manipulation libraries
import pandas as pd
import numpy as np

## 1. Data Loading

Read each engineered feature CSV produced by the feature-engineering notebooks.

In [2]:
# Add the repository root to sys.path so project-level utilities can be imported
import sys
from pathlib import Path

# Traverse up 4 directory levels from the current working directory to reach the repo root
repo_root = Path.cwd().resolve().parents[1]
print(f"Adding {repo_root} to sys.path")
sys.path.append(str(repo_root))

# Import shared utility functions defined at the repo level
import utils

Adding /home/mrmath/Documents/modeling2 to sys.path


In [3]:
# Base player statistics (carries, rush yards, touchdowns, etc.) with rolling averages
base_stats_df = pd.read_csv('/home/mrmath/sports_betting_empire/sports_betting_empire/americanfootball_nfl/player_rush_yards/modeling/feature_engineering/base_stats_feature_engineering.csv')

# Opposing defense statistics (yards allowed, tackles, etc.) with rolling averages
defense_stats_df = pd.read_csv('/home/mrmath/sports_betting_empire/sports_betting_empire/americanfootball_nfl/player_rush_yards/modeling/feature_engineering/defense_stats_feature_engineering.csv')

# Play-by-play derived features (split carries, down/distance tendencies, etc.)
pbp_df = pd.read_csv('/home/mrmath/sports_betting_empire/sports_betting_empire/americanfootball_nfl/player_rush_yards/modeling/feature_engineering/play_by_play_feature_engineering.csv')

# Yards before contact / yards after contact features
ybc_yac_df = pd.read_csv('/home/mrmath/sports_betting_empire/sports_betting_empire/americanfootball_nfl/player_rush_yards/modeling/feature_engineering/ybc_yac_feature_engineering.csv')

In [4]:
# Game-level spread and point-differential features (team context for each matchup)
spread_point_diff_df = pd.read_csv('/home/mrmath/sports_betting_empire/sports_betting_empire/americanfootball_nfl/player_rush_yards/modeling/feature_engineering/point_diff_spread_train.csv')

# Build a team + date composite key used to join this table onto the player-level data
spread_point_diff_df['team_key'] = spread_point_diff_df['team'] + "_" + spread_point_diff_df['date'].astype(str)

## 2. Data Cleaning

Normalize player names across datasets to ensure consistent join keys.  
Some data sources include generational suffixes (Jr., III, etc.) while others omit them.

In [5]:
import re

def clean_player_name(name: str) -> str:
    """
    Remove generational suffixes from player names.

    Why this matters:
    - Player name keys must be consistent across datasets
    - Some sources include suffixes (e.g., "Jr.", "III")
    - Others omit them
    - Removing them prevents join mismatches and duplicate identities

    Handles:
    - Jr, Jr.
    - Sr, Sr.
    - II, III, IV, V, VI
    - Case-insensitive
    - Extra whitespace
    """

    if not isinstance(name, str):
        return name

    # Normalize whitespace
    name = name.strip()

    # Regex to remove suffix at end of string
    # \b ensures we only match whole suffix tokens
    suffix_pattern = r"\b(JR|SR|II|III|IV|V|VI)\.?$"

    # Remove suffix (case-insensitive)
    cleaned = re.sub(suffix_pattern, "", name, flags=re.IGNORECASE)

    # Remove any leftover trailing spaces
    return cleaned.strip()


In [6]:
# Apply name normalization to each dataset so player names are comparable across sources
base_stats_df['clean_player_name'] = base_stats_df['Player'].apply(clean_player_name)
pbp_df['clean_player_name'] = pbp_df['player'].apply(clean_player_name)
ybc_yac_df['clean_player_name'] = ybc_yac_df['Player'].apply(clean_player_name)

In [7]:
# Create a player + game-date composite key for joining across datasets
# Format: "<CleanPlayerName>_<YYYY-MM-DD>" — uniquely identifies a player's performance in a given game
base_stats_df['player_key'] = base_stats_df['clean_player_name'] + "_" + base_stats_df['Date'].astype(str)
pbp_df['player_key'] = pbp_df['clean_player_name'] + "_" + pbp_df['Date'].astype(str)
ybc_yac_df['player_key'] = ybc_yac_df['clean_player_name'] + "_" + ybc_yac_df['Date'].astype(str)

## 3. Data Merging

Join all feature tables into a single modeling DataFrame using player keys and team keys.  
Left joins preserve all rows from base_stats and fill missing features with NaN.

In [8]:
# Merge base stats with play-by-play features on the player key
# Left join keeps all base_stats rows; unmatched pbp rows are dropped
merged_df = pd.merge(base_stats_df, pbp_df, on='player_key', how='left', suffixes=('_base', '_pbp'))

# Merge in yards-before/after-contact features
merged_df = pd.merge(merged_df, ybc_yac_df, on='player_key', how='left', suffixes=('', '_ybc_yac'))

In [9]:
# Build a team + game-date composite key to join game-level context (spread, point-diff, defense)
# Format: "<TeamAbbreviation>_<YYYY-MM-DD>"
merged_df['team_key'] = merged_df['Team'] + "_" + merged_df['Date'].astype(str)

In [10]:
# Merge spread and point-differential features for each team/game combination
merged_df = pd.merge(merged_df, spread_point_diff_df, on='team_key', how='left', suffixes=('', '_spread_point_diff'))

In [11]:
# Build the defense team key and merge opposing defense stats into the final DataFrame
defense_stats_df['team_key'] = defense_stats_df['Team'] + "_" + defense_stats_df['Date'].astype(str)

# final_df is the complete, fully-joined dataset used for modeling
final_df = pd.merge(merged_df, defense_stats_df, on='team_key', how='left', suffixes=('', '_defense'))

## 4. Exploratory Data Checks

Quick sanity checks on the merged data before modeling.

In [12]:
# Inspect the spread / point-differential table to verify contents and column names
spread_point_diff_df

Unnamed: 0,spread,team,date,point_diff_3_ma,point_diff_5_ma,point_diff_3_sum,point_diff_5_sum,point_scored_3_ma,point_scored_5_ma,points_allowed_3_ma,points_allowed_5_ma,opp_point_diff_3_ma,opp_point_diff_5_ma,opp_point_diff_3_sum,opp_point_diff_5_sum,opp_point_scored_3_ma,opp_point_scored_5_ma,opp_points_allowed_3_ma,opp_points_allowed_5_ma,team_key
0,-1.0,PHI,2018-09-06,,,,,,,,,,,,,,,,,PHI_2018-09-06
1,1.0,ATL,2018-09-06,,,,,,,,,,,,,,,,,ATL_2018-09-06
2,-1.0,CIN,2018-09-09,,,,,,,,,,,,,,,,,CIN_2018-09-09
3,-1.0,TEN,2018-09-09,,,,,,,,,,,,,,,,,TEN_2018-09-09
4,-3.5,LAC,2018-09-09,,,,,,,,,,,,,,,,,LAC_2018-09-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4249,8.0,NYJ,2026-01-04,-27.666667,-20.8,-83.0,-104.0,12.000000,14.6,39.666667,35.4,2.000000,6.0,6.0,30.0,23.333333,27.0,21.333333,21.0,NYJ_2026-01-04
4250,10.5,GNB,2026-01-04,-10.333333,-3.4,-31.0,-17.0,22.000000,25.0,32.333333,28.4,8.000000,5.8,24.0,29.0,24.333333,20.8,16.333333,15.0,GNB_2026-01-04
4251,8.0,CLE,2026-01-04,-8.000000,-8.8,-24.0,-44.0,12.000000,14.6,20.000000,23.4,7.666667,7.2,23.0,36.0,27.333333,29.6,19.666667,22.4,CLE_2026-01-04
4252,14.5,LAC,2026-01-04,5.333333,7.2,16.0,36.0,22.000000,23.8,16.666667,16.6,0.333333,1.8,1.0,9.0,24.666667,25.0,24.333333,23.2,LAC_2026-01-04


In [13]:
# Check the earliest date in the spread/point-diff dataset to understand historical coverage
spread_point_diff_df['date'].min()

'2018-09-06'

## 5. Feature Selection

Define the initial set of training columns using rolling-average and delta features,  
then examine their linear correlation with the target variable (`Rush_yards`).

In [14]:
# Collect all columns whose names contain rolling-average or delta suffixes:
#   - delta_3_5: difference between 3-game and 5-game rolling averages (momentum signal)
#   - *3ma / *5ma: 3-game and 5-game moving averages
#   - *g_ma: general game-level moving averages
train_cols = [col for col in final_df.columns if 'delta_3_5' in col or col.endswith('3ma') or col.endswith('5ma') or col.endswith('g_ma')] + ['Starter']

# Injury-context features
train_cols += ['others_been_injured_1ma']   # whether teammates were recently injured
train_cols += ['carries_before_injury_1ma'] # player's recent carry load before an injury

# Game-context features
train_cols += ['spread']  # Vegas spread for the game (proxy for expected game flow)

# Team and opponent point-differential / scoring momentum features
train_cols += ['point_diff_3_ma', 'point_diff_5_ma',
       'point_diff_3_sum', 'point_diff_5_sum', 'point_scored_3_ma',
       'point_scored_5_ma', 'points_allowed_3_ma', 'points_allowed_5_ma',
       'opp_point_diff_3_ma', 'opp_point_diff_5_ma', 'opp_point_diff_3_sum',
       'opp_point_diff_5_sum', 'opp_point_scored_3_ma',
       'opp_point_scored_5_ma', 'opp_points_allowed_3_ma',
       'opp_points_allowed_5_ma']

In [15]:
# Parse the Date column as datetime for reliable season assignment
final_df['Date'] = pd.to_datetime(final_df['Date'])

# Assign an NFL season year to each game:
#   - Games in March or later belong to the current calendar year's season
#   - Games in January/February (post-season) belong to the prior year's season
final_df['season'] = final_df['Date'].apply(lambda x: x.year if x.month >= 3 else x.year - 1)

In [16]:
# Compute Pearson correlation of each training feature with the target (Rush_yards)
# Sorted descending so the most positively correlated features appear first
final_df[train_cols + ['Rush_yards']].corr()['Rush_yards'].sort_values(ascending=False)

Rush_yards                  1.000000
pct_of_carries_3ma          0.649969
pct_of_carries_5ma          0.647841
rush_attempts_3ma           0.637186
rush_attempts_5ma           0.637117
                              ...   
opp_point_diff_5_ma        -0.051169
rushes_one_to_two_5ma      -0.054466
spread                     -0.068975
others_rush_attempts_5ma   -0.501672
others_rush_attempts_3ma   -0.503975
Name: Rush_yards, Length: 166, dtype: float64

## 6. Walk-Forward Model Training (Initial Feature Set)

Train an XGBoost regressor using a **walk-forward validation** strategy:
- For each season from 2019–2023, the model trains on **all prior seasons** and tests on the **current season**.
- Only players with ≥1 recent carry (3-game MA) or who are listed as a starter are included.
- Evaluation metrics reported per season: MAE, RMSE, and R².

In [17]:
import xgboost as xgb
from sklearn.metrics import r2_score

# Initialize XGBoost regressor with fixed hyperparameters
# n_estimators=100: number of boosting rounds
# max_depth=5: limits tree depth to reduce overfitting
# learning_rate=0.1: shrinkage step size
# random_state=42: ensures reproducibility
# n_jobs=1: restricts to a single thread for deterministic behavior
reg_model = xgb.XGBRegressor(n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42, n_jobs=1)

all_preds = []  # accumulates per-season prediction DataFrames for later analysis
final_df = final_df.sort_values('Date')  # Ensure data is sorted chronologically for walk-forward validation
for season in range(2019, 2026):  # Loop over seasons from 2019 to 2025 inclusive
    # Walk-forward split: train on every season before the current one
    train_data = final_df[final_df['season'] < season]
    test_data = final_df[final_df['season'] == season]

    # Filter to meaningful rush-volume players: starters or those with recent carry history
    train_data = train_data[(train_data['rush_attempts_3ma'] >= 1) | (train_data['Starter'] == 1)]
    test_data = test_data[(test_data['rush_attempts_3ma'] >= 1) | (test_data['Starter'] == 1)]

    X_train = train_data[train_cols]
    y_train = train_data['Rush_yards']
    X_test = test_data[train_cols]
    y_test = test_data['Rush_yards']

    # Fit the model on historical data and generate predictions for the held-out season
    reg_model.fit(X_train, y_train)
    predictions = reg_model.predict(X_test)

    # Attach predictions and actuals to the test set for downstream analysis
    test_data = test_data.copy()
    test_data.loc[:, 'predicted_rush_yards'] = predictions
    test_data.loc[:, 'actual_rush_yards'] = y_test
    all_preds.append(test_data)

    # Report per-season error metrics
    print(
        f"Season {season} - "
        f"MAE: {np.mean(np.abs(predictions - y_test)):.2f}, "
        f"RMSE: {np.sqrt(np.mean((predictions - y_test) ** 2)):.2f}, "
        f"R^2: {r2_score(y_test, predictions):.2f}"
    )

Season 2019 - MAE: 22.64, RMSE: 31.32, R^2: 0.32
Season 2020 - MAE: 21.27, RMSE: 30.08, R^2: 0.35
Season 2021 - MAE: 19.81, RMSE: 27.98, R^2: 0.40
Season 2022 - MAE: 20.28, RMSE: 28.80, R^2: 0.41
Season 2023 - MAE: 18.76, RMSE: 26.63, R^2: 0.41
Season 2024 - MAE: 20.01, RMSE: 28.13, R^2: 0.45
Season 2025 - MAE: 19.68, RMSE: 28.69, R^2: 0.42


In [18]:
# Concatenate per-season prediction DataFrames into one unified results table
concat_df = pd.concat(all_preds)

In [19]:
# Check the Pearson correlation between predicted and actual rush yards across all test seasons
# A higher correlation (closer to 1.0) indicates the model captures the right directional signals
concat_df[['predicted_rush_yards', 'actual_rush_yards']].corr()

Unnamed: 0,predicted_rush_yards,actual_rush_yards
predicted_rush_yards,1.0,0.630351
actual_rush_yards,0.630351,1.0


## 7. Permutation Feature Importance

Rank features by how much model performance degrades when each feature's values are randomly shuffled.  
Features with near-zero or negative importance are candidates for removal in the refined training pass.

In [20]:
from sklearn.inspection import permutation_importance

# Compute permutation importance on the most recent test season (last iteration of the training loop)
# n_repeats=5: shuffle each feature 5 times and average the performance drop for stability
perm = permutation_importance(
    reg_model,
    X_test[train_cols],
    y_test,
    n_repeats=5,
    random_state=42,
)

# Sort features from most to least important
sorted_idx = perm.importances_mean.argsort()[::-1]

new_train_cols = []  # will hold features that pass the importance threshold
for i in sorted_idx:
    print(f"{train_cols[i]}: {perm.importances_mean[i]:.4f}")
    # Keep any feature with non-negative permutation importance (removes strictly harmful features)
    if perm.importances_mean[i] > -0:  # Threshold for importance
        new_train_cols.append(train_cols[i])

Starter: 0.0634
pct_of_carries_3ma: 0.0405
pct_of_carries_5ma: 0.0241
rush_yards_5ma: 0.0130
team_up the middle_diff_3ma: 0.0077
min_rush_yards_3ma: 0.0065
team_total_diff_3ma: 0.0047
opponent_rushes_six_plus_3ma: 0.0045
team_rushes_six_plus_3ma: 0.0042
others_rush_attempts_5ma: 0.0036
others_rush_attempts_3ma: 0.0032
opponent_right guard_diff_5ma: 0.0030
opponent_left tackle_diff_3ma: 0.0029
team_rushes_less_than_eq_zero_3ma: 0.0029
min_rush_yards_5ma: 0.0027
spread: 0.0026
team_rushes_forty_plus_5ma: 0.0026
opponent_rushes_three_to_five_3ma: 0.0025
team_right tackle_diff_3ma: 0.0023
team_brk_tkl_per_att_5_g_ma: 0.0021
opponent_rushes_one_to_two_3ma: 0.0020
rushes_twenty_plus_5ma: 0.0019
rushes_one_to_two_5ma: 0.0018
ypc_5ma: 0.0017
right tackle_diff_3ma: 0.0017
opponent_rushes_twenty_plus_5ma: 0.0015
opp_point_diff_5_ma: 0.0015
left tackle_diff_3ma: 0.0014
opponent_rushes_ten_plus_5ma: 0.0014
team_ybc_per_att_5_g_ma: 0.0014
player_ybc_per_att_5_g_ma: 0.0013
rush_attempts_5ma: 0.0013


In [21]:
# Display the pruned feature list that will be used for the refined model
new_train_cols

['Starter',
 'pct_of_carries_3ma',
 'pct_of_carries_5ma',
 'rush_yards_5ma',
 'team_up the middle_diff_3ma',
 'min_rush_yards_3ma',
 'team_total_diff_3ma',
 'opponent_rushes_six_plus_3ma',
 'team_rushes_six_plus_3ma',
 'others_rush_attempts_5ma',
 'others_rush_attempts_3ma',
 'opponent_right guard_diff_5ma',
 'opponent_left tackle_diff_3ma',
 'team_rushes_less_than_eq_zero_3ma',
 'min_rush_yards_5ma',
 'spread',
 'team_rushes_forty_plus_5ma',
 'opponent_rushes_three_to_five_3ma',
 'team_right tackle_diff_3ma',
 'team_brk_tkl_per_att_5_g_ma',
 'opponent_rushes_one_to_two_3ma',
 'rushes_twenty_plus_5ma',
 'rushes_one_to_two_5ma',
 'ypc_5ma',
 'right tackle_diff_3ma',
 'opponent_rushes_twenty_plus_5ma',
 'opp_point_diff_5_ma',
 'left tackle_diff_3ma',
 'opponent_rushes_ten_plus_5ma',
 'team_ybc_per_att_5_g_ma',
 'player_ybc_per_att_5_g_ma',
 'rush_attempts_5ma',
 'opponent_right guard_diff_3ma',
 'opp_brk_tkl_per_att_3_g_ma',
 'ypc_delta_3_5',
 'carries_before_injury_3ma',
 'rushes_six_pl

## 8. Refined Walk-Forward Training (Pruned Feature Set)

Retrain the same XGBoost regressor using only the features that passed the permutation importance filter.  
The data is also pre-sorted by date to ensure chronological ordering before splitting.

In [22]:
import xgboost as xgb
from sklearn.metrics import r2_score

# Sort by date to guarantee walk-forward integrity (no future data leaking into training)
final_df = final_df.sort_values('Date')

# Re-initialize the model with the same hyperparameters for a clean training run
reg_model = xgb.XGBRegressor(n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42, n_jobs=1)

all_preds = []  # accumulates per-season prediction DataFrames

for season in range(2019, 2026):
    # Walk-forward split: train on everything before the target season
    train_data = final_df[final_df['season'] < season]
    test_data = final_df[final_df['season'] == season]

    # Keep only players with meaningful rush history or starter status
    train_data = train_data[(train_data['rush_attempts_3ma'] >= 1) | (train_data['Starter'] == 1)]
    test_data = test_data[(test_data['rush_attempts_3ma'] >= 1) | (test_data['Starter'] == 1)]

    # Use the pruned feature set from permutation importance
    X_train = train_data[new_train_cols]
    y_train = train_data['Rush_yards']
    X_test = test_data[new_train_cols]
    y_test = test_data['Rush_yards']

    # Train and predict
    reg_model.fit(X_train, y_train)
    predictions = reg_model.predict(X_test)

    # Attach predictions and actuals for downstream analysis
    test_data = test_data.copy()
    test_data.loc[:, 'predicted_rush_yards'] = predictions
    test_data.loc[:, 'actual_rush_yards'] = y_test
    all_preds.append(test_data)

    # Report per-season performance metrics
    print(
        f"Season {season} - "
        f"MAE: {np.mean(np.abs(predictions - y_test)):.2f}, "
        f"RMSE: {np.sqrt(np.mean((predictions - y_test) ** 2)):.2f}, "
        f"R^2: {r2_score(y_test, predictions):.2f}"
    )

Season 2019 - MAE: 22.65, RMSE: 31.50, R^2: 0.31
Season 2020 - MAE: 21.24, RMSE: 30.06, R^2: 0.35
Season 2021 - MAE: 19.91, RMSE: 28.30, R^2: 0.39
Season 2022 - MAE: 20.38, RMSE: 29.21, R^2: 0.39
Season 2023 - MAE: 18.66, RMSE: 26.78, R^2: 0.40
Season 2024 - MAE: 19.80, RMSE: 27.82, R^2: 0.46
Season 2025 - MAE: 19.60, RMSE: 28.40, R^2: 0.43


In [23]:
all_pred_df = pd.concat(all_preds)
all_pred_df.to_csv('predictions_from_regression_model.csv', index=False)

## 9. Save Model

Export the trained XGBoost regressor to JSON format for deployment or future inference.

In [24]:
# Serialize the final trained model to a JSON file
# The JSON format is portable and can be loaded by XGBoost in any environment
reg_model.save_model('rush_yard_regressor.json')