# Base Stat Feature Engineering

Builds the core per-player rolling feature set from raw game-by-game rushing stats.  
Output: `base_stats_feature_engineering.csv` — consumed by the `train_test/` notebooks.

## Pipeline Overview

| Stage | Description |
|-------|-------------|
| **Data Ingestion** | Load per-game RB stats from S3 (seasons 2018–2025) |
| **Offensive Rolling Lookup** | Pre-compute leak-free rolling features per player |
| **Data Cleaning** | Parse dates, filter to RBs only |
| **Teammate Lookups** | Index starter / rusher presence by `(player, week, season)` |
| **Training Assembly** | Row-by-row join of player features + competition + injury-impact metrics |
| **Export** | Write final DataFrame to CSV |

## Feature Summary

| Feature Group | Windows | Description |
|---------------|---------|-------------|
| `rush_yards_{w}ma` | 1, 3, 5, 10 | Rolling mean rushing yards |
| `rush_attempts_{w}ma` | 1, 3, 5, 10 | Rolling mean carries |
| `ypc_{w}ma` | 1, 3, 5, 10 | Rolling mean yards per carry |
| `success_rate_{w}ma` | 1, 3, 5, 10 | Rolling mean success rate |
| `*_delta_3_5` / `*_delta_5_10` | — | Momentum: short-window minus long-window |
| `rush_yards_vol_5`, `ypc_vol_5` | 5 | Rolling std — game-to-game consistency |
| `min/max_rush_yards_{3,5}ma` | 3, 5 | Rolling floor / ceiling for yards |
| `pct_of_carries_{w}ma` | 1, 3, 5 | Player's share of combined team carries |
| `others_rush_attempts_{w}ma` | 1, 3, 5 | Sum of active teammates' rolling carries |
| `others_been_injured_{w}ma` | 1, 3, 5 | Absent teammate workload (bucketed by weeks since last game) |
| `carries_before_injury_{w}ma` | 1, 3, 5 | Player's carry share vs injured teammate before absence |

> **Leak prevention:** all rolling windows apply `shift(1)` so the feature for game N reflects only games N−1, N−2, …

In [None]:
# Standard library imports
import pandas as pd


## 1. Environment Setup & Data Ingestion

Resolve the repo root so the private `utils` module can be found, then load raw per-game rushing stats from S3 for seasons **2018–2025**.

In [None]:
# Resolve the repo root dynamically so the shared `utils` module can be imported
# regardless of where the notebook is run from within the project tree.
#
# NOTE: `utils` is a private module NOT included in this repository.
# It contains custom web-scraping functions and an S3 client wrapper used
# to fetch pre-scraped NFL data stored in a private S3 bucket. To run this
# notebook you will need to supply your own data source and adapt
# `utils.rush_yard_stats_from_s3()` accordingly.
import sys
from pathlib import Path
repo_root = Path.cwd().resolve().parents[3]
print(f"Adding {repo_root} to sys.path")
sys.path.append(str(repo_root))
import utils


Adding /home/mrmath/sports_betting_empire/sports_betting_empire to sys.path


In [None]:
# Load raw base rushing stats from S3 for seasons 2018-2025.
# Returns a DataFrame with per-game rushing stats for all players.
base_stats = utils.rush_yard_stats_from_s3("base_stats", 2018, 2025)


## 2. Offensive Rolling Feature Lookup

Pre-compute all rolling features per player and store them in `offense_rush_stats_LOOKUP` — a dict keyed by player name where each value is a DataFrame indexed by game date.

Building the lookup once here (rather than recalculating inside `generate_train_df`) keeps assembly fast: each feature row is a single `iloc[-1]` slice.  
All windows use `shift(1)` so no same-game data bleeds into the feature row.

In [None]:
# Build a per-player lookup dictionary of rolling engineered features.
#
# For each player we compute, using a 1-game look-back (shift(1)) to prevent
# data leakage:
#   - Rolling means over 1, 3, 5, and 10 game windows for:
#       * Rush yards        (rush_yards_Xma)
#       * Rush attempts     (rush_attempts_Xma)
#       * Yards per carry   (ypc_Xma)
#       * Success rate      (success_rate_Xma)
#   - Short-vs-long window deltas to capture momentum / trend:
#       * rush_yards_delta_3_5, rush_yards_delta_5_10
#       * rush_attempts_delta_3_5, rush_attempts_delta_5_10
#       * ypc_delta_3_5, ypc_delta_5_10
#       * success_rate_delta_3_5
#   - Volatility (rolling std over 5 games) for yards and YPC
#   - Rolling min/max over 3 and 5 games for yards
#
# Result: offense_rush_stats_LOOKUP[player_name] -> DataFrame indexed by Date

offense_rush_stats_LOOKUP = {}
for k, v in base_stats.sort_values(['Date']).groupby(['Team']):
    for i in v['Player'].unique():
        player_data = base_stats[base_stats['Player'] == i].sort_values(['Date'])

        # --- Rolling means (shift(1) ensures no same-game leakage) ---
        rush_yards_1ma = player_data['Yds'].shift(1).rolling(1, min_periods=1).mean()
        rush_yards_3ma = player_data['Yds'].shift(1).rolling(3, min_periods=1).mean()
        rush_yards_5ma = player_data['Yds'].shift(1).rolling(5, min_periods=1).mean()
        rush_yards_10ma = player_data['Yds'].shift(1).rolling(10, min_periods=1).mean()

        rush_attempts_1ma = player_data['Att'].shift(1).rolling(1, min_periods=1).mean()
        rush_attempts_3ma = player_data['Att'].shift(1).rolling(3, min_periods=1).mean()
        rush_attempts_5ma = player_data['Att'].shift(1).rolling(5, min_periods=1).mean()
        rush_attempts_10ma = player_data['Att'].shift(1).rolling(10, min_periods=1).mean()

        ypc_1ma = (player_data['Yds'] / player_data['Att']).shift(1).rolling(1, min_periods=1).mean()
        ypc_3ma = (player_data['Yds'] / player_data['Att']).shift(1).rolling(3, min_periods=1).mean()
        ypc_5ma = (player_data['Yds'] / player_data['Att']).shift(1).rolling(5, min_periods=1).mean()
        ypc_10ma = (player_data['Yds'] / player_data['Att']).shift(1).rolling(10, min_periods=1).mean()

        success_rate_1ma = player_data['Succ%'].shift(1).rolling(1, min_periods=1).mean()
        success_rate_3ma = player_data['Succ%'].shift(1).rolling(3, min_periods=1).mean()
        success_rate_5ma = player_data['Succ%'].shift(1).rolling(5, min_periods=1).mean()
        success_rate_10ma = player_data['Succ%'].shift(1).rolling(10, min_periods=1).mean()

        # --- Rolling min/max for range context ---
        max_rush_yards_3ma = player_data['Yds'].shift(1).rolling(3, min_periods=1).max()
        max_rush_yards_5ma = player_data['Yds'].shift(1).rolling(5, min_periods=1).max()
        min_rush_yards_3ma = player_data['Yds'].shift(1).rolling(3, min_periods=1).min()
        min_rush_yards_5ma = player_data['Yds'].shift(1).rolling(5, min_periods=1).min()

        # --- Momentum / trend deltas (short window minus long window) ---
        rush_yards_delta_3_5 = rush_yards_3ma - rush_yards_5ma
        rush_yards_delta_5_10 = rush_yards_5ma - rush_yards_10ma

        rush_attempts_delta_3_5 = rush_attempts_3ma - rush_attempts_5ma
        rush_attempts_delta_5_10 = rush_attempts_5ma - rush_attempts_10ma

        ypc_delta_3_5 = ypc_3ma - ypc_5ma
        ypc_delta_5_10 = ypc_5ma - ypc_10ma

        success_rate_delta_3_5 = success_rate_3ma - success_rate_5ma

        # --- Volatility features: rolling std captures game-to-game consistency ---
        rush_yards_vol_5 = player_data['Yds'].shift(1).rolling(5, min_periods=2).std()
        ypc_vol_5 = (player_data['Yds'] / player_data['Att']).shift(1).rolling(5, min_periods=2).std()

        base_player_stats_ma = {
            'Date': pd.to_datetime(player_data['Date']),
            'rush_yards_1ma': rush_yards_1ma,
            'rush_yards_3ma': rush_yards_3ma,
            'rush_yards_5ma': rush_yards_5ma,
            'rush_yards_10ma': rush_yards_10ma,
            'rush_yards_delta_3_5': rush_yards_delta_3_5,
            'rush_yards_delta_5_10': rush_yards_delta_5_10,

            'rush_attempts_1ma': rush_attempts_1ma,
            'rush_attempts_3ma': rush_attempts_3ma,
            'rush_attempts_5ma': rush_attempts_5ma,
            'rush_attempts_10ma': rush_attempts_10ma,
            'rush_attempts_delta_3_5': rush_attempts_delta_3_5,
            'rush_attempts_delta_5_10': rush_attempts_delta_5_10,

            'ypc_1ma': ypc_1ma,
            'ypc_3ma': ypc_3ma,
            'ypc_5ma': ypc_5ma,
            'ypc_10ma': ypc_10ma,
            'ypc_delta_3_5': ypc_delta_3_5,
            'ypc_delta_5_10': ypc_delta_5_10,

            'success_rate_1ma': success_rate_1ma,
            'success_rate_3ma': success_rate_3ma,
            'success_rate_5ma': success_rate_5ma,
            'success_rate_10ma': success_rate_10ma,
            'success_rate_delta_3_5': success_rate_delta_3_5,

            'rush_yards_vol_5': rush_yards_vol_5,
            'ypc_vol_5': ypc_vol_5,

            'min_rush_yards_3ma': min_rush_yards_3ma,
            'min_rush_yards_5ma': min_rush_yards_5ma,
            'max_rush_yards_3ma': max_rush_yards_3ma,
            'max_rush_yards_5ma': max_rush_yards_5ma,
            'Pos.': player_data['Pos.'].iloc[0],
        }

        offense_rush_stats_LOOKUP[i] = pd.DataFrame(base_player_stats_ma)


## 3. Data Cleaning & Filtering

Parse the `Date` column to `datetime` and restrict the dataset to **running backs only** (`Pos. == 'RB'`).  
QBs and other positions that occasionally rush are excluded — the model targets RB workload specifically.

In [None]:
# Ensure the Date column is parsed as datetime for consistent sorting and filtering
base_stats['Date'] = pd.to_datetime(base_stats['Date'])


In [None]:
# Restrict the dataset to running backs only.
# QBs and other positions that occasionally rush are excluded to keep
# the model focused on RB workload prediction.
base_stats = base_stats[base_stats['Pos.'] == 'RB']


## 4. Competition & Injury Lookup Tables

Build two O(1) dicts indexed by `(player, week, season)`:

| Dict | Contents |
|------|----------|
| `starter_lookup_by_week_season` | `True` for players marked as starters |
| `rusher_lookup_by_week_season` | `1` for every player who recorded a carry |

These are queried inside `generate_train_df` to efficiently determine whether a previously seen teammate is **active** in the current week or likely **injured** — without re-scanning the full DataFrame on every iteration.

In [None]:
# Build two O(1) lookup dictionaries keyed by (player, week, season):
#   starter_lookup_by_week_season  - maps key -> True if the player was a starter
#   rusher_lookup_by_week_season   - maps key -> 1 for every player who carried the ball
#
# These are used later in generate_train_df() to quickly determine whether a
# previously seen teammate is currently active (healthy) or likely injured.
starter_lookup_by_week_season = {}
rusher_lookup_by_week_season = {}
base_stats = base_stats.sort_values(['Date'])
for i in range(len(base_stats)):
    rusher = base_stats.iloc[i]
    key = (base_stats.iloc[i]['Player'], base_stats.iloc[i]['Week'], base_stats.iloc[i]['season'])
    if rusher['is_starter']:
        starter_lookup_by_week_season[key] = rusher['is_starter']
    rusher_lookup_by_week_season[key] = 1


## 5. Training Data Assembly — `generate_train_df()`

Iterates over every RB game row and assembles a feature-rich training record by combining four signal sources:

1. **Player rolling features** — look up the player's pre-game rolling stats from `offense_rush_stats_LOOKUP`
2. **Active teammate competition** — sum rolling carries of all other RBs who played the same game
3. **Injury-impact features** — identify teammates absent from the current week; bucket their recent workload by how many weeks they've been missing (1 week → `_1ma` bucket, 2–3 weeks → `_3ma`, 4–5 weeks → `_5ma`)
4. **Carry-share percentages** — player carries as a fraction of combined team carries across 1/3/5 windows

> All lookups reference only data prior to the current game date, maintaining strict temporal separation.

In [None]:
# Assemble the final training DataFrame row-by-row.
#
# Each row represents a single player-game observation and includes:
#   1. Player's own rolling features (from offense_rush_stats_LOOKUP)
#   2. Active-teammate competition metrics (same-game carry share)
#   3. Injury-impact features (carry replacement load from recently absent teammates)
#   4. Derived carry-share percentages across 1/3/5 game windows
#
# All features are computed strictly from data prior to the current game
# to guarantee no data leakage into the training set.

def generate_train_df(rush_df):
    """
    Build training dataset for RB workload prediction.

    Key modeling ideas:
    - Capture teammate competition within same game
    - Model recency-weighted injury impact of other RBs
    - Estimate how carry share changes when injured RBs return
    - Use rolling moving averages (1/3/5 windows) to capture workload trends
    """

    rush_df = rush_df.sort_values("Date").copy()
    rush_df["game_date"] = rush_df["Date"].dt.date

    rows = []

    for row in rush_df.itertuples(index=False):

        player_key = row.Player
        team = row.Team
        game_date = row.game_date
        week = row.Week
        season = row.season

        # -------------------------------------------------
        # PLAYER ROLLING FEATURES
        # Look up pre-computed rolling stats for this player up to (and
        # including) the current game date, then take the most recent row.
        # -------------------------------------------------

        if player_key not in offense_rush_stats_LOOKUP:
            continue

        player_full_history = offense_rush_stats_LOOKUP[player_key]
        ps = player_full_history[player_full_history["Date"].dt.date <= game_date]

        if ps.empty:
            continue

        player_stats_on_date = ps.iloc[-1]

        # -------------------------------------------------
        # SAME-GAME TEAMMATES
        # Identify all other RBs who carried the ball in this game —
        # these directly compete for carries with the target player.
        # -------------------------------------------------

        same_game_teammates = rush_df[
            (rush_df["Team"] == team) &
            (rush_df["game_date"] == game_date) &
            (rush_df["Player"] != player_key)
        ]

        # -------------------------------------------------
        # PREVIOUS TEAMMATES THIS SEASON
        # Used to detect recently injured players: any RB who appeared
        # earlier in the season but is absent from this game's active roster.
        # de-duplicate to keep only each teammate's most recent appearance.
        # -------------------------------------------------

        prev_teammates = rush_df[
            (rush_df["Team"] == team) &
            (rush_df["game_date"] < game_date) &
            (rush_df["season"] == season) &
            (rush_df["Player"] != player_key)
        ].drop_duplicates(["Player", "season"], keep="last")

        # Accumulators for injury-impact features
        others_been_injured_1ma = 0
        others_been_injured_3ma = 0
        others_been_injured_5ma = 0

        # Accumulators for carry-share-before-injury features
        carries_before_injury_1ma = 0
        carries_before_injury_3ma = 0
        carries_before_injury_5ma = 0

        for teammate in prev_teammates.itertuples(index=False):

            key = teammate.Player
            last_active_week = teammate.Week

            # Skip teammate if they are active in the current game week
            if (key, week, season) in rusher_lookup_by_week_season:
                continue

            # Only consider teammates who went missing within the last 5 weeks
            # (beyond that, the injury is unlikely to still affect carry share).
            week_diff = week - last_active_week

            if week_diff > 5:
                continue

            stats_df = offense_rush_stats_LOOKUP.get(key)
            if stats_df is None:
                continue

            stats_df = stats_df[stats_df["Date"].dt.date < game_date]
            if stats_df.empty:
                continue

            last_val = stats_df.iloc[-1]["rush_attempts_5ma"]
            if pd.isna(last_val):
                continue

            # Bucket the absent teammate's average workload by how recently
            # they disappeared — closer absence = stronger carry-share impact.
            if week_diff < 2:
                others_been_injured_1ma += last_val
            elif 2 <= week_diff <= 3:
                others_been_injured_3ma += last_val
            elif 4 <= week_diff <= 5:
                others_been_injured_5ma += last_val

            # -------------------------------------------------
            # CARRY SHARE BEFORE INJURY
            # Compare the target player's rolling carries vs the now-absent
            # teammate's rolling carries at the last game the teammate played.
            # A high share indicates the target player was dominant even before
            # the teammate went down, reducing expected carry-share "bonus."
            # -------------------------------------------------

            if len(stats_df) < 2:
                continue  # Need at least two rows to access the pre-injury row

            pre_injury_row = stats_df.iloc[-2]
            pre_injury_date = pre_injury_row["Date"]

            player_hist = player_full_history[
                player_full_history["Date"] == pre_injury_date
            ]

            if player_hist.empty:
                continue

            player_pre = player_hist.iloc[-1]

            for window in [1, 3, 5]:

                col = f"rush_attempts_{window}ma"

                player_val = player_pre.get(col, 0)
                teammate_val = pre_injury_row.get(col, 0)

                denom = player_val + teammate_val
                if denom <= 0:
                    continue

                # Fraction of shared carries belonging to the target player
                share = player_val / denom

                if window == 1:
                    carries_before_injury_1ma += share
                elif window == 3:
                    carries_before_injury_3ma += share
                elif window == 5:
                    carries_before_injury_5ma += share

        # -------------------------------------------------
        # ACTIVE TEAMMATE ROLLING AVERAGES
        # Sum up each active same-game teammate's rolling carry averages to
        # quantify how much competition the target player faces in this game.
        # -------------------------------------------------

        other_stats = {
            "others_rush_attempts_1ma": 0,
            "others_rush_attempts_3ma": 0,
            "others_rush_attempts_5ma": 0,
        }

        for teammate in same_game_teammates.itertuples(index=False):

            key = teammate.Player
            stats_df = offense_rush_stats_LOOKUP.get(key)
            if stats_df is None:
                continue

            stats_df = stats_df[stats_df["Date"].dt.date <= game_date]
            if stats_df.empty:
                continue

            latest = stats_df.iloc[-1]

            for window in [1, 3, 5]:
                col = f"rush_attempts_{window}ma"
                val = latest[col]
                if pd.notna(val):
                    other_stats[f"others_rush_attempts_{window}ma"] += val

        # -------------------------------------------------
        # CARRY SHARE PERCENTAGES
        # Express the target player's rolling carries as a fraction of the
        # combined team rolling carries (player + active teammates).
        # -------------------------------------------------

        pct = {}

        for window in [1, 3, 5]:
            player_val = player_stats_on_date[f"rush_attempts_{window}ma"]
            other_val = other_stats[f"others_rush_attempts_{window}ma"]

            denom = player_val + other_val
            pct[f"pct_of_carries_{window}ma"] = (
                player_val / denom if denom > 0 else 0
            )

        # -------------------------------------------------
        # ASSEMBLE TRAINING ROW
        # Flatten all feature groups into a single dict and append to rows list.
        # -------------------------------------------------

        data_row = {
            "Player": player_key,
            "Team": team,
            "Date": game_date,
            "Att": row.Att,
            "Rush_yards": row.Yds,
            "Starter": row.is_starter,
            **player_stats_on_date.to_dict(),
            **other_stats,
            **pct,
            "others_been_injured_1ma": others_been_injured_1ma,
            "others_been_injured_3ma": others_been_injured_3ma,
            "others_been_injured_5ma": others_been_injured_5ma,
            "carries_before_injury_1ma": carries_before_injury_1ma,
            "carries_before_injury_3ma": carries_before_injury_3ma,
            "carries_before_injury_5ma": carries_before_injury_5ma,
        }

        rows.append(data_row)

    return pd.DataFrame(rows)


## 6. Generate Dataset & Export

Run `generate_train_df` over the full RB dataset, then write the result to `base_stats_feature_engineering.csv`.  
This file is the primary input for the training notebooks in `train_test/`.

In [None]:
# Generate the full training DataFrame by iterating over all RB game rows
# and attaching the engineered features defined above.
train_df = generate_train_df(base_stats)


In [None]:
# Persist the engineered feature set to disk.
# This CSV is consumed by the train/test notebooks in train_test/.
train_df.to_csv('base_stats_feature_engineering.csv', index=False)
