# Defense Feature Engineering

Engineers **opponent-adjusted defensive rushing features** for every NFL team, then joins them onto each RB game row to provide matchup context for the downstream regression and classification models.

### Input
| Source | Description |
|--------|-------------|
| S3 — `base_stats` (2018–2025) | Game-level RB stats: yards, attempts, YPC, success rate, opposing team |

### Output
`defense_stats_feature_engineering.csv` — one row per team per game with the following defensive features (all leak-free via `shift(1)`):

| Feature Group | Columns |
|---------------|---------|
| Yards allowed | `RB_rush_yards_allowed_{1,3,5}ma`, `min/max_rush_yards_allowed_{3,5}ma` |
| YPC allowed | `RB_ypc_allowed_{1,3,5}ma` |
| Offense quality adjustment | `strength_of_offense_{1,3,5}ma`, `defense_performance_relative_{1,3,5}ma` |
| Momentum deltas | `RB_rush_yards_allowed_delta_{3_5,1_3}`, `ypc_allowed_delta_{3_5,1_3}`, `defense_relative_delta_{3_5,1_3}` |
| Volatility | `RB_rush_yards_allowed_vol_5`, `ypc_allowed_vol_5` |

### Pipeline Overview
1. Load raw RB game-log from S3
2. Build per-player **offensive** rolling lookup (used for opponent-strength adjustment)
3. Aggregate RB production conceded by each team per game
4. Build per-team **defensive** rolling feature lookup (opponent-adjusted)
5. Join defensive features onto each game row and export to CSV

In [None]:
# Standard library imports
import pandas as pd


## 1. Environment Setup & Data Ingestion

Add the repo root to `sys.path`, import shared utilities, then load raw RB game-log data from S3.

In [None]:
# Resolve the repo root dynamically so the shared `utils` module can be imported
# regardless of where the notebook is run from within the project tree.
#
# NOTE: `utils` is a private module NOT included in this repository.
# It contains custom web-scraping functions and an S3 client wrapper used
# to fetch pre-scraped NFL data stored in a private S3 bucket. To run this
# notebook you will need to supply your own data source and adapt
# `utils.rush_yard_stats_from_s3()` accordingly.
import sys
from pathlib import Path
repo_root = Path.cwd().resolve().parents[3]
print(f"Adding {repo_root} to sys.path")
sys.path.append(str(repo_root))
import utils


Adding /home/mrmath/sports_betting_empire/sports_betting_empire to sys.path


In [None]:
# Load raw base rushing stats from S3 for seasons 2018-2025.
# This data drives both the offense lookup and the defense allowed calculations below.
base_stats = utils.rush_yard_stats_from_s3("base_stats", 2018, 2025)


## 2. Offensive Rolling Feature Lookup

Build a per-player rolling feature lookup (`offense_rush_stats_LOOKUP`) from `base_stats`.  
This is used downstream to compute **`strength_of_offense`** — the aggregate rolling rushing output of the offense a defense faces — which context-adjusts raw yards-allowed figures.

For every player, the following 1/3/5/10-game rolling means are computed with `shift(1)` to prevent leakage:

| Metric | Columns |
|--------|---------|
| Rush yards | `rush_yards_{1,3,5,10}ma` |
| Attempts | `rush_attempts_{1,3,5,10}ma` |
| Yards per carry | `ypc_{1,3,5,10}ma` |
| Success rate | `success_rate_{1,3,5,10}ma` |

In [None]:
# Build a per-player lookup of rolling offensive rushing features.
#
# This is needed here to calculate `strength_of_offense` — the aggregate
# rolling output of the offense a defense faced — which is used to
# context-adjust raw yards-allowed figures (see defense lookup below).
#
# All windows use shift(1) to prevent same-game data leakage.
# Computes 1/3/5/10-game rolling means for rush yards, attempts, YPC,
# and success rate for every player in base_stats.

offense_rush_stats_LOOKUP = {}
for k, v in base_stats.sort_values(['Date']).groupby(['Team']):
    for i in v['Player'].unique():
        player_data = base_stats[base_stats['Player'] == i].sort_values(['Date'])

        # --- Rolling means (shift(1) ensures no same-game leakage) ---
        rush_yards_1ma = player_data['Yds'].shift(1).rolling(1, min_periods=1).mean()
        rush_yards_3ma = player_data['Yds'].shift(1).rolling(3, min_periods=1).mean()
        rush_yards_5ma = player_data['Yds'].shift(1).rolling(5, min_periods=1).mean()
        rush_yards_10ma = player_data['Yds'].shift(1).rolling(10, min_periods=1).mean()

        rush_attempts_1ma = player_data['Att'].shift(1).rolling(1, min_periods=1).mean()
        rush_attempts_3ma = player_data['Att'].shift(1).rolling(3, min_periods=1).mean()
        rush_attempts_5ma = player_data['Att'].shift(1).rolling(5, min_periods=1).mean()
        rush_attempts_10ma = player_data['Att'].shift(1).rolling(10, min_periods=1).mean()

        ypc_1ma = (player_data['Yds'] / player_data['Att']).shift(1).rolling(1, min_periods=1).mean()
        ypc_3ma = (player_data['Yds'] / player_data['Att']).shift(1).rolling(3, min_periods=1).mean()
        ypc_5ma = (player_data['Yds'] / player_data['Att']).shift(1).rolling(5, min_periods=1).mean()
        ypc_10ma = (player_data['Yds'] / player_data['Att']).shift(1).rolling(10, min_periods=1).mean()

        success_rate_1ma = player_data['Succ%'].shift(1).rolling(1, min_periods=1).mean()
        success_rate_3ma = player_data['Succ%'].shift(1).rolling(3, min_periods=1).mean()
        success_rate_5ma = player_data['Succ%'].shift(1).rolling(5, min_periods=1).mean()
        success_rate_10ma = player_data['Succ%'].shift(1).rolling(10, min_periods=1).mean()

        base_player_stats_ma = {
            'Date': pd.to_datetime(player_data['Date']),
            'rush_yards_1ma': rush_yards_1ma,
            'rush_yards_3ma': rush_yards_3ma,
            'rush_yards_5ma': rush_yards_5ma,
            'rush_yards_10ma': rush_yards_10ma,

            'rush_attempts_1ma': rush_attempts_1ma,
            'rush_attempts_3ma': rush_attempts_3ma,
            'rush_attempts_5ma': rush_attempts_5ma,
            'rush_attempts_10ma': rush_attempts_10ma,

            'ypc_1ma': ypc_1ma,
            'ypc_3ma': ypc_3ma,
            'ypc_5ma': ypc_5ma,
            'ypc_10ma': ypc_10ma,

            'success_rate_1ma': success_rate_1ma,
            'success_rate_3ma': success_rate_3ma,
            'success_rate_5ma': success_rate_5ma,
            'success_rate_10ma': success_rate_10ma,

            'Pos.': player_data['Pos.'].iloc[0],
        }

        offense_rush_stats_LOOKUP[i] = pd.DataFrame(base_player_stats_ma)


## 3. Per-Game Defensive Summary

Derive what each team's defense allowed in every game by aggregating **offensive** RB production from the opposing team's perspective.

For each `(Team, Date)` pair in `base_stats`:
- Sum RB rush yards and attempts conceded → `RB_rush_yards_allowed`, `RB_rush_attempts_allowed`
- Compute yards per carry allowed → `RB_ypc_allowed`
- Sum each opposing RB's 5-game rolling yard average → **`strength_of_offense`** (quality-of-opponent adjustment)

> The `Opp` column on the **offensive** side identifies the defending team, so grouping by `Team/Date` on the offense gives us the defense's perspective.

In [None]:
# Compute per-game defensive rushing summaries from the offensive RB data.
#
# The core idea: a team's "defense stats" are derived by looking at what RBs
# gained against them. For each (defending team, game date) we aggregate:
#   - RB_rush_yards_allowed   : total RB yards gained against this defense
#   - RB_rush_attempts_allowed: total RB carries against this defense
#   - RB_ypc_allowed          : yards per carry allowed (yards / attempts)
#   - strength_of_offense     : sum of the opposing RBs' 5-game rolling rush-yard
#                               averages — used to adjust raw yards-allowed for
#                               the quality of the offense faced.
#
# Note: the `Opp` column on the offensive side identifies the defending team,
# so we group by Team/Date from the offense's perspective to build the defense view.

import numpy as np

base_stats['Date'] = pd.to_datetime(base_stats['Date'])

# Filter to RBs only for the yards-allowed calculation
rb_stats = base_stats[base_stats['Pos.'] == 'RB'].copy()

# Aggregate total RB production conceded by each team on each game date
team_rb_summary = (
    rb_stats
    .groupby(['Team', 'Date'], as_index=False)
    .agg(
        RB_rush_yards_allowed=('Yds', 'sum'),
        RB_rush_attempts_allowed=('Att', 'sum'),
        Opp=('Opp', 'first')           # Identify the defending team
    )
)

# Yards per carry allowed — guard against divide-by-zero with NaN replacement
team_rb_summary['RB_ypc_allowed'] = (
    team_rb_summary['RB_rush_yards_allowed'] /
    team_rb_summary['RB_rush_attempts_allowed'].replace(0, np.nan)
)

# Strength of offense: for each (team, date) pair, sum the 5-game rolling
# rush-yard average across all opposing players present in the offense lookup.
# This acts as a quality-of-opponent adjustment for the defense features.
strength_map = {}
for k, v in base_stats.groupby(['Team', 'Date']):
    offense_strength = 0
    for player in v['Player'].unique():
        if player in offense_rush_stats_LOOKUP:
            player_offense_stats = offense_rush_stats_LOOKUP[player]
            player_stats_on_date = player_offense_stats[player_offense_stats['Date'] == k[1]]
            if not player_stats_on_date.empty:
                offense_strength += player_stats_on_date['rush_yards_5ma'].values[0]
    strength_map[k] = offense_strength

team_rb_summary['strength_of_offense'] = team_rb_summary.apply(
    lambda row: strength_map.get((row['Team'], row['Date']), 0), axis=1
)


## 4. Exploratory Check

Inspect the per-game defensive summary table to verify column names and data coverage before building rolling features.

In [None]:
# Quick sanity check — inspect the team defensive summary table
team_rb_summary


Unnamed: 0,Team,Date,RB_rush_yards_allowed,RB_rush_attempts_allowed,Opp,RB_ypc_allowed,strength_of_offense
0,ARI,2018-09-09,61,13,WAS,4.692308,
1,ARI,2018-09-16,54,15,LAR,3.600000,
2,ARI,2018-09-23,41,17,CHI,2.411765,
3,ARI,2018-09-30,72,25,SEA,2.880000,64.0
4,ARI,2018-10-07,54,19,SFO,2.842105,69.5
...,...,...,...,...,...,...,...
4249,WAS,2025-12-07,84,17,MIN,4.941176,152.8
4250,WAS,2025-12-14,102,27,NYG,3.777778,86.0
4251,WAS,2025-12-20,91,24,PHI,3.791667,141.6
4252,WAS,2025-12-25,103,12,DAL,8.583333,64.2


## 5. Defensive Rolling Feature Lookup

Build `defense_rush_stats_LOOKUP` — a dict keyed by team abbreviation, where each entry is a DataFrame of that team's rolling defensive stats indexed by game date.

Grouped by **`Opp`** (the defending team as seen from the offense's perspective).  
All windows use `shift(1)` to prevent same-game leakage.

**Features computed per team per game:**

| Feature Group | Description |
|---------------|-------------|
| `RB_rush_yards_allowed_{1,3,5}ma` | Rolling mean yards allowed to RBs |
| `RB_ypc_allowed_{1,3,5}ma` | Rolling mean yards per carry allowed |
| `min/max_rush_yards_allowed_{3,5}ma` | Rolling floor/ceiling — range context for the defense |
| `strength_of_offense_{1,3,5}ma` | Rolling mean of opposing offense quality |
| `defense_performance_relative_{1,3,5}ma` | Yards allowed minus avg opponent strength (< 0 = outperformed expectations) |
| `*_delta_3_5` / `*_delta_1_3` | Momentum signals: short-window minus long-window difference |
| `*_vol_5` | Rolling 5-game std of yards allowed and YPC (consistency signals) |

In [None]:
# Build a per-team defensive rolling feature lookup.
#
# Grouped by the OPPOSING team (Opp) so that each entry represents how a
# defense has performed over its recent games — not the team that ran on them.
# All windows use shift(1) to prevent same-game leakage.
#
# Features computed per team per game:
#   - RB_rush_yards_allowed_Xma    : rolling mean yards allowed to RBs (1/3/5)
#   - RB_ypc_allowed_Xma           : rolling mean YPC allowed (1/3/5)
#   - min/max_rush_yards_allowed    : rolling floor/ceiling over 3 and 5 games
#   - strength_of_offense_Xma      : rolling mean of opposing offense quality
#   - defense_performance_relative_Xma : yards allowed minus avg offense strength
#                                        (< 0 means defense outperformed expectations)
#   - delta features               : 3v5 and 1v3 window differences to capture
#                                    defensive momentum / regression
#   - vol features                 : rolling std over 5 games for yards allowed
#                                    and YPC allowed (consistency signals)
#
# Result: defense_rush_stats_LOOKUP[team_abbrev] -> DataFrame indexed by Date

defense_rush_stats_LOOKUP = {}
for k, v in team_rb_summary.sort_values(['Date']).groupby('Opp'):
    v = v.sort_values(['Date']).reset_index(drop=True)

    # --- Rolling means for yards allowed and YPC allowed ---
    RB_rush_yards_allowed_1ma = v['RB_rush_yards_allowed'].shift(1).rolling(1, min_periods=1).mean()
    RB_rush_yards_allowed_3ma = v['RB_rush_yards_allowed'].shift(1).rolling(3, min_periods=1).mean()
    RB_rush_yards_allowed_5ma = v['RB_rush_yards_allowed'].shift(1).rolling(5, min_periods=1).mean()

    RB_ypc_allowed_1ma = v['RB_ypc_allowed'].shift(1).rolling(1, min_periods=1).mean()
    RB_ypc_allowed_3ma = v['RB_ypc_allowed'].shift(1).rolling(3, min_periods=1).mean()
    RB_ypc_allowed_5ma = v['RB_ypc_allowed'].shift(1).rolling(5, min_periods=1).mean()

    # --- Rolling min/max for yards-allowed range context ---
    min_rush_yards_allowed_3ma = v['RB_rush_yards_allowed'].shift(1).rolling(3, min_periods=1).min()
    min_rush_yards_allowed_5ma = v['RB_rush_yards_allowed'].shift(1).rolling(5, min_periods=1).min()
    max_rush_yards_allowed_3ma = v['RB_rush_yards_allowed'].shift(1).rolling(3, min_periods=1).max()
    max_rush_yards_allowed_5ma = v['RB_rush_yards_allowed'].shift(1).rolling(5, min_periods=1).max()

    # --- Opponent offensive strength rolling means ---
    strength_of_offense_1ma = v['strength_of_offense'].shift(1).rolling(1, min_periods=1).mean()
    strength_of_offense_3ma = v['strength_of_offense'].shift(1).rolling(3, min_periods=1).mean()
    strength_of_offense_5ma = v['strength_of_offense'].shift(1).rolling(5, min_periods=1).mean()

    # --- Opponent-adjusted defensive performance ---
    # Positive = gave up more yards than expected; Negative = locked down relative to opponent quality
    defense_performance_relative_1ma = RB_rush_yards_allowed_1ma - strength_of_offense_1ma
    defense_performance_relative_3ma = RB_rush_yards_allowed_3ma - strength_of_offense_3ma
    defense_performance_relative_5ma = RB_rush_yards_allowed_5ma - strength_of_offense_5ma

    # --- Momentum deltas (short window minus long window) ---
    RB_rush_yards_allowed_delta_3_5 = RB_rush_yards_allowed_3ma - RB_rush_yards_allowed_5ma
    RB_rush_yards_allowed_delta_1_3 = RB_rush_yards_allowed_1ma - RB_rush_yards_allowed_3ma

    ypc_allowed_delta_3_5 = RB_ypc_allowed_3ma - RB_ypc_allowed_5ma
    ypc_allowed_delta_1_3 = RB_ypc_allowed_1ma - RB_ypc_allowed_3ma

    defense_relative_delta_3_5 = defense_performance_relative_3ma - defense_performance_relative_5ma
    defense_relative_delta_1_3 = defense_performance_relative_1ma - defense_performance_relative_3ma

    # --- Volatility features: consistency of the defense over the last 5 games ---
    RB_rush_yards_allowed_vol_5 = v['RB_rush_yards_allowed'].shift(1).rolling(5, min_periods=1).std()
    ypc_allowed_vol_5 = v['RB_ypc_allowed'].shift(1).rolling(5, min_periods=1).std()

    defense_rush_stats_LOOKUP[k] = pd.DataFrame({
        'Date': pd.to_datetime(v['Date']),
        'RB_rush_yards_allowed_1ma': RB_rush_yards_allowed_1ma,
        'RB_rush_yards_allowed_3ma': RB_rush_yards_allowed_3ma,
        'RB_rush_yards_allowed_5ma': RB_rush_yards_allowed_5ma,
        'RB_ypc_allowed_1ma': RB_ypc_allowed_1ma,
        'RB_ypc_allowed_3ma': RB_ypc_allowed_3ma,
        'RB_ypc_allowed_5ma': RB_ypc_allowed_5ma,
        'min_rush_yards_allowed_3ma': min_rush_yards_allowed_3ma,
        'min_rush_yards_allowed_5ma': min_rush_yards_allowed_5ma,
        'max_rush_yards_allowed_3ma': max_rush_yards_allowed_3ma,
        'max_rush_yards_allowed_5ma': max_rush_yards_allowed_5ma,
        'defense_performance_relative_1ma': defense_performance_relative_1ma,
        'defense_performance_relative_3ma': defense_performance_relative_3ma,
        'defense_performance_relative_5ma': defense_performance_relative_5ma,
        'RB_rush_yards_allowed_delta_3_5': RB_rush_yards_allowed_delta_3_5,
        'RB_rush_yards_allowed_delta_1_3': RB_rush_yards_allowed_delta_1_3,
        'ypc_allowed_delta_3_5': ypc_allowed_delta_3_5,
        'ypc_allowed_delta_1_3': ypc_allowed_delta_1_3,
        'defense_relative_delta_3_5': defense_relative_delta_3_5,
        'defense_relative_delta_1_3': defense_relative_delta_1_3,
        'RB_rush_yards_allowed_vol_5': RB_rush_yards_allowed_vol_5,
        'ypc_allowed_vol_5': ypc_allowed_vol_5,
    })


## 6. Join, Finalise & Export

For each `(Team, Date)` row in `base_stats`, look up the opponent team's defensive rolling snapshot for that date from `defense_rush_stats_LOOKUP`.  
All matched rows are assembled into a single DataFrame and exported to CSV.

**Output file:** `defense_stats_feature_engineering.csv`

> Rows where no defensive lookup is found (e.g. first game of a season before any history accumulates) are dropped to keep the dataset fully leak-free.

In [None]:
# Join defensive rolling features back to each (Team, Date) game observation.
#
# For each game, look up the opponent's defensive rolling stats and attach them
# along with identifiers (Team, Date, Opp). The result is one row per game
# representing the defense the team faced — ready to be merged with the
# offensive RB features in the train/test notebooks.

defense_train = []
for k, v in base_stats.groupby(['Team', 'Date']):
    team = k[0]
    date = k[1]
    opp = v['Opp'].iloc[0]

    # Look up the opponent's pre-computed defensive rolling stats for this date
    defense_features = defense_rush_stats_LOOKUP.get(opp, pd.DataFrame())
    defense_row = defense_features[defense_features['Date'] == date]

    if not defense_row.empty:
        combined_row = defense_row.iloc[0].to_dict()
        combined_row['Team'] = team
        combined_row['Date'] = date
        combined_row['Opp'] = opp
        defense_train.append(combined_row)


In [None]:
# Convert the list of defense feature dicts into a tidy DataFrame
defense_train_df = pd.DataFrame(defense_train)


In [None]:
# Persist the defensive feature set to disk.
# This CSV is merged with base_stats_feature_engineering.csv in the
# train/test notebooks to provide opponent-context features for each RB game.
defense_train_df.to_csv('defense_stats_feature_engineering.csv', index=False)
