# Phase 5: Feature Engineering

**Goal:** Build ML-ready features from rolling team performance metrics with zero future leakage.

| Step | Cell | Description |
|------|------|-------------|
| Setup | Cell 1 | Load continuous timeline (Master + banked + future) |
| Unstack | Cell 2 | Convert matches to team-level records |
| Rolling | Cell 3 | Calculate rolling features with shift(1) |
| Matchup | Cell 4 | Map features back to match rows |
| Split | Cell 5 | Re-split into 3 engineered CSVs |
| Validate | Cell 6 | Leakage checks and feature summary |

### Features Engineered
| Feature | Window | Formula |
|---------|--------|---------|
| `form` | L5 / L10 | Rolling sum of points (W=3, D=1, L=0) |
| `finishing_efficiency` | L5 / L10 | Rolling mean of (Goals - xG) |
| `attacking_xg` | L5 / L10 | Rolling mean of xG created |
| `defensive_xg` | L5 / L10 | Rolling mean of xG conceded |
| `rest_days` | - | Days since previous match |

### Anti-Leakage Design
Every rolling feature uses `shift(1)` so match N uses only data from matches 1..N-1.

In [1]:
# Ensure working directory is the project root perfectly across IDEs/Terminals
import os
import sys
try:
    if 'notebooks' in os.getcwd():
        project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
    else:
        project_root = os.getcwd()
    os.chdir(project_root)
    if project_root not in sys.path:
        sys.path.append(project_root)
except Exception:
    pass

# =============================================================================
# Cell 1: Setup & Load Continuous Timeline
# =============================================================================

import pandas as pd
import numpy as np
import os
import warnings

warnings.filterwarnings('ignore')

processed_dir = os.path.join('data', 'processed')

# --- Load all 3 CSVs ---
print('Loading continuous timeline...')
master = pd.read_csv(os.path.join(processed_dir, 'Master_Training_Set.csv'), low_memory=False)
banked = pd.read_csv(os.path.join(processed_dir, 'current_season_banked.csv'), low_memory=False)
future = pd.read_csv(os.path.join(processed_dir, 'future_schedule_features.csv'), low_memory=False)

print(f'  Master (1415-2425):  {len(master):>7,} rows')
print(f'  Banked (2526 played):{len(banked):>7,} rows')
print(f'  Future (2526 sched): {len(future):>7,} rows')

# --- Concat into one continuous timeline ---
# Tag each row with its source for later re-splitting
master['_source'] = 'history'
banked['_source'] = 'banked'
future['_source'] = 'future'

timeline = pd.concat([master, banked, future], ignore_index=True)

# --- Parse and sort by date ---
timeline['date'] = pd.to_datetime(timeline['date_norm'], format='mixed', errors='coerce')
timeline = timeline.sort_values(['date', 'league']).reset_index(drop=True)

# --- Add a unique match ID for later joining ---
timeline['_match_id'] = range(len(timeline))

# --- Subset key columns ---
key_cols = ['_match_id', '_source', 'league', 'season', 'date',
            'home_team', 'away_team', 'FTHG', 'FTAG', 'FTR',
            'home_xg', 'away_xg', 'home_elo', 'away_elo', 'elo_diff']
slim = timeline[key_cols].copy()

print(f'\nCombined timeline: {len(slim):,} rows')
print(f'  Date range: {slim["date"].min().date()} to {slim["date"].max().date()}')
print(f'  Seasons: {sorted(slim["season"].unique())}')
print(f'  Played: {slim["FTR"].notna().sum():,}  |  Future: {slim["FTR"].isna().sum():,}')

Loading continuous timeline...
  Master (1415-2425):   19,837 rows
  Banked (2526 played):  1,104 rows
  Future (2526 sched):     648 rows

Combined timeline: 21,589 rows
  Date range: 2014-08-08 to 2026-12-04
  Seasons: [1415, 1516, 1617, 1718, 1819, 1920, 2021, 2223, 2324, 2425, 2526]
  Played: 20,941  |  Future: 648


In [2]:
# =============================================================================
# Cell 2: Unstack Matches to Team-Level Records
# =============================================================================
# Each match produces TWO rows: one from home perspective, one from away.
# This gives each team a chronological match log for rolling calculations.

print('Unstacking matches to team-level records...')
print('=' * 60)

def make_team_records(df):
    """Convert match rows into team-perspective records."""
    # Home perspective
    home = pd.DataFrame({
        '_match_id': df['_match_id'],
        'team': df['home_team'],
        'opponent': df['away_team'],
        'date': df['date'],
        'venue': 'H',
        'gf': df['FTHG'],
        'ga': df['FTAG'],
        'xg_for': df['home_xg'],
        'xg_against': df['away_xg'],
        'ftr': df['FTR'],
        'league': df['league'],
        'season': df['season'],
        '_source': df['_source'],
    })
    
    # Away perspective
    away = pd.DataFrame({
        '_match_id': df['_match_id'],
        'team': df['away_team'],
        'opponent': df['home_team'],
        'date': df['date'],
        'venue': 'A',
        'gf': df['FTAG'],
        'ga': df['FTHG'],
        'xg_for': df['away_xg'],
        'xg_against': df['home_xg'],
        'ftr': df['FTR'],
        'league': df['league'],
        'season': df['season'],
        '_source': df['_source'],
    })
    
    records = pd.concat([home, away], ignore_index=True)
    
    # Calculate points from team's perspective
    conditions = [
        (records['venue'] == 'H') & (records['ftr'] == 'H'),  # Home win
        (records['venue'] == 'A') & (records['ftr'] == 'A'),  # Away win
        records['ftr'] == 'D',                                 # Draw
    ]
    records['points'] = np.select(conditions, [3, 3, 1], default=0)
    
    # For future games (no result), set points to NaN so they don't pollute rolling calcs
    records.loc[records['ftr'].isna(), 'points'] = np.nan
    
    # Calculate excess goals (finishing efficiency raw)
    records['excess_goals'] = records['gf'] - records['xg_for']
    
    # Sort chronologically per team
    records = records.sort_values(['team', 'date']).reset_index(drop=True)
    
    return records

team_records = make_team_records(slim)

print(f'Team records: {len(team_records):,} rows ({len(team_records)//2:,} matches x 2 perspectives)')
print(f'Unique teams: {team_records["team"].nunique()}')

# Show sample for one team
sample_team = 'Arsenal'
sample = team_records[team_records['team'] == sample_team].tail(6)
print(f'\nSample: {sample_team} (last 6 matches)')
print(sample[['date', 'opponent', 'venue', 'gf', 'ga', 'xg_for', 'xg_against', 'points', 'excess_goals']].to_string(index=False))

Unstacking matches to team-level records...
Team records: 43,178 rows (21,589 matches x 2 perspectives)
Unique teams: 166

Sample: Arsenal (last 6 matches)
      date       opponent venue  gf  ga  xg_for  xg_against  points  excess_goals
2026-04-18       Man City     A NaN NaN     NaN         NaN     NaN           NaN
2026-04-25      Newcastle     H NaN NaN     NaN         NaN     NaN           NaN
2026-05-17        Burnley     H NaN NaN     NaN         NaN     NaN           NaN
2026-05-24 Crystal Palace     A NaN NaN     NaN         NaN     NaN           NaN
2026-09-05       West Ham     A NaN NaN     NaN         NaN     NaN           NaN
2026-11-04    Bournemouth     H NaN NaN     NaN         NaN     NaN           NaN


In [3]:
# =============================================================================
# Cell 3: Calculate Rolling Features (Per Team)
# =============================================================================
# CRITICAL: shift(1) ensures no future leakage.
# Match N's features use only data from matches 1..N-1.

print('Computing rolling features...')
print('=' * 60)

def calc_rolling_features(group):
    """Calculate rolling features for a single team's chronological match log."""
    g = group.copy()
    
    for window in [5, 10]:
        suffix = f'_l{window}'
        
        # Form: rolling SUM of points (shifted to prevent leakage)
        g[f'form{suffix}'] = (
            g['points']
            .shift(1)
            .rolling(window=window, min_periods=1)
            .sum()
        )
        
        # Finishing Efficiency: rolling MEAN of excess_goals (Goals - xG)
        g[f'finishing_efficiency{suffix}'] = (
            g['excess_goals']
            .shift(1)
            .rolling(window=window, min_periods=1)
            .mean()
        )
        
        # Attacking Strength: rolling MEAN of xG created
        g[f'attacking_xg{suffix}'] = (
            g['xg_for']
            .shift(1)
            .rolling(window=window, min_periods=1)
            .mean()
        )
        
        # Defensive Strength: rolling MEAN of xG conceded
        g[f'defensive_xg{suffix}'] = (
            g['xg_against']
            .shift(1)
            .rolling(window=window, min_periods=1)
            .mean()
        )
    
    # Rest Days: days since previous match
    g['rest_days'] = g['date'].diff().dt.days
    
    return g

team_features = team_records.groupby('team', group_keys=False).apply(calc_rolling_features)
team_features = team_features.reset_index(drop=True)

# --- Feature columns ---
feature_cols = [
    'form_l5', 'form_l10',
    'finishing_efficiency_l5', 'finishing_efficiency_l10',
    'attacking_xg_l5', 'attacking_xg_l10',
    'defensive_xg_l5', 'defensive_xg_l10',
    'rest_days',
]

# --- Forward-fill: freeze rolling stats at last known values for future games ---
# Without this, the 2nd+ future game per team gets NaN because the rolling
# window starts including unplayed matches (NaN points/xG).
# After ffill, every future fixture uses the team's most recent performance.
before_null = team_features[feature_cols].isna().sum().sum()
team_features[feature_cols] = (
    team_features
    .groupby('team')[feature_cols]
    .ffill()
)
after_null = team_features[feature_cols].isna().sum().sum()
print(f'Forward-fill: {before_null - after_null:,} NaN values filled with last known stats')

print(f'Rolling features calculated for {team_features["team"].nunique()} teams.')
print(f'\nFeature columns: {feature_cols}')

# --- Show sample ---
sample = team_features[team_features['team'] == 'Arsenal'].tail(6)
print(f'\nSample: Arsenal (last 6 matches -- future games now have features)')
print(sample[['date', 'opponent', 'points'] + feature_cols].to_string(index=False))

# --- Check for NaN in features ---
played_features = team_features[team_features['_source'] != 'future']
future_features = team_features[team_features['_source'] == 'future']
print(f'\nNaN rates (played):')
for col in feature_cols:
    null_pct = played_features[col].isna().mean() * 100
    print(f'  {col:30s} NaN: {null_pct:.1f}%')
print(f'\nNaN rates (future):')
for col in feature_cols:
    null_pct = future_features[col].isna().mean() * 100
    print(f'  {col:30s} NaN: {null_pct:.1f}%')

Computing rolling features...
Forward-fill: 3,686 NaN values filled with last known stats
Rolling features calculated for 166 teams.

Feature columns: ['form_l5', 'form_l10', 'finishing_efficiency_l5', 'finishing_efficiency_l10', 'attacking_xg_l5', 'attacking_xg_l10', 'defensive_xg_l5', 'defensive_xg_l10', 'rest_days']

Sample: Arsenal (last 6 matches -- future games now have features)
      date       opponent  points  form_l5  form_l10  finishing_efficiency_l5  finishing_efficiency_l10  attacking_xg_l5  attacking_xg_l10  defensive_xg_l5  defensive_xg_l10  rest_days
2026-04-18       Man City     NaN      1.0       8.0                  0.21582                  0.038040          0.78418          1.961960          1.81154          0.675372       15.0
2026-04-25      Newcastle     NaN      1.0       7.0                  0.21582                  0.763357          0.78418          1.736642          1.81154          0.764594        7.0
2026-05-17        Burnley     NaN      1.0       7.0    

In [7]:
# =============================================================================
# Cell 4: Map Features Back to Match Rows (The "Matchup" Dataset)
# =============================================================================
# Join home team's rolling stats as home_*, away team's as away_*.

print('Mapping rolling features to match rows...')
print('=' * 60)

# Separate home and away features from team_features
home_feats = team_features[team_features['venue'] == 'H'][['_match_id'] + feature_cols].copy()
away_feats = team_features[team_features['venue'] == 'A'][['_match_id'] + feature_cols].copy()

# Rename with home_ / away_ prefix
home_feats = home_feats.rename(columns={c: f'home_{c}' for c in feature_cols})
away_feats = away_feats.rename(columns={c: f'away_{c}' for c in feature_cols})

# Join back to the original timeline
matchup = slim.merge(home_feats, on='_match_id', how='left')
matchup = matchup.merge(away_feats, on='_match_id', how='left')

# --- Derived matchup features ---
matchup['form_diff_l5'] = matchup['home_form_l5'] - matchup['away_form_l5']
matchup['form_diff_l10'] = matchup['home_form_l10'] - matchup['away_form_l10']
matchup['attack_vs_defense_l5'] = matchup['home_attacking_xg_l5'] - matchup['away_defensive_xg_l5']
matchup['defense_vs_attack_l5'] = matchup['home_defensive_xg_l5'] - matchup['away_attacking_xg_l5']

# --- Summary ---
all_feature_cols = (
    [f'home_{c}' for c in feature_cols] +
    [f'away_{c}' for c in feature_cols] +
    ['form_diff_l5', 'form_diff_l10', 'attack_vs_defense_l5', 'defense_vs_attack_l5', 'elo_diff']
)

print(f'Matchup dataset: {len(matchup):,} rows x {len(matchup.columns)} columns')
print(f'New feature columns: {len(all_feature_cols)}')
print(f'\nFeature list:')
for i, col in enumerate(all_feature_cols, 1):
    null_pct = matchup[col].isna().mean() * 100
    print(f'  {i:2d}. {col:35s} NaN: {null_pct:5.1f}%')

# --- Quick sample ---
print(f'\nSample matchup (last 3 history rows):')
sample = matchup[matchup['_source'] == 'history'].tail(3)
print(sample[['home_team', 'away_team', 'FTR', 'elo_diff', 'home_form_l5', 'away_form_l5', 'home_finishing_efficiency_l5', 'away_finishing_efficiency_l5']].to_string(index=False))

Mapping rolling features to match rows...
Matchup dataset: 21,589 rows x 37 columns
New feature columns: 23

Feature list:
   1. home_form_l5                        NaN:   0.4%
   2. home_form_l10                       NaN:   0.4%
   3. home_finishing_efficiency_l5        NaN:   0.4%
   4. home_finishing_efficiency_l10       NaN:   0.4%
   5. home_attacking_xg_l5                NaN:   0.4%
   6. home_attacking_xg_l10               NaN:   0.4%
   7. home_defensive_xg_l5                NaN:   0.4%
   8. home_defensive_xg_l10               NaN:   0.4%
   9. home_rest_days                      NaN:   0.4%
  10. away_form_l5                        NaN:   0.4%
  11. away_form_l10                       NaN:   0.4%
  12. away_finishing_efficiency_l5        NaN:   0.5%
  13. away_finishing_efficiency_l10       NaN:   0.5%
  14. away_attacking_xg_l5                NaN:   0.5%
  15. away_attacking_xg_l10               NaN:   0.5%
  16. away_defensive_xg_l5                NaN:   0.5%
  17. away_de

In [8]:
# =============================================================================
# Cell 5: Re-Split into 3 Engineered CSVs
# =============================================================================

print('Re-splitting into 3 engineered datasets...')
print('=' * 60)

# Drop internal columns before saving
save_cols = [c for c in matchup.columns if not c.startswith('_')]
clean = matchup[save_cols].copy()

# --- File A: model_training_engineered.csv (all history) ---
train_eng = clean[matchup['_source'] == 'history'].copy()
train_path = os.path.join(processed_dir, 'model_training_engineered.csv')
train_eng.to_csv(train_path, index=False)
print(f'  [A] model_training_engineered.csv:  {len(train_eng):>7,} rows -> {train_path}')

# --- File B: current_banked_engineered.csv (25/26 played) ---
banked_eng = clean[matchup['_source'] == 'banked'].copy()
banked_path = os.path.join(processed_dir, 'current_banked_engineered.csv')
banked_eng.to_csv(banked_path, index=False)
print(f'  [B] current_banked_engineered.csv:  {len(banked_eng):>7,} rows -> {banked_path}')

# --- File C: future_predict_engineered.csv (25/26 upcoming) ---
future_eng = clean[matchup['_source'] == 'future'].copy()
future_path = os.path.join(processed_dir, 'future_predict_engineered.csv')
future_eng.to_csv(future_path, index=False)
print(f'  [C] future_predict_engineered.csv:  {len(future_eng):>7,} rows -> {future_path}')

# --- Row count validation ---
total = len(train_eng) + len(banked_eng) + len(future_eng)
print(f'\n  Total: {total:,} rows (should match timeline: {len(matchup):,})')

# --- File sizes ---
for name, path in [('training', train_path), ('banked', banked_path), ('future', future_path)]:
    size_kb = os.path.getsize(path) / 1024
    print(f'  {name:12s} {size_kb:>8.0f} KB')

Re-splitting into 3 engineered datasets...
  [A] model_training_engineered.csv:   19,837 rows -> data\processed\model_training_engineered.csv
  [B] current_banked_engineered.csv:    1,104 rows -> data\processed\current_banked_engineered.csv
  [C] future_predict_engineered.csv:      648 rows -> data\processed\future_predict_engineered.csv

  Total: 21,589 rows (should match timeline: 21,589)
  training         7252 KB
  banked            413 KB
  future            210 KB


In [9]:
# =============================================================================
# Cell 6: Validation & Sanity Checks
# =============================================================================

print('PHASE 5 VALIDATION')
print('=' * 60)

# --- Check 1: finishing_efficiency_l5 exists ---
fe_exists = 'home_finishing_efficiency_l5' in train_eng.columns and 'away_finishing_efficiency_l5' in train_eng.columns
print(f'\n  1. finishing_efficiency_l5 exists:  {"[PASS]" if fe_exists else "[FAIL]"}')

# --- Check 2: No future leakage test ---
# For a played match, verify the rolling features were computed BEFORE that match.
# We do this by checking that team_features shift(1) is correctly applied:
# Pick a team, look at match N, and verify form_l5 equals sum of points from matches N-5..N-1
test_team = 'Liverpool'
test_records = team_features[
    (team_features['team'] == test_team) & 
    (team_features['_source'] != 'future')
].copy()

if len(test_records) >= 10:
    # Check match at index 10 (0-indexed)
    test_idx = 10
    test_row = test_records.iloc[test_idx]
    prev_5_pts = test_records.iloc[test_idx-5:test_idx]['points'].sum()
    computed_form = test_row['form_l5']
    leakage_pass = abs(computed_form - prev_5_pts) < 0.01
    print(f'  2. No future leakage ({test_team}):   {"[PASS]" if leakage_pass else "[FAIL]"}')
    if not leakage_pass:
        print(f'     Expected form_l5={prev_5_pts}, got {computed_form}')
else:
    print(f'  2. No future leakage:              [SKIP] (not enough data for {test_team})')

# --- Check 3: Future fixtures have valid rolling stats ---
future_check = future_eng[['home_form_l5', 'away_form_l5', 'home_attacking_xg_l5', 'away_attacking_xg_l5']]
future_coverage = future_check.notna().all(axis=1).mean() * 100
future_ok = future_coverage > 90
print(f'  3. Future features coverage:       {"[PASS]" if future_ok else "[WARN]"} ({future_coverage:.1f}%)')

# --- Check 4: All 3 files have feature columns ---
required_features = ['home_form_l5', 'away_form_l5', 'home_finishing_efficiency_l5',
                     'away_finishing_efficiency_l5', 'elo_diff', 'home_rest_days', 'away_rest_days']
all_present = all(c in train_eng.columns and c in future_eng.columns for c in required_features)
print(f'  4. All feature cols in outputs:    {"[PASS]" if all_present else "[FAIL]"}')

# --- Check 5: Row counts match original ---
rows_ok = len(train_eng) == len(master) and len(banked_eng) == len(banked) and len(future_eng) == len(future)
print(f'  5. Row counts match originals:     {"[PASS]" if rows_ok else "[WARN]"}')
print(f'     Training: {len(train_eng):,} (was {len(master):,})')
print(f'     Banked:   {len(banked_eng):,} (was {len(banked):,})')
print(f'     Future:   {len(future_eng):,} (was {len(future):,})')

# --- Feature summary statistics ---
print(f'\n  Feature Statistics (Training Set):')
print(f'  {"Feature":35s} {"Mean":>8s} {"Std":>8s} {"Min":>8s} {"Max":>8s}')
print(f'  {"-"*69}')
for col in required_features:
    if col in train_eng.columns:
        s = train_eng[col].dropna()
        print(f'  {col:35s} {s.mean():8.2f} {s.std():8.2f} {s.min():8.2f} {s.max():8.2f}')

print(f'\n{"=" * 60}')
print('Phase 5 complete. Engineered features are ready for modeling.')

PHASE 5 VALIDATION

  1. finishing_efficiency_l5 exists:  [PASS]
  2. No future leakage (Liverpool):   [PASS]
  3. Future features coverage:       [PASS] (97.5%)
  4. All feature cols in outputs:    [PASS]
  5. Row counts match originals:     [PASS]
     Training: 19,837 (was 19,837)
     Banked:   1,104 (was 1,104)
     Future:   648 (was 648)

  Feature Statistics (Training Set):
  Feature                                 Mean      Std      Min      Max
  ---------------------------------------------------------------------
  home_form_l5                            6.73     3.60     0.00    15.00
  away_form_l5                            6.93     3.61     0.00    15.00
  home_finishing_efficiency_l5            0.00     0.48    -2.35     2.79
  away_finishing_efficiency_l5            0.00     0.49    -2.19     2.73
  elo_diff                                0.21   160.18  -530.86   518.48
  home_rest_days                         10.57    44.81     0.00  2640.00
  away_rest_days         