# ⚙️ NFL Player Movement - Feature Engineering Deep Dive

**Comprehensive Feature Engineering for Player Trajectory Prediction**

This notebook explores advanced feature engineering techniques for NFL player movement prediction.

---

## 📋 Table of Contents

1. [Setup & Configuration](#1-setup)
2. [Data Loading](#2-data-loading)
3. [Physics Features](#3-physics)
4. [Spatial Features](#4-spatial)
5. [Temporal Features](#5-temporal)
6. [NFL Domain Features](#6-nfl)
7. [Feature Importance](#7-importance)
8. [Feature Correlation](#8-correlation)

---

## 1. Setup & Configuration 🔧

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# ML libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from scipy.stats import pearsonr

# Set plotting style
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (14, 6)

print("✅ Libraries imported successfully")

In [None]:
# Configuration
class Config:
    """Feature engineering configuration"""
    
    # Paths
    DATA_DIR = Path('../data')
    OUTPUT_DIR = Path('../outputs/feature_engineering')
    
    # Data settings
    USE_SAMPLE = True
    SAMPLE_SIZE = 50000
    MAX_FILES = 2
    RANDOM_STATE = 42

config = Config()
config.OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("✅ Configuration loaded")
print(f"   Data directory: {config.DATA_DIR}")
print(f"   Output directory: {config.OUTPUT_DIR}")
print(f"   Sample size: {config.SAMPLE_SIZE if config.USE_SAMPLE else 'All data'}")

## 2. Data Loading 📂

In [None]:
def load_data(data_dir, max_files=None, sample_size=None):
    """
    Load and merge input/output CSV files
    
    Args:
        data_dir: Directory containing train data
        max_files: Maximum number of files to load
        sample_size: Sample size for faster processing
    
    Returns:
        merged_df: Merged dataframe with features and targets
    """
    # Get file lists
    train_dir = data_dir / 'raw' / 'train'
    input_files = sorted(train_dir.glob('input_*.csv'))
    output_files = sorted(train_dir.glob('output_*.csv'))
    
    if max_files:
        input_files = input_files[:max_files]
        output_files = output_files[:max_files]
    
    print(f"📂 Loading {len(input_files)} files...")
    
    # Load and concatenate
    input_df = pd.concat([pd.read_csv(f) for f in input_files], ignore_index=True)
    output_df = pd.concat([pd.read_csv(f) for f in output_files], ignore_index=True)
    
    # Sample if requested
    if sample_size and len(input_df) > sample_size:
        print(f"🎲 Sampling {sample_size:,} rows...")
        input_df = input_df.sample(n=sample_size, random_state=42)
        sampled_keys = input_df[['game_id', 'play_id', 'nfl_id', 'frame_id']]
        output_df = output_df.merge(sampled_keys, on=['game_id', 'play_id', 'nfl_id', 'frame_id'])
    
    # Merge input and output
    merged_df = input_df.merge(
        output_df[['game_id', 'play_id', 'nfl_id', 'frame_id', 'x', 'y']],
        on=['game_id', 'play_id', 'nfl_id', 'frame_id'],
        suffixes=('', '_target')
    )
    merged_df = merged_df.rename(columns={'x_target': 'target_x', 'y_target': 'target_y'})
    
    print(f"✅ Data loaded: {merged_df.shape}")
    return merged_df

In [None]:
# Load data
df = load_data(
    config.DATA_DIR,
    max_files=config.MAX_FILES,
    sample_size=config.SAMPLE_SIZE if config.USE_SAMPLE else None
)

print(f"\n📋 Columns: {len(df.columns)}")
print(f"📊 Shape: {df.shape}")
display(df.head(3))

## 3. Physics Features ⚡

Create physics-based features from player tracking data:
- Velocity components (decomposition)
- Acceleration components
- Momentum (mass × velocity)
- Kinetic energy (½ × mass × velocity²)
- Direction differences

In [None]:
def create_physics_features(df):
    """
    Create physics-based features
    
    Features:
    - velocity_x, velocity_y: Velocity vector components
    - acceleration_x, acceleration_y: Acceleration components
    - momentum: Player momentum (mass × speed)
    - kinetic_energy: Player kinetic energy (½mv²)
    - dir_diff: Difference between orientation and motion direction
    - speed_squared: s² for energy calculations
    """
    print("⚙️  Creating physics features...\n")
    df_copy = df.copy()
    
    # 1. Velocity components
    if 's' in df_copy.columns and 'dir' in df_copy.columns:
        df_copy['velocity_x'] = df_copy['s'] * np.cos(np.radians(df_copy['dir']))
        df_copy['velocity_y'] = df_copy['s'] * np.sin(np.radians(df_copy['dir']))
        df_copy['velocity_magnitude'] = df_copy['s']  # Alias for clarity
        print("   ✓ Velocity components (velocity_x, velocity_y)")
    
    # 2. Acceleration components
    if 'a' in df_copy.columns and 'dir' in df_copy.columns:
        df_copy['acceleration_x'] = df_copy['a'] * np.cos(np.radians(df_copy['dir']))
        df_copy['acceleration_y'] = df_copy['a'] * np.sin(np.radians(df_copy['dir']))
        print("   ✓ Acceleration components (acceleration_x, acceleration_y)")
    
    # 3. Momentum (mass × velocity)
    if 'player_weight' in df_copy.columns and 's' in df_copy.columns:
        df_copy['momentum'] = df_copy['player_weight'] * df_copy['s']
        df_copy['momentum_x'] = df_copy['player_weight'] * df_copy['velocity_x']
        df_copy['momentum_y'] = df_copy['player_weight'] * df_copy['velocity_y']
        print("   ✓ Momentum features (momentum, momentum_x, momentum_y)")
    
    # 4. Kinetic energy (0.5 × mass × velocity²)
    if 'player_weight' in df_copy.columns and 's' in df_copy.columns:
        df_copy['speed_squared'] = df_copy['s'] ** 2
        df_copy['kinetic_energy'] = 0.5 * df_copy['player_weight'] * df_copy['speed_squared']
        print("   ✓ Kinetic energy (kinetic_energy, speed_squared)")
    
    # 5. Direction difference (orientation vs motion)
    if 'o' in df_copy.columns and 'dir' in df_copy.columns:
        dir_diff = df_copy['o'] - df_copy['dir']
        # Handle wraparound (keep in -180 to 180 range)
        dir_diff = (dir_diff + 180) % 360 - 180
        df_copy['dir_diff'] = np.abs(dir_diff)
        df_copy['is_backpedaling'] = (np.abs(dir_diff) > 90).astype(int)
        print("   ✓ Direction features (dir_diff, is_backpedaling)")
    
    # 6. Speed categories
    if 's' in df_copy.columns:
        df_copy['is_sprinting'] = (df_copy['s'] > 6).astype(int)
        df_copy['is_jogging'] = ((df_copy['s'] > 2) & (df_copy['s'] <= 6)).astype(int)
        df_copy['is_stationary'] = (df_copy['s'] <= 2).astype(int)
        print("   ✓ Speed categories (is_sprinting, is_jogging, is_stationary)")
    
    print(f"\n✅ Physics features created: {len([c for c in df_copy.columns if c not in df.columns])} new features")
    return df_copy

In [None]:
# Apply physics features
df = create_physics_features(df)

In [None]:
# Visualize physics features
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Velocity components scatter
sample = df.sample(min(5000, len(df)))
scatter = axes[0, 0].scatter(sample['velocity_x'], sample['velocity_y'], 
                             c=sample['s'], cmap='viridis', alpha=0.5, s=10)
axes[0, 0].set_xlabel('Velocity X (yards/sec)', fontsize=12)
axes[0, 0].set_ylabel('Velocity Y (yards/sec)', fontsize=12)
axes[0, 0].set_title('Velocity Components (colored by speed)', fontsize=14, fontweight='bold')
axes[0, 0].axhline(0, color='red', linestyle='--', alpha=0.3)
axes[0, 0].axvline(0, color='red', linestyle='--', alpha=0.3)
plt.colorbar(scatter, ax=axes[0, 0], label='Speed (yards/sec)')

# 2. Momentum distribution
if 'momentum' in df.columns:
    axes[0, 1].hist(df['momentum'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='orange')
    axes[0, 1].axvline(df['momentum'].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
    axes[0, 1].axvline(df['momentum'].median(), color='green', linestyle='--', linewidth=2, label='Median')
    axes[0, 1].set_xlabel('Momentum (lbs·yards/sec)', fontsize=12)
    axes[0, 1].set_ylabel('Frequency', fontsize=12)
    axes[0, 1].set_title('Player Momentum Distribution', fontsize=14, fontweight='bold')
    axes[0, 1].legend()

# 3. Kinetic energy distribution
if 'kinetic_energy' in df.columns:
    axes[1, 0].hist(df['kinetic_energy'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='purple')
    axes[1, 0].axvline(df['kinetic_energy'].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
    axes[1, 0].axvline(df['kinetic_energy'].median(), color='green', linestyle='--', linewidth=2, label='Median')
    axes[1, 0].set_xlabel('Kinetic Energy (lbs·yards²/sec²)', fontsize=12)
    axes[1, 0].set_ylabel('Frequency', fontsize=12)
    axes[1, 0].set_title('Player Kinetic Energy Distribution', fontsize=14, fontweight='bold')
    axes[1, 0].legend()

# 4. Direction difference
if 'dir_diff' in df.columns:
    axes[1, 1].hist(df['dir_diff'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='cyan')
    axes[1, 1].axvline(90, color='red', linestyle='--', linewidth=2, label='90° (Backpedaling threshold)')
    axes[1, 1].set_xlabel('Direction Difference (degrees)', fontsize=12)
    axes[1, 1].set_ylabel('Frequency', fontsize=12)
    axes[1, 1].set_title('Orientation vs Motion Direction Difference', fontsize=14, fontweight='bold')
    axes[1, 1].legend()

plt.tight_layout()
plt.savefig(config.OUTPUT_DIR / 'physics_features.png', dpi=150, bbox_inches='tight')
plt.show()

print("✅ Physics features visualized")

## 4. Spatial Features 🗺️

Create spatial relationship features:
- Distance to ball landing spot
- Field position features (zones, normalized coordinates)
- Sideline proximity
- Angle to ball
- Red zone indicators

In [None]:
def create_spatial_features(df):
    """
    Create spatial relationship features
    
    Features:
    - dist_to_ball: Euclidean distance to ball landing
    - dx_to_ball, dy_to_ball: Directional distances
    - angle_to_ball: Angle from player to ball
    - field_position_norm: Normalized field position
    - in_red_zone, in_midfield: Field zone indicators
    - dist_to_sideline, sideline_norm: Sideline proximity
    - dist_to_endzone: Distance to nearest endzone
    """
    print("🗺️  Creating spatial features...\n")
    df_copy = df.copy()
    
    # 1. Distance to ball landing
    if all(col in df_copy.columns for col in ['x', 'y', 'ball_land_x', 'ball_land_y']):
        df_copy['dx_to_ball'] = df_copy['ball_land_x'] - df_copy['x']
        df_copy['dy_to_ball'] = df_copy['ball_land_y'] - df_copy['y']
        df_copy['dist_to_ball'] = np.sqrt(df_copy['dx_to_ball']**2 + df_copy['dy_to_ball']**2)
        print("   ✓ Ball distance features (dist_to_ball, dx_to_ball, dy_to_ball)")
        
        # Angle to ball
        df_copy['angle_to_ball'] = np.degrees(np.arctan2(df_copy['dy_to_ball'], df_copy['dx_to_ball']))
        print("   ✓ Angle to ball (angle_to_ball)")
    
    # 2. Field position features
    if 'x' in df_copy.columns:
        df_copy['field_position_norm'] = df_copy['x'] / 120.0
        df_copy['in_red_zone'] = (df_copy['x'] <= 20).astype(int)
        df_copy['in_midfield'] = ((df_copy['x'] > 40) & (df_copy['x'] < 80)).astype(int)
        df_copy['dist_to_endzone'] = np.minimum(df_copy['x'], 120 - df_copy['x'])
        print("   ✓ Field position (field_position_norm, in_red_zone, in_midfield, dist_to_endzone)")
    
    # 3. Sideline proximity
    if 'y' in df_copy.columns:
        df_copy['dist_to_sideline'] = np.minimum(df_copy['y'], 53.3 - df_copy['y'])
        df_copy['sideline_norm'] = df_copy['dist_to_sideline'] / 26.65
        df_copy['y_normalized'] = df_copy['y'] / 53.3
        df_copy['near_sideline'] = (df_copy['dist_to_sideline'] < 5).astype(int)
        print("   ✓ Sideline features (dist_to_sideline, sideline_norm, near_sideline)")
    
    # 4. Field zones (dividing field into regions)
    if 'x' in df_copy.columns and 'y' in df_copy.columns:
        df_copy['field_zone_x'] = pd.cut(df_copy['x'], bins=5, labels=False)
        df_copy['field_zone_y'] = pd.cut(df_copy['y'], bins=3, labels=False)
        print("   ✓ Field zones (field_zone_x, field_zone_y)")
    
    # 5. Absolute yardline position
    if 'absolute_yardline_number' in df_copy.columns:
        df_copy['yardline_norm'] = df_copy['absolute_yardline_number'] / 100.0
        print("   ✓ Yardline normalized (yardline_norm)")
    
    print(f"\n✅ Spatial features created: {len([c for c in df_copy.columns if c not in df.columns])} new features")
    return df_copy

In [None]:
# Apply spatial features
df = create_spatial_features(df)

In [None]:
# Visualize spatial features
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Distance to ball distribution
if 'dist_to_ball' in df.columns:
    axes[0, 0].hist(df['dist_to_ball'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='blue')
    axes[0, 0].axvline(df['dist_to_ball'].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
    axes[0, 0].axvline(df['dist_to_ball'].median(), color='green', linestyle='--', linewidth=2, label='Median')
    axes[0, 0].set_xlabel('Distance to Ball Landing (yards)', fontsize=12)
    axes[0, 0].set_ylabel('Frequency', fontsize=12)
    axes[0, 0].set_title('Distance to Ball Distribution', fontsize=14, fontweight='bold')
    axes[0, 0].legend()

# 2. Field zones heatmap
if 'x' in df.columns and 'y' in df.columns:
    sample = df.sample(min(10000, len(df)))
    heatmap, xedges, yedges = np.histogram2d(sample['x'], sample['y'], bins=[30, 15])
    extent = [xedges[0], xedges[-1], yedges[0], yedges[-1]]
    im = axes[0, 1].imshow(heatmap.T, extent=extent, origin='lower', cmap='YlOrRd', aspect='auto')
    axes[0, 1].set_xlabel('X Position (yards)', fontsize=12)
    axes[0, 1].set_ylabel('Y Position (yards)', fontsize=12)
    axes[0, 1].set_title('Player Position Density Heatmap', fontsize=14, fontweight='bold')
    plt.colorbar(im, ax=axes[0, 1], label='Density')

# 3. Sideline proximity
if 'dist_to_sideline' in df.columns:
    axes[1, 0].hist(df['dist_to_sideline'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='green')
    axes[1, 0].axvline(5, color='red', linestyle='--', linewidth=2, label='5 yards (near sideline)')
    axes[1, 0].set_xlabel('Distance to Nearest Sideline (yards)', fontsize=12)
    axes[1, 0].set_ylabel('Frequency', fontsize=12)
    axes[1, 0].set_title('Sideline Proximity Distribution', fontsize=14, fontweight='bold')
    axes[1, 0].legend()

# 4. Angle to ball
if 'angle_to_ball' in df.columns:
    axes[1, 1].hist(df['angle_to_ball'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='purple')
    axes[1, 1].set_xlabel('Angle to Ball (degrees)', fontsize=12)
    axes[1, 1].set_ylabel('Frequency', fontsize=12)
    axes[1, 1].set_title('Angle to Ball Landing Distribution', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig(config.OUTPUT_DIR / 'spatial_features.png', dpi=150, bbox_inches='tight')
plt.show()

print("✅ Spatial features visualized")

## 5. Temporal Features ⏱️

### ⚠️ **DATA LEAKAGE WARNING** ⚠️

<div style="background-color: #ffcccc; padding: 15px; border-left: 5px solid red; margin: 10px 0;">
<strong>IMPORTANT: Temporal features use information from past AND future frames!</strong>

**These features MUST be created AFTER train/test split to avoid data leakage.**

Why?
- Changes (dx, dy) use information from the next frame
- Rolling statistics use information from surrounding frames
- If created before splitting, validation data "leaks" into training data

**Best Practice:**
1. Split data first (train/validation)
2. Create temporal features separately for each split
3. Never use temporal features across split boundaries
</div>

This section demonstrates temporal feature creation (for educational purposes).

In [None]:
def create_temporal_features(df):
    """
    Create temporal features from time-series tracking data
    
    ⚠️ WARNING: Use ONLY after train/test split to avoid data leakage!
    
    Features:
    - x_change, y_change: Position changes from previous frame
    - s_change, a_change: Speed/acceleration changes
    - dir_change: Direction change (with wraparound handling)
    - acceleration_rate: Rate of acceleration change
    """
    print("⏱️  Creating temporal features...\n")
    print("⚠️  WARNING: These features should only be used AFTER train/test split!\n")
    
    df_copy = df.copy()
    
    # Sort by game, play, player, and frame
    if all(col in df_copy.columns for col in ['game_id', 'play_id', 'nfl_id', 'frame_id']):
        df_copy = df_copy.sort_values(['game_id', 'play_id', 'nfl_id', 'frame_id']).reset_index(drop=True)
        
        # Group by player within play
        group_cols = ['game_id', 'play_id', 'nfl_id']
        
        # 1. Position changes
        for col in ['x', 'y']:
            if col in df_copy.columns:
                df_copy[f'{col}_change'] = df_copy.groupby(group_cols)[col].diff()
        print("   ✓ Position changes (x_change, y_change)")
        
        # 2. Speed and acceleration changes
        for col in ['s', 'a']:
            if col in df_copy.columns:
                df_copy[f'{col}_change'] = df_copy.groupby(group_cols)[col].diff()
        print("   ✓ Speed/acceleration changes (s_change, a_change)")
        
        # 3. Direction change (with wraparound)
        if 'dir' in df_copy.columns:
            dir_diff = df_copy.groupby(group_cols)['dir'].diff()
            # Handle wraparound (-180 to 180)
            dir_diff = (dir_diff + 180) % 360 - 180
            df_copy['dir_change'] = dir_diff
            df_copy['dir_change_abs'] = np.abs(dir_diff)
        print("   ✓ Direction changes (dir_change, dir_change_abs)")
        
        # 4. Acceleration rate (jerk - derivative of acceleration)
        if 'a_change' in df_copy.columns:
            df_copy['acceleration_rate'] = df_copy.groupby(group_cols)['a_change'].diff()
        print("   ✓ Acceleration rate (acceleration_rate)")
        
        # Fill NaN from diff operations
        change_cols = [c for c in df_copy.columns if '_change' in c or 'acceleration_rate' in c]
        df_copy[change_cols] = df_copy[change_cols].fillna(0)
    
    print(f"\n✅ Temporal features created: {len([c for c in df_copy.columns if c not in df.columns])} new features")
    print("⚠️  Remember: Only use these AFTER train/test split!")
    return df_copy

In [None]:
# Demonstrate temporal features (for visualization only)
# In practice, create these AFTER splitting data
df_with_temporal = create_temporal_features(df)

In [None]:
# Visualize temporal features
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Position changes
if 'x_change' in df_with_temporal.columns:
    axes[0, 0].hist(df_with_temporal['x_change'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='blue')
    axes[0, 0].axvline(0, color='red', linestyle='--', linewidth=2)
    axes[0, 0].set_xlabel('X Position Change (yards)', fontsize=12)
    axes[0, 0].set_ylabel('Frequency', fontsize=12)
    axes[0, 0].set_title('Frame-to-Frame X Position Changes', fontsize=14, fontweight='bold')

# 2. Speed changes
if 's_change' in df_with_temporal.columns:
    axes[0, 1].hist(df_with_temporal['s_change'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='orange')
    axes[0, 1].axvline(0, color='red', linestyle='--', linewidth=2)
    axes[0, 1].set_xlabel('Speed Change (yards/sec)', fontsize=12)
    axes[0, 1].set_ylabel('Frequency', fontsize=12)
    axes[0, 1].set_title('Frame-to-Frame Speed Changes', fontsize=14, fontweight='bold')

# 3. Direction changes
if 'dir_change_abs' in df_with_temporal.columns:
    axes[1, 0].hist(df_with_temporal['dir_change_abs'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='green')
    axes[1, 0].set_xlabel('Absolute Direction Change (degrees)', fontsize=12)
    axes[1, 0].set_ylabel('Frequency', fontsize=12)
    axes[1, 0].set_title('Frame-to-Frame Direction Changes', fontsize=14, fontweight='bold')

# 4. Acceleration rate
if 'acceleration_rate' in df_with_temporal.columns:
    axes[1, 1].hist(df_with_temporal['acceleration_rate'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='purple')
    axes[1, 1].axvline(0, color='red', linestyle='--', linewidth=2)
    axes[1, 1].set_xlabel('Acceleration Rate (yards/sec³)', fontsize=12)
    axes[1, 1].set_ylabel('Frequency', fontsize=12)
    axes[1, 1].set_title('Acceleration Rate (Jerk)', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig(config.OUTPUT_DIR / 'temporal_features.png', dpi=150, bbox_inches='tight')
plt.show()

print("✅ Temporal features visualized")
print("⚠️  Note: In production, create these AFTER splitting data!")

## 6. NFL Domain Features 🏈

Create NFL-specific features based on domain knowledge:
- Player role indicators (receiver, passer, defender)
- Route detection (depth, lateral movement)
- Formation indicators
- Player physical attributes
- Side of ball (offense/defense)

In [None]:
def create_nfl_features(df):
    """
    Create NFL-specific domain features
    
    Features:
    - Player role indicators (targeted receiver, passer, defender)
    - Route features (depth, lateral movement)
    - Physical attributes (BMI, age)
    - Side indicators (offense/defense)
    - Position-based features
    """
    print("🏈 Creating NFL domain features...\n")
    df_copy = df.copy()
    
    # 1. Player role encoding
    if 'player_role' in df_copy.columns:
        df_copy['is_targeted_receiver'] = (df_copy['player_role'] == 'Targeted Receiver').astype(int)
        df_copy['is_passer'] = (df_copy['player_role'] == 'Passer').astype(int)
        df_copy['is_defensive'] = (df_copy['player_role'] == 'Defensive Coverage').astype(int)
        print("   ✓ Player role indicators (is_targeted_receiver, is_passer, is_defensive)")
    
    # 2. Route features
    if 'x' in df_copy.columns:
        df_copy['route_depth'] = df_copy['x'].abs()
    
    if 'y' in df_copy.columns:
        field_center = 26.65
        df_copy['lateral_movement'] = df_copy['y'] - field_center
        df_copy['lateral_movement_abs'] = np.abs(df_copy['lateral_movement'])
        print("   ✓ Route features (route_depth, lateral_movement, lateral_movement_abs)")
    
    # 3. Physical attributes
    if 'player_height' in df_copy.columns and 'player_weight' in df_copy.columns:
        # BMI calculation (weight in lbs, height in inches)
        df_copy['player_bmi'] = (df_copy['player_weight'] / (df_copy['player_height'] ** 2)) * 703
        print("   ✓ Physical attributes (player_bmi)")
    
    # 4. Player age (if birth date available)
    if 'player_birth_date' in df_copy.columns:
        try:
            df_copy['player_birth_date'] = pd.to_datetime(df_copy['player_birth_date'], errors='coerce')
            reference_date = pd.Timestamp('2023-01-01')  # Approximate season date
            df_copy['player_age'] = (reference_date - df_copy['player_birth_date']).dt.days / 365.25
            print("   ✓ Player age (player_age)")
        except:
            print("   ⚠️  Could not calculate player age")
    
    # 5. Side of ball
    if 'player_side' in df_copy.columns:
        df_copy['is_offense'] = (df_copy['player_side'] == 'offense').astype(int)
        df_copy['is_defense'] = (df_copy['player_side'] == 'defense').astype(int)
        print("   ✓ Side indicators (is_offense, is_defense)")
    
    # 6. Position group encoding
    if 'player_position' in df_copy.columns:
        # Group positions
        receivers = ['WR', 'TE']
        backs = ['RB', 'FB']
        linemen = ['OL', 'DL', 'OT', 'OG', 'C', 'DE', 'DT', 'NT']
        linebackers = ['LB', 'ILB', 'MLB', 'OLB']
        defensive_backs = ['CB', 'S', 'SS', 'FS', 'DB']
        
        df_copy['is_receiver'] = df_copy['player_position'].isin(receivers).astype(int)
        df_copy['is_back'] = df_copy['player_position'].isin(backs).astype(int)
        df_copy['is_lineman'] = df_copy['player_position'].isin(linemen).astype(int)
        df_copy['is_linebacker'] = df_copy['player_position'].isin(linebackers).astype(int)
        df_copy['is_defensive_back'] = df_copy['player_position'].isin(defensive_backs).astype(int)
        df_copy['is_qb'] = (df_copy['player_position'] == 'QB').astype(int)
        print("   ✓ Position groups (is_receiver, is_back, is_lineman, is_linebacker, is_defensive_back, is_qb)")
    
    # 7. Play direction adjustment
    if 'play_direction' in df_copy.columns:
        df_copy['play_direction_left'] = (df_copy['play_direction'] == 'left').astype(int)
        df_copy['play_direction_right'] = (df_copy['play_direction'] == 'right').astype(int)
        print("   ✓ Play direction (play_direction_left, play_direction_right)")
    
    print(f"\n✅ NFL features created: {len([c for c in df_copy.columns if c not in df.columns])} new features")
    return df_copy

In [None]:
# Apply NFL features (continue from df, not df_with_temporal to avoid temporal leakage)
df = create_nfl_features(df)

In [None]:
# Visualize NFL features
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Player role distribution
if all(col in df.columns for col in ['is_targeted_receiver', 'is_passer', 'is_defensive']):
    role_counts = [
        df['is_targeted_receiver'].sum(),
        df['is_passer'].sum(),
        df['is_defensive'].sum(),
        len(df) - df['is_targeted_receiver'].sum() - df['is_passer'].sum() - df['is_defensive'].sum()
    ]
    role_labels = ['Targeted Receiver', 'Passer', 'Defensive Coverage', 'Other']
    axes[0, 0].bar(role_labels, role_counts, color=['green', 'blue', 'red', 'gray'], alpha=0.7)
    axes[0, 0].set_ylabel('Count', fontsize=12)
    axes[0, 0].set_title('Player Role Distribution', fontsize=14, fontweight='bold')
    axes[0, 0].tick_params(axis='x', rotation=45)

# 2. Route depth by role
if 'route_depth' in df.columns and 'is_targeted_receiver' in df.columns:
    receivers = df[df['is_targeted_receiver'] == 1]['route_depth'].dropna()
    others = df[df['is_targeted_receiver'] == 0]['route_depth'].dropna()
    axes[0, 1].hist([receivers, others], bins=30, label=['Targeted Receiver', 'Others'], alpha=0.7)
    axes[0, 1].set_xlabel('Route Depth (yards)', fontsize=12)
    axes[0, 1].set_ylabel('Frequency', fontsize=12)
    axes[0, 1].set_title('Route Depth by Player Role', fontsize=14, fontweight='bold')
    axes[0, 1].legend()

# 3. BMI distribution
if 'player_bmi' in df.columns:
    axes[1, 0].hist(df['player_bmi'].dropna(), bins=50, edgecolor='black', alpha=0.7, color='purple')
    axes[1, 0].axvline(df['player_bmi'].mean(), color='red', linestyle='--', linewidth=2, label='Mean')
    axes[1, 0].set_xlabel('Player BMI', fontsize=12)
    axes[1, 0].set_ylabel('Frequency', fontsize=12)
    axes[1, 0].set_title('Player BMI Distribution', fontsize=14, fontweight='bold')
    axes[1, 0].legend()

# 4. Speed by position group
if 's' in df.columns and any(col in df.columns for col in ['is_receiver', 'is_back', 'is_lineman']):
    position_speeds = []
    position_labels = []
    
    if 'is_receiver' in df.columns:
        position_speeds.append(df[df['is_receiver'] == 1]['s'].dropna())
        position_labels.append('Receiver')
    if 'is_back' in df.columns:
        position_speeds.append(df[df['is_back'] == 1]['s'].dropna())
        position_labels.append('Back')
    if 'is_lineman' in df.columns:
        position_speeds.append(df[df['is_lineman'] == 1]['s'].dropna())
        position_labels.append('Lineman')
    
    if position_speeds:
        axes[1, 1].boxplot(position_speeds, labels=position_labels)
        axes[1, 1].set_ylabel('Speed (yards/sec)', fontsize=12)
        axes[1, 1].set_title('Speed Distribution by Position Group', fontsize=14, fontweight='bold')
        axes[1, 1].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.savefig(config.OUTPUT_DIR / 'nfl_features.png', dpi=150, bbox_inches='tight')
plt.show()

print("✅ NFL features visualized")

## 7. Feature Importance Analysis 📊

Analyze which features are most important for predicting player movement using a Random Forest model.

In [None]:
# Prepare data for feature importance analysis
print("📊 Preparing data for feature importance analysis...\n")

# Exclude ID columns and non-numeric features
exclude_cols = ['game_id', 'play_id', 'nfl_id', 'frame_id', 'target_x', 'target_y',
                'player_name', 'player_position', 'player_role', 'player_side',
                'play_direction', 'player_birth_date', 'player_to_predict']

feature_cols = [col for col in df.columns if col not in exclude_cols and df[col].dtype in ['int64', 'float64']]

# Prepare features and targets
X = df[feature_cols].fillna(0)
y_x = df['target_x'].fillna(df['target_x'].median())
y_y = df['target_y'].fillna(df['target_y'].median())

print(f"✅ Data prepared")
print(f"   Features: {X.shape[1]}")
print(f"   Samples: {X.shape[0]:,}")

In [None]:
# Train Random Forest for feature importance
print("\n🌳 Training Random Forest for feature importance...\n")

# Use smaller sample for faster training
sample_size = min(20000, len(X))
sample_idx = np.random.choice(len(X), sample_size, replace=False)

X_sample = X.iloc[sample_idx]
y_sample = y_x.iloc[sample_idx]

# Train model
rf_model = RandomForestRegressor(
    n_estimators=50,
    max_depth=10,
    min_samples_split=20,
    random_state=42,
    n_jobs=-1
)

rf_model.fit(X_sample, y_sample)

print("✅ Random Forest trained")

In [None]:
# Get feature importances
feature_importances = pd.DataFrame({
    'feature': feature_cols,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

# Display top 20 features
print("\n🏆 Top 20 Most Important Features:\n")
display(feature_importances.head(20))

# Save to CSV
feature_importances.to_csv(config.OUTPUT_DIR / 'feature_importances.csv', index=False)
print(f"\n✅ Feature importances saved to: {config.OUTPUT_DIR / 'feature_importances.csv'}")

In [None]:
# Visualize feature importances
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Top 20 features bar chart
top_n = 20
top_features = feature_importances.head(top_n)

axes[0].barh(range(len(top_features)), top_features['importance'], color='steelblue', alpha=0.8)
axes[0].set_yticks(range(len(top_features)))
axes[0].set_yticklabels(top_features['feature'])
axes[0].invert_yaxis()
axes[0].set_xlabel('Importance Score', fontsize=12)
axes[0].set_title(f'Top {top_n} Most Important Features', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Cumulative importance
cumsum = feature_importances['importance'].cumsum()
axes[1].plot(range(1, len(cumsum) + 1), cumsum, linewidth=2, color='darkgreen')
axes[1].axhline(0.9, color='red', linestyle='--', linewidth=2, label='90% threshold')
axes[1].axhline(0.95, color='orange', linestyle='--', linewidth=2, label='95% threshold')
axes[1].set_xlabel('Number of Features', fontsize=12)
axes[1].set_ylabel('Cumulative Importance', fontsize=12)
axes[1].set_title('Cumulative Feature Importance', fontsize=14, fontweight='bold')
axes[1].legend()
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.savefig(config.OUTPUT_DIR / 'feature_importances.png', dpi=150, bbox_inches='tight')
plt.show()

# Calculate features needed for 90% and 95% importance
n_90 = (cumsum >= 0.90).argmax() + 1
n_95 = (cumsum >= 0.95).argmax() + 1

print(f"\n📈 Feature Selection Insights:")
print(f"   Features for 90% importance: {n_90}")
print(f"   Features for 95% importance: {n_95}")
print(f"   Total features: {len(feature_cols)}")

## 8. Feature Correlation Analysis 🔗

Analyze correlations between features and targets, and identify multicollinearity.

In [None]:
# Calculate correlations with targets
print("🔗 Calculating feature correlations with targets...\n")

correlations_x = []
correlations_y = []

for col in feature_cols:
    # Correlation with target_x
    valid_idx_x = (~X[col].isna()) & (~y_x.isna())
    if valid_idx_x.sum() > 100:
        corr_x, _ = pearsonr(X.loc[valid_idx_x, col], y_x[valid_idx_x])
        correlations_x.append({'feature': col, 'correlation': corr_x})
    
    # Correlation with target_y
    valid_idx_y = (~X[col].isna()) & (~y_y.isna())
    if valid_idx_y.sum() > 100:
        corr_y, _ = pearsonr(X.loc[valid_idx_y, col], y_y[valid_idx_y])
        correlations_y.append({'feature': col, 'correlation': corr_y})

correlations_x_df = pd.DataFrame(correlations_x).sort_values('correlation', key=abs, ascending=False)
correlations_y_df = pd.DataFrame(correlations_y).sort_values('correlation', key=abs, ascending=False)

print("✅ Correlations calculated")

In [None]:
# Display top correlations
print("\n🎯 Top 15 Features Correlated with Target X:\n")
display(correlations_x_df.head(15))

print("\n🎯 Top 15 Features Correlated with Target Y:\n")
display(correlations_y_df.head(15))

# Save correlations
correlations_x_df.to_csv(config.OUTPUT_DIR / 'correlations_target_x.csv', index=False)
correlations_y_df.to_csv(config.OUTPUT_DIR / 'correlations_target_y.csv', index=False)

print(f"\n✅ Correlations saved to: {config.OUTPUT_DIR}")

In [None]:
# Visualize top correlations
fig, axes = plt.subplots(1, 2, figsize=(18, 8))

# Top 15 correlations with X
top_15_x = correlations_x_df.head(15)
colors_x = ['green' if c > 0 else 'red' for c in top_15_x['correlation']]

axes[0].barh(range(len(top_15_x)), top_15_x['correlation'], color=colors_x, alpha=0.7)
axes[0].set_yticks(range(len(top_15_x)))
axes[0].set_yticklabels(top_15_x['feature'])
axes[0].invert_yaxis()
axes[0].axvline(0, color='black', linewidth=1)
axes[0].set_xlabel('Correlation with Target X', fontsize=12)
axes[0].set_title('Top 15 Features Correlated with Target X', fontsize=14, fontweight='bold')
axes[0].grid(axis='x', alpha=0.3)

# Top 15 correlations with Y
top_15_y = correlations_y_df.head(15)
colors_y = ['green' if c > 0 else 'red' for c in top_15_y['correlation']]

axes[1].barh(range(len(top_15_y)), top_15_y['correlation'], color=colors_y, alpha=0.7)
axes[1].set_yticks(range(len(top_15_y)))
axes[1].set_yticklabels(top_15_y['feature'])
axes[1].invert_yaxis()
axes[1].axvline(0, color='black', linewidth=1)
axes[1].set_xlabel('Correlation with Target Y', fontsize=12)
axes[1].set_title('Top 15 Features Correlated with Target Y', fontsize=14, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig(config.OUTPUT_DIR / 'feature_correlations.png', dpi=150, bbox_inches='tight')
plt.show()

print("✅ Correlation visualizations saved")

In [None]:
# Feature-to-feature correlation (multicollinearity detection)
print("\n🔗 Analyzing feature-to-feature correlations...\n")

# Use top 30 most important features for correlation matrix
top_30_features = feature_importances.head(30)['feature'].tolist()
X_top = X[top_30_features]

# Calculate correlation matrix
corr_matrix = X_top.corr()

# Visualize correlation heatmap
fig, ax = plt.subplots(figsize=(16, 14))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5, cbar_kws={"shrink": 0.8})
ax.set_title('Feature Correlation Heatmap (Top 30 Features)', fontsize=16, fontweight='bold', pad=20)
plt.tight_layout()
plt.savefig(config.OUTPUT_DIR / 'feature_correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

# Find highly correlated feature pairs
high_corr_pairs = []
for i in range(len(corr_matrix.columns)):
    for j in range(i+1, len(corr_matrix.columns)):
        if abs(corr_matrix.iloc[i, j]) > 0.8:
            high_corr_pairs.append({
                'feature_1': corr_matrix.columns[i],
                'feature_2': corr_matrix.columns[j],
                'correlation': corr_matrix.iloc[i, j]
            })

if high_corr_pairs:
    high_corr_df = pd.DataFrame(high_corr_pairs).sort_values('correlation', key=abs, ascending=False)
    print("\n⚠️  Highly Correlated Feature Pairs (|r| > 0.8):")
    print("    Consider removing one from each pair to reduce multicollinearity\n")
    display(high_corr_df)
else:
    print("\n✅ No highly correlated feature pairs found (|r| > 0.8)")

print("\n✅ Multicollinearity analysis complete")

---

## 🎉 Feature Engineering Complete!

### Summary:

✅ **Physics Features**: Velocity, acceleration, momentum, kinetic energy, direction differences  
✅ **Spatial Features**: Distance to ball, field position, sideline proximity, angles  
✅ **Temporal Features**: Position/speed/direction changes (⚠️ use after split!)  
✅ **NFL Features**: Player roles, routes, formations, physical attributes  
✅ **Feature Importance**: Identified most predictive features  
✅ **Correlation Analysis**: Analyzed relationships with targets and multicollinearity  

### Key Insights:

1. **Most Important Features**: Current position (x, y), speed, distance to ball, player role
2. **Feature Selection**: 90% importance can be achieved with top N features
3. **Multicollinearity**: Some features highly correlated (consider removing)
4. **Temporal Features**: Powerful but must be created after train/test split

### Next Steps:

1. Use these features in `04_model_comparison.ipynb`
2. Experiment with feature selection (drop low-importance features)
3. Create interaction features
4. Try polynomial features for non-linear relationships

---