# NFL Defensive Formation Preprocessing

This notebook demonstrates the data preprocessing pipeline for NFL tracking data.

## Objectives
1. Load NFL Big Data Bowl 2021 tracking data
2. Standardize coordinate systems
3. Extract ball release frames
4. Create defensive formation point clouds
5. Validate preprocessing with visualizations

In [None]:
import sys
sys.path.append('../src')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from preprocessing import NFLDataPreprocessor, save_point_clouds, load_point_clouds
from visualization import NFLFieldVisualizer, validate_preprocessing

%matplotlib inline
plt.style.use('default')

## Step 1: Initialize Preprocessor

Make sure you have downloaded the NFL Big Data Bowl 2021 dataset and placed it in `../data/raw/`

In [None]:
# Initialize preprocessor with path to raw data
preprocessor = NFLDataPreprocessor("../data/raw")

## Step 2: Run Preprocessing Pipeline

This will:
- Load all data files
- Merge tracking data with metadata
- Standardize coordinates (flip plays so offense always goes left-to-right)
- Extract ball release frames
- Filter to defensive players only
- Create point cloud representations

In [None]:
# Run full pipeline (this may take a few minutes)
defensive_df, point_clouds = preprocessor.preprocess_pipeline(filter_pass_plays=True)

## Step 3: Inspect Results

In [None]:
# Check dataframe
print(f"Defensive DataFrame shape: {defensive_df.shape}")
print(f"\nColumns: {list(defensive_df.columns)}")
print(f"\nFirst few rows:")
defensive_df.head()

In [None]:
# Check point clouds
print(f"Number of plays: {len(point_clouds)}")
print(f"\nSample point cloud shape: {list(point_clouds.values())[0].shape}")
print(f"\nAverage defenders per play: {np.mean([pc.shape[0] for pc in point_clouds.values()]):.1f}")

## Step 4: Visualize Sample Formations

In [None]:
# Create visualizer
visualizer = NFLFieldVisualizer()

# Plot sample formations
fig = visualizer.plot_multiple_formations(point_clouds, n_samples=6)
plt.show()

## Step 5: Check Formation Statistics

In [None]:
# Plot statistical distributions
fig = visualizer.plot_formation_statistics(point_clouds)
plt.show()

## Step 6: Detailed Look at Individual Formations

In [None]:
# Pick a random play to examine
sample_play_id = list(point_clouds.keys())[42]
sample_formation = point_clouds[sample_play_id]

print(f"Play: Game {sample_play_id[0]}, Play {sample_play_id[1]}")
print(f"Number of defenders: {sample_formation.shape[0]}")
print(f"\nDefender positions (x, y):")
print(sample_formation)

In [None]:
# Visualize this formation in detail
fig, ax = plt.subplots(figsize=(14, 7))
visualizer.plot_formation(
    sample_formation,
    ax=ax,
    title=f"Defensive Formation - Game {sample_play_id[0]}, Play {sample_play_id[1]}",
    annotate_players=True,
    player_size=300
)
plt.show()

## Step 7: Save Processed Data

Save the processed data for use in subsequent analysis (TDA computation).

In [None]:
# Save defensive formations dataframe
defensive_df.to_csv("../data/processed/defensive_formations.csv", index=False)
print("Saved defensive formations CSV")

# Save point clouds
save_point_clouds(point_clouds, "../data/processed/point_clouds.npy")
print("Saved point clouds")

## Step 8: Run Complete Validation

Generate all validation plots and save to results folder.

In [None]:
# Run validation and save plots
validate_preprocessing(
    defensive_df,
    point_clouds,
    output_dir="../results/figures/preprocessing"
)

## Summary

We have successfully:
- ✅ Loaded NFL Big Data Bowl 2021 tracking data
- ✅ Standardized field coordinates
- ✅ Extracted ball release frames
- ✅ Created defensive formation point clouds
- ✅ Validated preprocessing with visualizations

**Next steps:** 
- Compute persistent homology on these point clouds
- Extract topological features
- Analyze correlation with play outcomes