# Super Bowl Analytics with Polars

## Module 0 & 1: Environment Setup & Data Ingestion

Welcome to the **DataCamp Code-Along: Super Bowl Analytics with Polars**! 

In this session, we'll embark to answer the ultimate question:

> **"Can we mathematically predict the Super Bowl winner?"**

### Why Polars?

We're using **Polars** instead of Pandas for several compelling reasons:

- **10-30x faster** for common data operations
- **Memory efficient** - works with massive NFL datasets on laptops
- **Modern API** - functional, expression-based syntax
- **Lazy evaluation** - query optimization like a SQL engine

Let's get started!


## 1. Verify Dependencies

First, let's ensure all required packages are installed:


In [None]:
# Check package versions
import sys

print(f"Python version: {sys.version}")

# Core packages
import polars as pl
print(f"Polars version: {pl.__version__}")

import pyarrow
print(f"PyArrow version: {pyarrow.__version__}")

# NFL data
import nflreadpy as nfl
print(f"nflreadpy loaded successfully")

# Visualization
import plotly
print(f"Plotly version: {plotly.__version__}")

# ML
import sklearn
print(f"scikit-learn version: {sklearn.__version__}")

print("\n‚úÖ All dependencies loaded successfully!")


## 2. Project Setup

Our project follows data engineering best practices:

```
NFL Analytics/
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/           # Immutable source data
‚îÇ   ‚îî‚îÄ‚îÄ processed/     # Transformed checkpoints
‚îú‚îÄ‚îÄ src/               # Reusable Python modules
‚îú‚îÄ‚îÄ assets/            # Static files (Madden ratings)
‚îî‚îÄ‚îÄ notebooks/         # Analysis notebooks
```


In [None]:
from pathlib import Path
import os
import time

# Set up project root
current_path = Path.cwd()
if not (current_path / "src").exists() and (current_path.parent / "src").exists():
    os.chdir(current_path.parent)

PROJECT_ROOT = Path.cwd()

# Add src to path for imports
sys.path.insert(0, str(PROJECT_ROOT / "src"))

print(f"Project root: {PROJECT_ROOT}")
print("\nDirectory structure:")
for item in sorted(PROJECT_ROOT.iterdir()):
    if not item.name.startswith("."):
        icon = "üìÅ" if item.is_dir() else "üìÑ"
        print(f"  {icon} {item.name}")


In [None]:
# Import our custom modules
from ingestion import load_pbp_cached, load_schedules_cached, load_players_cached, load_super_bowl_games
from cleaning import normalize_player_name, standardize_team_abbr
from features import epa_above_avg, rolling_epa, qb_season_stats
from viz import plot_epa_evolution, apply_dark_theme

print("‚úÖ Custom modules loaded successfully!")
print("\nAvailable functions:")
print("  - ingestion: load_pbp_cached, load_schedules_cached, load_super_bowl_games")
print("  - cleaning: normalize_player_name, standardize_team_abbr")
print("  - features: epa_above_avg, rolling_epa, qb_season_stats")


---

## 3. Entering the NFLverse

We're using data from the **NFLverse** ‚Äî a massive community effort led by Ben Baldwin, Sebastian Carl, Lee Sharpe, Tan Ho, and John Edwards to make NFL analytics accessible to everyone.

### Key Concept: LazyFrames

In Polars, we use `pl.scan_parquet()` instead of loading data directly.

**This does NOT load the data.** It creates a `LazyFrame` ‚Äî think of it like a query plan.

Polars looks at your file, notes the schema, and **waits**. It won't actually touch the data until we ask for a result.


In [None]:
# Load PBP data - this creates a LazyFrame, NOT loaded into memory yet!
start_time = time.time()

pbp = load_pbp_cached(seasons=range(2000, 2026))

elapsed = time.time() - start_time
print(f"‚è±Ô∏è Reference created in: {elapsed:.2f} seconds")
print(f"\nThis is a: {type(pbp).__name__}")


## 4. The Schema: Type Safety

**This is a key difference from Pandas.**

In Pandas, you might get 'Object' columns that are a mix of strings and integers, which causes bugs.

Polars forces you to be disciplined ‚Äî every column has a strict type. This saves you hours of debugging later.


In [None]:
# Look at the schema - Polars is STRICT about types
schema = pbp.collect_schema()

print(f"Total columns: {len(schema)}")
print("\nKey columns with their types:")
print("-" * 40)

key_cols = ["game_id", "season", "week", "posteam", "defteam", "epa", "wpa", "passer", "play_type"]
for col in key_cols:
    if col in schema:
        print(f"  {col:20s} {str(schema[col])}")


## 5. Polars vs Pandas: Speed Test üèéÔ∏è

Here's the magic: Polars optimizes EVERYTHING before touching the data.

Let's compare **Polars** vs **Pandas** on the same operation:
- Filter to passing plays only
- Select 5 columns

**Watch the difference!**


In [None]:
# üêº PANDAS: Eager loading - loads EVERYTHING first
import pyarrow.parquet as pq
import pyarrow as pa

print("üêº PANDAS: Loading and filtering...")
start_pandas = time.time()

# Load with PyArrow and cast dictionary columns
arrow_table = pq.read_table(PROJECT_ROOT / "data" / "raw" / "pbp_slim.parquet")

new_schema = []
for field in arrow_table.schema:
    if pa.types.is_dictionary(field.type):
        new_schema.append(pa.field(field.name, field.type.value_type))
    else:
        new_schema.append(field)

arrow_table = arrow_table.cast(pa.schema(new_schema))
pandas_df = arrow_table.to_pandas()

# Then filter to passing plays
pandas_passing = pandas_df[
    (pandas_df["play_type"] == "pass") & 
    (pandas_df["passer_id"].notna())
][["game_id", "season", "passer", "epa", "cpoe"]]

pandas_time = time.time() - start_pandas
print(f"‚è±Ô∏è Pandas time: {pandas_time:.2f} seconds")
print(f"   Rows: {len(pandas_passing):,}")


In [None]:
# üêª‚Äç‚ùÑÔ∏è POLARS: Lazy evaluation - builds a plan, then executes optimally
print("üêª‚Äç‚ùÑÔ∏è POLARS: Building query plan and executing...")
start_polars = time.time()

# Build the lazy query (this is instant - just a plan!)
passing_query = (
    pbp
    .filter(
        pl.col("play_type") == "pass",
        pl.col("passer_id").is_not_null()
    )
    .select(["game_id", "season", "passer", "epa", "cpoe"])
)

# Execute with .collect()
passing_plays = passing_query.collect()

polars_time = time.time() - start_polars
print(f"‚è±Ô∏è Polars time: {polars_time:.4f} seconds")
print(f"   Rows: {len(passing_plays):,}")


In [None]:
# üèÜ THE VERDICT
speedup = pandas_time / polars_time if polars_time > 0 else float('inf')

print("\n" + "=" * 50)
print("  üèÜ SPEED COMPARISON RESULTS")
print("=" * 50)
print(f"\n  üêº Pandas:  {pandas_time:.3f} seconds")
print(f"  üêª‚Äç‚ùÑÔ∏è Polars:  {polars_time:.3f} seconds")
print(f"\n  üöÄ Polars is {speedup:.1f}x FASTER!")
print("\n" + "=" * 50)


In [None]:
# WHY is Polars faster? Let's look at the query plan
print("Why is Polars faster? Look at the optimized query plan:")
print("=" * 50)
print(passing_query.explain())
print("=" * 50)
print("\nüëÜ Polars only reads the 5 columns we need!")
print("   Pandas loaded ALL 30 columns, then threw most away.")


## 6. Loading Schedule & Super Bowl Data

The schedule data contains game-level metadata:
- Final scores
- Betting lines (spread, total)
- Game type (regular season, playoffs, **Super Bowl**)


In [None]:
# Load schedule data
schedules = load_schedules_cached(seasons=range(2000, 2026))

# Filter to Super Bowl games only
super_bowls = load_super_bowl_games(schedules)
sb_df = super_bowls.collect()

print(f"Super Bowls in dataset: {len(sb_df)}")
print("\nRecent Super Bowls:")
sb_df.select([
    "season", "game_id", "home_team", "away_team", 
    "home_score", "away_score", "spread_line"
]).tail(10)


## 7. Understanding EPA (Expected Points Added)

**EPA** is *the* foundational metric for modern NFL analytics.

Imagine it's 1st & 10 at your own 20. Historical data says you're expected to score ~0.5 points on this drive.

- **Play A:** You run for 30 yards to midfield. Expected points jumps to ~2.5. **EPA = +2.0**
- **Play B:** You throw an interception. Expected points drops to -3. **EPA = -3.5**

It assigns a concrete value to every single play based on context (down, distance, field position).


In [None]:
# Quick data quality check
print("üìä DATA QUALITY SUMMARY")
print("=" * 50)

seasons = pbp.select("season").unique().collect()["season"].to_list()
print(f"\n‚úÖ Seasons covered: {min(seasons)} - {max(seasons)}")

total_plays = pbp.select(pl.len()).collect().item()
print(f"‚úÖ Total plays: {total_plays:,}")

print(f"‚úÖ Super Bowls: {len(sb_df)}")

print("\n" + "=" * 50)
print("Data is ready! Next: Era of the Quarterback ‚Üí")


---

## Next Steps

We've successfully set up and loaded our core datasets! In the next notebook, we'll:

1. **Analyze the evolution of passing** in the NFL
2. **Visualize the "Era of the Quarterback"**
3. **TYPE ALONG** ‚Äî Build a Polars query together!

Continue to **[02_era_of_the_quarterback.ipynb](live_02_era_of_the_quarterback.ipynb)** ‚Üí
