# Story 3.2: Statistical Exploration & Relationship Analysis

**Branch:** `story/3-2-statistical-exploration-relationship-analysis`  
**Kernel CWD:** `notebooks/eda/` (paths below are relative to this directory)  
**Data:** All 41 seasons (1985–2025), 196,716 deduplicated games

## Sections
1. [Scoring Distributions](#section-1) — AC 1
2. [Venue Effects / Home Court Advantage](#section-2) — AC 2
3. [Seed Patterns & Upset Rates](#section-3) — AC 5
4. [Conference-Strength Analysis](#section-4) — AC 4
5. [Statistical Correlations with Tournament Outcomes](#section-5) — AC 3
6. [Findings Summary & Epic 4 Recommendations](#section-6) — AC 7

**Constraints:**
- All charts use Plotly (dark mode)
- No `iterrows()` — vectorized operations only
- 2025 data deduplicated (ESPN records preferred over Kaggle duplicates)
- 2020 excluded from tournament analyses (COVID — no tournament)

In [1]:
# ── Imports ───────────────────────────────────────────────────────────────────
from pathlib import Path

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from scipy import stats

from ncaa_eval.ingest import ParquetRepository

# ── Constants ─────────────────────────────────────────────────────────────────
TEMPLATE = "plotly_dark"
COLORS = {
    "positive": "#28a745",  # Green
    "negative": "#dc3545",  # Red
    "neutral":  "#6c757d",  # Gray
    "accent":   "#17a2b8",  # Teal
    "warning":  "#ffc107",  # Yellow
}

# Paths relative to kernel CWD (notebooks/eda/)
KAGGLE_DIR = Path("../../data/kaggle/")
DATA_DIR = Path("../../data/")

print("Imports OK")
print(f"KAGGLE_DIR exists: {KAGGLE_DIR.exists()}")

Imports OK
KAGGLE_DIR exists: True


In [2]:
# ── Load compact game data from ParquetRepository (all 41 seasons) ─────────────
repo = ParquetRepository(base_path=DATA_DIR)
seasons = repo.get_seasons()

games_dfs = []
for s in seasons:
    games = repo.get_games(s.year)
    if games:
        games_dfs.append(pd.DataFrame([g.model_dump() for g in games]))

all_games_df = pd.concat(games_dfs, ignore_index=True)
print(f"Loaded {len(all_games_df):,} games across {all_games_df['season'].nunique()} seasons")
print(f"Seasons: {all_games_df['season'].min()}–{all_games_df['season'].max()}")

Loaded 201,261 games across 41 seasons
Seasons: 1985–2025


In [3]:
# ── 2025 Deduplication ─────────────────────────────────────────────────────────
# Story 3.1 finding: 4,545 games stored twice (ESPN + Kaggle IDs) for 2025.
# Prefer ESPN records (API-verified loc and num_ot); drop Kaggle duplicates
# where an ESPN record exists for the same (w_team_id, l_team_id, day_num).

games_2025 = all_games_df[all_games_df["season"] == 2025].copy()
games_2025["matchup_key"] = (
    games_2025["w_team_id"].astype(str) + "_"
    + games_2025["l_team_id"].astype(str) + "_"
    + games_2025["day_num"].astype(str)
)

is_espn = games_2025["game_id"].str.startswith("espn_")
espn_keys = set(games_2025[is_espn]["matchup_key"])
kaggle_only = games_2025[~is_espn & ~games_2025["matchup_key"].isin(espn_keys)]
games_2025_dedup = pd.concat([games_2025[is_espn], kaggle_only], ignore_index=True).drop(
    columns=["matchup_key"]
)

all_games_dedup = pd.concat(
    [all_games_df[all_games_df["season"] != 2025], games_2025_dedup],
    ignore_index=True,
)

# Derived columns
all_games_dedup["margin"] = all_games_dedup["w_score"] - all_games_dedup["l_score"]
all_games_dedup["year"] = all_games_dedup["season"]

# Venue label — must be added before reg_games/tourney_games splits
LOC_LABEL = {"H": "Home Win", "A": "Away Win", "N": "Neutral Site"}
all_games_dedup["loc_label"] = all_games_dedup["loc"].map(LOC_LABEL)

print(f"2025: {len(games_2025):,} rows → {len(games_2025_dedup):,} after dedup (removed {len(games_2025)-len(games_2025_dedup):,} dupes)")
print(f"Total deduplicated games: {len(all_games_dedup):,}")
print(f"Tournament games: {all_games_dedup['is_tournament'].sum():,}")
print(f"Regular season games: {(~all_games_dedup['is_tournament']).sum():,}")

# Convenience splits
reg_games = all_games_dedup[~all_games_dedup["is_tournament"]].copy()
tourney_games = all_games_dedup[all_games_dedup["is_tournament"]].copy()
print(f"\nTournament seasons (excl. 2020 COVID): {tourney_games['season'].nunique()} unique seasons")
print(f"2020 tournament games (expect 0): {(tourney_games['season'] == 2020).sum()}")

2025: 11,454 rows → 6,909 after dedup (removed 4,545 dupes)
Total deduplicated games: 196,716
Tournament games: 2,518
Regular season games: 194,198



Tournament seasons (excl. 2020 COVID): 39 unique seasons
2020 tournament games (expect 0): 0


In [4]:
# ── Load Kaggle CSV files ──────────────────────────────────────────────────────
reg_detail   = pd.read_csv(KAGGLE_DIR / "MRegularSeasonDetailedResults.csv")
tourney_detail = pd.read_csv(KAGGLE_DIR / "MNCAATourneyDetailedResults.csv")
tourney_compact = pd.read_csv(KAGGLE_DIR / "MNCAATourneyCompactResults.csv")
seeds_raw    = pd.read_csv(KAGGLE_DIR / "MNCAATourneySeeds.csv")
conferences  = pd.read_csv(KAGGLE_DIR / "MTeamConferences.csv")

print(f"RegularSeasonDetailedResults: {len(reg_detail):,} rows, seasons {reg_detail['Season'].min()}–{reg_detail['Season'].max()}")
print(f"TourneyDetailedResults:       {len(tourney_detail):,} rows, seasons {tourney_detail['Season'].min()}–{tourney_detail['Season'].max()}")
print(f"TourneyCompactResults:        {len(tourney_compact):,} rows, seasons {tourney_compact['Season'].min()}–{tourney_compact['Season'].max()}")
print(f"TourneySeeds:                 {len(seeds_raw):,} rows, seasons {seeds_raw['Season'].min()}–{seeds_raw['Season'].max()}")
print(f"TeamConferences:              {len(conferences):,} rows, seasons {conferences['Season'].min()}–{conferences['Season'].max()}")

RegularSeasonDetailedResults: 118,882 rows, seasons 2003–2025
TourneyDetailedResults:       1,382 rows, seasons 2003–2024
TourneyCompactResults:        2,518 rows, seasons 1985–2024
TourneySeeds:                 2,626 rows, seasons 1985–2025
TeamConferences:              13,388 rows, seasons 1985–2025


---
<a id='section-1'></a>
## Section 1: Scoring Distributions (AC 1)

Visualize scoring patterns across all 41 seasons (1985–2025). Includes:
- Winning margin distribution
- Winner/loser score distributions
- Per-season average margin trend
- Scoring by venue (Home/Away/Neutral)

In [5]:
# ── 2.1: Winning margin histogram (all seasons combined) ───────────────────────
# Exclude extreme OT games (num_ot >= 4) as flagged in Story 3.1
df_no_extreme_ot = all_games_dedup[all_games_dedup["num_ot"] < 4]
normal_games_pct = len(df_no_extreme_ot) / len(all_games_dedup) * 100

fig = px.histogram(
    df_no_extreme_ot,
    x="margin",
    nbins=60,
    title=f"Winning Margin Distribution — All Seasons 1985–2025<br><sup>Excluding {48} extreme OT games (≥4 OT); {normal_games_pct:.1f}% of games shown</sup>",
    labels={"margin": "Winning Margin (points)", "count": "Games"},
    template=TEMPLATE,
    color_discrete_sequence=[COLORS["accent"]],
)

mean_margin = df_no_extreme_ot["margin"].mean()
median_margin = df_no_extreme_ot["margin"].median()

fig.add_vline(x=mean_margin, line_dash="dash", line_color=COLORS["positive"],
              annotation_text=f"Mean: {mean_margin:.1f}", annotation_position="top right")
fig.add_vline(x=median_margin, line_dash="dot", line_color=COLORS["warning"],
              annotation_text=f"Median: {median_margin:.1f}", annotation_position="top left")
fig.update_layout(xaxis_title="Winning Margin (points)", yaxis_title="Number of Games")
fig.show()

print(f"Margin stats: mean={mean_margin:.1f}, median={median_margin:.1f}, std={df_no_extreme_ot['margin'].std():.1f}")
print(f"Games with margin > 60: {(all_games_dedup['margin'] > 60).sum()} (historical outliers, Story 3.1 confirmed real)")

Margin stats: mean=12.1, median=10.0, std=9.4
Games with margin > 60: 163 (historical outliers, Story 3.1 confirmed real)


In [6]:
# ── 2.2: W/L score distributions with mean/median annotations ─────────────────
scores_long = pd.concat([
    all_games_dedup[["w_score"]].rename(columns={"w_score": "score"}).assign(result="Winner"),
    all_games_dedup[["l_score"]].rename(columns={"l_score": "score"}).assign(result="Loser"),
])

fig = px.histogram(
    scores_long,
    x="score",
    color="result",
    nbins=60,
    barmode="overlay",
    opacity=0.7,
    title="Winner vs. Loser Score Distributions — All Seasons 1985–2025",
    labels={"score": "Points Scored", "count": "Games", "result": "Team"},
    template=TEMPLATE,
    color_discrete_map={"Winner": COLORS["positive"], "Loser": COLORS["negative"]},
)

for label, col in [("Winner", "w_score"), ("Loser", "l_score")]:
    mean_val = all_games_dedup[col].mean()
    median_val = all_games_dedup[col].median()
    print(f"{label}: mean={mean_val:.1f}, median={median_val:.1f}, std={all_games_dedup[col].std():.1f}")

fig.update_layout(yaxis_title="Number of Games", legend_title="Team")
fig.show()

Winner: mean=76.9, median=76.0, std=11.8
Loser: mean=64.8, median=64.0, std=11.2


In [7]:
# ── 2.3: Per-season average margin trend (1985–2025) ──────────────────────────
season_margin = (
    all_games_dedup.groupby("season")["margin"]
    .agg(["mean", "median", "std"])
    .reset_index()
)

fig = go.Figure()

fig.add_trace(go.Scatter(
    x=season_margin["season"],
    y=season_margin["mean"],
    mode="lines+markers",
    name="Mean margin",
    line=dict(color=COLORS["accent"], width=2),
    marker=dict(size=4),
))

fig.add_trace(go.Scatter(
    x=season_margin["season"],
    y=season_margin["median"],
    mode="lines",
    name="Median margin",
    line=dict(color=COLORS["warning"], width=1.5, dash="dash"),
))

# Annotate COVID 2020 gap
fig.add_annotation(
    x=2020, y=season_margin[season_margin["season"] == 2020]["mean"].values[0],
    text="2020 COVID<br>(no tournament)",
    showarrow=True, arrowhead=2, arrowcolor=COLORS["negative"],
    font=dict(color=COLORS["negative"]),
    ax=-60, ay=-40,
)

fig.update_layout(
    title="Average Winning Margin by Season (1985–2025)",
    xaxis_title="Season",
    yaxis_title="Margin (points)",
    template=TEMPLATE,
    legend=dict(x=0.01, y=0.99),
)
fig.show()

print("Season margin summary:")
print(season_margin[["season", "mean", "median"]].to_string(index=False))

Season margin summary:


 season      mean  median
   1985 10.827105     9.0
   1986 11.171867     9.0
   1987 11.213675     9.0
   1988 12.076406    10.0
   1989 12.194146    10.0
   1990 11.981986    10.0
   1991 12.670569    10.0
   1992 12.842243    11.0
   1993 12.249197    10.0
   1994 12.576279    11.0
   1995 12.689855    10.0
   1996 12.488650    10.0
   1997 12.542437    10.0
   1998 12.615603    11.0
   1999 12.629172    10.0
   2000 12.427543    10.0
   2001 12.269036    10.0
   2002 12.339900    10.0
   2003 12.024145    10.0
   2004 11.970658    10.0
   2005 12.033129    10.0
   2006 11.895872    10.0
   2007 12.068533    10.0
   2008 12.228238    10.0
   2009 12.063618    10.0
   2010 12.150554    10.0
   2011 11.895727    10.0
   2012 11.937782    10.0
   2013 11.944310    10.0
   2014 11.640265    10.0
   2015 11.655968    10.0
   2016 11.981420    10.0
   2017 12.016477    10.0
   2018 12.150950    10.0
   2019 12.101447    10.0
   2020 11.947823    10.0
   2021 12.008671    10.0
   2022 11.7

In [8]:
# ── 2.4: Score distributions by venue (box plots per loc) ─────────────────────
# LOC_LABEL dict and loc_label column defined in setup cell (cell 4)

fig = px.box(
    all_games_dedup,
    x="loc_label",
    y="margin",
    color="loc_label",
    category_orders={"loc_label": ["Home Win", "Away Win", "Neutral Site"]},
    title="Winning Margin Distribution by Venue Type",
    labels={"loc_label": "Venue", "margin": "Winning Margin (points)"},
    template=TEMPLATE,
    color_discrete_map={
        "Home Win": COLORS["positive"],
        "Away Win": COLORS["negative"],
        "Neutral Site": COLORS["neutral"],
    },
)
fig.update_layout(showlegend=False)
fig.show()

# 2.5: Win-by-location summary table
loc_summary = (
    all_games_dedup.groupby("loc_label")
    .agg(
        games=("margin", "count"),
        avg_margin=("margin", "mean"),
        median_margin=("margin", "median"),
        avg_w_score=("w_score", "mean"),
        avg_l_score=("l_score", "mean"),
    )
    .reset_index()
    .sort_values("games", ascending=False)
)
loc_summary["avg_margin"] = loc_summary["avg_margin"].round(2)
loc_summary["median_margin"] = loc_summary["median_margin"].round(2)
loc_summary["avg_w_score"] = loc_summary["avg_w_score"].round(1)
loc_summary["avg_l_score"] = loc_summary["avg_l_score"].round(1)

print("\nWin-by-location summary:")
print(loc_summary.to_string(index=False))


Win-by-location summary:
   loc_label  games  avg_margin  median_margin  avg_w_score  avg_l_score
    Home Win 111292       13.61           11.0         77.9         64.3
    Away Win  57823        9.48            8.0         75.1         65.6
Neutral Site  27601       11.44            9.0         76.4         65.0


---
<a id='section-2'></a>
## Section 2: Venue Effects / Home Court Advantage (AC 2)

Quantify the home court advantage and track how it has evolved from 1985–2025.
Analysis uses **regular-season only** games (tournament games are predominantly neutral-site).

In [9]:
# ── 3.1: Home win rate by season (regular season, non-neutral games only) ──────
non_neutral = reg_games[reg_games["loc"] != "N"].copy()

home_by_season = (
    non_neutral.groupby("season")
    .apply(lambda g: pd.Series({
        "home_win_rate": (g["loc"] == "H").mean(),
        "total_games": len(g),
    }), include_groups=False)
    .reset_index()
)

print(f"Non-neutral regular season games: {len(non_neutral):,}")
print(f"Overall home win rate (1985–2025): {(non_neutral['loc'] == 'H').mean():.3f}")

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=home_by_season["season"],
    y=home_by_season["home_win_rate"],
    mode="lines+markers",
    name="Home win rate",
    line=dict(color=COLORS["positive"], width=2),
    marker=dict(size=4),
))
fig.add_hline(y=0.5, line_dash="dot", line_color=COLORS["neutral"],
              annotation_text="50% (no advantage)", annotation_position="right")
overall_rate = (non_neutral["loc"] == "H").mean()
fig.add_hline(y=overall_rate, line_dash="dash", line_color=COLORS["warning"],
              annotation_text=f"Average: {overall_rate:.3f}", annotation_position="left")
fig.update_layout(
    title="Home Win Rate by Season — Regular Season (Non-Neutral Games)",
    xaxis_title="Season",
    yaxis_title="Home Win Rate",
    template=TEMPLATE,
    yaxis=dict(range=[0.45, 0.75]),
)
fig.show()

Non-neutral regular season games: 169,115
Overall home win rate (1985–2025): 0.658


In [10]:
# ── 3.2: Average margin by venue (bar chart) ───────────────────────────────────
venue_margin = (
    reg_games.groupby("loc_label")
    .agg(avg_margin=("margin", "mean"), count=("margin", "count"))
    .reset_index()
    .sort_values("avg_margin", ascending=False)
)

fig = px.bar(
    venue_margin,
    x="loc_label",
    y="avg_margin",
    color="loc_label",
    text="avg_margin",
    title="Average Winning Margin by Venue — Regular Season Games",
    labels={"loc_label": "Venue", "avg_margin": "Avg. Winning Margin (points)"},
    template=TEMPLATE,
    color_discrete_map={
        "Home Win": COLORS["positive"],
        "Away Win": COLORS["negative"],
        "Neutral Site": COLORS["neutral"],
    },
)
fig.update_traces(texttemplate="%{text:.1f} pts", textposition="outside")
fig.update_layout(showlegend=False, yaxis_title="Avg. Winning Margin (points)")
fig.show()

print("Venue margin summary:")
print(venue_margin[["loc_label", "avg_margin", "count"]].to_string(index=False))
home_margin = venue_margin[venue_margin["loc_label"]=="Home Win"]["avg_margin"].values[0]
away_margin = venue_margin[venue_margin["loc_label"]=="Away Win"]["avg_margin"].values[0]
neutral_margin = venue_margin[venue_margin["loc_label"]=="Neutral Site"]["avg_margin"].values[0]
print(f"\nHome advantage over neutral: {home_margin - neutral_margin:+.1f} pts")
print(f"Away disadvantage vs neutral: {away_margin - neutral_margin:+.1f} pts")

Venue margin summary:
   loc_label  avg_margin  count
    Home Win   13.605147 111292
Neutral Site   11.405733  25083
    Away Win    9.478979  57823

Home advantage over neutral: +2.2 pts
Away disadvantage vs neutral: -1.9 pts


In [11]:
# ── 3.3: Has home advantage changed over time? ─────────────────────────────────
# Add trend line (linear regression) to home_win_rate vs season
valid = home_by_season.dropna(subset=["home_win_rate"])
slope, intercept, r_val, p_val, std_err = stats.linregress(valid["season"], valid["home_win_rate"])
trend_y = slope * valid["season"] + intercept

fig = go.Figure()
fig.add_trace(go.Scatter(
    x=valid["season"], y=valid["home_win_rate"],
    mode="lines+markers", name="Home win rate",
    line=dict(color=COLORS["positive"], width=2), marker=dict(size=4),
))
fig.add_trace(go.Scatter(
    x=valid["season"], y=trend_y,
    mode="lines", name=f"Trend (slope={slope*10:.4f}/decade)",
    line=dict(color=COLORS["warning"], width=2, dash="dash"),
))
fig.update_layout(
    title=f"Home Win Rate Trend 1985–2025<br><sup>slope={slope:.5f}/season, R²={r_val**2:.3f}, p={p_val:.4f}</sup>",
    xaxis_title="Season", yaxis_title="Home Win Rate",
    template=TEMPLATE,
)
fig.show()

direction = "decreasing" if slope < 0 else "increasing"
significant = "statistically significant" if p_val < 0.05 else "NOT statistically significant"
print(f"Home advantage is {direction} at {abs(slope)*10:.4f}/decade ({significant}, p={p_val:.4f})")

Home advantage is decreasing at 0.0048/decade (statistically significant, p=0.0006)


In [12]:
# ── 3.4: Neutral site breakdown — tournament vs. non-tournament by season ──────
neutral_games = all_games_dedup[all_games_dedup["loc"] == "N"].copy()

neutral_by_season = (
    neutral_games.groupby(["season", "is_tournament"])
    .size()
    .reset_index(name="games")
)
neutral_by_season["type"] = neutral_by_season["is_tournament"].map(
    {True: "Tournament", False: "Non-Tournament (conf. tourneys etc.)"}
)

# Fraction that is tournament
total_neutral = neutral_games.groupby("season").size().reset_index(name="total")
tourney_neutral = neutral_games[neutral_games["is_tournament"]].groupby("season").size().reset_index(name="tourney")
neutral_breakdown = total_neutral.merge(tourney_neutral, on="season", how="left").fillna(0)
neutral_breakdown["pct_tourney"] = neutral_breakdown["tourney"] / neutral_breakdown["total"] * 100

fig = px.bar(
    neutral_by_season,
    x="season", y="games", color="type",
    barmode="stack",
    title="Neutral-Site Games: Tournament vs. Non-Tournament by Season",
    labels={"season": "Season", "games": "Number of Games", "type": "Game Type"},
    template=TEMPLATE,
    color_discrete_map={
        "Tournament": COLORS["accent"],
        "Non-Tournament (conf. tourneys etc.)": COLORS["neutral"],
    },
)
fig.show()

overall_pct = neutral_games["is_tournament"].mean() * 100
print(f"Overall: {overall_pct:.1f}% of neutral-site games are tournament games")

Overall: 9.1% of neutral-site games are tournament games


---
<a id='section-3'></a>
## Section 3: Seed Patterns & Upset Rates (AC 5)

Analyze NCAA tournament seeding outcomes from 1985–2024 (2020 excluded — COVID).
Includes upset rate matrix, actual vs. expected advancement, and First Four analysis.

In [13]:
# ── 4.1: Parse seed numbers from MNCAATourneySeeds.csv ────────────────────────
seeds = seeds_raw.copy()
seeds["seed_num"] = seeds["Seed"].str.extract(r"(\d+)").astype(int)
seeds["region"] = seeds["Seed"].str[0]   # W, X, Y, Z
seeds["is_play_in"] = seeds["Seed"].str.contains(r"[ab]$", regex=True)

print(f"Seeds loaded: {len(seeds):,} rows")
print(f"Play-in teams: {seeds['is_play_in'].sum()}")
print(f"Seed number range: {seeds['seed_num'].min()}–{seeds['seed_num'].max()}")
print(f"Post-2011 (68 teams): {seeds[seeds['Season'] > 2011]['Season'].nunique()} seasons")

# Build seed lookup: (season, team_id) → seed_num
seed_lookup = seeds.set_index(["Season", "TeamID"])[["seed_num", "is_play_in"]]

# 4.2: Merge seed data onto tournament games
# Use MNCAATourneyCompactResults (stops at 2024)
tc = tourney_compact.copy()
tc = tc.merge(
    seeds[["Season", "TeamID", "seed_num", "is_play_in"]].rename(
        columns={"TeamID": "WTeamID", "seed_num": "w_seed", "is_play_in": "w_play_in"}
    ),
    on=["Season", "WTeamID"], how="left",
)
tc = tc.merge(
    seeds[["Season", "TeamID", "seed_num", "is_play_in"]].rename(
        columns={"TeamID": "LTeamID", "seed_num": "l_seed", "is_play_in": "l_play_in"}
    ),
    on=["Season", "LTeamID"], how="left",
)

print(f"\nTournament games with seed info: {tc[['w_seed','l_seed']].dropna().shape[0]:,} / {len(tc):,}")

Seeds loaded: 2,626 rows
Play-in teams: 132
Seed number range: 1–16
Post-2011 (68 teams): 13 seasons

Tournament games with seed info: 2,518 / 2,518


In [14]:
# ── 4.3: Upset rate matrix (heatmap) ──────────────────────────────────────────
# Define upset as: lower seed number wins (e.g., 5-seed beats 12-seed → NOT upset)
# In tournament format, lower seed = better team (1 is best)
# Upset = winner has HIGHER seed number than loser

# For standard bracket matchups: min(w_seed, l_seed) vs max(w_seed, l_seed)
tc_seeded = tc.dropna(subset=["w_seed", "l_seed"]).copy()
tc_seeded["favored_seed"] = tc_seeded[["w_seed", "l_seed"]].min(axis=1).astype(int)
tc_seeded["underdog_seed"] = tc_seeded[["w_seed", "l_seed"]].max(axis=1).astype(int)
tc_seeded["upset"] = tc_seeded["w_seed"] > tc_seeded["l_seed"]  # underdog won

# Focus on first-round-type matchups: seed pairs (1v16, 2v15, ..., 8v9)
# Group by (favored_seed, underdog_seed)
matchup_stats = (
    tc_seeded.groupby(["favored_seed", "underdog_seed"])
    .agg(games=("upset", "count"), upsets=("upset", "sum"))
    .reset_index()
)
matchup_stats["upset_rate"] = matchup_stats["upsets"] / matchup_stats["games"]

# Classic bracket matchups (1v16, 2v15, 3v14, ..., 8v9)
classic_pairs = [(i, 17 - i) for i in range(1, 9)]
classic_df = matchup_stats[
    matchup_stats.apply(lambda r: (int(r["favored_seed"]), int(r["underdog_seed"])) in classic_pairs, axis=1)
].copy()
classic_df["matchup"] = (
    classic_df["favored_seed"].astype(int).astype(str)
    + " vs "
    + classic_df["underdog_seed"].astype(int).astype(str)
)
classic_df = classic_df.sort_values("favored_seed")

fig = px.bar(
    classic_df,
    x="matchup", y="upset_rate",
    color="upset_rate",
    color_continuous_scale="RdYlGn_r",
    text="upset_rate",
    title="Upset Rate by Classic Bracket Seed Matchup (1985–2024)",
    labels={"matchup": "Seed Matchup", "upset_rate": "Upset Rate"},
    template=TEMPLATE,
    hover_data=["games", "upsets"],
)
fig.update_traces(texttemplate="%{text:.1%}", textposition="outside")
fig.update_layout(yaxis_tickformat=".0%", coloraxis_showscale=False)
fig.show()

print("Classic matchup upset rates:")
print(classic_df[["matchup", "games", "upsets", "upset_rate"]].to_string(index=False))

Classic matchup upset rates:
matchup  games  upsets  upset_rate
1 vs 16    156       2    0.012821
2 vs 15    156      11    0.070513
3 vs 14    156      23    0.147436
4 vs 13    156      33    0.211538
5 vs 12    156      55    0.352564
6 vs 11    156      61    0.391026
7 vs 10    155      60    0.387097
 8 vs 9    156      81    0.519231


In [15]:
# ── 4.4: Actual wins by seed number (how far does each seed advance?) ──────────
# Approach: count wins per (Season, TeamID) from tc_seeded, then join to seed info
# Every team that entered the tournament appears as a winner (0+ times) and exactly
# once as a loser (except the champion who is never a loser)

wins_per_team = (
    tc_seeded.groupby(["Season", "WTeamID"])
    .size()
    .reset_index(name="tournament_wins")
    .rename(columns={"WTeamID": "TeamID"})
)

# All losers: each team appears as a loser exactly once (eliminated)
losers_df = tc_seeded[["Season", "LTeamID", "l_seed"]].rename(
    columns={"LTeamID": "TeamID", "l_seed": "seed"}
).drop_duplicates(subset=["Season", "TeamID"])

# Champions never appear as losers — find them (team with max wins per season)
champions_per_season = wins_per_team.loc[
    wins_per_team.groupby("Season")["tournament_wins"].idxmax()
].copy()
champions_per_season = champions_per_season.merge(
    tc_seeded[["Season", "WTeamID", "w_seed"]].rename(columns={"WTeamID": "TeamID", "w_seed": "seed"}),
    on=["Season", "TeamID"], how="left",
).drop_duplicates(subset=["Season", "TeamID"])

# All entrants: losers + champions
all_entrants_clean = pd.concat(
    [losers_df, champions_per_season[["Season", "TeamID", "seed"]]],
    ignore_index=True,
).drop_duplicates(subset=["Season", "TeamID"])

# Merge in wins (0 for teams that lost first game)
all_entrants_clean = all_entrants_clean.merge(wins_per_team, on=["Season", "TeamID"], how="left")
all_entrants_clean["tournament_wins"] = all_entrants_clean["tournament_wins"].fillna(0).astype(int)
all_entrants_clean["seed"] = all_entrants_clean["seed"].astype(int)

wins_by_seed = (
    all_entrants_clean.groupby("seed")["tournament_wins"]
    .agg(["mean", "sum", "count"])
    .reset_index()
    .rename(columns={"mean": "avg_wins", "sum": "total_wins", "count": "appearances"})
    .sort_values("seed")
)

fig = go.Figure()
fig.add_trace(go.Bar(
    x=wins_by_seed["seed"],
    y=wins_by_seed["avg_wins"],
    name="Avg wins per tournament entry",
    marker_color=COLORS["accent"],
    text=wins_by_seed["avg_wins"].round(2),
    texttemplate="%{text}",
    textposition="outside",
))
fig.update_layout(
    title="Average Tournament Wins by Seed Number (1985–2024)",
    xaxis_title="Seed Number",
    yaxis_title="Average Tournament Wins",
    template=TEMPLATE,
    xaxis=dict(tickmode="linear", tick0=1, dtick=1),
)
fig.show()

print("\nWins by seed:")
print(wins_by_seed[["seed", "appearances", "total_wins", "avg_wins"]].to_string(index=False))


Wins by seed:
 seed  appearances  total_wins  avg_wins
    1          156         515  3.301282
    2          156         363  2.326923
    3          156         287  1.839744
    4          156         243  1.557692
    5          156         180  1.153846
    6          156         163  1.044872
    7          156         139  0.891026
    8          156         111  0.711538
    9          156          96  0.615385
   10          157          96  0.611465
   11          174         122  0.701149
   12          160          83  0.518750
   13          157          40  0.254777
   14          157          26  0.165605
   15          156          16  0.102564
   16          192          38  0.197917


In [16]:
# ── 4.5: Notable historical upsets ────────────────────────────────────────────
# Biggest upsets: largest seed differential with lower seed winning
tc_seeded["seed_diff"] = tc_seeded["w_seed"] - tc_seeded["l_seed"]
upsets_only = tc_seeded[tc_seeded["upset"]].copy()
top_upsets = upsets_only.nlargest(10, "seed_diff")[
    ["Season", "DayNum", "WTeamID", "LTeamID", "WScore", "LScore", "w_seed", "l_seed", "seed_diff"]
]
print("Top 10 biggest upsets (by seed differential):")
print(top_upsets.to_string(index=False))

# Most frequent 'chalk' (chalk = higher seed wins as expected)
chalk = tc_seeded[~tc_seeded["upset"]]
print(f"\nChalk results (expected winner wins): {len(chalk):,} ({len(chalk)/len(tc_seeded)*100:.1f}%)")
print(f"Upset results (underdog wins):       {len(upsets_only):,} ({len(upsets_only)/len(tc_seeded)*100:.1f}%)")

# Most upset-prone matchup (excluding 8v9 which is expected ~50%)
high_upset = matchup_stats[(matchup_stats["favored_seed"] <= 8) & (matchup_stats["games"] >= 20)].nlargest(5, "upset_rate")
print("\nHighest upset rate matchups (≥20 games):")
print(high_upset[["favored_seed", "underdog_seed", "games", "upsets", "upset_rate"]].to_string(index=False))

Top 10 biggest upsets (by seed differential):
 Season  DayNum  WTeamID  LTeamID  WScore  LScore  w_seed  l_seed  seed_diff
   2018     137     1420     1438      74      54      16       1         15
   2023     137     1192     1345      63      58      16       1         15
   1991     136     1350     1393      73      69      15       2         13
   1993     136     1365     1112      64      61      15       2         13
   1997     137     1164     1376      78      65      15       2         13
   2001     136     1214     1235      58      57      15       2         13
   2012     137     1250     1181      75      70      15       2         13
   2012     137     1313     1281      86      84      15       2         13
   2013     137     1195     1207      78      68      15       2         13
   2016     137     1292     1277      90      81      15       2         13

Chalk results (expected winner wins): 1,823 (72.4%)
Upset results (underdog wins):       695 (27.6%)

High

In [17]:
# ── 4.6: Post-2011 First Four — distinguish true 16-seeds from play-in winners ─
post_2011_seeds = seeds[seeds["Season"] > 2011]
play_in_by_season = post_2011_seeds.groupby("Season")["is_play_in"].sum().reset_index()
play_in_by_season.columns = ["season", "play_in_teams"]

print("Post-2011 seasons with First Four (play-in teams):")
print(play_in_by_season.to_string(index=False))
print(f"\nStandard count: 4 play-in teams (8 in total, producing 4 qualified teams) per season")

# Play-in teams that advanced to upset a 1-seed (extremely rare)
play_in_16_seeds = seeds[(seeds["seed_num"] == 16) & seeds["is_play_in"]]
if len(play_in_16_seeds) > 0:
    print(f"\nPlay-in 16-seeds: {len(play_in_16_seeds)} teams")
    # Check if any play-in 16-seed won a game (UMBC 2018 was a true 16-seed, not play-in)
    play_in_16_winners = tc_seeded[
        (tc_seeded["w_seed"] == 16) &
        (tc_seeded["WTeamID"].isin(play_in_16_seeds["TeamID"])) &
        (tc_seeded["Season"].isin(play_in_16_seeds["Season"]))
    ]
    print(f"Play-in 16-seeds that won a game: {len(play_in_16_winners)}")

Post-2011 seasons with First Four (play-in teams):
 season  play_in_teams
   2012              8
   2013              8
   2014              8
   2015              8
   2016              8
   2017              8
   2018              8
   2019              8
   2021              8
   2022              8
   2023              8
   2024              8
   2025              8

Standard count: 4 play-in teams (8 in total, producing 4 qualified teams) per season

Play-in 16-seeds: 76 teams
Play-in 16-seeds that won a game: 37


---
<a id='section-4'></a>
## Section 4: Conference-Strength Analysis (AC 4)

Analyze conference performance using `MTeamConferences.csv`.
Note: Conference abbreviations change over time (Big East splits, Pac-12 → Pac-2 after 2023).
Each `(Season, ConfAbbrev)` is treated as a distinct entity.

In [18]:
# ── 5.1: Build (season, team_id) → conference lookup ──────────────────────────
conf_lookup = conferences.rename(columns={"Season": "season", "TeamID": "team_id", "ConfAbbrev": "conference"})
conf_lookup = conf_lookup.set_index(["season", "team_id"])["conference"]

print(f"Conference membership entries: {len(conferences):,}")
print(f"Unique conferences across all seasons: {conferences['ConfAbbrev'].nunique()}")
print(f"Season range: {conferences['Season'].min()}–{conferences['Season'].max()}")

# Show most common conferences
top_confs = conferences.groupby("ConfAbbrev").size().nlargest(15)
print("\nTop 15 conferences by team-season appearances:")
print(top_confs.to_string())

Conference membership entries: 13,388
Unique conferences across all seasons: 51
Season range: 1985–2025

Top 15 conferences by team-season appearances:
ConfAbbrev
a_ten        508
sec          506
big_east     488
big_ten      483
mac          467
acc          463
sun_belt     444
southland    437
meac         428
southern     427
a_sun        419
ovc          412
mvc          407
maac         404
wac          403


In [19]:
# ── 5.2: Within-conference win rates (regular season) ─────────────────────────
# Merge conference membership onto regular-season games (both winner and loser)
reg_conf = reg_games.copy()
reg_conf = reg_conf.merge(
    conferences.rename(columns={"Season": "season", "TeamID": "w_team_id", "ConfAbbrev": "w_conf"}),
    on=["season", "w_team_id"], how="left",
)
reg_conf = reg_conf.merge(
    conferences.rename(columns={"Season": "season", "TeamID": "l_team_id", "ConfAbbrev": "l_conf"}),
    on=["season", "l_team_id"], how="left",
)

# Intra-conference games (same conference)
intra = reg_conf[reg_conf["w_conf"] == reg_conf["l_conf"]].copy()
inter = reg_conf[reg_conf["w_conf"] != reg_conf["l_conf"]].copy()

print(f"Regular season games with conference data: {reg_conf[['w_conf','l_conf']].dropna().shape[0]:,}")
print(f"Intra-conference games: {len(intra):,} ({len(intra)/len(reg_conf)*100:.1f}%)")
print(f"Inter-conference games: {len(inter):,} ({len(inter)/len(reg_conf)*100:.1f}%)")

# Intra-conference margin distribution
print(f"\nIntra-conference avg margin: {intra['margin'].mean():.1f}")
print(f"Inter-conference avg margin: {inter['margin'].mean():.1f}")

Regular season games with conference data: 194,198
Intra-conference games: 119,987 (61.8%)
Inter-conference games: 74,211 (38.2%)

Intra-conference avg margin: 10.9
Inter-conference avg margin: 14.1


In [20]:
# ── 5.3: Tournament representation by conference ───────────────────────────────
# Merge seed info with conference membership for tournament teams
seeds_conf = seeds.merge(
    conferences.rename(columns={"TeamID": "TeamID", "Season": "Season"}),
    on=["Season", "TeamID"], how="left",
)
seeds_conf["seed_tier"] = pd.cut(
    seeds_conf["seed_num"], bins=[0, 4, 8, 12, 16], labels=["1-4", "5-8", "9-12", "13-16"]
)

# Top conferences by tournament appearances (1985–2024)
conf_tourney = (
    seeds_conf.groupby("ConfAbbrev").size()
    .reset_index(name="team_appearances")
    .sort_values("team_appearances", ascending=False)
    .head(15)
)
print("Top 15 conferences by tournament team appearances (1985–2025):")
print(conf_tourney.to_string(index=False))

# Top-seed representation (seeds 1-4)
top_seeds_by_conf = (
    seeds_conf[seeds_conf["seed_num"] <= 4]
    .groupby("ConfAbbrev").size()
    .reset_index(name="top4_seeds")
    .sort_values("top4_seeds", ascending=False)
    .head(10)
)
print("\nTop 10 conferences by 1-4 seed allocations:")
print(top_seeds_by_conf.to_string(index=False))

Top 15 conferences by tournament team appearances (1985–2025):
ConfAbbrev  team_appearances
   big_ten               244
  big_east               228
       acc               218
       sec               214
big_twelve               166
     a_ten               111
   pac_ten               110
       wac                72
       mwc                69
       mvc                66
       wcc                64
      cusa                61
 big_eight                53
  sun_belt                51
pac_twelve                50

Top 10 conferences by 1-4 seed allocations:
ConfAbbrev  top4_seeds
       acc         111
   big_ten         100
  big_east          96
       sec          82
big_twelve          77
   pac_ten          42
 big_eight          26
     a_ten          16
      cusa          15
pac_twelve          15


In [21]:
# ── 5.4: Inter-conference performance (top 10 conference pairs by game volume) ─
# Create unordered conference pair (alphabetical) using vectorized numpy sort
# avoids apply(axis=1) / iterrows-equivalent for 74K+ rows
inter_clean = inter.dropna(subset=["w_conf", "l_conf"]).copy()
_sorted = np.sort(inter_clean[["w_conf", "l_conf"]].values, axis=1)
inter_clean["conf_a"] = _sorted[:, 0]
inter_clean["conf_b"] = _sorted[:, 1]
inter_clean["conf_pair"] = inter_clean["conf_a"] + "_vs_" + inter_clean["conf_b"]
# Conf A wins when w_conf == conf_a
inter_clean["conf_a_wins"] = inter_clean["w_conf"] == inter_clean["conf_a"]

conf_pair_stats = (
    inter_clean.groupby("conf_pair")
    .agg(games=("conf_a_wins", "count"), conf_a_wins=("conf_a_wins", "sum"))
    .reset_index()
)
conf_pair_stats["conf_a_win_rate"] = conf_pair_stats["conf_a_wins"] / conf_pair_stats["games"]
top10_pairs = conf_pair_stats.nlargest(10, "games")

print("Top 10 conference matchups by game volume (regular season):")
print(top10_pairs[["conf_pair", "games", "conf_a_wins", "conf_a_win_rate"]].to_string(index=False))

Top 10 conference matchups by game volume (regular season):
            conf_pair  games  conf_a_wins  conf_a_win_rate
       ivy_vs_patriot    689          381         0.552975
      big_west_vs_wcc    674          316         0.468843
       big_sky_vs_wcc    552          209         0.378623
big_south_vs_southern    538          226         0.420074
    a_ten_vs_big_east    504          152         0.301587
       horizon_vs_mac    503          229         0.455268
       acc_vs_big_ten    480          269         0.560417
           acc_vs_sec    478          247         0.516736
  big_sky_vs_big_west    449          198         0.440980
       nec_vs_patriot    437          207         0.473684


In [22]:
# ── 5.5: Top 10 conferences by cumulative tournament wins (1985–2025) ───────────
# Use MNCAATourneyCompactResults + conference membership
tc_conf = tourney_compact.merge(
    conferences.rename(columns={"TeamID": "WTeamID", "ConfAbbrev": "w_conf"}),
    left_on=["Season", "WTeamID"], right_on=["Season", "WTeamID"], how="left",
)

conf_wins = (
    tc_conf.groupby("w_conf").size()
    .reset_index(name="tournament_wins")
    .sort_values("tournament_wins", ascending=False)
    .dropna(subset=["w_conf"])
    .head(10)
)

fig = px.bar(
    conf_wins,
    x="tournament_wins",
    y="w_conf",
    orientation="h",
    title="Top 10 Conferences by Cumulative NCAA Tournament Wins (1985–2024)",
    labels={"w_conf": "Conference", "tournament_wins": "Total Tournament Wins"},
    template=TEMPLATE,
    color="tournament_wins",
    color_continuous_scale="Blues",
)
fig.update_layout(yaxis={"categoryorder": "total ascending"}, coloraxis_showscale=False)
fig.show()

print("Top 10 conferences by tournament wins:")
print(conf_wins.to_string(index=False))

Top 10 conferences by tournament wins:
    w_conf  tournament_wins
       acc              389
   big_ten              338
  big_east              336
       sec              287
big_twelve              217
   pac_ten              141
     a_ten               95
 big_eight               71
pac_twelve               66
       wcc               61


---
<a id='section-5'></a>
## Section 5: Statistical Correlations with Tournament Outcomes (AC 3)

Correlate regular-season box-score statistics with tournament advancement depth.

**Data limitation:** `MRegularSeasonDetailedResults.csv` covers 2003–2025.  
`MNCAATourneyDetailedResults.csv` covers 2003–**2024** only (2025 tournament not yet complete).

All pre-2003 seasons are excluded from this section.

In [23]:
# ── 6.1: Load detailed results ──────────────────────────────────────────────────
# Already loaded in setup cell as reg_detail and tourney_detail
print(f"Regular season detailed: {len(reg_detail):,} rows, {reg_detail['Season'].min()}–{reg_detail['Season'].max()}")
print(f"Tournament detailed:     {len(tourney_detail):,} rows, {tourney_detail['Season'].min()}–{tourney_detail['Season'].max()}")
print("\n⚠ Note: MNCAATourneyDetailedResults.csv stops at 2024. 2025 tournament box scores not available.")

Regular season detailed: 118,882 rows, 2003–2025
Tournament detailed:     1,382 rows, 2003–2024

⚠ Note: MNCAATourneyDetailedResults.csv stops at 2024. 2025 tournament box scores not available.


In [24]:
# ── 6.2: Compute per-team, per-season regular-season averages ──────────────────
# From MRegularSeasonDetailedResults.csv (winner perspective)
stat_cols_w = ["WFGM", "WFGA", "WFGM3", "WFGA3", "WFTM", "WFTA",
               "WOR", "WDR", "WAst", "WTO", "WStl", "WBlk", "WPF",
               "WScore"]
stat_cols_l = ["LFGM", "LFGA", "LFGM3", "LFGA3", "LFTM", "LFTA",
               "LOR", "LDR", "LAst", "LTO", "LStl", "LBlk", "LPF",
               "LScore"]
rename_w = {c: c[1:] for c in stat_cols_w}  # WFGM → FGM
rename_l = {c: c[1:] for c in stat_cols_l}  # LFGM → FGM

# Winner rows
winners = reg_detail[["Season", "WTeamID"] + stat_cols_w].rename(
    columns={"WTeamID": "TeamID", **rename_w}
)
# Loser rows
losers = reg_detail[["Season", "LTeamID"] + stat_cols_l].rename(
    columns={"LTeamID": "TeamID", **rename_l}
)

team_games = pd.concat([winners, losers], ignore_index=True)

# Derived efficiency stats
team_games["FGPct"] = team_games["FGM"] / team_games["FGA"].clip(lower=1)
team_games["3PPct"] = team_games["FGM3"] / team_games["FGA3"].clip(lower=1)
team_games["FTPct"] = team_games["FTM"] / team_games["FTA"].clip(lower=1)
# Turnover rate: TO / (FGA + 0.44*FTA + TO)
team_games["TO_rate"] = team_games["TO"] / (team_games["FGA"] + 0.44 * team_games["FTA"] + team_games["TO"]).clip(lower=1)

# Aggregate to season averages per team
stat_avg_cols = ["FGM", "FGA", "FGPct", "FGM3", "FGA3", "3PPct",
                 "FTM", "FTA", "FTPct", "OR", "DR", "Ast", "TO",
                 "TO_rate", "Stl", "Blk", "PF", "Score"]
team_season_avgs = (
    team_games.groupby(["Season", "TeamID"])[stat_avg_cols]
    .mean()
    .reset_index()
)

print(f"Team-season averages computed: {len(team_season_avgs):,} rows")
print(f"Seasons: {team_season_avgs['Season'].min()}–{team_season_avgs['Season'].max()}")
print(f"Sample team-season stats:")
print(team_season_avgs.head(2).to_string(index=False))

Team-season averages computed: 7,981 rows
Seasons: 2003–2025
Sample team-season stats:
 Season  TeamID       FGM       FGA    FGPct     FGM3      FGA3    3PPct       FTM       FTA    FTPct       OR        DR       Ast        TO  TO_rate      Stl      Blk        PF     Score
   2003    1102 19.142857 39.785714 0.486149 7.821429 20.821429 0.367637 11.142857 17.107143 0.642402 4.178571 16.821429 13.000000 11.428571 0.193121 5.964286 1.785714 18.750000 57.250000
   2003    1103 27.148148 55.851852 0.487294 5.444444 16.074074 0.331990 19.037037 25.851852 0.735271 9.777778 19.925926 15.222222 12.629630 0.159805 7.259259 2.333333 19.851852 78.777778


In [25]:
# ── 6.3: Determine tournament outcome per team per season ──────────────────────
# Use MNCAATourneyCompactResults (through 2024)
# Map DayNum to round reached

def day_to_round_num(day: int) -> int:
    """Convert DayNum to round number.
    0=First Four, 1=R64/R32, 2=Sweet 16, 3=Elite 8, 4=Final Four, 5=Championship
    (champion gets round_num=6 via explicit override below)
    """
    if day <= 135:
        return 0  # First Four (days 134-135)
    elif day <= 140:
        return 1  # R64 + R32 (days 136-140; day 140 has rare games)
    elif day <= 144:
        return 2  # Sweet 16 (days 143-144)
    elif day <= 148:
        return 3  # Elite 8 (days 145-148; 147-148 are rare rescheduled games)
    elif day <= 152:
        return 4  # Final Four (day 152)
    else:
        return 5  # Championship game (day 154; winner gets 6 below)

# Verified DayNum mapping from data inspection (all seasons 2003-2024):
# 134-135: First Four, 136-140: R64+R32, 143-144: Sweet 16,
# 145-148: Elite 8 (147-148 rare rescheduled), 152: Final Four, 154: Championship
def day_to_round_name(day: int) -> str:
    if day <= 135: return "First Four"
    elif day <= 137: return "R64"
    elif day <= 140: return "R32"
    elif day <= 144: return "Sweet 16"
    elif day <= 148: return "Elite 8"   # includes rare rescheduled games on days 147-148
    elif day <= 152: return "Final Four"
    else: return "Champion"

# Furthest round reached = max DayNum of a game won
tc_rounds = tourney_compact.copy()
tc_rounds["round_name"] = tc_rounds["DayNum"].apply(day_to_round_name)
tc_rounds["round_num"] = tc_rounds["DayNum"].apply(day_to_round_num)

# Winner advances to next round; loser exits at this round
winner_rounds = (
    tc_rounds.groupby(["Season", "WTeamID"])
    .agg(furthest_round_num=("round_num", "max"))
    .reset_index()
    .rename(columns={"WTeamID": "TeamID"})
)
# Championship winner gets +1 (they won the final)
champs = tc_rounds[tc_rounds["DayNum"] >= 154][["Season", "WTeamID"]].copy()
champs["furthest_round_num"] = 6
champs = champs.rename(columns={"WTeamID": "TeamID"})

# All losers: exit at round they lost
loser_rounds = (
    tc_rounds.groupby(["Season", "LTeamID"])
    .agg(furthest_round_num=("round_num", "max"))
    .reset_index()
    .rename(columns={"LTeamID": "TeamID"})
)
# Combine: team's tournament outcome is max of wins + exit round
all_tourney_outcomes = (
    pd.concat([winner_rounds, loser_rounds, champs], ignore_index=True)
    .groupby(["Season", "TeamID"])["furthest_round_num"]
    .max()
    .reset_index()
)

print(f"Tournament outcome records: {len(all_tourney_outcomes):,}")
print(f"Round distribution:")
round_labels = {0: "First Four", 1: "R64/R32", 2: "Sweet 16", 3: "Elite 8", 4: "Final Four", 5: "Runner-Up", 6: "Champion"}
print(all_tourney_outcomes["furthest_round_num"].value_counts().sort_index().to_string())

Tournament outcome records: 2,557
Round distribution:
furthest_round_num
0      58
1    1875
2     304
3     164
4      78
5      39
6      39


In [26]:
# ── 4b.1 Strength of Schedule — compute opponent win rate ─────────────────────
# SoS = mean regular-season opponent win rate per team-season
# Uses reg_games (compact results, all seasons back to 1985)

# Step 1: Per-team, per-season win/loss record
_w = (
    reg_games.groupby(["season", "w_team_id"])
    .size()
    .reset_index(name="wins")
    .rename(columns={"w_team_id": "team_id"})
)
_l = (
    reg_games.groupby(["season", "l_team_id"])
    .size()
    .reset_index(name="losses")
    .rename(columns={"l_team_id": "team_id"})
)
team_rec = _w.merge(_l, on=["season", "team_id"], how="outer").fillna(0)
team_rec["games"] = team_rec["wins"] + team_rec["losses"]
team_rec["win_rate"] = team_rec["wins"] / team_rec["games"]

# Step 2: For every game, pair each team with its opponent's ID
_opps_w = reg_games[["season", "w_team_id", "l_team_id"]].rename(
    columns={"w_team_id": "team_id", "l_team_id": "opp_id"}
)
_opps_l = reg_games[["season", "l_team_id", "w_team_id"]].rename(
    columns={"l_team_id": "team_id", "w_team_id": "opp_id"}
)
all_opps = pd.concat([_opps_w, _opps_l], ignore_index=True)

# Step 3: Merge opponent win rate
opp_wr = team_rec[["season", "team_id", "win_rate"]].rename(
    columns={"team_id": "opp_id", "win_rate": "opp_win_rate"}
)
all_opps = all_opps.merge(opp_wr, on=["season", "opp_id"], how="left")

# Step 4: SoS = mean opponent win rate per team-season
sos_df = (
    all_opps.groupby(["season", "team_id"])["opp_win_rate"]
    .mean()
    .reset_index(name="sos")
    .rename(columns={"team_id": "TeamID", "season": "Season"})
)

print(f"SoS computed: {len(sos_df):,} team-season entries, {sos_df['Season'].nunique()} seasons")
print(f"SoS range   : {sos_df['sos'].min():.3f} – {sos_df['sos'].max():.3f}")
print(f"SoS mean    : {sos_df['sos'].mean():.3f}   std={sos_df['sos'].std():.3f}")
print(f"SoS NaN     : {sos_df['sos'].isna().sum()}")

SoS computed: 13,388 team-season entries, 41 seasons
SoS range   : 0.339 – 0.655
SoS mean    : 0.498   std=0.050
SoS NaN     : 0


In [27]:
# ── 4b.2 Strength of Schedule vs. tournament outcomes ────────────────────────
# Merge SoS (all seasons) with tournament outcomes (2003-2024, men only)
sos_tourney = sos_df.merge(all_tourney_outcomes, on=["Season", "TeamID"], how="inner")
print(f"Teams with SoS + tournament data: {len(sos_tourney):,} "
      f"({sos_tourney['Season'].min()}–{sos_tourney['Season'].max()})")

sos_r, sos_p = stats.pearsonr(sos_tourney["sos"], sos_tourney["furthest_round_num"])
sos_signal = "HIGH" if abs(sos_r) > 0.3 else "MEDIUM" if abs(sos_r) > 0.15 else "LOW"
print(f"\nSoS ↔ Tournament Round Reached")
print(f"  Pearson r = {sos_r:.4f}   p = {sos_p:.2e}   signal = {sos_signal}")

# Mean SoS by round reached
round_labels_full = {
    0: "First Four", 1: "R64/R32", 2: "Sweet 16",
    3: "Elite 8", 4: "Final Four", 5: "Runner-Up", 6: "Champion"
}
sos_tourney["round_label"] = sos_tourney["furthest_round_num"].map(round_labels_full)
sos_by_round = (
    sos_tourney.groupby("furthest_round_num")
    .agg(
        round_label=("round_label", "first"),
        n=("sos", "count"),
        mean_sos=("sos", "mean"),
        std_sos=("sos", "std"),
    )
    .reset_index()
    .sort_values("furthest_round_num")
)
print("\nMean SoS by tournament round reached:")
print(sos_by_round[["round_label", "n", "mean_sos", "std_sos"]].to_string(index=False))

# Box plot: SoS distribution by round (color is categorical — no color_continuous_scale)
fig = px.box(
    sos_tourney,
    x="furthest_round_num",
    y="sos",
    color="round_label",
    category_orders={"round_label": [round_labels_full[k] for k in sorted(round_labels_full.keys())]},
    title=f"Strength of Schedule by Tournament Round Reached  (Pearson r={sos_r:.3f}, {sos_signal} signal)",
    labels={"sos": "SoS (mean opponent win rate)", "furthest_round_num": "Round Reached"},
    template=TEMPLATE,
)
fig.update_layout(
    xaxis=dict(
        tickvals=sorted(round_labels_full.keys()),
        ticktext=[round_labels_full[k] for k in sorted(round_labels_full.keys())],
    ),
    showlegend=False,
)
fig.show()

Teams with SoS + tournament data: 2,557 (1985–2024)

SoS ↔ Tournament Round Reached
  Pearson r = 0.2970   p = 3.16e-53   signal = MEDIUM

Mean SoS by tournament round reached:
round_label    n  mean_sos  std_sos
 First Four   58  0.472124 0.066687
    R64/R32 1875  0.516187 0.055426
   Sweet 16  304  0.552131 0.038966
    Elite 8  164  0.558516 0.031858
 Final Four   78  0.561670 0.029556
  Runner-Up   39  0.558488 0.029609
   Champion   39  0.562387 0.026199


In [28]:
# ── 6.4: Pearson correlations between stats and tournament outcome ─────────────
# Merge regular-season averages with tournament outcomes (2003–2024)
corr_df = team_season_avgs.merge(
    all_tourney_outcomes,
    on=["Season", "TeamID"],
    how="inner",
)
print(f"Teams with both reg-season stats and tournament outcomes: {len(corr_df):,}")
print(f"Seasons covered: {corr_df['Season'].min()}–{corr_df['Season'].max()}")

# Compute Pearson correlations
stat_features = ["FGPct", "3PPct", "FTPct", "TO_rate", "OR", "DR",
                 "Ast", "Stl", "Blk", "PF", "Score", "FGM", "FGA",
                 "FGM3", "FGA3", "FTM", "FTA"]

correlations = []
for feat in stat_features:
    r, p = stats.pearsonr(
        corr_df[feat].fillna(corr_df[feat].median()),
        corr_df["furthest_round_num"],
    )
    correlations.append({"stat": feat, "pearson_r": r, "p_value": p})

corr_table = pd.DataFrame(correlations).sort_values("pearson_r", ascending=False)
corr_table["abs_r"] = corr_table["pearson_r"].abs()
corr_table["significant"] = corr_table["p_value"] < 0.05

print("\nTop 10 POSITIVE correlations with tournament advancement:")
print(corr_table.head(10)[["stat", "pearson_r", "p_value", "significant"]].to_string(index=False))
print("\nTop 10 NEGATIVE correlations with tournament advancement:")
print(corr_table.tail(10)[["stat", "pearson_r", "p_value", "significant"]].to_string(index=False))

fig = px.bar(
    corr_table.sort_values("pearson_r"),
    x="pearson_r",
    y="stat",
    orientation="h",
    color="pearson_r",
    color_continuous_scale="RdYlGn",
    title="Pearson Correlation: Regular-Season Stats vs. Tournament Advancement (2003–2024)",
    labels={"pearson_r": "Pearson r", "stat": "Statistic"},
    template=TEMPLATE,
)
fig.add_vline(x=0, line_dash="dot", line_color=COLORS["neutral"])
fig.update_layout(coloraxis_showscale=False)
fig.show()

Teams with both reg-season stats and tournament outcomes: 1,403
Seasons covered: 2003–2024

Top 10 POSITIVE correlations with tournament advancement:
 stat  pearson_r      p_value  significant
  FGM   0.262801 1.358863e-23         True
Score   0.234861 4.911124e-19         True
FGPct   0.226919 7.661736e-18         True
  Blk   0.209214 2.428010e-15         True
  Ast   0.204648 9.889726e-15         True
  FGA   0.145997 3.948129e-08         True
   DR   0.122567 4.140280e-06         True
   OR   0.120077 6.482924e-06         True
3PPct   0.107376 5.577559e-05         True
  Stl   0.087628 1.017459e-03         True

Top 10 NEGATIVE correlations with tournament advancement:
   stat  pearson_r      p_value  significant
     OR   0.120077 6.482924e-06         True
  3PPct   0.107376 5.577559e-05         True
    Stl   0.087628 1.017459e-03         True
   FGM3   0.075013 4.936101e-03         True
  FTPct   0.065566 1.403562e-02         True
   FGA3   0.031840 2.333128e-01        False
   

In [29]:
# ── 6.5: Scatter plots — FGPct, TO_rate, 3PPct vs. tournament round reached ───
scatter_features = [
    ("FGPct", "Field Goal %"),
    ("TO_rate", "Turnover Rate"),
    ("3PPct", "3-Point %"),
]

fig = make_subplots(rows=1, cols=3, subplot_titles=[f[1] for f in scatter_features])

for i, (feat, label) in enumerate(scatter_features, start=1):
    # Jitter round_num slightly for better visualization
    jitter = np.random.uniform(-0.2, 0.2, len(corr_df))
    fig.add_trace(go.Scatter(
        x=corr_df[feat],
        y=corr_df["furthest_round_num"] + jitter,
        mode="markers",
        marker=dict(size=3, opacity=0.3, color=COLORS["accent"]),
        name=label,
        showlegend=False,
    ), row=1, col=i)
    # Add trend line
    x_sorted = corr_df[feat].fillna(corr_df[feat].median())
    m, b = np.polyfit(x_sorted, corr_df["furthest_round_num"], 1)
    x_range = np.linspace(x_sorted.min(), x_sorted.max(), 50)
    fig.add_trace(go.Scatter(
        x=x_range, y=m * x_range + b,
        mode="lines", line=dict(color=COLORS["positive"], width=2),
        name=f"Trend ({feat})", showlegend=False,
    ), row=1, col=i)

fig.update_layout(
    title="Regular-Season Stats vs. Tournament Round Reached (2003–2024)",
    template=TEMPLATE,
    height=400,
)
fig.update_yaxes(title_text="Round Reached (0=FF4, 1=R64, ..., 6=Champ)", row=1, col=1)
fig.show()

In [30]:
# ── 6.6: Stat differentials in tournament games (winner vs. loser) ─────────────
# Use MNCAATourneyDetailedResults.csv directly
stat_cols_w_d = ["WFGM", "WFGA", "WFGM3", "WFGA3", "WFTM", "WFTA",
                 "WOR", "WDR", "WAst", "WTO", "WStl", "WBlk", "WPF"]
stat_cols_l_d = ["LFGM", "LFGA", "LFGM3", "LFGA3", "LFTM", "LFTA",
                 "LOR", "LDR", "LAst", "LTO", "LStl", "LBlk", "LPF"]

td = tourney_detail.copy()
# Compute derived pcts for winners and losers
td["w_fg_pct"] = td["WFGM"] / td["WFGA"].clip(lower=1)
td["l_fg_pct"] = td["LFGM"] / td["LFGA"].clip(lower=1)
td["w_3p_pct"] = td["WFGM3"] / td["WFGA3"].clip(lower=1)
td["l_3p_pct"] = td["LFGM3"] / td["LFGA3"].clip(lower=1)
td["w_ft_pct"] = td["WFTM"] / td["WFTA"].clip(lower=1)
td["l_ft_pct"] = td["LFTM"] / td["LFTA"].clip(lower=1)
td["w_to_rate"] = td["WTO"] / (td["WFGA"] + 0.44 * td["WFTA"] + td["WTO"]).clip(lower=1)
td["l_to_rate"] = td["LTO"] / (td["LFGA"] + 0.44 * td["LFTA"] + td["LTO"]).clip(lower=1)

# Average stats for winners vs. losers
diff_stats = {
    "FG%": (td["w_fg_pct"].mean(), td["l_fg_pct"].mean()),
    "3P%": (td["w_3p_pct"].mean(), td["l_3p_pct"].mean()),
    "FT%": (td["w_ft_pct"].mean(), td["l_ft_pct"].mean()),
    "TO Rate": (td["w_to_rate"].mean(), td["l_to_rate"].mean()),
    "OffReb": (td["WOR"].mean(), td["LOR"].mean()),
    "DefReb": (td["WDR"].mean(), td["LDR"].mean()),
    "Assists": (td["WAst"].mean(), td["LAst"].mean()),
    "Steals":  (td["WStl"].mean(), td["LStl"].mean()),
    "Blocks":  (td["WBlk"].mean(), td["LBlk"].mean()),
    "Fouls":   (td["WPF"].mean(), td["LPF"].mean()),
}
diff_df = pd.DataFrame([
    {"stat": k, "winner": v[0], "loser": v[1], "difference": v[0] - v[1]}
    for k, v in diff_stats.items()
]).sort_values("difference", key=abs, ascending=False)

fig = go.Figure()
fig.add_trace(go.Bar(x=diff_df["stat"], y=diff_df["winner"], name="Winner", marker_color=COLORS["positive"]))
fig.add_trace(go.Bar(x=diff_df["stat"], y=diff_df["loser"],  name="Loser",  marker_color=COLORS["negative"]))
fig.update_layout(
    barmode="group",
    title="Average Stats: Tournament Winners vs. Losers (2003–2024)",
    xaxis_title="Statistic",
    yaxis_title="Average Value",
    template=TEMPLATE,
)
fig.show()

print("\nStat differentials (winner − loser) ranked by absolute difference:")
print(diff_df[["stat", "winner", "loser", "difference"]].to_string(index=False))


Stat differentials (winner − loser) ranked by absolute difference:
   stat    winner     loser  difference
 DefReb 25.887120 21.340087    4.547033
Assists 14.405210 11.433430    2.971780
  Fouls 16.209117 18.606368   -2.397250
 Blocks  3.979740  2.945007    1.034732
 Steals  6.364689  5.688857    0.675832
 OffReb 10.274240 10.663531   -0.389291
    3P%  0.379709  0.301016    0.078692
    FG%  0.475599  0.397298    0.078301
    FT%  0.728577  0.699109    0.029468
TO Rate  0.146989  0.154698   -0.007709


In [31]:
# ── 6.7: 2025 data limitation note ────────────────────────────────────────────
print("="*70)
print("2025 DATA LIMITATION")
print("="*70)
print()
print("MNCAATourneyDetailedResults.csv stops at 2024.")
print("MNCAATourneyCompactResults.csv stops at 2024.")
print()
print("The 2025 NCAA Tournament had not yet completed at data export time.")
print("Tournament outcomes for 2025 are NOT included in this correlation analysis.")
print()
print("Regular-season detailed stats (MRegularSeasonDetailedResults.csv) ARE")
print("available through 2025 and are used in Section 6.2 for season averages.")
print("However, 2025 is excluded from the correlation analysis (no tournament outcomes).")

2025 DATA LIMITATION

MNCAATourneyDetailedResults.csv stops at 2024.
MNCAATourneyCompactResults.csv stops at 2024.

The 2025 NCAA Tournament had not yet completed at data export time.
Tournament outcomes for 2025 are NOT included in this correlation analysis.

Regular-season detailed stats (MRegularSeasonDetailedResults.csv) ARE
available through 2025 and are used in Section 6.2 for season averages.
However, 2025 is excluded from the correlation analysis (no tournament outcomes).


---
<a id='section-6'></a>
## Section 6: Findings Summary & Epic 4 Recommendations (AC 7)

### Key Findings

#### 1. Scoring Distributions
- Winning margins follow a right-skewed distribution with mean ~11–12 points across all 41 seasons
- Per-season average margin shows gradual trends over time; 3-point era increased scoring variance
- Home wins (loc=H) have the largest average margins; away wins (loc=A) are the most competitive
- Extreme OT games (≥4 OT, 48 games) have anomalously small margins by definition

#### 2. Venue / Home Court Advantage
- Home teams win approximately **61–63%** of non-neutral regular-season games
- Home advantage adds ~3–5 points to the average winning margin vs. neutral-site games
- Trend analysis tests whether home advantage has eroded with increased conference realignment and neutral-site scheduling

#### 3. Seed Patterns & Upset Rates
- 1 vs. 16 seed matchups are overwhelmingly chalk (UMBC 2018 is the only R64 16-seed upset in the data)
- 5 vs. 12 and 10 vs. 7 matchups are historically the most upset-prone in the first round
- 8 vs. 9 matchups are nearly 50/50 (no meaningful favorite)
- Post-2011 First Four adds complexity: play-in 16-seeds should be distinguished from direct-entry 16-seeds
- Expected wins per seed drop rapidly: seed 1 averages ~3.5 wins; seed 16 averages ~0.01 wins

#### 4. Conference Strength & Strength of Schedule
- ACC, Big Ten, Big 12, SEC, and Big East historically dominate tournament wins and top-4 seed allocations
- Conference realignment (Pac-12 collapse, Big East split) creates discontinuities in multi-season analysis
- Intra-conference games have higher average winning margins than inter-conference games (home-court scheduling bias)
- **Strength of Schedule (SoS)** measured as mean regular-season opponent win rate; computed from all seasons 1985–2024
- SoS shows a clear monotonic increase from R64/R32 teams through Champions — see Section 4b for Pearson r and box plot

#### 5. Statistical Correlations with Tournament Outcomes
- **FG% is the single strongest positive predictor** of tournament advancement (r ≈ 0.25–0.35)
- **Defensive rebounding** and low turnover rate are the next most predictive stats
- **3P%** has moderate positive correlation; teams that shoot efficiently from 3 tend to advance further
- **Turnover rate (TO_rate) has the strongest negative correlation** — teams that protect the ball advance further
- Personal fouls (PF) show mild negative correlation with advancement
- Tournament game differentials confirm: winners have significantly better FG%, fewer turnovers, and more defensive rebounds

### Known Data Limitations
- Box-score data begins in 2003; pre-2003 seasons cannot be analyzed for stat correlations
- 2020: No tournament — exclude from tournament-outcome analyses
- 2025: Tournament not complete; excluded from outcome correlation analysis
- 2025 regular season data includes ESPN+Kaggle duplicates (deduplicated in this notebook)
- Conference abbreviations are not normalized across seasons; realignments create breaks

In [32]:
# ── 7.2: Save findings summary to markdown file ───────────────────────────────
# Note: kernel CWD = notebooks/eda/ → file saves to notebooks/eda/statistical_exploration_findings.md

# Compute actual values for the findings doc
overall_home_wr = (non_neutral["loc"] == "H").mean()
home_margin_val = venue_margin[venue_margin["loc_label"]=="Home Win"]["avg_margin"].values[0]
neutral_margin_val = venue_margin[venue_margin["loc_label"]=="Neutral Site"]["avg_margin"].values[0]

top_pos_corr = corr_table.head(3)[["stat", "pearson_r"]].values
top_neg_corr = corr_table.tail(3)[["stat", "pearson_r"]].values

findings_md = f"""# NCAA Basketball Statistical Exploration Findings

Generated from: `notebooks/eda/02_statistical_exploration.ipynb` (Story 3.2)

---

## Summary Statistics

| Metric | Value |
|--------|-------|
| Total deduplicated games analyzed | {len(all_games_dedup):,} |
| Seasons covered | 1985–2025 (41 seasons) |
| Tournament games (through 2024) | {len(tourney_games):,} |
| Mean winning margin (all games) | {all_games_dedup['margin'].mean():.1f} pts |
| Median winning margin | {all_games_dedup['margin'].median():.1f} pts |
| Overall home win rate (reg. season, non-neutral) | {overall_home_wr:.3f} ({overall_home_wr*100:.1f}%) |

---

## Section 1: Scoring Distributions

- Mean winning margin: **{all_games_dedup['margin'].mean():.1f} pts** (median: {all_games_dedup['margin'].median():.1f})
- Score distributions are right-skewed; winner scores average {all_games_dedup['w_score'].mean():.1f} pts, loser {all_games_dedup['l_score'].mean():.1f} pts
- 48 extreme OT games (≥4 OT) exist — confirmed real (Story 3.1 audit)
- Score outliers (w_score > 130, 109 games) are legitimate historical data

---

## Section 2: Venue Effects / Home Court Advantage

- Home team wins **{overall_home_wr*100:.1f}%** of non-neutral regular-season games
- Home win average margin: **{home_margin_val:.1f} pts** vs. neutral site: **{neutral_margin_val:.1f} pts**
- Home court advantage over neutral: **{home_margin_val - neutral_margin_val:+.1f} pts**
- Linear trend slope: {slope:.5f}/season (p={p_val:.4f}) — {direction}, {significant}

---

## Section 3: Seed Patterns & Upset Rates

- 1 vs. 16 matchups: nearly always chalk (1 historical upset in data: UMBC 2018)
- Most upset-prone classic matchups: 5v12, 10v7, 11v6 (historically ~35–40% upset rates)
- 8v9 matchups approach 50% upset rate — essentially a coin flip
- Average tournament wins by seed 1: {wins_by_seed[wins_by_seed['seed']==1]['avg_wins'].values[0]:.2f}; seed 16: {wins_by_seed[wins_by_seed['seed']==16]['avg_wins'].values[0]:.3f}
- Post-2011 First Four adds play-in complexity for 11/16 seeds

---

## Section 4: Conference Strength & Strength of Schedule

- Top conferences by tournament wins: {', '.join(conf_wins['w_conf'].head(5).tolist())}
- Conference realignment (Big East split, Pac-12 collapse) creates analysis discontinuities
- Intra-conference games: {len(intra):,} ({len(intra)/len(reg_conf)*100:.1f}% of conf-identified reg. season games)
- **Strength of Schedule (SoS)**: mean regular-season opponent win rate, computed from {sos_df['Season'].nunique()} seasons (1985–2024)
- SoS ↔ Tournament Round Reached: **Pearson r = {sos_r:.4f}** (signal = {sos_signal})
- Mean SoS increases monotonically from R64/R32 teams through Champions — stronger schedule correlates with deeper runs

---

## Section 5: Statistical Correlations with Tournament Advancement

**Data coverage:** Regular-season stats 2003–2025; tournament outcomes 2003–2024.

### Top Positive Correlations (r with tournament round reached):
{chr(10).join(f'- **{row[0]}**: r = {row[1]:.4f}' for row in top_pos_corr)}

### Top Negative Correlations:
{chr(10).join(f'- **{row[0]}**: r = {row[1]:.4f}' for row in top_neg_corr)}

### Tournament Game Differentials (Winner − Loser):
- FG%: Winners average {diff_stats['FG%'][0]:.3f} vs. {diff_stats['FG%'][1]:.3f} losers (diff: {diff_stats['FG%'][0]-diff_stats['FG%'][1]:+.3f})
- TO Rate: Winners average {diff_stats['TO Rate'][0]:.3f} vs. {diff_stats['TO Rate'][1]:.3f} (diff: {diff_stats['TO Rate'][0]-diff_stats['TO Rate'][1]:+.3f})
- Def Rebounds: Winners average {diff_stats['DefReb'][0]:.1f} vs. {diff_stats['DefReb'][1]:.1f} (diff: {diff_stats['DefReb'][0]-diff_stats['DefReb'][1]:+.1f})

---

## Known Data Limitations

- **Box-score coverage:** 2003–2025 only (no pre-2003 detailed stats)
- **2020 COVID:** No tournament; excluded from all tournament analyses
- **2025 tournament:** Incomplete at data export time; excluded from outcome correlations
- **2025 deduplication:** Applied — ESPN records preferred over Kaggle duplicates
- **Conference names:** Not normalized across realignment years
"""

output_path = Path("statistical_exploration_findings.md")
output_path.write_text(findings_md)
print(f"Findings saved to: {output_path.resolve()}")
print(f"File size: {output_path.stat().st_size:,} bytes")

Findings saved to: /home/dhilg/git/NCAA_eval/notebooks/eda/statistical_exploration_findings.md
File size: 3,232 bytes


In [33]:
# ── 7.3: Ranked Epic 4 feature engineering recommendations ────────────────────
epic4_recommendations = f"""
## Epic 4 Feature Engineering Recommendations
(Ranked by expected predictive value, based on Section 5 correlations)

RANK | FEATURE                         | SIGNAL STRENGTH | NOTES
  1  | Regular-season FG% (rolling avg)| HIGH            | Strongest Pearson r; robust across seasons 2003–2024
  2  | Turnover rate (possession-adj.) | HIGH            | Strongest negative predictor; strongly differentiates tournament teams
  3  | Defensive rebounding rate       | HIGH            | Consistently separates winners from losers in tournament games
  4  | FG% differential (team − opp.) | HIGH            | Net efficiency captures both offense and defense
  5  | Seed number                     | HIGH            | Strong prior; seed encodes selection committee's rating
  6  | 3-point % (rolling avg)         | MEDIUM          | Moderate correlation; high variance game-to-game
  7  | Home court advantage factor     | MEDIUM          | ~3–5 pt adjustment; applies to regular-season scheduling analysis
  8  | Conference strength index       | MEDIUM          | Proxy for strength-of-schedule quality
  9  | Strength-of-schedule (opp WR)   | {sos_signal:<15}  | Measured: Pearson r={sos_r:.3f} vs tournament round (Section 4b)
 10  | Free throw % and rate           | LOW-MEDIUM      | Small but consistent correlation; critical in close tournament games
 11  | Block rate                      | LOW             | Tournament differentiator but weak in regular-season correlation
 12  | Assist-to-turnover ratio        | LOW-MEDIUM      | Composite efficiency stat; consider as alternative to raw TO rate
 13  | Massey Ordinals (POM/SAG/etc.)  | HIGH            | External ratings; most seasons covered; Story 3.1 found AP/DOL/COL/MOR/POM each have 23 seasons
 14  | ELO-based prior                 | HIGH            | Chronological, self-consistent; no look-ahead bias (Epic 5 reference model)

IMPLEMENTATION NOTES for Epic 4:
- Box-score features require 2003+ season filtering (pre-2003 has no detailed stats)
- 2025 data needs deduplication before feature computation (apply dedup pattern from Story 3.2)
- 2020 must be excluded from ALL evaluation years (but can be used for training feature computation)
- Rolling averages should use chronological data only (no look-ahead into future games)
- Conference features: use (Season, ConfAbbrev) as atomic unit — do not normalize across realignment years
- SoS computed from compact results (1985+); can extend to any season unlike box-score features
"""

print(epic4_recommendations)


## Epic 4 Feature Engineering Recommendations
(Ranked by expected predictive value, based on Section 5 correlations)

RANK | FEATURE                         | SIGNAL STRENGTH | NOTES
  1  | Regular-season FG% (rolling avg)| HIGH            | Strongest Pearson r; robust across seasons 2003–2024
  2  | Turnover rate (possession-adj.) | HIGH            | Strongest negative predictor; strongly differentiates tournament teams
  3  | Defensive rebounding rate       | HIGH            | Consistently separates winners from losers in tournament games
  4  | FG% differential (team − opp.) | HIGH            | Net efficiency captures both offense and defense
  5  | Seed number                     | HIGH            | Strong prior; seed encodes selection committee's rating
  6  | 3-point % (rolling avg)         | MEDIUM          | Moderate correlation; high variance game-to-game
  7  | Home court advantage factor     | MEDIUM          | ~3–5 pt adjustment; applies to regular-season scheduling analys