NCAA Bracket Model

A machine learning model for predicting NCAA tournament outcomes, trained on 23 seasons of data (2002–2025, excluding 2020). Achieves 71.4% accuracy on leave-year-out cross-validation across 1,444 historical games — compared to a seed-alone baseline of ~58% and AdjEM-alone of ~65%.

How It Works

The model is a HistGradientBoostingClassifier trained on per-matchup feature differentials. For each game, both teams' pre-tournament stats are differenced (Team A minus Team B), producing a single feature vector that the model uses to predict the win probability.

Training strategy: Leave-Year-Out cross-validation — for each held-out year, the model trains on all other years. This mirrors real prediction conditions where you never have data from the year you're predicting.

Data augmentation: Every game is duplicated with team positions swapped and all differential features negated. This removes positional bias and balances classes to exactly 50/50.

Validated Performance

Metric	Value
Accuracy	71.4%
Brier Score	0.193
Log Loss	0.581
Games evaluated	1,444 (23 seasons)
Seed-alone baseline	~58%
AdjEM-alone baseline	~65%

Top Predictive Features

Rank	Feature	Correlation	Description
1	AdjEM	+0.392	KenPom efficiency margin
2	AdjOE	+0.325	Adjusted offensive efficiency
3	two_way_depth	+0.301	Roster two-way player depth
4	AdjDE	-0.288	Adjusted defensive efficiency
5	program_tourney_rate_l5	+0.256	5-year tournament appearance rate

Pipeline

Raw data (data/{year}/)
    ↓
Feature Engineering
  src/kenpom.py            KenPom pre-tournament ratings
  src/scouting.py          Four Factors + advanced team stats
  src/player_features.py   Roster composition features
  src/gameplan_features.py Last-10-game rolling momentum
  src/program_features.py  Program pedigree (tourney/F4 rates)
  src/features.py          Assembles matchup differential matrix
    ↓
Training  (src/model.py)
  HistGradientBoostingClassifier, Leave-Year-Out CV
    ↓
Evaluation  (scripts/)
  precompute_brackets.py   Simulated bracket per year
  precompute_feature_importance.py  Permutation importance
  profile_outcomes.py      Round-by-round accuracy
    ↓
2026 Inference  (scripts/predict_2026.py)
  Win probabilities for all possible 2026 matchups

Project Structure

src/                    Core feature engineering + model
scripts/                Pipeline scripts (scraping, training, evaluation, prediction)
app/
  backend/              FastAPI server
  frontend/             Vite + vanilla JS similarity UI
config/                 Reference JSON files (name map, coaches, tournament dates)
analysis/               Feature importance findings, outcome profiles
data/                   Raw data — NOT included (see data/README.md for sources)

Quick Start

Prerequisites

Python 3.10+
Node.js 18+
KenPom premium subscription (required for scraping — data is not included in this repo)

1. Install dependencies

# Python
pip install -r app/requirements.txt

# Frontend
cd app/frontend && npm install && cd ../..

2. Export your KenPom cookies

The scraping scripts authenticate with KenPom using your browser session cookies. No username/password is stored.

Log into kenpom.com in Chrome
Install a cookie export extension — Cookie-Editor works well
Navigate to any KenPom page, open Cookie-Editor, and click Export → Export as JSON
Save the file as cookies.json in the project root (same level as README.md)

cookies.json is gitignored. Never commit it.

3. Download KenPom pre-tournament summary CSVs

These must be downloaded manually — they are the pre-tournament snapshots that serve as model features.

For each season you want (2002–2026, skip 2020):

Go to kenpom.com/summary.php?y=YEAR while logged in
Click Export (top of the table) to download the CSV
Save it as data/{year}/summary{YY}_pt.csv
- Example: data/2025/summary25_pt.csv

The _pt suffix is important — it signals a pre-tournament snapshot. Post-tournament CSVs have updated ratings that would introduce data leakage.

4. Scrape remaining data

Run these from the project root in order. Each script is resumable — it skips files that already exist.

# Game-by-game logs (rolling momentum features) — ~2–3 hrs
python scripts/scrape_all_years.py

# Scouting reports + Four Factors — ~2–3 hrs
python scripts/scrape_scouting.py

# Player roster stats — ~2–3 hrs
python scripts/scrape_players_all_years.py

# Conference standings (Sports Reference, no login required)
python scripts/scrape_conferences.py

# Build coach map from raw coach data
python scripts/build_coach_map.py

Each scraper prints progress and writes errors to a .log file in the project root if anything fails. Re-running will resume where it left off.

5. Precompute artifacts

# Delete stale cache (required after any data changes)
rm data/datacache.pkl

# Leave-year-out bracket simulations (used by the similarity UI)
python scripts/precompute_brackets.py

# Feature importance by round (~5–10 min, optional)
python scripts/precompute_feature_importance.py

6. Run the app

# Backend (from project root)
uvicorn app.backend.main:app --reload --app-dir .
# → http://localhost:8000

# Frontend (separate terminal)
cd app/frontend && npm run dev
# → http://localhost:5173

Open http://localhost:5173 — the backend warms up the data cache on first start (~30s).

Pipeline Scripts

# Generate 2026 win probability predictions
python scripts/predict_2026.py

# Audit team name normalization across years
python scripts/audit_names.py

To force a full data cache rebuild at any time, delete data/datacache.pkl and restart the backend.

Team Similarity UI

The app includes a team similarity explorer — a dual-space Euclidean search across team stat vectors and player roster vectors. Given any historical tournament team, it finds the most similar teams across all other years. This is a scouting tool separate from the bracket predictor; similarity scores are for exploration, not win probabilities.

Key Design Decisions

No data leakage: All features use only information available before Selection Sunday. KenPom ratings are pre-tournament snapshots; rolling stats use only regular season and conference tournament games.
Within-year normalization: Z-scores are computed per year using only that year's tournament teams. 2026 inference normalizes using only 2026 teams.
2020 excluded everywhere: No tournament was held; excluded from all year ranges, loops, and aggregations.
Name normalization is critical: Every team name passes through config/name_map.json before any join. Silent mismatches drop teams.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NCAA Bracket Model

How It Works

Validated Performance

Top Predictive Features

Pipeline

Project Structure

Quick Start

Prerequisites

1. Install dependencies

2. Export your KenPom cookies

3. Download KenPom pre-tournament summary CSVs

4. Scrape remaining data

5. Precompute artifacts

6. Run the app

Pipeline Scripts

Team Similarity UI

Key Design Decisions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
analysis		analysis
app		app
config		config
data		data
outputs		outputs
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
monte-carlo-bracket-sim.png		monte-carlo-bracket-sim.png
team-matchup.png		team-matchup.png

Folders and files

Latest commit

History

Repository files navigation

NCAA Bracket Model

How It Works

Validated Performance

Top Predictive Features

Pipeline

Project Structure

Quick Start

Prerequisites

1. Install dependencies

2. Export your KenPom cookies

3. Download KenPom pre-tournament summary CSVs

4. Scrape remaining data

5. Precompute artifacts

6. Run the app

Pipeline Scripts

Team Similarity UI

Key Design Decisions

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages