A machine learning model for predicting NCAA tournament outcomes, trained on 23 seasons of data (2002–2025, excluding 2020). Achieves 71.4% accuracy on leave-year-out cross-validation across 1,444 historical games — compared to a seed-alone baseline of ~58% and AdjEM-alone of ~65%.
The model is a HistGradientBoostingClassifier trained on per-matchup feature differentials. For each game, both teams' pre-tournament stats are differenced (Team A minus Team B), producing a single feature vector that the model uses to predict the win probability.
Training strategy: Leave-Year-Out cross-validation — for each held-out year, the model trains on all other years. This mirrors real prediction conditions where you never have data from the year you're predicting.
Data augmentation: Every game is duplicated with team positions swapped and all differential features negated. This removes positional bias and balances classes to exactly 50/50.
| Metric | Value |
|---|---|
| Accuracy | 71.4% |
| Brier Score | 0.193 |
| Log Loss | 0.581 |
| Games evaluated | 1,444 (23 seasons) |
| Seed-alone baseline | ~58% |
| AdjEM-alone baseline | ~65% |
| Rank | Feature | Correlation | Description |
|---|---|---|---|
| 1 | AdjEM | +0.392 | KenPom efficiency margin |
| 2 | AdjOE | +0.325 | Adjusted offensive efficiency |
| 3 | two_way_depth | +0.301 | Roster two-way player depth |
| 4 | AdjDE | -0.288 | Adjusted defensive efficiency |
| 5 | program_tourney_rate_l5 | +0.256 | 5-year tournament appearance rate |
Raw data (data/{year}/)
↓
Feature Engineering
src/kenpom.py KenPom pre-tournament ratings
src/scouting.py Four Factors + advanced team stats
src/player_features.py Roster composition features
src/gameplan_features.py Last-10-game rolling momentum
src/program_features.py Program pedigree (tourney/F4 rates)
src/features.py Assembles matchup differential matrix
↓
Training (src/model.py)
HistGradientBoostingClassifier, Leave-Year-Out CV
↓
Evaluation (scripts/)
precompute_brackets.py Simulated bracket per year
precompute_feature_importance.py Permutation importance
profile_outcomes.py Round-by-round accuracy
↓
2026 Inference (scripts/predict_2026.py)
Win probabilities for all possible 2026 matchups
src/ Core feature engineering + model
scripts/ Pipeline scripts (scraping, training, evaluation, prediction)
app/
backend/ FastAPI server
frontend/ Vite + vanilla JS similarity UI
config/ Reference JSON files (name map, coaches, tournament dates)
analysis/ Feature importance findings, outcome profiles
data/ Raw data — NOT included (see data/README.md for sources)
- Python 3.10+
- Node.js 18+
- KenPom premium subscription (required for scraping — data is not included in this repo)
# Python
pip install -r app/requirements.txt
# Frontend
cd app/frontend && npm install && cd ../..The scraping scripts authenticate with KenPom using your browser session cookies. No username/password is stored.
- Log into kenpom.com in Chrome
- Install a cookie export extension — Cookie-Editor works well
- Navigate to any KenPom page, open Cookie-Editor, and click Export → Export as JSON
- Save the file as
cookies.jsonin the project root (same level asREADME.md)
cookies.jsonis gitignored. Never commit it.
These must be downloaded manually — they are the pre-tournament snapshots that serve as model features.
For each season you want (2002–2026, skip 2020):
- Go to
kenpom.com/summary.php?y=YEARwhile logged in - Click Export (top of the table) to download the CSV
- Save it as
data/{year}/summary{YY}_pt.csv- Example:
data/2025/summary25_pt.csv
- Example:
The
_ptsuffix is important — it signals a pre-tournament snapshot. Post-tournament CSVs have updated ratings that would introduce data leakage.
Run these from the project root in order. Each script is resumable — it skips files that already exist.
# Game-by-game logs (rolling momentum features) — ~2–3 hrs
python scripts/scrape_all_years.py
# Scouting reports + Four Factors — ~2–3 hrs
python scripts/scrape_scouting.py
# Player roster stats — ~2–3 hrs
python scripts/scrape_players_all_years.py
# Conference standings (Sports Reference, no login required)
python scripts/scrape_conferences.py
# Build coach map from raw coach data
python scripts/build_coach_map.pyEach scraper prints progress and writes errors to a .log file in the project root if anything fails. Re-running will resume where it left off.
# Delete stale cache (required after any data changes)
rm data/datacache.pkl
# Leave-year-out bracket simulations (used by the similarity UI)
python scripts/precompute_brackets.py
# Feature importance by round (~5–10 min, optional)
python scripts/precompute_feature_importance.py# Backend (from project root)
uvicorn app.backend.main:app --reload --app-dir .
# → http://localhost:8000
# Frontend (separate terminal)
cd app/frontend && npm run dev
# → http://localhost:5173Open http://localhost:5173 — the backend warms up the data cache on first start (~30s).
# Generate 2026 win probability predictions
python scripts/predict_2026.py
# Audit team name normalization across years
python scripts/audit_names.pyTo force a full data cache rebuild at any time, delete
data/datacache.pkland restart the backend.
The app includes a team similarity explorer — a dual-space Euclidean search across team stat vectors and player roster vectors. Given any historical tournament team, it finds the most similar teams across all other years. This is a scouting tool separate from the bracket predictor; similarity scores are for exploration, not win probabilities.
- No data leakage: All features use only information available before Selection Sunday. KenPom ratings are pre-tournament snapshots; rolling stats use only regular season and conference tournament games.
- Within-year normalization: Z-scores are computed per year using only that year's tournament teams. 2026 inference normalizes using only 2026 teams.
- 2020 excluded everywhere: No tournament was held; excluded from all year ranges, loops, and aggregations.
- Name normalization is critical: Every team name passes through
config/name_map.jsonbefore any join. Silent mismatches drop teams.

