# Data Collection - NCAA Basketball 2025-26

This notebook collects team statistics from various sources for the 2025-26 season.

**Primary Data Source:** Barttorvik (free efficiency ratings)  
**Optional Sources:** ESPN, Haslametrics, CBBpy (for future features)

**Data we collect:**
- Adjusted Offensive Efficiency (AdjO)
- Adjusted Defensive Efficiency (AdjD)
- Win/Loss records and win percentage
- Power ratings (net efficiency)

In [1]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
import ssl
import urllib.request
from urllib.error import URLError, HTTPError
import certifi
import time
import warnings
warnings.filterwarnings('ignore')

from src import config

print("Libraries loaded!")

Libraries loaded!


---

## Part 1: Barttorvik Data (Primary Source)

Barttorvik provides free, reliable efficiency ratings used as our main data source.

### 1.1 Target Teams

In [2]:
# Teams from our prediction template
TARGET_TEAMS = [
    'Baylor', 'Boston College', 'California', 'Clemson', 'Duke',
    'Florida State', 'Georgia Tech', 'Louisville', 'Miami', 'Michigan',
    'NC State', 'North Carolina', 'Notre Dame', 'Ohio State', 'Pitt',
    'SMU', 'Stanford', 'Syracuse', 'Virginia', 'Virginia Tech', 'Wake Forest'
]

# Mapping from our schedule names to Barttorvik names
SCHEDULE_TO_BARTTORVIK = {
    'Florida State': 'Florida St.',
    'Miami': 'Miami FL',
    'NC State': 'N.C. State',
    'Ohio State': 'Ohio St.',
    'Pitt': 'Pittsburgh',
}

print(f"Need data for {len(TARGET_TEAMS)} teams")

Need data for 21 teams


### 1.2 Scrape Barttorvik CSV

In [3]:
def scrape_barttorvik_csv(year=2026, max_retries=3, retry_delay=1.0):
    """Fetch team ratings from Barttorvik CSV endpoint with retry logic"""
    url = f"https://barttorvik.com/{year}_team_results.csv"
    
    try:
        print(f"Fetching Barttorvik CSV for {year}...")
        
        # Try with requests first
        try:
            response = requests.get(url, timeout=15)
            response.raise_for_status()
            df = pd.read_csv(StringIO(response.text))
            print(f"‚úì Found {len(df)} teams")
            return df
        except:
            pass
        
        # Try with secure SSL using certifi
        ssl_context = ssl.create_default_context(cafile=certifi.where())
        
        for attempt in range(max_retries):
            try:
                with urllib.request.urlopen(url, context=ssl_context, timeout=30) as response:
                    data = response.read().decode('utf-8')
                    df = pd.read_csv(StringIO(data))
                    print(f"‚úì Found {len(df)} teams")
                    return df
            except (URLError, HTTPError, ssl.SSLError) as e:
                if attempt < max_retries - 1:
                    wait_time = retry_delay * (2 ** attempt)
                    print(f"   Attempt {attempt + 1}/{max_retries} failed, retrying...")
                    time.sleep(wait_time)
                else:
                    # Fallback without SSL verification
                    ssl_context_unverified = ssl.create_default_context()
                    ssl_context_unverified.check_hostname = False
                    ssl_context_unverified.verify_mode = ssl.CERT_NONE
                    with urllib.request.urlopen(url, context=ssl_context_unverified, timeout=30) as response:
                        data = response.read().decode('utf-8')
                        df = pd.read_csv(StringIO(data))
                        print(f"‚úì Found {len(df)} teams (fallback)")
                        return df
    except Exception as e:
        print(f"Error: {e}")
        return None

# Fetch the data
barttorvik_df = scrape_barttorvik_csv(2026)

if barttorvik_df is not None:
    print(f"\nColumns: {len(barttorvik_df.columns)}")
    display(barttorvik_df.head(10))

Fetching Barttorvik CSV for 2026...
‚úì Found 365 teams

Columns: 45


Unnamed: 0,rank,team,conf,record,adjoe,oe Rank,adjde,de Rank,barthag,rank.1,...,ConPA,ConPoss,ConOE,ConDE,ConSOSRemain,Conf Win%,WAB,WAB Rk,Fun Rk,adjt
0,1,Michigan,B10,19-1,126.89027,5,89.998099,1,0.981121,1,...,721.0,714.925,1.200126,1.008497,0.859326,0.9,7.433483,4,97,72.659255
1,2,Arizona,B12,21-0,126.349485,7,91.408287,3,0.976402,2,...,573.0,585.825,1.208552,0.978108,0.870899,1.0,8.015676,1,12,71.118552
2,3,Houston,B12,18-2,126.403685,6,93.325694,6,0.970371,3,...,461.0,459.3875,1.214661,1.00351,0.816846,0.857143,5.41608,8,118,63.792142
3,4,Duke,ACC,19-1,124.995094,14,93.869013,9,0.964198,4,...,541.0,542.9375,1.226661,0.996431,0.810477,1.0,7.551541,2,9,67.216864
4,5,Florida,SEC,15-6,124.374856,18,93.867069,8,0.962179,5,...,578.0,565.925,1.222777,1.021337,0.833664,0.75,3.135693,21,331,69.530998
5,6,Illinois,B10,17-3,131.214936,1,99.151965,33,0.961658,6,...,639.0,583.0875,1.251956,1.09589,0.844619,0.888889,5.244019,9,81,66.155771
6,7,Vanderbilt,SEC,18-3,125.671,12,95.019005,14,0.961406,7,...,616.0,569.975,1.152682,1.080749,0.837592,0.625,4.99209,11,159,70.255994
7,8,Connecticut,BE,20-1,120.480165,40,91.6055,4,0.958945,8,...,670.0,686.775,1.141567,0.975574,0.794829,1.0,7.527951,3,8,65.015185
8,9,Purdue,B10,17-4,128.998475,3,98.649378,28,0.956254,9,...,718.0,655.675,1.209441,1.095055,0.866859,0.7,5.212849,10,136,64.966775
9,10,Nebraska,B10,20-1,122.338073,26,93.796755,7,0.955001,10,...,653.0,657.675,1.181435,0.992892,0.788028,0.9,7.119445,5,5,66.813662


### 1.3 Process and Filter Data

In [4]:
# Name mappings
BARTTORVIK_TO_SCHEDULE = {v: k for k, v in SCHEDULE_TO_BARTTORVIK.items()}

SCHEDULE_TEAMS = [
    'Baylor', 'Boston College', 'California', 'Clemson', 'Duke',
    'Florida State', 'Georgia Tech', 'Louisville', 'Miami', 'Michigan',
    'NC State', 'North Carolina', 'Notre Dame', 'Ohio State', 'Pitt',
    'SMU', 'Stanford', 'Syracuse', 'Virginia', 'Virginia Tech', 'Wake Forest'
]

def process_barttorvik_csv(df):
    """Process and filter to target teams"""
    if df is None:
        return None
    
    # Find team column
    team_col = next((col for col in df.columns if col.lower() == 'team'), None)
    if not team_col:
        print("Error: Could not find team column")
        return None
    
    result = df.copy()
    result['team_clean'] = result[team_col].apply(
        lambda x: BARTTORVIK_TO_SCHEDULE.get(str(x).strip(), str(x).strip())
    )
    
    filtered = result[result['team_clean'].isin(SCHEDULE_TEAMS)].copy()
    print(f"Found {len(filtered)}/{len(SCHEDULE_TEAMS)} target teams")
    
    missing = set(SCHEDULE_TEAMS) - set(filtered['team_clean'].tolist())
    if missing:
        print(f"Missing teams: {missing}")
    
    return filtered

our_teams = process_barttorvik_csv(barttorvik_df)
if our_teams is not None:
    display(our_teams[['team_clean', 'adjoe', 'adjde', 'record', 'barthag']].sort_values('barthag', ascending=False))

Found 21/21 target teams


Unnamed: 0,team_clean,adjoe,adjde,record,barthag
0,Michigan,126.89027,89.998099,19-1,0.981121
3,Duke,124.995094,93.869013,19-1,0.964198
14,Virginia,125.744189,98.160431,17-3,0.945212
17,Louisville,125.245995,98.830686,14-6,0.938429
24,Clemson,116.936727,94.625707,17-4,0.919427
26,NC State,121.487262,99.262932,15-6,0.910799
30,North Carolina,122.832399,101.623051,16-4,0.898424
32,SMU,124.611305,104.194921,15-5,0.88673
39,Ohio State,122.195268,103.485037,14-6,0.871156
40,Miami,118.006155,100.792829,17-4,0.859748


### 1.4 Convert to Model Format

In [5]:
def find_col(df, options):
    """Find column by name options"""
    for opt in options:
        matches = [c for c in df.columns if opt.lower() in c.lower()]
        if matches:
            return matches[0]
    return None

if our_teams is not None and len(our_teams) >= 18:
    print("‚úì Using scraped Barttorvik data")
    
    adj_o_col = find_col(our_teams, ['adjoe', 'adj_o'])
    adj_d_col = find_col(our_teams, ['adjde', 'adj_d'])
    
    # Parse wins/losses from record column (format: '16-2')
    our_teams['wins'] = our_teams['record'].str.split('-').str[0].astype(float)
    our_teams['losses'] = our_teams['record'].str.split('-').str[1].astype(float)
    
    model_df = pd.DataFrame({
        'team': our_teams['team_clean'].values,
        'off_efficiency': our_teams[adj_o_col].values,
        'def_efficiency': our_teams[adj_d_col].values,
        'wins': our_teams['wins'].values,
        'losses': our_teams['losses'].values
    })
    
    model_df['ppg'] = model_df['off_efficiency'] * 0.70
    model_df['opp_ppg'] = model_df['def_efficiency'] * 0.70
    model_df['pace'] = 70.0
    model_df['power_rating'] = model_df['off_efficiency'] - model_df['def_efficiency']
    model_df['win_pct'] = model_df['wins'] / (model_df['wins'] + model_df['losses'])
    
    print(f"\n‚úì Created model data for {len(model_df)} teams")
    display(model_df.sort_values('power_rating', ascending=False))
else:
    print("‚ö†Ô∏è  No Barttorvik data available")

‚úì Using scraped Barttorvik data

‚úì Created model data for 21 teams


Unnamed: 0,team,off_efficiency,def_efficiency,wins,losses,ppg,opp_ppg,pace,power_rating,win_pct
0,Michigan,126.89027,89.998099,19.0,1.0,88.823189,62.99867,70.0,36.89217,0.95
1,Duke,124.995094,93.869013,19.0,1.0,87.496566,65.708309,70.0,31.126081,0.95
2,Virginia,125.744189,98.160431,17.0,3.0,88.020933,68.712302,70.0,27.583758,0.85
3,Louisville,125.245995,98.830686,14.0,6.0,87.672196,69.18148,70.0,26.415309,0.7
4,Clemson,116.936727,94.625707,17.0,4.0,81.855709,66.237995,70.0,22.31102,0.809524
5,NC State,121.487262,99.262932,15.0,6.0,85.041083,69.484052,70.0,22.22433,0.714286
6,North Carolina,122.832399,101.623051,16.0,4.0,85.982679,71.136136,70.0,21.209347,0.8
7,SMU,124.611305,104.194921,15.0,5.0,87.227913,72.936445,70.0,20.416383,0.75
8,Ohio State,122.195268,103.485037,14.0,6.0,85.536688,72.439526,70.0,18.710232,0.7
9,Miami,118.006155,100.792829,17.0,4.0,82.604308,70.55498,70.0,17.213326,0.809524


### 1.5 Save to File

In [6]:
output_path = config.PROCESSED_DATA_DIR / 'team_stats_2025_26.csv'
model_df.to_csv(output_path, index=False)
print(f"‚úì Saved to {output_path}")
print(f"  {len(model_df)} teams")
print(f"  Power rating range: {model_df['power_rating'].min():.1f} to {model_df['power_rating'].max():.1f}")

‚úì Saved to /Users/calebhan/Documents/Coding/Personal/triangle-sports-analytics-26/notebooks/../data/processed/team_stats_2025_26.csv
  21 teams
  Power rating range: 1.6 to 36.9


### 1.6 Sanity Check

In [7]:
HOME_COURT_ADVANTAGE = 3.5

def predict_spread(home_team, away_team, team_stats):
    """Simple spread prediction using net efficiency"""
    stats = team_stats.set_index('team')
    home_net = stats.loc[home_team, 'power_rating']
    away_net = stats.loc[away_team, 'power_rating']
    spread = (home_net - away_net) / 2 + HOME_COURT_ADVANTAGE
    return spread

test_matchups = [
    ('North Carolina', 'Duke'),
    ('Duke', 'North Carolina'),
    ('Virginia', 'Duke'),
]

print("Expected Spreads (sanity check):")
print("=" * 50)
for home, away in test_matchups:
    spread = predict_spread(home, away, model_df)
    if spread > 0:
        print(f"{away:15} @ {home:15} ‚Üí {home} by {spread:.1f}")
    else:
        print(f"{away:15} @ {home:15} ‚Üí {away} by {-spread:.1f}")

Expected Spreads (sanity check):
Duke            @ North Carolina  ‚Üí Duke by 1.5
North Carolina  @ Duke            ‚Üí Duke by 8.5
Duke            @ Virginia        ‚Üí Virginia by 1.7


---

## Part 2: Optional Data Sources (Future Features)

These sources can provide additional signals for future model improvements.

### 2.1 ESPN BPI (Optional)

ESPN's Basketball Power Index provides alternative team ratings.

**Status:** Not currently used (Barttorvik is sufficient)  
**Potential use:** Test as additional feature if model plateaus

In [8]:
# Uncomment to explore ESPN data
# from src.data_sources import espn
# 
# standings = espn.fetch_standings(year=2026)
# print(f"Found {len(standings)} teams with standings")
# display(standings.head(10))

### 2.2 Haslametrics (Optional)

Haslametrics provides momentum metrics and consistency scores.

**Status:** Not currently used  
**Potential use:** Momentum features for blowout prediction (+2-5% MAE improvement expected)

In [9]:
# Uncomment to explore Haslametrics data
# from src.data_sources import haslametrics
# 
# ratings = haslametrics.fetch_team_ratings(2026)
# momentum = haslametrics.fetch_momentum_metrics(2026)
# 
# print(f"Ratings: {len(ratings)} teams")
# print(f"Momentum: {len(momentum)} teams")
# 
# if len(momentum) > 0:
#     print("\nTop teams by momentum:")
#     display(momentum.head(10))

### 2.3 CBBpy (Optional)

CBBpy provides play-by-play, box scores, and player-level stats from NCAA.com.

**Status:** Not currently used (high effort, moderate value)  
**Potential use:** Pace variance, bench depth, run analysis

**Note:** Requires `pip install cbbpy` and one-time patch:
```bash
python scripts/patch_cbbpy_venv.py
```

In [10]:
# Uncomment to explore CBBpy data
# from src.data_sources import cbbpy_enhanced
# 
# if cbbpy_enhanced.CBBPY_AVAILABLE:
#     games = cbbpy_enhanced.fetch_games_team('Duke', season=2026)
#     print(f"Found {len(games)} Duke games")
#     display(games.head(10))
# else:
#     print("‚ö†Ô∏è  CBBpy not installed")
#     print("Install with: pip install cbbpy")

---

## Summary

‚úÖ **Barttorvik data collected** - 21 teams with efficiency ratings  
‚úÖ **Saved to:** `data/processed/team_stats_2025_26.csv`  
üìä **Next step:** Run `02_modeling.ipynb` to train the prediction model

**Optional data sources** (ESPN, Haslametrics, CBBpy) are available for future feature engineering but not currently needed.