# Data Collection - NCAA Basketball 2025-26

This notebook collects team statistics from various sources for the 2025-26 season.

**Primary Data Source:** Barttorvik (free efficiency ratings)  
**Optional Sources:** ESPN, Haslametrics, CBBpy (for future features)

**Data we collect:**
- Adjusted Offensive Efficiency (AdjO)
- Adjusted Defensive Efficiency (AdjD)
- Win/Loss records and win percentage
- Power ratings (net efficiency)

In [1]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
from io import StringIO
import ssl
import urllib.request
from urllib.error import URLError, HTTPError
import certifi
import time
import warnings
warnings.filterwarnings('ignore')

from src import config

print("Libraries loaded!")

Libraries loaded!


---

## Part 1: Barttorvik Data (Primary Source)

Barttorvik provides free, reliable efficiency ratings used as our main data source.

### 1.1 Target Teams

In [2]:
# Teams from our prediction template
TARGET_TEAMS = [
    'Baylor', 'Boston College', 'California', 'Clemson', 'Duke',
    'Florida State', 'Georgia Tech', 'Louisville', 'Miami', 'Michigan',
    'NC State', 'North Carolina', 'Notre Dame', 'Ohio State', 'Pitt',
    'SMU', 'Stanford', 'Syracuse', 'Virginia', 'Virginia Tech', 'Wake Forest'
]

# Mapping from our schedule names to Barttorvik names
SCHEDULE_TO_BARTTORVIK = {
    'Florida State': 'Florida St.',
    'Miami': 'Miami FL',
    'NC State': 'N.C. State',
    'Ohio State': 'Ohio St.',
    'Pitt': 'Pittsburgh',
}

print(f"Need data for {len(TARGET_TEAMS)} teams")

Need data for 21 teams


### 1.2 Scrape Barttorvik CSV

In [3]:
def scrape_barttorvik_csv(year=2026, max_retries=3, retry_delay=1.0):
    """Fetch team ratings from Barttorvik CSV endpoint with retry logic"""
    url = f"https://barttorvik.com/{year}_team_results.csv"
    
    try:
        print(f"Fetching Barttorvik CSV for {year}...")
        
        # Try with requests first
        try:
            response = requests.get(url, timeout=15)
            response.raise_for_status()
            df = pd.read_csv(StringIO(response.text), index_col=False)  # Fix: prevent column shift
            print(f"‚úì Found {len(df)} teams")
            return df
        except:
            pass
        
        # Try with secure SSL using certifi
        ssl_context = ssl.create_default_context(cafile=certifi.where())
        
        for attempt in range(max_retries):
            try:
                with urllib.request.urlopen(url, context=ssl_context, timeout=30) as response:
                    data = response.read().decode('utf-8')
                    df = pd.read_csv(StringIO(data), index_col=False)  # Fix: prevent column shift
                    print(f"‚úì Found {len(df)} teams")
                    return df
            except (URLError, HTTPError, ssl.SSLError) as e:
                if attempt < max_retries - 1:
                    wait_time = retry_delay * (2 ** attempt)
                    print(f"   Attempt {attempt + 1}/{max_retries} failed, retrying...")
                    time.sleep(wait_time)
                else:
                    # ‚ö†Ô∏è WARNING: Fallback without SSL verification (INSECURE!)
                    print(f"   ‚ö†Ô∏è All {max_retries} attempts failed with SSL verification")
                    print(f"   ‚ö†Ô∏è WARNING: Attempting without SSL verification (INSECURE!)")
                    ssl_context_unverified = ssl.create_default_context()
                    ssl_context_unverified.check_hostname = False
                    ssl_context_unverified.verify_mode = ssl.CERT_NONE
                    with urllib.request.urlopen(url, context=ssl_context_unverified, timeout=30) as response:
                        data = response.read().decode('utf-8')
                        df = pd.read_csv(StringIO(data), index_col=False)
                        print(f"‚úì Found {len(df)} teams (fallback)")
                        return df
    except Exception as e:
        print(f"Error: {e}")
        return None

# Fetch the data
barttorvik_df = scrape_barttorvik_csv(2026)

if barttorvik_df is not None:
    print(f"\nColumns: {len(barttorvik_df.columns)}")
    display(barttorvik_df.head(10))

Fetching Barttorvik CSV for 2026...
‚úì Found 365 teams

Columns: 45


Unnamed: 0,rank,team,conf,record,adjoe,oe Rank,adjde,de Rank,barthag,rank.1,...,ConPA,ConPoss,ConOE,ConDE,ConSOSRemain,Conf Win%,WAB,WAB Rk,Fun Rk,adjt
0,1,Michigan,B10,21-1,128.086617,4,90.3507,1,0.982252,1,...,861.0,858.725,1.223908,1.002649,0.896612,0.916667,7.937796,2,76,72.742236
1,2,Arizona,B12,22-0,126.076863,8,91.717523,3,0.97489,2,...,647.0,659.75,1.205002,0.980674,0.888263,1.0,8.268154,1,7,71.299909
2,3,Houston,B12,20-2,127.752644,5,93.104652,6,0.974376,3,...,570.0,580.4,1.228463,0.982081,0.825089,0.888889,5.80175,7,132,63.486795
3,4,Illinois,B10,20-3,131.556116,1,96.996914,20,0.97082,4,...,818.0,770.925,1.254337,1.061063,0.843262,0.916667,6.52175,5,74,65.824999
4,5,Iowa St.,B12,20-2,126.522078,6,93.930431,11,0.96849,5,...,609.0,608.75,1.197536,1.000411,0.865158,0.777778,5.526687,8,78,67.974591
5,6,Duke,ACC,21-1,125.040921,11,93.060233,5,0.967612,6,...,648.0,661.9125,1.216173,0.978981,0.851943,1.0,7.927135,3,10,66.572023
6,7,Connecticut,BE,22-1,122.642331,27,91.554484,2,0.966492,7,...,788.0,814.5625,1.179774,0.96739,0.805991,1.0,7.880878,4,16,65.00317
7,8,Florida,SEC,16-6,124.764873,14,93.264736,8,0.965986,8,...,655.0,643.1875,1.231367,1.018366,0.828738,0.777778,3.35013,21,338,69.780226
8,9,Purdue,B10,18-4,130.015583,2,98.166936,26,0.961996,9,...,781.0,720.05,1.23047,1.084647,0.881786,0.727273,5.458264,9,137,65.117976
9,10,Vanderbilt,SEC,19-3,124.983116,13,95.193158,15,0.958158,10,...,684.0,633.9,1.148446,1.079035,0.862438,0.666667,4.896889,12,128,70.14546


In [4]:
# Data Validation - ensure scraped data is correct
if barttorvik_df is not None:
    print("Validating scraped data...")
    
    # Check required columns exist
    required_cols = ['team', 'adjoe', 'adjde', 'record']
    missing_cols = [col for col in required_cols if col not in barttorvik_df.columns]
    if missing_cols:
        print(f"‚ùå ERROR: Missing columns: {missing_cols}")
    else:
        print(f"‚úì All required columns present")
    
    # Check data types and ranges
    assert barttorvik_df['team'].dtype == object, "team should be string type"
    assert len(barttorvik_df) >= 350, f"Expected ~365 teams, got {len(barttorvik_df)}"
    
    # Validate efficiency ranges (typical: 90-140 for adjoe, 85-115 for adjde)
    adjoe_range = (barttorvik_df['adjoe'].min(), barttorvik_df['adjoe'].max())
    adjde_range = (barttorvik_df['adjde'].min(), barttorvik_df['adjde'].max())
    
    assert 80 < adjoe_range[0] < 100, f"adjoe min seems wrong: {adjoe_range[0]}"
    assert 130 < adjoe_range[1] < 150, f"adjoe max seems wrong: {adjoe_range[1]}"
    assert 80 < adjde_range[0] < 100, f"adjde min seems wrong: {adjde_range[0]}"
    assert 110 < adjde_range[1] < 130, f"adjde max seems wrong: {adjde_range[1]}"
    
    print(f"‚úì Efficiency ranges valid: adjoe {adjoe_range[0]:.1f}-{adjoe_range[1]:.1f}, adjde {adjde_range[0]:.1f}-{adjde_range[1]:.1f}")
    
    # Check for data corruption (conference codes in team column)
    first_teams = barttorvik_df['team'].head(5).tolist()
    conf_codes = ['B12', 'B10', 'ACC', 'SEC', 'BE', 'WCC', 'A10', 'MWC', 'P12']
    corrupted_teams = [t for t in first_teams if t in conf_codes]
    if corrupted_teams:
        print(f"‚ùå ERROR: Team column contains conference codes: {corrupted_teams}")
        print(f"   This indicates data corruption - check index_col parameter!")
        raise ValueError("Data corruption detected!")
    else:
        print(f"‚úì Team names valid (no conference codes detected)")
    
    # Check index starts at 0
    assert barttorvik_df.index[0] == 0, f"Index should start at 0, got {barttorvik_df.index[0]}"
    print(f"‚úì Index starts at 0 (no row skipping)")
    
    print("\n‚úÖ All validation checks passed!")

Validating scraped data...
‚úì All required columns present
‚úì Efficiency ranges valid: adjoe 88.0-131.6, adjde 90.4-127.0
‚úì Team names valid (no conference codes detected)
‚úì Index starts at 0 (no row skipping)

‚úÖ All validation checks passed!


### 1.3 Process and Filter Data

In [5]:
# Name mappings
BARTTORVIK_TO_SCHEDULE = {v: k for k, v in SCHEDULE_TO_BARTTORVIK.items()}

SCHEDULE_TEAMS = [
    'Baylor', 'Boston College', 'California', 'Clemson', 'Duke',
    'Florida State', 'Georgia Tech', 'Louisville', 'Miami', 'Michigan',
    'NC State', 'North Carolina', 'Notre Dame', 'Ohio State', 'Pitt',
    'SMU', 'Stanford', 'Syracuse', 'Virginia', 'Virginia Tech', 'Wake Forest'
]

def process_barttorvik_csv(df):
    """Process and filter to target teams"""
    if df is None:
        return None
    
    # Find team column
    team_col = next((col for col in df.columns if col.lower() == 'team'), None)
    if not team_col:
        print("Error: Could not find team column")
        return None
    
    result = df.copy()
    result['team_clean'] = result[team_col].apply(
        lambda x: BARTTORVIK_TO_SCHEDULE.get(str(x).strip(), str(x).strip())
    )
    
    filtered = result[result['team_clean'].isin(SCHEDULE_TEAMS)].copy()
    print(f"Found {len(filtered)}/{len(SCHEDULE_TEAMS)} target teams")
    
    missing = set(SCHEDULE_TEAMS) - set(filtered['team_clean'].tolist())
    if missing:
        print(f"Missing teams: {missing}")
    
    return filtered

our_teams = process_barttorvik_csv(barttorvik_df)
if our_teams is not None:
    display(our_teams[['team_clean', 'adjoe', 'adjde', 'record', 'barthag']].sort_values('barthag', ascending=False))

Found 21/21 target teams


Unnamed: 0,team_clean,adjoe,adjde,record,barthag
0,Michigan,128.086617,90.3507,21-1,0.982252
5,Duke,125.040921,93.060233,21-1,0.967612
14,Virginia,123.920623,97.408913,19-3,0.94094
18,Louisville,123.88108,97.936004,16-6,0.937179
23,NC State,123.064464,99.625647,17-6,0.919073
24,Clemson,116.203353,94.121228,19-4,0.918618
27,North Carolina,123.858752,101.248743,18-4,0.910353
34,SMU,124.548807,104.017675,15-7,0.888109
37,Ohio State,123.437375,103.667667,15-7,0.881556
44,Miami,119.044947,102.014052,17-5,0.855139


def find_col(df, options):
    """Find column by name options"""
    for opt in options:
        matches = [c for c in df.columns if opt.lower() in c.lower()]
        if matches:
            return matches[0]
    return None

if our_teams is not None and len(our_teams) >= 18:
    print("‚úì Using scraped Barttorvik data")
    
    adj_o_col = find_col(our_teams, ['adjoe', 'adj_o'])
    adj_d_col = find_col(our_teams, ['adjde', 'adj_d'])
    
    # Parse wins/losses from record column (format: '16-2')
    our_teams['wins'] = our_teams['record'].str.split('-').str[0].astype(float)
    our_teams['losses'] = our_teams['record'].str.split('-').str[1].astype(float)
    
    model_df = pd.DataFrame({
        'team': our_teams['team_clean'].values,
        'off_efficiency': our_teams[adj_o_col].values,
        'def_efficiency': our_teams[adj_d_col].values,
        'wins': our_teams['wins'].values,
        'losses': our_teams['losses'].values
    })
    
    # Calculate derived features
    # Note: ppg, opp_ppg, pace are NOT used by the model (see config.BASELINE_FEATURES)
    # They're calculated for reference only. The model uses efficiency values directly.
    model_df['power_rating'] = model_df['off_efficiency'] - model_df['def_efficiency']
    model_df['win_pct'] = model_df['wins'] / (model_df['wins'] + model_df['losses'])
    
    print(f"\n‚úì Created model data for {len(model_df)} teams")
    display(model_df.sort_values('power_rating', ascending=False))
else:
    print("‚ö†Ô∏è  No Barttorvik data available")

In [6]:
def find_col(df, options):
    """Find column by name options"""
    for opt in options:
        matches = [c for c in df.columns if opt.lower() in c.lower()]
        if matches:
            return matches[0]
    return None

if our_teams is not None and len(our_teams) >= 18:
    print("‚úì Using scraped Barttorvik data")
    
    adj_o_col = find_col(our_teams, ['adjoe', 'adj_o'])
    adj_d_col = find_col(our_teams, ['adjde', 'adj_d'])
    
    # Parse wins/losses from record column (format: '16-2')
    our_teams['wins'] = our_teams['record'].str.split('-').str[0].astype(float)
    our_teams['losses'] = our_teams['record'].str.split('-').str[1].astype(float)
    
    model_df = pd.DataFrame({
        'team': our_teams['team_clean'].values,
        'off_efficiency': our_teams[adj_o_col].values,
        'def_efficiency': our_teams[adj_d_col].values,
        'wins': our_teams['wins'].values,
        'losses': our_teams['losses'].values
    })
    
    model_df['ppg'] = model_df['off_efficiency'] * 0.70
    model_df['opp_ppg'] = model_df['def_efficiency'] * 0.70
    model_df['pace'] = 70.0
    model_df['power_rating'] = model_df['off_efficiency'] - model_df['def_efficiency']
    model_df['win_pct'] = model_df['wins'] / (model_df['wins'] + model_df['losses'])
    
    print(f"\n‚úì Created model data for {len(model_df)} teams")
    display(model_df.sort_values('power_rating', ascending=False))
else:
    print("‚ö†Ô∏è  No Barttorvik data available")

‚úì Using scraped Barttorvik data

‚úì Created model data for 21 teams


Unnamed: 0,team,off_efficiency,def_efficiency,wins,losses,ppg,opp_ppg,pace,power_rating,win_pct
0,Michigan,128.086617,90.3507,21.0,1.0,89.660632,63.24549,70.0,37.735917,0.954545
1,Duke,125.040921,93.060233,21.0,1.0,87.528645,65.142163,70.0,31.980689,0.954545
2,Virginia,123.920623,97.408913,19.0,3.0,86.744436,68.186239,70.0,26.511711,0.863636
3,Louisville,123.88108,97.936004,16.0,6.0,86.716756,68.555203,70.0,25.945076,0.727273
4,NC State,123.064464,99.625647,17.0,6.0,86.145125,69.737953,70.0,23.438817,0.73913
6,North Carolina,123.858752,101.248743,18.0,4.0,86.701127,70.87412,70.0,22.610009,0.818182
5,Clemson,116.203353,94.121228,19.0,4.0,81.342347,65.88486,70.0,22.082125,0.826087
7,SMU,124.548807,104.017675,15.0,7.0,87.184165,72.812372,70.0,20.531133,0.681818
8,Ohio State,123.437375,103.667667,15.0,7.0,86.406163,72.567367,70.0,19.769709,0.681818
9,Miami,119.044947,102.014052,17.0,5.0,83.331463,71.409836,70.0,17.030895,0.772727


### 1.5 Save to File

In [7]:
output_path = config.PROCESSED_DATA_DIR / 'team_stats_2025_26.csv'
model_df.to_csv(output_path, index=False)
print(f"‚úì Saved to {output_path}")
print(f"  {len(model_df)} teams")
print(f"  Power rating range: {model_df['power_rating'].min():.1f} to {model_df['power_rating'].max():.1f}")

‚úì Saved to /Users/calebhan/Documents/Coding/Personal/triangle-sports-analytics-26/notebooks/../data/processed/team_stats_2025_26.csv
  21 teams
  Power rating range: 2.5 to 37.7


HOME_COURT_ADVANTAGE = 3.5

def predict_spread(home_team, away_team, team_stats):
    """
    SIMPLE sanity check spread estimate (NOT the actual model!).
    
    This uses a simplified power rating formula for quick validation.
    The actual model in 02_modeling.ipynb uses Ridge + LightGBM ensemble
    with Elo ratings and multiple features.
    """
    stats = team_stats.set_index('team')
    home_net = stats.loc[home_team, 'power_rating']
    away_net = stats.loc[away_team, 'power_rating']
    # Simple formula: efficiency differential / 2 + home court advantage
    spread = (home_net - away_net) / 2 + HOME_COURT_ADVANTAGE
    return spread

test_matchups = [
    ('North Carolina', 'Duke'),
    ('Duke', 'North Carolina'),
    ('Virginia', 'Duke'),
]

print("Simple Spread Estimates (sanity check - not actual model):")
print("=" * 50)
for home, away in test_matchups:
    spread = predict_spread(home, away, model_df)
    if spread > 0:
        print(f"{away:15} @ {home:15} ‚Üí {home} by {spread:.1f}")
    else:
        print(f"{away:15} @ {home:15} ‚Üí {away} by {-spread:.1f}")

In [8]:
HOME_COURT_ADVANTAGE = 3.5

def predict_spread(home_team, away_team, team_stats):
    """Simple spread prediction using net efficiency"""
    stats = team_stats.set_index('team')
    home_net = stats.loc[home_team, 'power_rating']
    away_net = stats.loc[away_team, 'power_rating']
    spread = (home_net - away_net) / 2 + HOME_COURT_ADVANTAGE
    return spread

test_matchups = [
    ('North Carolina', 'Duke'),
    ('Duke', 'North Carolina'),
    ('Virginia', 'Duke'),
]

print("Expected Spreads (sanity check):")
print("=" * 50)
for home, away in test_matchups:
    spread = predict_spread(home, away, model_df)
    if spread > 0:
        print(f"{away:15} @ {home:15} ‚Üí {home} by {spread:.1f}")
    else:
        print(f"{away:15} @ {home:15} ‚Üí {away} by {-spread:.1f}")

Expected Spreads (sanity check):
Duke            @ North Carolina  ‚Üí Duke by 1.2
North Carolina  @ Duke            ‚Üí Duke by 8.2
Duke            @ Virginia        ‚Üí Virginia by 0.8


---

## Part 2: Optional Data Sources (Future Features)

These sources can provide additional signals for future model improvements.

### 2.1 ESPN BPI (Optional)

ESPN's Basketball Power Index provides alternative team ratings.

**Status:** Not currently used (Barttorvik is sufficient)  
**Potential use:** Test as additional feature if model plateaus

In [9]:
# Uncomment to explore ESPN data
# from src.data_sources import espn
# 
# standings = espn.fetch_standings(year=2026)
# print(f"Found {len(standings)} teams with standings")
# display(standings.head(10))

### 2.2 Haslametrics (Optional)

Haslametrics provides momentum metrics and consistency scores.

**Status:** Not currently used  
**Potential use:** Momentum features for blowout prediction (+2-5% MAE improvement expected)

In [10]:
# Uncomment to explore Haslametrics data
# from src.data_sources import haslametrics
# 
# ratings = haslametrics.fetch_team_ratings(2026)
# momentum = haslametrics.fetch_momentum_metrics(2026)
# 
# print(f"Ratings: {len(ratings)} teams")
# print(f"Momentum: {len(momentum)} teams")
# 
# if len(momentum) > 0:
#     print("\nTop teams by momentum:")
#     display(momentum.head(10))

### 2.3 CBBpy (Optional)

CBBpy provides play-by-play, box scores, and player-level stats from NCAA.com.

**Status:** Not currently used (high effort, moderate value)  
**Potential use:** Pace variance, bench depth, run analysis

**Note:** Requires `pip install cbbpy` and one-time patch:
```bash
python scripts/patch_cbbpy_venv.py
```

In [11]:
# Uncomment to explore CBBpy data
# from src.data_sources import cbbpy_enhanced
# 
# if cbbpy_enhanced.CBBPY_AVAILABLE:
#     games = cbbpy_enhanced.fetch_games_team('Duke', season=2026)
#     print(f"Found {len(games)} Duke games")
#     display(games.head(10))
# else:
#     print("‚ö†Ô∏è  CBBpy not installed")
#     print("Install with: pip install cbbpy")

---

## Summary

‚úÖ **Barttorvik data collected** - 21 teams with efficiency ratings  
‚úÖ **Saved to:** `data/processed/team_stats_2025_26.csv`  
üìä **Next step:** Run `02_modeling.ipynb` to train the prediction model

**Optional data sources** (ESPN, Haslametrics, CBBpy) are available for future feature engineering but not currently needed.