# Scrape Real Team Ratings

This notebook collects actual team statistics from Barttorvik (free) and Sports Reference.

**Data we need for each team:**
- Adjusted Offensive Efficiency (AdjO)
- Adjusted Defensive Efficiency (AdjD)
- Tempo/Pace
- Overall record and win %

In [1]:
import sys
sys.path.append('..')

import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time
import warnings
warnings.filterwarnings('ignore')

# Import config
from src import config

print("Libraries loaded!")

Libraries loaded!


## 1. Teams We Need Data For

In [2]:
# Teams from our prediction template (authoritative source)
TARGET_TEAMS = [
    'Baylor', 'Boston College', 'California', 'Clemson', 'Duke',
    'Florida State', 'Georgia Tech', 'Louisville', 'Miami', 'Michigan',
    'NC State', 'North Carolina', 'Notre Dame', 'Ohio State', 'Pitt',
    'SMU', 'Stanford', 'Syracuse', 'Virginia', 'Virginia Tech', 'Wake Forest'
]

# Mapping from our schedule names to Barttorvik names
SCHEDULE_TO_BARTTORVIK = {
    'Florida State': 'Florida St.',
    'Miami': 'Miami FL',
    'NC State': 'N.C. State',
    'Ohio State': 'Ohio St.',
    'Pitt': 'Pittsburgh',
}

print(f"Need data for {len(TARGET_TEAMS)} teams from schedule")

Need data for 21 teams from schedule


## 2. Scrape from Barttorvik

Barttorvik provides free efficiency ratings. We'll scrape the main rankings table.

In [3]:
import ssl
import urllib.request
from io import StringIO
from urllib.error import URLError, HTTPError
import certifi
import time

def scrape_barttorvik_csv(year=2026, max_retries=3, retry_delay=1.0):
    """
    Fetch team ratings directly from Barttorvik CSV endpoint
    With secure SSL certificate validation and retry logic
    """
    url = f"https://barttorvik.com/{year}_team_results.csv"
    
    try:
        print(f"Fetching Barttorvik CSV for {year}...")
        
        # Try with requests first (often handles SSL better)
        try:
            response = requests.get(url, timeout=15)
            response.raise_for_status()
            df = pd.read_csv(StringIO(response.text))
            print(f"✓ Found {len(df)} teams")
            return df
        except:
            pass
        
        # Try with secure SSL using certifi
        ssl_context = ssl.create_default_context(cafile=certifi.where())
        
        last_error = None
        for attempt in range(max_retries):
            try:
                with urllib.request.urlopen(url, context=ssl_context, timeout=30) as response:
                    data = response.read().decode('utf-8')
                    df = pd.read_csv(StringIO(data))
                    print(f"✓ Found {len(df)} teams")
                    return df
            except (URLError, HTTPError, ssl.SSLError) as e:
                last_error = e
                if attempt < max_retries - 1:
                    wait_time = retry_delay * (2 ** attempt)
                    print(f"   Attempt {attempt + 1}/{max_retries} failed: {e}")
                    print(f"   Retrying in {wait_time}s...")
                    time.sleep(wait_time)
                else:
                    # Fallback without SSL verification only as last resort
                    print(f"   ⚠ All attempts failed, trying without SSL verification...")
                    try:
                        ssl_context_unverified = ssl.create_default_context()
                        ssl_context_unverified.check_hostname = False
                        ssl_context_unverified.verify_mode = ssl.CERT_NONE
                        with urllib.request.urlopen(url, context=ssl_context_unverified, timeout=30) as response:
                            data = response.read().decode('utf-8')
                            df = pd.read_csv(StringIO(data))
                            print(f"✓ Found {len(df)} teams (using fallback)")
                            return df
                    except Exception as fallback_error:
                        print(f"   ✗ Fallback failed: {fallback_error}")
                        raise
            
    except Exception as e:
        print(f"Error: {e}")
        return None

# Fetch the data
barttorvik_df = scrape_barttorvik_csv(2026)

if barttorvik_df is not None:
    print("\nColumns available:")
    print(barttorvik_df.columns.tolist())
    print("\nFirst few rows:")
    display(barttorvik_df.head(10))

Fetching Barttorvik CSV for 2026...
✓ Found 365 teams

Columns available:
['rank', 'team', 'conf', 'record', 'adjoe', 'oe Rank', 'adjde', 'de Rank', 'barthag', 'rank.1', 'proj. W', 'Proj. L', 'Pro Con W', 'Pro Con L', 'Con Rec.', 'sos', 'ncsos', 'consos', 'Proj. SOS', 'Proj. Noncon SOS', 'Proj. Con SOS', 'elite SOS', 'elite noncon SOS', 'Opp OE', 'Opp DE', 'Opp Proj. OE', 'Opp Proj DE', 'Con Adj OE', 'Con Adj DE', 'Qual O', 'Qual D', 'Qual Barthag', 'Qual Games', 'FUN', 'ConPF', 'ConPA', 'ConPoss', 'ConOE', 'ConDE', 'ConSOSRemain', 'Conf Win%', 'WAB', 'WAB Rk', 'Fun Rk', 'adjt']

First few rows:


Unnamed: 0,rank,team,conf,record,adjoe,oe Rank,adjde,de Rank,barthag,rank.1,...,ConPA,ConPoss,ConOE,ConDE,ConSOSRemain,Conf Win%,WAB,WAB Rk,Fun Rk,adjt
0,1,Michigan,B10,15-1,128.283321,4,89.600828,1,0.984126,1,...,444.0,440.9,1.229304,1.007031,0.859928,0.833333,5.013475,5,162,74.244402
1,2,Arizona,B12,17-0,125.411012,9,92.849114,5,0.969443,2,...,309.0,303.1625,1.230363,1.019255,0.864292,1.0,5.599799,2,11,71.983672
2,3,Connecticut,BE,17-1,122.845116,22,91.588652,3,0.966965,3,...,460.0,490.975,1.136514,0.936911,0.783837,1.0,5.955329,1,22,65.750543
3,4,Houston,B12,16-1,122.834185,23,91.595969,4,0.966903,4,...,228.0,253.5875,1.143589,0.899098,0.823471,1.0,4.440385,11,88,63.552388
4,5,Purdue,B10,16-1,130.612087,2,98.182504,27,0.963811,5,...,425.0,403.3375,1.259491,1.053708,0.872546,1.0,4.969355,6,44,65.970781
5,6,Illinois,B10,14-3,129.878282,3,97.676214,23,0.963624,6,...,420.0,390.075,1.220278,1.076716,0.815101,0.833333,3.66634,12,176,66.532935
6,7,Vanderbilt,SEC,16-1,126.226805,6,95.026332,10,0.963214,7,...,314.0,285.6625,1.144707,1.099199,0.84099,0.75,4.587928,7,66,71.021593
7,8,Iowa St.,B12,16-1,123.308101,20,94.756676,8,0.953858,8,...,274.0,267.3125,1.107318,1.025018,0.833706,0.75,4.537453,9,17,68.815123
8,9,Virginia,ACC,15-2,125.705708,8,96.643305,17,0.953626,9,...,341.0,358.225,1.099867,0.951916,0.77412,0.8,2.88284,14,130,66.106372
9,10,Gonzaga,WCC,18-1,124.839368,11,96.029998,16,0.953346,10,...,410.0,436.325,1.219275,0.939667,0.609712,1.0,4.582267,8,39,70.663863


## 3. Alternative: Sports Reference Scraping

In [4]:
def scrape_sports_reference_rankings(year=2026):
    """
    Scrape team ratings from Sports Reference school ratings page
    """
    url = f"https://www.sports-reference.com/cbb/seasons/men/{year}-ratings.html"
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
    }
    
    try:
        print(f"Fetching Sports Reference ratings for {year}...")
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        
        tables = pd.read_html(response.text)
        
        if tables:
            df = tables[0]
            # Flatten multi-level columns if present
            if isinstance(df.columns, pd.MultiIndex):
                df.columns = ['_'.join(col).strip() for col in df.columns.values]
            print(f"Found {len(df)} teams")
            return df
        
        return None
        
    except Exception as e:
        print(f"Error: {e}")
        return None

# Try Sports Reference
sr_df = scrape_sports_reference_rankings(2026)

if sr_df is not None:
    print("\nColumns:")
    print(sr_df.columns.tolist()[:15])  # First 15 columns
    print("\nFirst few rows:")
    display(sr_df.head())

Fetching Sports Reference ratings for 2026...
Found 401 teams

Columns:
['Unnamed: 0_level_0_Rk', 'Unnamed: 1_level_0_School', 'Unnamed: 2_level_0_Conf', 'Unnamed: 3_level_0_Unnamed: 3_level_1', 'Unnamed: 4_level_0_AP Rank', 'Unnamed: 5_level_0_W', 'Unnamed: 6_level_0_L', 'Unnamed: 7_level_0_Pts', 'Unnamed: 8_level_0_Opp', 'Unnamed: 9_level_0_MOV', 'Unnamed: 10_level_0_Unnamed: 10_level_1', 'Unnamed: 11_level_0_SOS', 'Unnamed: 12_level_0_Unnamed: 12_level_1', 'SRS_OSRS', 'SRS_DSRS']

First few rows:


Unnamed: 0,Unnamed: 0_level_0_Rk,Unnamed: 1_level_0_School,Unnamed: 2_level_0_Conf,Unnamed: 3_level_0_Unnamed: 3_level_1,Unnamed: 4_level_0_AP Rank,Unnamed: 5_level_0_W,Unnamed: 6_level_0_L,Unnamed: 7_level_0_Pts,Unnamed: 8_level_0_Opp,Unnamed: 9_level_0_MOV,Unnamed: 10_level_0_Unnamed: 10_level_1,Unnamed: 11_level_0_SOS,Unnamed: 12_level_0_Unnamed: 12_level_1,SRS_OSRS,SRS_DSRS,SRS_SRS,Adjusted_ORtg,Adjusted_DRtg,Adjusted_NRtg
0,1,Michigan,Big Ten,,4,15,1,93.8,68.7,25.13,,11.02,,23.22,12.93,36.14,128.22,82.8,45.42
1,2,Gonzaga,WCC,,9,17,1,91.4,67.9,23.56,,5.68,,17.82,11.41,29.23,125.29,87.92,37.38
2,3,Arizona,Big 12,,1,17,0,91.0,68.8,22.18,,6.62,,17.68,11.11,28.79,124.75,88.25,36.5
3,4,Purdue,Big Ten,,5,16,1,86.0,68.3,17.71,,9.93,,14.15,13.48,27.63,131.28,93.6,37.68
4,5,Duke,ACC,,6,16,1,85.8,65.9,19.88,,7.65,,13.35,14.18,27.53,124.41,88.67,35.74


## 4. Process and Filter to Our Teams

Extract just the teams we need and standardize the column names.

In [5]:
# Name mappings from Barttorvik names back to our schedule names
BARTTORVIK_TO_SCHEDULE = {
    'Florida St.': 'Florida State',
    'Miami FL': 'Miami',
    'N.C. State': 'NC State',
    'Ohio St.': 'Ohio State',
    'Pittsburgh': 'Pitt',
}

# Our target teams (from schedule - authoritative)
SCHEDULE_TEAMS = [
    'Baylor', 'Boston College', 'California', 'Clemson', 'Duke',
    'Florida State', 'Georgia Tech', 'Louisville', 'Miami', 'Michigan',
    'NC State', 'North Carolina', 'Notre Dame', 'Ohio State', 'Pitt',
    'SMU', 'Stanford', 'Syracuse', 'Virginia', 'Virginia Tech', 'Wake Forest'
]

def process_barttorvik_csv(df):
    """
    Process Barttorvik CSV data and filter to our target teams
    """
    if df is None:
        return None
    
    # Find the team column (usually 'team' or 'Team')
    team_col = None
    for col in df.columns:
        if col.lower() == 'team':
            team_col = col
            break
    
    if team_col is None:
        print(f"Could not find team column. Columns: {df.columns.tolist()}")
        return None
    
    print(f"Using team column: '{team_col}'")
    
    # Create cleaned copy
    result = df.copy()
    
    # Standardize team names from Barttorvik to our schedule format
    result['team_clean'] = result[team_col].apply(
        lambda x: BARTTORVIK_TO_SCHEDULE.get(str(x).strip(), str(x).strip())
    )
    
    # Filter to our teams
    filtered = result[result['team_clean'].isin(SCHEDULE_TEAMS)].copy()
    
    print(f"Found {len(filtered)}/{len(SCHEDULE_TEAMS)} target teams")
    
    missing = set(SCHEDULE_TEAMS) - set(filtered['team_clean'].tolist())
    if missing:
        print(f"Missing teams: {missing}")
    
    return filtered

# Process the data
if barttorvik_df is not None:
    our_teams = process_barttorvik_csv(barttorvik_df)
    if our_teams is not None:
        print("\nOur teams data:")
        display(our_teams)
else:
    our_teams = None
    print("No Barttorvik data - will use manual entry")

Using team column: 'team'
Found 21/21 target teams

Our teams data:


Unnamed: 0,rank,team,conf,record,adjoe,oe Rank,adjde,de Rank,barthag,rank.1,...,ConPoss,ConOE,ConDE,ConSOSRemain,Conf Win%,WAB,WAB Rk,Fun Rk,adjt,team_clean
0,1,Michigan,B10,15-1,128.283321,4,89.600828,1,0.984126,1,...,440.9,1.229304,1.007031,0.859928,0.833333,5.013475,5,162,74.244402,Michigan
8,9,Virginia,ACC,15-2,125.705708,8,96.643305,17,0.953626,9,...,358.225,1.099867,0.951916,0.77412,0.8,2.88284,14,130,66.106372,Virginia
14,15,Duke,ACC,16-1,122.341741,28,96.017459,15,0.941932,15,...,350.2625,1.179116,1.056351,0.819805,1.0,5.519244,3,5,68.490626,Duke
20,21,Louisville,ACC,12-5,124.937656,10,100.395387,39,0.925189,21,...,345.4875,1.111473,1.085423,0.819707,0.4,0.915674,39,285,70.478812,Louisville
23,24,Clemson,ACC,15-3,117.586288,61,95.608177,14,0.915252,24,...,328.9,1.097598,0.942536,0.771551,1.0,2.801531,15,70,65.506775,Clemson
25,26,N.C. State,ACC,12-5,122.132113,30,99.617986,35,0.912396,26,...,274.05,1.178617,0.996169,0.824124,0.75,0.274608,48,323,69.376036,NC State
28,29,North Carolina,ACC,14-3,121.422034,38,100.898766,43,0.893717,29,...,283.1875,1.197087,1.20768,0.829499,0.5,1.873323,26,74,67.769279,North Carolina
31,32,Miami FL,ACC,15-2,120.095744,47,100.904098,44,0.881045,32,...,276.9375,1.187994,1.068833,0.766557,1.0,1.995058,25,40,70.072057,Miami
33,34,SMU,ACC,13-4,124.148102,13,104.780248,87,0.875503,34,...,273.6875,1.165563,1.150948,0.785038,0.5,1.648823,31,61,69.08747,SMU
46,47,Ohio St.,B10,11-5,120.378481,44,103.736389,73,0.846974,47,...,398.475,1.156911,1.149382,0.837317,0.5,0.149176,52,146,67.820319,Ohio State


## 5. Manual Entry Option (if scraping fails)

If scraping doesn't work, you can manually enter ratings from Barttorvik or KenPom.
Go to https://barttorvik.com and copy the values for each team.

In [6]:
# ================================================================
# AUTOMATIC FETCH: This cell fetches real-time data from Barttorvik
# If you want to use this as a fallback when scraping fails
# ================================================================

def fetch_manual_ratings(year=2026):
    """
    Fetch team ratings from Barttorvik for manual fallback
    Returns dictionary of team ratings
    """
    try:
        url = f"https://barttorvik.com/{year}_team_results.csv"
        print(f"Fetching Barttorvik data for fallback...")
        
        response = requests.get(url, timeout=15)
        response.raise_for_status()
        df = pd.read_csv(StringIO(response.text))
        
        # Map our schedule names to Barttorvik names
        schedule_to_barttorvik = {
            'Florida State': 'Florida St.',
            'Miami': 'Miami FL',
            'NC State': 'N.C. State',
            'Ohio State': 'Ohio St.',
            'Pitt': 'Pittsburgh',
        }
        
        # Teams we need (Barttorvik format)
        barttorvik_teams = [
            'Baylor', 'Boston College', 'California', 'Clemson', 'Duke',
            'Florida St.', 'Georgia Tech', 'Louisville', 'Miami FL', 'Michigan',
            'N.C. State', 'North Carolina', 'Notre Dame', 'Ohio St.', 'Pittsburgh',
            'SMU', 'Stanford', 'Syracuse', 'Virginia', 'Virginia Tech', 'Wake Forest'
        ]
        
        # Filter to our teams
        our_teams = df[df['team'].isin(barttorvik_teams)].copy()
        
        # Parse wins/losses from record column (format: '16-2')
        our_teams['wins'] = our_teams['record'].str.split('-').str[0].astype(int)
        our_teams['losses'] = our_teams['record'].str.split('-').str[1].astype(int)
        
        # Map Barttorvik names back to our schedule format
        barttorvik_to_schedule = {
            'Florida St.': 'Florida State',
            'Miami FL': 'Miami',
            'N.C. State': 'NC State',
            'Ohio St.': 'Ohio State',
            'Pittsburgh': 'Pitt',
        }
        
        # Build ratings dictionary
        ratings = {}
        for _, row in our_teams.iterrows():
            team_name = barttorvik_to_schedule.get(row['team'], row['team'])
            ratings[team_name] = {
                'adj_o': row['adjoe'],
                'adj_d': row['adjde'],
                'barthag': row['barthag'],
                'wins': row['wins'],
                'losses': row['losses']
            }
        
        print(f"✓ Fetched data for {len(ratings)} teams from Barttorvik")
        
        # Check for missing teams
        schedule_teams = [
            'Baylor', 'Boston College', 'California', 'Clemson', 'Duke',
            'Florida State', 'Georgia Tech', 'Louisville', 'Miami', 'Michigan',
            'NC State', 'North Carolina', 'Notre Dame', 'Ohio State', 'Pitt',
            'SMU', 'Stanford', 'Syracuse', 'Virginia', 'Virginia Tech', 'Wake Forest'
        ]
        missing = set(schedule_teams) - set(ratings.keys())
        if missing:
            print(f"⚠️  Missing teams (using placeholders): {missing}")
            # Add placeholders for missing teams
            for team in missing:
                ratings[team] = {
                    'adj_o': 100.0, 'adj_d': 100.0, 'barthag': 0.50,
                    'wins': 10, 'losses': 10
                }
        
        return ratings
        
    except Exception as e:
        print(f"⚠️ Error fetching from Barttorvik: {e}")
        print("Using placeholder values - UPDATE THESE MANUALLY!")
        
        # Fallback placeholder values (only used if fetch fails)
        return {
            'Baylor':         {'adj_o': 116.0, 'adj_d': 96.0, 'barthag': 0.88, 'wins': 14, 'losses': 4},
            'Boston College': {'adj_o': 98.0, 'adj_d': 108.0, 'barthag': 0.42, 'wins': 6, 'losses': 11},
            'California':     {'adj_o': 102.0, 'adj_d': 108.0, 'barthag': 0.45, 'wins': 7, 'losses': 11},
            'Clemson':        {'adj_o': 112.0, 'adj_d': 98.0, 'barthag': 0.82, 'wins': 13, 'losses': 5},
            'Duke':           {'adj_o': 120.0, 'adj_d': 95.0, 'barthag': 0.94, 'wins': 16, 'losses': 2},
            'Florida State':  {'adj_o': 102.0, 'adj_d': 104.0, 'barthag': 0.52, 'wins': 8, 'losses': 10},
            'Georgia Tech':   {'adj_o': 100.0, 'adj_d': 106.0, 'barthag': 0.48, 'wins': 7, 'losses': 11},
            'Louisville':     {'adj_o': 108.0, 'adj_d': 104.0, 'barthag': 0.68, 'wins': 10, 'losses': 8},
            'Miami':          {'adj_o': 108.0, 'adj_d': 102.0, 'barthag': 0.72, 'wins': 10, 'losses': 7},
            'Michigan':       {'adj_o': 110.0, 'adj_d': 102.0, 'barthag': 0.74, 'wins': 11, 'losses': 7},
            'NC State':       {'adj_o': 110.0, 'adj_d': 100.0, 'barthag': 0.78, 'wins': 12, 'losses': 6},
            'North Carolina': {'adj_o': 118.0, 'adj_d': 98.0, 'barthag': 0.90, 'wins': 14, 'losses': 4},
            'Notre Dame':     {'adj_o': 108.0, 'adj_d': 106.0, 'barthag': 0.58, 'wins': 8, 'losses': 9},
            'Ohio State':     {'adj_o': 112.0, 'adj_d': 100.0, 'barthag': 0.80, 'wins': 12, 'losses': 6},
            'Pitt':           {'adj_o': 106.0, 'adj_d': 100.0, 'barthag': 0.70, 'wins': 10, 'losses': 8},
            'SMU':            {'adj_o': 114.0, 'adj_d': 102.0, 'barthag': 0.80, 'wins': 13, 'losses': 5},
            'Stanford':       {'adj_o': 100.0, 'adj_d': 106.0, 'barthag': 0.44, 'wins': 6, 'losses': 11},
            'Syracuse':       {'adj_o': 110.0, 'adj_d': 106.0, 'barthag': 0.66, 'wins': 9, 'losses': 8},
            'Virginia':       {'adj_o': 108.0, 'adj_d': 90.0, 'barthag': 0.88, 'wins': 13, 'losses': 4},
            'Virginia Tech':  {'adj_o': 104.0, 'adj_d': 102.0, 'barthag': 0.62, 'wins': 9, 'losses': 9},
            'Wake Forest':    {'adj_o': 106.0, 'adj_d': 104.0, 'barthag': 0.60, 'wins': 9, 'losses': 9},
        }

# Fetch the ratings
MANUAL_TEAM_RATINGS = fetch_manual_ratings(2026)

# Convert to DataFrame
manual_df = pd.DataFrame(MANUAL_TEAM_RATINGS).T.reset_index()
manual_df.columns = ['team', 'adj_o', 'adj_d', 'barthag', 'wins', 'losses']

# Calculate derived metrics
manual_df['net_rating'] = manual_df['adj_o'] - manual_df['adj_d']
manual_df['win_pct'] = manual_df['wins'] / (manual_df['wins'] + manual_df['losses'])
manual_df['power_rating'] = manual_df['net_rating']  # Simple power rating = net efficiency

print("\nManual team ratings loaded:")
manual_df.sort_values('net_rating', ascending=False)

Fetching Barttorvik data for fallback...
✓ Fetched data for 21 teams from Barttorvik

Manual team ratings loaded:


Unnamed: 0,team,adj_o,adj_d,barthag,wins,losses,net_rating,win_pct,power_rating
0,Michigan,128.283321,89.600828,0.984126,15.0,1.0,38.682493,0.9375,38.682493
1,Virginia,125.705708,96.643305,0.953626,15.0,2.0,29.062403,0.882353,29.062403
2,Duke,122.341741,96.017459,0.941932,16.0,1.0,26.324282,0.941176,26.324282
3,Louisville,124.937656,100.395387,0.925189,12.0,5.0,24.542269,0.705882,24.542269
5,NC State,122.132113,99.617986,0.912396,12.0,5.0,22.514128,0.705882,22.514128
4,Clemson,117.586288,95.608177,0.915252,15.0,3.0,21.978111,0.833333,21.978111
6,North Carolina,121.422034,100.898766,0.893717,14.0,3.0,20.523268,0.823529,20.523268
8,SMU,124.148102,104.780248,0.875503,13.0,4.0,19.367853,0.764706,19.367853
7,Miami,120.095744,100.904098,0.881045,15.0,2.0,19.191646,0.882353,19.191646
9,Ohio State,120.378481,103.736389,0.846974,11.0,5.0,16.642092,0.6875,16.642092


## 6. Convert to Model Format and Save

Convert the ratings to the format expected by our prediction model.

In [7]:
# Use scraped Barttorvik data if available
if our_teams is not None and len(our_teams) >= 18:
    print("✓ Using scraped Barttorvik data!")
    
    # Map Barttorvik column names to our format
    # Common Barttorvik columns: team, conf, rec, adjoe, adjde, barthag, etc.
    print(f"\nAvailable columns: {our_teams.columns.tolist()}")
    
    # Try to find the right columns (Barttorvik uses various naming conventions)
    def find_col(df, options):
        for opt in options:
            matches = [c for c in df.columns if opt.lower() in c.lower()]
            if matches:
                return matches[0]
        return None
    
    adj_o_col = find_col(our_teams, ['adjoe', 'adj_o', 'adjO', 'oe'])
    adj_d_col = find_col(our_teams, ['adjde', 'adj_d', 'adjD', 'de'])
    barthag_col = find_col(our_teams, ['barthag', 'barth'])
    wins_col = find_col(our_teams, ['wins', 'w'])
    losses_col = find_col(our_teams, ['losses', 'l'])
    
    print(f"Found columns - AdjO: {adj_o_col}, AdjD: {adj_d_col}, Barthag: {barthag_col}")
    
    model_df = pd.DataFrame({
        'team': our_teams['team_clean'],
        'off_efficiency': our_teams[adj_o_col] if adj_o_col else 100.0,
        'def_efficiency': our_teams[adj_d_col] if adj_d_col else 100.0,
    })
    
    model_df['ppg'] = model_df['off_efficiency'] * 0.70
    model_df['opp_ppg'] = model_df['def_efficiency'] * 0.70
    model_df['pace'] = 70.0
    model_df['power_rating'] = model_df['off_efficiency'] - model_df['def_efficiency']
    model_df['win_pct'] = 0.5  # Default, can update if wins/losses columns found
    
    if wins_col and losses_col:
        model_df['win_pct'] = our_teams[wins_col] / (our_teams[wins_col] + our_teams[losses_col])

else:
    print("⚠️ Using manual entry data (update values in cell 6 for accuracy)")
    
    # Fallback to manual data
    MANUAL_TEAM_RATINGS = {
        'Duke':           {'adj_o': 120.0, 'adj_d': 95.0, 'wins': 16, 'losses': 2},
        'North Carolina': {'adj_o': 118.0, 'adj_d': 98.0, 'wins': 14, 'losses': 4},
        'Virginia':       {'adj_o': 108.0, 'adj_d': 90.0, 'wins': 13, 'losses': 4},
        'Clemson':        {'adj_o': 112.0, 'adj_d': 98.0, 'wins': 13, 'losses': 5},
        'NC State':       {'adj_o': 110.0, 'adj_d': 100.0, 'wins': 12, 'losses': 6},
        'SMU':            {'adj_o': 114.0, 'adj_d': 102.0, 'wins': 13, 'losses': 5},
        'Miami':          {'adj_o': 108.0, 'adj_d': 102.0, 'wins': 10, 'losses': 7},
        'Pitt':           {'adj_o': 106.0, 'adj_d': 100.0, 'wins': 10, 'losses': 8},
        'Louisville':     {'adj_o': 108.0, 'adj_d': 104.0, 'wins': 10, 'losses': 8},
        'Syracuse':       {'adj_o': 110.0, 'adj_d': 106.0, 'wins': 9, 'losses': 8},
        'Virginia Tech':  {'adj_o': 104.0, 'adj_d': 102.0, 'wins': 9, 'losses': 9},
        'Wake Forest':    {'adj_o': 106.0, 'adj_d': 104.0, 'wins': 9, 'losses': 9},
        'Notre Dame':     {'adj_o': 108.0, 'adj_d': 106.0, 'wins': 8, 'losses': 9},
        'Florida State':  {'adj_o': 102.0, 'adj_d': 104.0, 'wins': 8, 'losses': 10},
        'Georgia Tech':   {'adj_o': 100.0, 'adj_d': 106.0, 'wins': 7, 'losses': 11},
        'Boston College': {'adj_o': 98.0, 'adj_d': 108.0, 'wins': 6, 'losses': 11},
        'California':     {'adj_o': 102.0, 'adj_d': 108.0, 'wins': 7, 'losses': 11},
        'Stanford':       {'adj_o': 100.0, 'adj_d': 106.0, 'wins': 6, 'losses': 11},
        'Baylor':         {'adj_o': 116.0, 'adj_d': 96.0, 'wins': 14, 'losses': 4},
        'Ohio State':     {'adj_o': 112.0, 'adj_d': 100.0, 'wins': 12, 'losses': 6},
        'Michigan':       {'adj_o': 110.0, 'adj_d': 102.0, 'wins': 11, 'losses': 7},
    }
    
    teams = []
    for team_name, stats in MANUAL_TEAM_RATINGS.items():
        row = {'team': team_name}
        row.update(stats)
        teams.append(row)
    
    manual_df = pd.DataFrame(teams)
    
    model_df = pd.DataFrame({
        'team': manual_df['team'],
        'ppg': manual_df['adj_o'] * 0.70,
        'opp_ppg': manual_df['adj_d'] * 0.70,
        'off_efficiency': manual_df['adj_o'],
        'def_efficiency': manual_df['adj_d'],
        'pace': 70.0,
        'win_pct': manual_df['wins'] / (manual_df['wins'] + manual_df['losses']),
        'power_rating': manual_df['adj_o'] - manual_df['adj_d'],
    })

print(f"\n✓ Created model data for {len(model_df)} teams")
print("\nModel-ready data (sorted by power rating):")
model_df.sort_values('power_rating', ascending=False)

✓ Using scraped Barttorvik data!

Available columns: ['rank', 'team', 'conf', 'record', 'adjoe', 'oe Rank', 'adjde', 'de Rank', 'barthag', 'rank.1', 'proj. W', 'Proj. L', 'Pro Con W', 'Pro Con L', 'Con Rec.', 'sos', 'ncsos', 'consos', 'Proj. SOS', 'Proj. Noncon SOS', 'Proj. Con SOS', 'elite SOS', 'elite noncon SOS', 'Opp OE', 'Opp DE', 'Opp Proj. OE', 'Opp Proj DE', 'Con Adj OE', 'Con Adj DE', 'Qual O', 'Qual D', 'Qual Barthag', 'Qual Games', 'FUN', 'ConPF', 'ConPA', 'ConPoss', 'ConOE', 'ConDE', 'ConSOSRemain', 'Conf Win%', 'WAB', 'WAB Rk', 'Fun Rk', 'adjt', 'team_clean']
Found columns - AdjO: adjoe, AdjD: adjde, Barthag: barthag

✓ Created model data for 21 teams

Model-ready data (sorted by power rating):


Unnamed: 0,team,off_efficiency,def_efficiency,ppg,opp_ppg,pace,power_rating,win_pct
0,Michigan,128.283321,89.600828,89.798325,62.72058,70.0,38.682493,0.886766
8,Virginia,125.705708,96.643305,87.993996,67.650313,70.0,29.062403,0.850527
14,Duke,122.341741,96.017459,85.639218,67.212221,70.0,26.324282,0.825778
20,Louisville,124.937656,100.395387,87.456359,70.276771,70.0,24.542269,0.68053
25,NC State,122.132113,99.617986,85.492479,69.73259,70.0,22.514128,0.664219
23,Clemson,117.586288,95.608177,82.310402,66.925724,70.0,21.978111,0.772803
28,North Carolina,121.422034,100.898766,84.995424,70.629136,70.0,20.523268,0.706796
33,SMU,124.148102,104.780248,86.903671,73.346174,70.0,19.367853,0.685354
31,Miami,120.095744,100.904098,84.06702,70.632869,70.0,19.191646,0.756074
46,Ohio State,120.378481,103.736389,84.264937,72.615472,70.0,16.642092,0.567455


In [8]:
# Save to processed data folder using config path
output_path = config.PROCESSED_DATA_DIR / 'team_stats_2025_26.csv'
model_df.to_csv(output_path, index=False)
print(f"✓ Saved updated team stats to {output_path}")

# Verify
print(f"\nSaved {len(model_df)} teams")
print(f"Power rating range: {model_df['power_rating'].min():.1f} to {model_df['power_rating'].max():.1f}")

✓ Saved updated team stats to /Users/calebhan/Documents/Coding/Personal/triangle-sports-analytics-26/notebooks/../data/processed/team_stats_2025_26.csv

Saved 21 teams
Power rating range: -0.3 to 38.7


## 7. Quick Sanity Check: Expected Spreads

Let's verify the ratings make sense by looking at expected spreads for key matchups.

In [9]:
HOME_COURT_ADVANTAGE = 3.5

def predict_spread(home_team, away_team, team_stats):
    """Simple spread prediction using net efficiency"""
    stats = team_stats.set_index('team')
    
    home_net = stats.loc[home_team, 'power_rating']
    away_net = stats.loc[away_team, 'power_rating']
    
    # Spread = (home_efficiency - away_efficiency) / 2 + HCA
    # Division by 2 because efficiency is per 100 possessions but we want per-game
    spread = (home_net - away_net) / 2 + HOME_COURT_ADVANTAGE
    
    return spread

# Test some key matchups
test_matchups = [
    ('North Carolina', 'Duke'),   # Classic rivalry at UNC
    ('Duke', 'North Carolina'),   # At Duke
    ('Virginia', 'Duke'),         # UVA hosting Duke
    ('Boston College', 'Duke'),   # BC hosting Duke
    ('Duke', 'NC State'),         # Duke at NC State
]

print("Expected Spreads (sanity check):")
print("=" * 55)
for home, away in test_matchups:
    spread = predict_spread(home, away, model_df)
    if spread > 0:
        print(f"{away:15} @ {home:15} → {home} by {spread:.1f}")
    else:
        print(f"{away:15} @ {home:15} → {away} by {-spread:.1f}")

print("\n✓ Do these spreads look reasonable? If not, adjust the ratings above.")

Expected Spreads (sanity check):
Duke            @ North Carolina  → North Carolina by 0.6
North Carolina  @ Duke            → Duke by 6.4
Duke            @ Virginia        → Virginia by 4.9
Duke            @ Boston College  → Duke by 9.8
NC State        @ Duke            → Duke by 5.4

✓ Do these spreads look reasonable? If not, adjust the ratings above.


## Next Steps

Run [02_modeling.ipynb](02_modeling.ipynb) to use these ratings in our predictions!