# Triangle Sports Analytics - Data Collection

This notebook handles data collection for the 2026 ACC Basketball point spread prediction competition.

**Objective:** Collect team statistics for all teams in the prediction schedule to build features for our models.

**Data Sources:**
- Sports Reference (team stats, game results)
- Template CSV (78 games to predict)

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import requests
from bs4 import BeautifulSoup
import time
import os
from datetime import datetime

# Add src to path
import sys
sys.path.insert(0, '..')

print("Libraries imported successfully!")

Libraries imported successfully!


## 1. Load Prediction Template

Let's first load the template to see which teams we need data for.

In [2]:
# Load the prediction template
template_path = '../tsa_pt_spread_template_2026 - Sheet1.csv'
template_df = pd.read_csv(template_path)

# Clean the template - remove empty rows
template_df = template_df.dropna(subset=['Date', 'Away', 'Home'])

# Parse dates
template_df['Date'] = pd.to_datetime(template_df['Date'], format='%m/%d/%Y')

print(f"Total games to predict: {len(template_df)}")
print(f"Date range: {template_df['Date'].min().date()} to {template_df['Date'].max().date()}")
print(f"\nFirst 10 games:")
template_df.head(10)

Total games to predict: 78
Date range: 2026-02-07 to 2026-03-07

First 10 games:


Unnamed: 0,Date,Away,Home,pt_spread,team_name,team_members,email
0,2026-02-07,Syracuse,Virginia,,,,
1,2026-02-07,Louisville,Wake Forest,,,,
2,2026-02-07,Virginia Tech,NC State,,,,
3,2026-02-07,Miami,Boston College,,,,
4,2026-02-07,SMU,Pitt,,,,
5,2026-02-07,Florida State,Notre Dame,,,,
6,2026-02-07,Duke,North Carolina,,,,
7,2026-02-07,Clemson,California,,,,
8,2026-02-07,Georgia Tech,Stanford,,,,
9,2026-02-09,NC State,Louisville,,,,


In [3]:
# Get all unique teams from the schedule
away_teams = set(template_df['Away'].unique())
home_teams = set(template_df['Home'].unique())
all_teams = sorted(away_teams.union(home_teams))

print(f"Total unique teams: {len(all_teams)}")
print(f"\nTeams in schedule:")
for i, team in enumerate(all_teams, 1):
    print(f"  {i}. {team}")

Total unique teams: 21

Teams in schedule:
  1. Baylor
  2. Boston College
  3. California
  4. Clemson
  5. Duke
  6. Florida State
  7. Georgia Tech
  8. Louisville
  9. Miami
  10. Michigan
  11. NC State
  12. North Carolina
  13. Notre Dame
  14. Ohio State
  15. Pitt
  16. SMU
  17. Stanford
  18. Syracuse
  19. Virginia
  20. Virginia Tech
  21. Wake Forest


## 2. Define Team Mappings

Map team names to Sports Reference URL slugs for scraping.

In [4]:
# Sports Reference URL slugs for each team
TEAM_SLUGS = {
    'Duke': 'duke',
    'North Carolina': 'north-carolina',
    'NC State': 'north-carolina-state',
    'Virginia': 'virginia',
    'Virginia Tech': 'virginia-tech',
    'Clemson': 'clemson',
    'Florida State': 'florida-state',
    'Miami': 'miami-fl',
    'Pitt': 'pittsburgh',
    'Syracuse': 'syracuse',
    'Louisville': 'louisville',
    'Wake Forest': 'wake-forest',
    'Georgia Tech': 'georgia-tech',
    'Boston College': 'boston-college',
    'Notre Dame': 'notre-dame',
    'California': 'california',
    'Stanford': 'stanford',
    'SMU': 'southern-methodist',
    # Non-ACC teams in schedule
    'Baylor': 'baylor',
    'Ohio State': 'ohio-state',
    'Michigan': 'michigan',
}

# Verify all teams have mappings
missing_teams = [t for t in all_teams if t not in TEAM_SLUGS]
if missing_teams:
    print(f"WARNING: Missing mappings for: {missing_teams}")
else:
    print("✓ All teams have URL mappings")

✓ All teams have URL mappings


## 3. Scrape Team Statistics from Sports Reference

We'll collect team statistics for the 2025-26 season. Key stats include:
- Points per game (PPG) and opponent PPG
- Field goal percentage 
- Offensive and defensive efficiency
- Rebounds, assists, turnovers

In [5]:
def scrape_team_stats(team_name: str, team_slug: str, season_year: int = 2026) -> dict:
    """
    Scrape team statistics from Sports Reference.
    
    Args:
        team_name: Display name of team
        team_slug: URL slug for Sports Reference
        season_year: End year of season (2026 for 2025-26)
        
    Returns:
        Dictionary of team statistics
    """
    url = f"https://www.sports-reference.com/cbb/schools/{team_slug}/men/{season_year}.html"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36'
    }
    
    try:
        response = requests.get(url, headers=headers, timeout=15)
        response.raise_for_status()
        
        soup = BeautifulSoup(response.content, 'lxml')
        stats = {'team': team_name}
        
        # Try to find team stats table
        # The structure may vary - this is a template that needs adjustment
        
        # Look for per-game stats in the page
        # Sports Reference typically has this in a table with id="per_game"
        
        return stats
        
    except requests.RequestException as e:
        print(f"  Error fetching {team_name}: {e}")
        return {'team': team_name, 'error': str(e)}

# Test with one team
print("Testing scraper with Duke...")
test_stats = scrape_team_stats('Duke', 'duke', 2026)
print(f"Result: {test_stats}")

Testing scraper with Duke...
Result: {'team': 'Duke'}


## 4. Alternative: Use Manual/Pre-collected Team Ratings

Since web scraping can be unreliable, we can also create a baseline with estimated team ratings.
These can be updated with actual scraped data or from sources like KenPom/Barttorvik.

In [6]:
# Placeholder team ratings - UPDATE THESE with actual data!
# These are rough estimates based on typical ACC team performance
# Format: team -> {ppg, opp_ppg, off_efficiency, def_efficiency, pace, win_pct}

TEAM_STATS_PLACEHOLDER = {
    # ACC Powerhouses
    'Duke': {'ppg': 82.0, 'opp_ppg': 68.0, 'off_efficiency': 118.0, 'def_efficiency': 95.0, 'pace': 72.0, 'win_pct': 0.85, 'power_rating': 25.0},
    'North Carolina': {'ppg': 80.0, 'opp_ppg': 70.0, 'off_efficiency': 115.0, 'def_efficiency': 98.0, 'pace': 74.0, 'win_pct': 0.78, 'power_rating': 22.0},
    
    # Strong Teams
    'Virginia': {'ppg': 68.0, 'opp_ppg': 58.0, 'off_efficiency': 108.0, 'def_efficiency': 92.0, 'pace': 62.0, 'win_pct': 0.72, 'power_rating': 18.0},
    'NC State': {'ppg': 76.0, 'opp_ppg': 70.0, 'off_efficiency': 110.0, 'def_efficiency': 100.0, 'pace': 70.0, 'win_pct': 0.68, 'power_rating': 14.0},
    'Clemson': {'ppg': 74.0, 'opp_ppg': 68.0, 'off_efficiency': 108.0, 'def_efficiency': 98.0, 'pace': 68.0, 'win_pct': 0.65, 'power_rating': 12.0},
    'SMU': {'ppg': 78.0, 'opp_ppg': 72.0, 'off_efficiency': 112.0, 'def_efficiency': 102.0, 'pace': 70.0, 'win_pct': 0.70, 'power_rating': 15.0},
    
    # Mid-tier Teams
    'Miami': {'ppg': 74.0, 'opp_ppg': 72.0, 'off_efficiency': 106.0, 'def_efficiency': 102.0, 'pace': 69.0, 'win_pct': 0.58, 'power_rating': 8.0},
    'Pitt': {'ppg': 72.0, 'opp_ppg': 70.0, 'off_efficiency': 104.0, 'def_efficiency': 100.0, 'pace': 68.0, 'win_pct': 0.55, 'power_rating': 6.0},
    'Louisville': {'ppg': 75.0, 'opp_ppg': 74.0, 'off_efficiency': 105.0, 'def_efficiency': 104.0, 'pace': 71.0, 'win_pct': 0.52, 'power_rating': 4.0},
    'Syracuse': {'ppg': 76.0, 'opp_ppg': 75.0, 'off_efficiency': 106.0, 'def_efficiency': 105.0, 'pace': 72.0, 'win_pct': 0.50, 'power_rating': 3.0},
    'Virginia Tech': {'ppg': 72.0, 'opp_ppg': 71.0, 'off_efficiency': 103.0, 'def_efficiency': 102.0, 'pace': 68.0, 'win_pct': 0.52, 'power_rating': 4.0},
    'Wake Forest': {'ppg': 73.0, 'opp_ppg': 72.0, 'off_efficiency': 104.0, 'def_efficiency': 103.0, 'pace': 69.0, 'win_pct': 0.50, 'power_rating': 2.0},
    'Notre Dame': {'ppg': 74.0, 'opp_ppg': 73.0, 'off_efficiency': 105.0, 'def_efficiency': 104.0, 'pace': 68.0, 'win_pct': 0.48, 'power_rating': 1.0},
    
    # Lower-tier Teams  
    'Florida State': {'ppg': 70.0, 'opp_ppg': 72.0, 'off_efficiency': 100.0, 'def_efficiency': 103.0, 'pace': 67.0, 'win_pct': 0.45, 'power_rating': -2.0},
    'Georgia Tech': {'ppg': 68.0, 'opp_ppg': 72.0, 'off_efficiency': 98.0, 'def_efficiency': 104.0, 'pace': 66.0, 'win_pct': 0.42, 'power_rating': -4.0},
    'Boston College': {'ppg': 66.0, 'opp_ppg': 74.0, 'off_efficiency': 96.0, 'def_efficiency': 106.0, 'pace': 65.0, 'win_pct': 0.38, 'power_rating': -6.0},
    'California': {'ppg': 70.0, 'opp_ppg': 74.0, 'off_efficiency': 100.0, 'def_efficiency': 106.0, 'pace': 68.0, 'win_pct': 0.40, 'power_rating': -5.0},
    'Stanford': {'ppg': 68.0, 'opp_ppg': 73.0, 'off_efficiency': 98.0, 'def_efficiency': 105.0, 'pace': 66.0, 'win_pct': 0.38, 'power_rating': -6.0},
    
    # Non-ACC Teams (from schedule)
    'Baylor': {'ppg': 78.0, 'opp_ppg': 68.0, 'off_efficiency': 114.0, 'def_efficiency': 96.0, 'pace': 70.0, 'win_pct': 0.75, 'power_rating': 20.0},
    'Ohio State': {'ppg': 76.0, 'opp_ppg': 70.0, 'off_efficiency': 110.0, 'def_efficiency': 100.0, 'pace': 69.0, 'win_pct': 0.68, 'power_rating': 14.0},
    'Michigan': {'ppg': 74.0, 'opp_ppg': 70.0, 'off_efficiency': 108.0, 'def_efficiency': 100.0, 'pace': 68.0, 'win_pct': 0.65, 'power_rating': 10.0},
}

# Convert to DataFrame
team_stats_df = pd.DataFrame(TEAM_STATS_PLACEHOLDER).T.reset_index()
team_stats_df = team_stats_df.rename(columns={'index': 'team'})

print(f"Created stats for {len(team_stats_df)} teams")
print("\nTeam stats preview:")
team_stats_df.head(10)

Created stats for 21 teams

Team stats preview:


Unnamed: 0,team,ppg,opp_ppg,off_efficiency,def_efficiency,pace,win_pct,power_rating
0,Duke,82.0,68.0,118.0,95.0,72.0,0.85,25.0
1,North Carolina,80.0,70.0,115.0,98.0,74.0,0.78,22.0
2,Virginia,68.0,58.0,108.0,92.0,62.0,0.72,18.0
3,NC State,76.0,70.0,110.0,100.0,70.0,0.68,14.0
4,Clemson,74.0,68.0,108.0,98.0,68.0,0.65,12.0
5,SMU,78.0,72.0,112.0,102.0,70.0,0.7,15.0
6,Miami,74.0,72.0,106.0,102.0,69.0,0.58,8.0
7,Pitt,72.0,70.0,104.0,100.0,68.0,0.55,6.0
8,Louisville,75.0,74.0,105.0,104.0,71.0,0.52,4.0
9,Syracuse,76.0,75.0,106.0,105.0,72.0,0.5,3.0


## 5. Save Data to Files

In [7]:
# Save team stats to processed data
os.makedirs('../data/processed', exist_ok=True)

team_stats_df.to_csv('../data/processed/team_stats_2025_26.csv', index=False)
print("✓ Saved team stats to data/processed/team_stats_2025_26.csv")

# Save cleaned template
template_df.to_csv('../data/processed/games_to_predict.csv', index=False)
print("✓ Saved games template to data/processed/games_to_predict.csv")

print(f"\nReady for modeling with {len(team_stats_df)} teams and {len(template_df)} games!")

✓ Saved team stats to data/processed/team_stats_2025_26.csv
✓ Saved games template to data/processed/games_to_predict.csv

Ready for modeling with 21 teams and 78 games!


## Next Steps

Run 02_scrape_team_ratings to gather the latest team statistics for modeling!