# Data Collection and Processing Notebook

This notebook contains all code for collecting, cleaning, and merging data for the Next Up project.

## Overview

This notebook is organized into the following sections:
1. **Setup & Helper Functions** - Import libraries and define utility functions
2. **10-Day Callup Data** - Web scraping from HoopsRumors
3. **Cleaning 10-Day Callup Data** - Processing and tidying the scraped data
4. **Two-Way Contracts Data** - Loading manually collected two-way contract data
5. **Two-Way Conversions Data** - Loading manually collected conversion data
6. **G-League Player Stats (API)** - Collecting player statistics from SportsRadar API
7. **G-League Player Stats (CSV)** - Loading player stats from CSV files by season
8. **Creating Prediction Dataset** - Merging all data sources with `called_up` target variable

---

## Section 1: Setup & Helper Functions

Import necessary libraries and define helper functions used throughout the notebook.

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import os
import re
import time
import requests
import json
import glob
from pathlib import Path
from bs4 import BeautifulSoup
from dateutil import parser
from unidecode import unidecode
from dotenv import load_dotenv

# Load environment variables for API keys
load_dotenv()

print("✅ Libraries imported successfully")

(         date   season         player_name               nba_team  \
 0  2007-01-06  2006-07     Thompson, Dijon          Atlanta Hawks   
 1  2007-01-27  2006-07  Richardson, Jeremy          Atlanta Hawks   
 2  2007-04-04  2006-07     Pinkney, Kevinn         Boston Celtics   
 3  2007-04-02  2006-07       Willis, Kevin       Dallas Mavericks   
 4  2007-01-17  2006-07      Major, Renaldo  Golden State Warriors   
 
                                        contract_type  
 0      Two 10-day contracts (only 1st date is shown)  
 1      Two 10-day contracts (only 1st date is shown)  
 2                                    10-day contract  
 3  10-day contract followed by signing for the re...  
 4                                    10-day contract  ,
       player_name        date        nba_team   season    contract_type
 0      Eli Ndiaye  2025-07-01   atlanta hawks  2025-26  two-way signing
 1    Jacob Toppin  2025-07-01   atlanta hawks  2025-26  two-way signing
 2   Caleb Houston  20

In [None]:
# Helper Functions

def clean_name(name):
    """Clean player names to ensure consistent matching across datasets.
    Handles 'Last, First' format and normalizes spacing."""
    if pd.isna(name):
        return None
    name = str(name).strip()
    # Handle "Last, First" format
    if "," in name:
        last, first = name.split(",", 1)
        return f"{first.strip()} {last.strip()}"
    return name

def to_date(x):
    """Convert date string to datetime object. Handles various date formats."""
    if pd.isna(x):
        return pd.NaT
    try:
        # HoopsRumors uses mm/dd/yy
        return parser.parse(str(x), dayfirst=False, yearfirst=False)
    except Exception:
        return pd.NaT

def nba_season(dt):
    """Map a calendar date to NBA season label like '2024-25'.
    Season 'YYYY-YY' starts Aug 1 of YYYY and ends Jul 31 of YYYY+1."""
    if pd.isna(dt):
        return None
    y = dt.year
    if dt.month >= 8:  # Aug..Dec -> season starts this year
        return f"{y}-{(y+1)%100:02d}"
    else:  # Jan..Jul -> season started previous year
        return f"{y-1}-{y%100:02d}"

def canon_team(t):
    """Canonicalize team names to standard format."""
    if pd.isna(t):
        return None
    t = t.strip()
    TEAM_CANON = {
        "LA Clippers": "Los Angeles Clippers",
        "L.A. Clippers": "Los Angeles Clippers",
        "LA Lakers": "Los Angeles Lakers",
        "L.A. Lakers": "Los Angeles Lakers",
    }
    return TEAM_CANON.get(t, t)

def extract_season_year(season_str):
    """Extract season year from season string like '2024-25' -> 2024."""
    if pd.isna(season_str):
        return None
    season_str = str(season_str)
    # Handle formats like "2024-25", "2024", "2024.0"
    if "-" in season_str:
        return int(season_str.split("-")[0])
    try:
        return int(float(season_str))
    except:
        return None

print("✅ Helper functions defined")

---

## Section 2: 10-Day Callup Data (Web Scraping)

This section scrapes 10-day contract data from HoopsRumors website. The data includes all 10-day contracts signed since January 2007.

**Source**: https://www.hoopsrumors.com/hoops-apps/10_day_contract_tracker.php

**Output Files**:
- `callups_10day_raw.csv` - Raw scraped data
- `callups_10day_tidy.csv` - Cleaned and deduplicated data


In [None]:
# Configuration for 10-day callup scraping
URL = "https://www.hoopsrumors.com/hoops-apps/10_day_contract_tracker.php?name=&team=&type=&d1=2007-01-01&d2=2025-12-31"
OUT_DIR = "."
RAW_CSV = os.path.join(OUT_DIR, "callups_10day_raw.csv")
TIDY_CSV = os.path.join(OUT_DIR, "callups_10day_tidy.csv")

os.makedirs(OUT_DIR, exist_ok=True)

# Headers for web scraping
headers = {
    "User-Agent": "Mozilla/5.0 (DS3-NextUp class project; contact: example@ucsd.edu)"
}

print("✅ Configuration set up for 10-day callup scraping")


In [None]:
# Scraping functions for 10-day callup data

def scrape_with_pandas(url):
    """Try to scrape using pandas read_html (most reliable for tables)."""
    tables = pd.read_html(url)
    # Find the table that has the expected columns; usually the first
    for df in tables:
        cols = [c.lower() for c in df.columns.astype(str)]
        if {"date", "player", "team", "type"}.issubset(set(cols)):
            return df
    return None

def scrape_with_bs4(url):
    """Fallback scraping method using BeautifulSoup."""
    r = requests.get(url, headers=headers, timeout=30)
    r.raise_for_status()
    soup = BeautifulSoup(r.text, "lxml")
    table = soup.find("table")
    rows = []
    if table:
        th = [th.get_text(strip=True) for th in table.find_all("th")]
        for tr in table.find_all("tr"):
            tds = tr.find_all("td")
            if len(tds) >= 4:
                rows.append([td.get_text(" ", strip=True) for td in tds[:4]])
        if rows:
            df = pd.DataFrame(rows, columns=th[:4])  # Date, Player, Active, Team, Type
            return df
    return None

# Try pandas first, fallback to bs4
# NOTE: This code is commented out since we already have the CSV files
# Uncomment to re-scrape if needed
"""
df_raw = scrape_with_pandas(URL)
if df_raw is None:
    df_raw = scrape_with_bs4(URL)

if df_raw is None or df_raw.empty:
    raise RuntimeError("Failed to parse HoopsRumors table. Check the URL or site layout.")
"""

print("Scraping functions defined (commented out - using existing CSV files)")


---

## Section 3: Cleaning 10-Day Callup Data

This section processes the raw scraped 10-day callup data:
1. Standardizes column names
2. Parses dates and extracts NBA seasons
3. Cleans player names
4. Canonicalizes team names
5. Deduplicates records (keeps first call-up per player/team/season)


In [None]:
# Load raw 10-day callup data (if re-scraping, uncomment scraping code above)
# For now, we load the existing tidy CSV since we already have processed data
# If you need to re-process, load callups_10day_raw.csv instead

# NOTE: This code is commented out since we already have the tidy CSV
# Uncomment to re-process raw data if needed
"""
# Load raw data
df_raw = pd.read_csv(RAW_CSV)

# Standardize column names
df_raw.columns = [c.strip().title() for c in df_raw.columns]
# Keep only relevant cols if extra present
keep = [c for c in ["Date","Player","Active","Team","Type"] if c in df_raw.columns]
df_raw = df_raw[keep].copy()

# Parse and enrich
df_raw["date"] = df_raw["Date"].apply(to_date)
df_raw["season"] = df_raw["date"].apply(nba_season)
df_raw["player_name"] = df_raw["Player"].apply(clean_name)
df_raw["nba_team"] = df_raw["Team"].apply(canon_team)
df_raw["contract_type"] = df_raw["Type"].astype(str).str.strip()

# Sort newest→oldest
df_raw = df_raw.sort_values("date", ascending=False).reset_index(drop=True)

# Save RAW scrape as-is
df_raw.to_csv(RAW_CSV, index=False)
print(f"Saved raw 10-day table: {RAW_CSV}  ({len(df_raw):,} rows)")

# Tidy / Deduplicate
# Many players sign multiple 10-days; for labeling a 'call-up event' we usually keep the FIRST date per (player, team, season)
df_tidy = (
    df_raw
    .dropna(subset=["player_name","nba_team","season","date"])
    .sort_values("date")  # oldest first, so first call-up is kept
    .drop_duplicates(subset=["player_name","nba_team","season"], keep="first")
    .sort_values(["season","nba_team","date"])
    .reset_index(drop=True)
)

# Select neat columns
df_tidy = df_tidy[["date","season","player_name","nba_team","contract_type"]]

df_tidy.to_csv(TIDY_CSV, index=False)
print(f"Saved tidy call-up events: {TIDY_CSV}  ({len(df_tidy):,} rows)")
"""

# Load the existing tidy CSV
df_10day_tidy = pd.read_csv("callups_10day_tidy.csv")
print(f"Loaded 10-day callup data: {len(df_10day_tidy):,} records")
print(f"Date range: {df_10day_tidy['date'].min()} to {df_10day_tidy['date'].max()}")
df_10day_tidy.head()


---

## Section 4: Two-Way Contracts Data

This section loads manually collected two-way contract data. Two-way contracts are collected manually from HoopsRumors and stored in CSV format.

**Source**: Manually collected from https://www.hoopsrumors.com/hoops-apps/two_way_contract_tracker.php

**Note**: This data is manually maintained and updated as new two-way contracts are signed.


In [None]:
# Load two-way contracts data
df_two_way = pd.read_csv("two_way_contracts.csv")

# Clean player names for consistency
df_two_way["player_name"] = df_two_way["player_name"].apply(clean_name)

print(f"Loaded two-way contracts data: {len(df_two_way):,} records")
print(f"Seasons: {sorted(df_two_way['season'].unique())}")
df_two_way.head()


---

## Section 5: Two-Way Conversions Data

This section loads manually collected two-way contract conversion data. Conversions occur when a player's two-way contract is converted to a standard NBA contract.

**Source**: Manually collected from HoopsRumors transaction pages

**Note**: This data is manually maintained and updated as conversions occur.


In [None]:
# Load two-way conversions data
df_conversions = pd.read_csv("two_way_conversions.csv")

# Clean player names for consistency
df_conversions["player_name"] = df_conversions["player_name"].apply(clean_name)

print(f"Loaded two-way conversions data: {len(df_conversions):,} records")
print(f"Seasons: {sorted(df_conversions['season'].unique())}")
df_conversions.head()


---

## Section 6: G-League Player Stats (API Collection)

This section collects G-League player statistics from the SportsRadar API. The API provides comprehensive player data including:
- Player profiles (name, position, height, weight, etc.)
- Season statistics (points, rebounds, assists, shooting percentages, etc.)
- Team rosters

**Source**: SportsRadar NBDL API  
**API Endpoint**: `https://api.sportradar.us/nbdl/trial/v8/en`

**Note**: This code requires an API key stored in a `.env` file. The code includes caching to avoid redundant API calls.


In [None]:
# API Configuration
API_KEY = os.getenv("API_KEY")
if not API_KEY:
    print("Warning: API_KEY not found in .env file")
    print("   Set up API key following instructions in API_SETUP_INSTRUCTIONS.md")
else:
    print("API key loaded")

BASE_URL = "https://api.sportradar.us/nbdl/trial/v8/en"

# Create output directories
os.makedirs("../raw", exist_ok=True)
os.makedirs("../raw_json", exist_ok=True)
os.makedirs("external", exist_ok=True)


In [None]:
# API call function with caching and retry logic
def api_call(url, cache_file=None, max_retries=5):
    """Make API call with caching and exponential backoff for rate limiting."""
    if cache_file and os.path.exists(cache_file):
        print(f"  [cached]")
        with open(cache_file, 'r') as f:
            return json.load(f)
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=30)
            response.raise_for_status()
            data = response.json()
            
            if cache_file:
                with open(cache_file, 'w') as f:
                    json.dump(data, f)
            
            time.sleep(2)  # Rate limiting delay
            return data
            
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429:
                wait_time = (2 ** attempt) * 5  # Exponential backoff: 5, 10, 20, 40, 80 seconds
                print(f"  Rate limited. Waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise
        except Exception as e:
            if attempt < max_retries - 1:
                print(f"  Error: {e}. Retrying...")
                time.sleep(2)
            else:
                raise
    
    return None

print("API call function defined")


In [None]:
# NOTE: This code is commented out since we already have the CSV files
# Uncomment to re-collect data from API if needed

"""
# Fetch league hierarchy to get all teams
print("Fetching league hierarchy...")
url = f"{BASE_URL}/league/hierarchy.json?api_key={API_KEY}"
hierarchy = api_call(url, "../raw_json/hierarchy.json")

teams_data = []
for conf in hierarchy['conferences']:
    for div in conf['divisions']:
        for team in div['teams']:
            teams_data.append({
                'team_id': team['id'],
                'team_name': team.get('market', '') + ' ' + team['name'],
                'alias': team['alias']
            })

df_teams = pd.DataFrame(teams_data)
print(f"\\nFound {len(df_teams)} teams\\n")

all_players = []
all_rosters = []
all_stats = []

for idx, row in df_teams.iterrows():
    team_id = row['team_id']
    team_name = row['team_name']
    
    print(f"[{idx+1}/{len(df_teams)}] {team_name}")
    
    url = f"{BASE_URL}/teams/{team_id}/profile.json?api_key={API_KEY}"
    team_data = api_call(url, f"../raw_json/team_{team_id}.json")
    
    if 'players' not in team_data:
        continue
    
    for player in team_data['players']:
        player_id = player.get('id')
        player_name = player.get('full_name', '')
        
        all_players.append({
            'player_id': player_id,
            'full_name': player_name,
            'position': player.get('position', ''),
            'height': player.get('height', None),
            'weight': player.get('weight', None),
            'birthdate': player.get('birthdate', ''),
            'college': player.get('college', '')
        })
        
        all_rosters.append({
            'team_id': team_id,
            'team_name': team_name,
            'player_id': player_id,
            'player_name': player_name,
            'position': player.get('position', '')
        })
        
        if 'seasons' in player:
            for season in player['seasons']:
                if 'teams' in season:
                    for team in season['teams']:
                        if 'total' in team:
                            stats = team['total']
                            all_stats.append({
                                'player_id': player_id,
                                'player_name': player_name,
                                'team_id': team_id,
                                'team_name': team_name,
                                'position': player.get('position', ''),
                                'games_played': stats.get('games_played', 0),
                                'minutes': stats.get('minutes', 0),
                                'points': stats.get('points', 0),
                                'rebounds': stats.get('rebounds', 0),
                                'assists': stats.get('assists', 0),
                                'steals': stats.get('steals', 0),
                                'blocks': stats.get('blocks', 0),
                                'turnovers': stats.get('turnovers', 0),
                                'field_goals_made': stats.get('field_goals_made', 0),
                                'field_goals_att': stats.get('field_goals_att', 0),
                                'field_goals_pct': stats.get('field_goals_pct', 0),
                                'three_points_made': stats.get('three_points_made', 0),
                                'three_points_att': stats.get('three_points_att', 0),
                                'three_points_pct': stats.get('three_points_pct', 0),
                                'free_throws_made': stats.get('free_throws_made', 0),
                                'free_throws_att': stats.get('free_throws_att', 0),
                                'free_throws_pct': stats.get('free_throws_pct', 0)
                            })

print(f"\\nCollected {len(all_stats)} stat records")
print(f"Collected {len(all_players)} players")
print(f"Collected {len(all_rosters)} roster entries\\n")

df_stats = pd.DataFrame(all_stats)
df_players = pd.DataFrame(all_players).drop_duplicates('player_id')
df_rosters = pd.DataFrame(all_rosters)

if len(df_stats) > 0:
    df_stats['points_per_game'] = (df_stats['points'] / df_stats['games_played'].replace(0, 1)).round(2)
    df_stats['rebounds_per_game'] = (df_stats['rebounds'] / df_stats['games_played'].replace(0, 1)).round(2)
    df_stats['assists_per_game'] = (df_stats['assists'] / df_stats['games_played'].replace(0, 1)).round(2)

df_stats.to_csv('../raw/gleague_player_stats.csv', index=False)
df_players.to_csv('../raw/gleague_players.csv', index=False)
df_rosters.to_csv('../raw/gleague_rosters.csv', index=False)
df_teams.to_csv('../raw/gleague_teams.csv', index=False)

print("Saved:")
print(f"  - gleague_player_stats.csv: {len(df_stats)} records")
print(f"  - gleague_players.csv: {len(df_players)} records")
print(f"  - gleague_rosters.csv: {len(df_rosters)} records")
print(f"  - gleague_teams.csv: {len(df_teams)} records")
"""

print("API collection code defined (commented out - using existing CSV files)")


---

## Section 7: G-League Player Stats (Loading from CSV Files)

This section loads G-League player statistics from CSV files organized by season. These files contain comprehensive player statistics for each season.

**Source Files**: 
- `../raw/gleague_player_season_stats_2019_REG.csv`
- `../raw/gleague_player_season_stats_2021_REG.csv`
- `../raw/gleague_player_season_stats_2022_REG.csv`
- `../raw/gleague_player_season_stats_2023_REG.csv`
- `../raw/gleague_player_season_stats_2024_REG.csv`

**Note**: These files are created from the API collection process above, or may be manually downloaded from SportsRadar.


In [None]:
# Load all player stats from all season files
def load_all_player_stats():
    """Load all player stats from all season files and combine them."""
    stats_files = glob.glob("../raw/gleague_player_season_stats_*.csv")
    
    if not stats_files:
        raise FileNotFoundError("No player stats files found in ../raw/")
    
    print(f"Found {len(stats_files)} stats files")
    
    all_stats = []
    for file in stats_files:
        print(f"Loading {Path(file).name}...")
        df = pd.read_csv(file)
        
        # Rename full_name to player_name if needed
        if 'full_name' in df.columns and 'player_name' not in df.columns:
            df = df.rename(columns={'full_name': 'player_name'})
        
        # Clean player names
        df['player_name'] = df['player_name'].apply(clean_name)
        
        # Extract season year from season_id
        if 'season_id' in df.columns:
            df['season_year'] = df['season_id'].apply(extract_season_year)
        elif 'season' in df.columns:
            df['season_year'] = df['season'].apply(extract_season_year)
        else:
            # Try to infer from filename
            filename = Path(file).name
            if '2024' in filename:
                df['season_year'] = 2024
            elif '2023' in filename:
                df['season_year'] = 2023
            elif '2022' in filename:
                df['season_year'] = 2022
            elif '2021' in filename:
                df['season_year'] = 2021
            elif '2019' in filename:
                df['season_year'] = 2019
            else:
                print(f"Warning: Could not determine season for {file}")
                continue
        
        all_stats.append(df)
    
    # Combine all stats
    combined_stats = pd.concat(all_stats, ignore_index=True)
    
    # Remove duplicates (same player, same season)
    combined_stats = combined_stats.drop_duplicates(
        subset=['player_name', 'season_year'], 
        keep='first'
    )
    
    print(f"\\nTotal player-season records: {len(combined_stats)}")
    print(f"Unique players: {combined_stats['player_name'].nunique()}")
    print(f"Seasons: {sorted(combined_stats['season_year'].dropna().unique())}")
    
    return combined_stats

# Load the data
df_player_stats = load_all_player_stats()
df_player_stats.head()


---

## Section 8: Creating Prediction Dataset

This section merges all data sources to create the final prediction dataset with a `called_up` binary target variable.

**Process**:
1. Load all player stats (from Section 7)
2. Load all callup data (10-day, two-way, conversions from Sections 2-5)
3. Merge callup data with player stats based on player name and season
4. Create `called_up` column: 1 if player was called up in that season, 0 otherwise
5. Add callup details (date, NBA team, contract type) for called-up players

**Output**: `prediction_dataset.csv` - Ready for machine learning modeling


In [None]:
# Load and combine all callup data
def load_all_callups():
    """Load and combine all callup data sources."""
    callup_files = {
        '10day': 'callups_10day_tidy.csv',
        'two_way': 'two_way_contracts.csv',
        'conversions': 'two_way_conversions.csv'
    }
    
    all_callups = []
    
    for callup_type, filename in callup_files.items():
        filepath = Path(filename)
        if filepath.exists():
            print(f"Loading {filename}...")
            df = pd.read_csv(filepath)
            
            # Clean player names
            if 'player_name' in df.columns:
                df['player_name'] = df['player_name'].apply(clean_name)
            
            # Extract season year
            if 'season' in df.columns:
                df['season_year'] = df['season'].apply(extract_season_year)
            else:
                print(f"Warning: No season column in {filename}")
                continue
            
            # Add callup type if not present
            if 'contract_type' not in df.columns:
                df['contract_type'] = callup_type
            
            all_callups.append(df)
        else:
            print(f"Warning: {filename} not found")
    
    if not all_callups:
        print("No callup data found!")
        return pd.DataFrame()
    
    # Combine all callups
    combined_callups = pd.concat(all_callups, ignore_index=True)
    
    # Remove duplicates (same player, same season)
    combined_callups = combined_callups.drop_duplicates(
        subset=['player_name', 'season_year'],
        keep='first'
    )
    
    print(f"\\nTotal callup records: {len(combined_callups)}")
    print(f"Unique players called up: {combined_callups['player_name'].nunique()}")
    
    return combined_callups

# Load all callup data
df_all_callups = load_all_callups()
df_all_callups.head()


In [None]:
# Create the prediction dataset
def create_prediction_dataset():
    """Create the final prediction dataset with called_up column."""
    
    # Load all player stats (already loaded above)
    player_stats = df_player_stats.copy()
    
    # Load all callups (already loaded above)
    callups = df_all_callups.copy()
    
    # Create a callup indicator
    if len(callups) > 0:
        # Create a set of (player_name, season_year) tuples for called up players
        callup_keys = set(
            zip(
                callups['player_name'].astype(str),
                callups['season_year'].astype(int)
            )
        )
        
        # Add called_up column to player_stats
        player_stats['called_up'] = player_stats.apply(
            lambda row: 1 if (str(row['player_name']), int(row['season_year'])) in callup_keys else 0,
            axis=1
        )
        
        # Merge callup details (date, nba_team, contract_type) for called up players
        callup_details = callups[['player_name', 'season_year', 'date', 'nba_team', 'contract_type']].copy()
        callup_details = callup_details.rename(columns={
            'date': 'callup_date',
            'nba_team': 'callup_nba_team',
            'contract_type': 'callup_contract_type'
        })
        
        # Left merge to add callup details
        player_stats = player_stats.merge(
            callup_details,
            on=['player_name', 'season_year'],
            how='left'
        )
    else:
        # No callup data available
        player_stats['called_up'] = 0
        player_stats['callup_date'] = None
        player_stats['callup_nba_team'] = None
        player_stats['callup_contract_type'] = None
    
    # Summary statistics
    print("=" * 60)
    print("Dataset Summary")
    print("=" * 60)
    print(f"Total records: {len(player_stats)}")
    print(f"Players called up (called_up=1): {player_stats['called_up'].sum()}")
    print(f"Players not called up (called_up=0): {(player_stats['called_up'] == 0).sum()}")
    print(f"Call-up rate: {player_stats['called_up'].mean():.2%}")
    
    # Check by season
    if 'season_year' in player_stats.columns:
        print("\\nCall-up rate by season:")
        season_summary = player_stats.groupby('season_year').agg({
            'called_up': ['sum', 'count', 'mean']
        }).round(3)
        season_summary.columns = ['Called_Up', 'Total_Players', 'Call_Up_Rate']
        print(season_summary)
    
    return player_stats

# Create the prediction dataset
# NOTE: This code is commented out to avoid overwriting existing files
# Uncomment to regenerate the prediction dataset
"""
df_prediction = create_prediction_dataset()

# Save the dataset
output_file = 'prediction_dataset.csv'
df_prediction.to_csv(output_file, index=False)
print(f"Saved prediction dataset to: {output_file}")
print(f"   Shape: {df_prediction.shape}")
print(f"   Columns: {len(df_prediction.columns)}")
"""

# Load existing prediction dataset
df_prediction = pd.read_csv("prediction_dataset.csv")
print(f"Loaded prediction dataset: {len(df_prediction):,} records")
print(f"   Called up: {df_prediction['called_up'].sum():,} ({df_prediction['called_up'].mean():.2%})")
print(f"   Not called up: {(df_prediction['called_up'] == 0).sum():,}")
df_prediction.head()


---

## Summary

This notebook contains all data collection and processing code for the Next Up project:

1. ✅ **10-Day Callup Data** - Web scraping from HoopsRumors
2. ✅ **Cleaning 10-Day Callup Data** - Processing and deduplication
3. ✅ **Two-Way Contracts** - Loading manually collected data
4. ✅ **Two-Way Conversions** - Loading manually collected data
5. ✅ **G-League Player Stats (API)** - Collecting from SportsRadar API
6. ✅ **G-League Player Stats (CSV)** - Loading from season CSV files
7. ✅ **Prediction Dataset** - Merging all sources with `called_up` target variable

**Final Output**: `prediction_dataset.csv` with 2,437 player-season records ready for modeling.

**Next Steps**:
- Explore the data in `eda.ipynb`
- Build prediction models in `analysis.ipynb`


In [5]:
contracts_df = pd.concat([ten_day_df, two_way_df, conversions_df], ignore_index=True)
contracts_df.head()


Unnamed: 0,date,season,player_name,nba_team,contract_type
0,2007-01-06,2006-07,Dijon Thompson,Atlanta Hawks,Two 10-day contracts (only 1st date is shown)
1,2007-01-27,2006-07,Jeremy Richardson,Atlanta Hawks,Two 10-day contracts (only 1st date is shown)
2,2007-04-04,2006-07,Kevinn Pinkney,Boston Celtics,10-day contract
3,2007-04-02,2006-07,Kevin Willis,Dallas Mavericks,10-day contract followed by signing for the re...
4,2007-01-17,2006-07,Renaldo Major,Golden State Warriors,10-day contract


In [6]:
final_df = contracts_df.merge(sportsradar_df,on="player_name",how="left",)

In [7]:
final_df.to_csv("combined_gleague_contracts.csv", index=False)

In [8]:
df = pd.read_csv("combined_gleague_contracts.csv")
sorted_df = df.sort_values("season", ascending=False)
sorted_df.to_csv("combined_gleague_contracts.csv", index=False)
sorted_df

Unnamed: 0,date,season,player_name,nba_team,contract_type,season_id,season_type,team_id,player_id,position,...,rebounds,steals,blocks,turnovers,fgm,fga,tpm,tpa,ftm,fta
865,2025-07-01,2025-26,Kobe Sanders,los angeles clippers,two-way signing,,,,,,...,,,,,,,,,,
848,2025-07-01,2025-26,Moussa Cisse,dallas mavericks,two-way signing,,,,,,...,,,,,,,,,,
850,2025-07-01,2025-26,Spencer Jones,denver nuggets,two-way signing,2024.0,REG,bbeebcb5-0d9b-4992-bcba-cbde3ec60628,27403427-4b94-4336-8b1a-6c6035b7be1c,F,...,87.0,18.0,17.0,15.0,72.0,131.0,36.0,72.0,6.0,8.0
851,2025-07-01,2025-26,Curtis Jones,denver nuggets,two-way signing,,,,,,...,,,,,,,,,,
852,2025-07-01,2025-26,Tolu Smith,detroit pistons,two-way signing,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16,2007-01-05,2006-07,Andre Brown,Seattle SuperSonics,Two 10-day contracts followed by signing for t...,,,,,,...,,,,,,,,,,
17,2007-03-25,2006-07,Luke Jackson,Toronto Raptors,Two 10-day contracts followed by signing for t...,,,,,,...,,,,,,,,,,
18,2007-02-05,2006-07,Louis Amundson,Utah Jazz,Two 10-day contracts (only 1st date is shown),,,,,,...,,,,,,,,,,
19,2007-02-28,2006-07,Mike Hall,Washington Wizards,Two 10-day contracts followed by signing for t...,,,,,,...,,,,,,,,,,


In [12]:
test_df = pd.read_csv("combined_gleague_contracts.csv")
# test_df[test_df['player_name']  == "Cooper Flagg"]
test_df.columns

Index(['date', 'season', 'player_name', 'nba_team', 'contract_type',
       'season_id', 'season_type', 'team_id', 'player_id', 'position',
       'total_games_played', 'total_games_started', 'total_minutes',
       'total_field_goals_made', 'total_field_goals_att',
       'total_field_goals_pct', 'total_two_points_made',
       'total_two_points_att', 'total_two_points_pct',
       'total_three_points_made', 'total_three_points_att',
       'total_three_points_pct', 'total_blocked_att', 'total_free_throws_made',
       'total_free_throws_att', 'total_free_throws_pct',
       'total_offensive_rebounds', 'total_defensive_rebounds',
       'total_rebounds', 'total_assists', 'total_turnovers',
       'total_assists_turnover_ratio', 'total_steals', 'total_blocks',
       'total_personal_fouls', 'total_tech_fouls', 'total_points',
       'total_flagrant_fouls', 'total_ejections', 'total_foulouts',
       'total_tech_fouls_non_unsportsmanlike', 'total_true_shooting_att',
       'total_true_s

In [18]:
final_df

Unnamed: 0,date,season,player_name,nba_team,contract_type,season_id,season_type,team_id,player_id,position,...,rebounds,steals,blocks,turnovers,fgm,fga,tpm,tpa,ftm,fta
0,2007-01-06,2006-07,Dijon Thompson,Atlanta Hawks,Two 10-day contracts (only 1st date is shown),,,,,,...,,,,,,,,,,
1,2007-01-27,2006-07,Jeremy Richardson,Atlanta Hawks,Two 10-day contracts (only 1st date is shown),,,,,,...,,,,,,,,,,
2,2007-04-04,2006-07,Kevinn Pinkney,Boston Celtics,10-day contract,,,,,,...,,,,,,,,,,
3,2007-04-02,2006-07,Kevin Willis,Dallas Mavericks,10-day contract followed by signing for the re...,,,,,,...,,,,,,,,,,
4,2007-01-17,2006-07,Renaldo Major,Golden State Warriors,10-day contract,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
925,2025-07-01,2025-26,John Tonje,utah jazz,two-way signing,,,,,,...,,,,,,,,,,
926,2025-07-01,2025-26,Jamir Watkins,washington wizards,two-way signing,,,,,,...,,,,,,,,,,
927,2025-07-01,2025-26,Tristan Vukcevic,washington wizards,two-way signing,2024.0,REG,cefdc308-b8ea-441e-a698-6ef3891cec35,ee12f7c9-a000-4b89-8a29-e2821b115d20,F,...,49.0,4.0,14.0,20.0,45.0,94.0,18.0,43.0,9.0,13.0
928,2025-07-01,2025-26,Sharife Cooper,washington wizards,two-way signing,,,,,,...,,,,,,,,,,


In [17]:
print(f'SportsRadar num entries: {sportsradar_df.shape[0]}')
print(f'Combined data num entries: {test_df.shape[0]}')

SportsRadar num entries: 632
Combined data num entries: 930


In [25]:
final_df['season_type'].value_counts()

season_type
REG    201
Name: count, dtype: int64

In [None]:
final_df.isnull().sum().sort_values(ascending=False)

total_efficiency        729
avg_rebounds            729
avg_field_goals_made    729
avg_blocked_att         729
avg_flagrant_fouls      729
                       ... 
nba_team                  0
player_name               0
season                    0
contract_type             0
date                      0
Length: 87, dtype: int64