# Data Collection and Processing Notebook

This notebook contains all code for collecting, cleaning, and merging data for the Next Up project.

## Overview

This notebook is organized into the following sections:
1. **Setup & Helper Functions** - Import libraries and define utility functions
2. **NBA.com Call-Up Data** - Web scraping official G League call-up tables
3. **Call-Up Aggregation** - Season CSVs + player-level call-up summaries
4. **Two-Way Contracts Data** - Loading manually collected two-way contract data
5. **Two-Way Conversions Data** - Loading manually collected conversion data
6. **G-League Player Stats (API)** - Collecting player statistics from SportsRadar API
7. **G-League Player Stats (CSV)** - Loading player stats from CSV files by season
8. **Creating Prediction Dataset** - Merging all data sources with `called_up` target variable

---

## Section 1: Setup & Helper Functions

Import necessary libraries and define helper functions used throughout the notebook.

In [None]:
# Import all necessary libraries
import pandas as pd
import numpy as np
import os
import re
import time
import requests
import json
import glob
from pathlib import Path
from bs4 import BeautifulSoup
from dateutil import parser
from unidecode import unidecode
from dotenv import load_dotenv

# Load environment variables for API keys
load_dotenv()

print("✅ Libraries imported successfully")

✅ Libraries imported successfully


In [3]:
# Helper Functions

def clean_name(name):
    """Clean player names to ensure consistent matching across datasets.
    Handles 'Last, First' format and normalizes spacing."""
    if pd.isna(name):
        return None
    name = str(name).strip()
    # Handle "Last, First" format
    if "," in name:
        last, first = name.split(",", 1)
        return f"{first.strip()} {last.strip()}"
    return name

def to_date(x):
    """Convert date string to datetime object. Handles various date formats."""
    if pd.isna(x):
        return pd.NaT
    try:
        # Handles sources that use mm/dd/yy style date strings
        return parser.parse(str(x), dayfirst=False, yearfirst=False)
    except Exception:
        return pd.NaT

def nba_season(dt):
    """Map a calendar date to NBA season label like '2024-25'.
    Season 'YYYY-YY' starts Aug 1 of YYYY and ends Jul 31 of YYYY+1."""
    if pd.isna(dt):
        return None
    y = dt.year
    if dt.month >= 8:  # Aug..Dec -> season starts this year
        return f"{y}-{(y+1)%100:02d}"
    else:  # Jan..Jul -> season started previous year
        return f"{y-1}-{y%100:02d}"

def canon_team(t):
    """Canonicalize team names to standard format."""
    if pd.isna(t):
        return None
    t = t.strip()
    TEAM_CANON = {
        "LA Clippers": "Los Angeles Clippers",
        "L.A. Clippers": "Los Angeles Clippers",
        "LA Lakers": "Los Angeles Lakers",
        "L.A. Lakers": "Los Angeles Lakers",
    }
    return TEAM_CANON.get(t, t)

def extract_season_year(season_str):
    """Extract season year from season string like '2024-25' -> 2024."""
    if pd.isna(season_str):
        return None
    season_str = str(season_str)
    # Handle formats like "2024-25", "2024", "2024.0"
    if "-" in season_str:
        return int(season_str.split("-")[0])
    try:
        return int(float(season_str))
    except:
        return None

print("✅ Helper functions defined")

✅ Helper functions defined


---

## Section 2: NBA.com Call-Up Data (Web Scraping)

This section scrapes official NBA G League call-up tables from NBA.com for multiple seasons.

**Sources**:
- 2019-20: `https://gleague.nba.com/nba-call-ups-for-the-2019-20-season`
- 2020-21: `https://gleague.nba.com/nba-call-ups-for-the-2020-21-season`
- 2022-23: `https://gleague.nba.com/nba-call-ups-from-the-2022-23-season`
- 2023-24: `https://gleague.nba.com/nba-call-ups-2023-24`
- 2024-25: `https://gleague.nba.com/nba-call-ups-2024-25`

**Outputs**:
- One CSV per season (e.g. `callups_nba_2019_20.csv`)
- A combined, player-level dataset with:
  - `callup_dates` as a list of all call-up dates per player-season
  - `times_called_up` as the number of call-ups per player-season
  - `contract_type` aggregated across call-ups for that season


In [4]:
# Configuration for NBA.com call-up scraping

NBA_CALLUP_SOURCES = [
    {
        "season_label": "2019-20",
        "season_year": 2019,
        "url": "https://gleague.nba.com/nba-call-ups-for-the-2019-20-season",
        "csv_path": "callups_nba_2019_20.csv",
    },
    {
        "season_label": "2020-21",
        "season_year": 2021,
        "url": "https://gleague.nba.com/nba-call-ups-for-the-2020-21-season",
        "csv_path": "callups_nba_2020_21.csv",
    },
    {
        "season_label": "2022-23",
        "season_year": 2022,
        "url": "https://gleague.nba.com/nba-call-ups-from-the-2022-23-season",
        "csv_path": "callups_nba_2022_23.csv",
    },
    {
        "season_label": "2023-24",
        "season_year": 2023,
        "url": "https://gleague.nba.com/nba-call-ups-2023-24",
        "csv_path": "callups_nba_2023_24.csv",
    },
    {
        "season_label": "2024-25",
        "season_year": 2024,
        "url": "https://gleague.nba.com/nba-call-ups-2024-25",
        "csv_path": "callups_nba_2024_25.csv",
    },
]

print("✅ Configuration set up for NBA.com call-up scraping")


✅ Configuration set up for NBA.com call-up scraping


In [5]:
# Scraping function for NBA.com call-up tables

def scrape_nba_callups(url: str) -> pd.DataFrame:
    """Scrape a single NBA.com G League call-ups table into a tidy DataFrame."""

    def canon(text: str) -> str:
        if text is None:
            return ""
        text = unidecode(str(text)).replace("\xa0", " ")
        text = re.sub(r"\s+", " ", text).strip().upper()
        text = re.sub(r"[^A-Z0-9]+", "_", text)
        return text.strip("_")

    def extract_rows(table):
        rows = []
        for tr in table.find_all("tr"):
            cells = tr.find_all(["td", "th"])
            if not cells:
                continue
            row = [unidecode(td.get_text(separator=" ", strip=True)).strip() for td in cells]
            if any(row):
                rows.append(row)
        return rows

    col_aliases = {
        "NAME": "player_name",
        "PLAYER": "player_name",
        "NBA_G_LEAGUE_TEAM": "gleague_team",
        "G_LEAGUE_TEAM": "gleague_team",
        "NBA_TEAM": "nba_team",
        "DATE": "callup_date",
        "CALL_UP_DATE": "callup_date",
        "TYPE": "contract_type",
        "CONTRACT_TYPE": "contract_type",
    }
    required = {"player_name", "gleague_team", "nba_team", "callup_date", "contract_type"}
    expected_header = {"NAME", "NBA_G_LEAGUE_TEAM", "NBA_TEAM", "DATE", "TYPE"}

    response = requests.get(
        url,
        timeout=30,
        headers={
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
            "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0 Safari/537.36",
        },
    )
    response.raise_for_status()
    soup = BeautifulSoup(response.text, "html.parser")
    tables = soup.find_all("table")
    if not tables:
        raise RuntimeError(f"No tables found at {url}")

    for table in tables:
        rows = extract_rows(table)
        if not rows:
            continue

        header_idx = 0
        for idx, row in enumerate(rows[:3]):
            if expected_header.issubset({canon(val) for val in row}):
                header_idx = idx
                break

        header = rows[header_idx]
        width = len(header)
        data_rows = rows[header_idx + 1 :]
        if not data_rows:
            continue

        normalized_rows = []
        for row in data_rows:
            trimmed = row[:width]
            if len(trimmed) < width:
                trimmed = trimmed + [""] * (width - len(trimmed))
            normalized_rows.append(trimmed[:width])

        df = pd.DataFrame(normalized_rows, columns=header)

        rename_dict = {}
        for col in df.columns:
            key = canon(col)
            if key in col_aliases:
                rename_dict[col] = col_aliases[key]
        df = df.rename(columns=rename_dict)

        missing = required - set(df.columns)
        if missing:
            continue

        df = df.copy()
        df["player_name"] = df["player_name"].apply(clean_name)
        for col in ["gleague_team", "nba_team", "contract_type"]:
            df[col] = (
                df[col]
                .astype(str)
                .str.replace("\xa0", " ", regex=False)
                .str.strip()
                .replace({"": None, "nan": None, "None": None})
            )
        df["callup_date"] = pd.to_datetime(df["callup_date"], errors="coerce")
        df = df.dropna(subset=["player_name", "callup_date"])

        return df.reset_index(drop=True)

    raise RuntimeError(f"Missing expected columns at {url}")

print("✅ NBA.com call-up scraping function defined")


✅ NBA.com call-up scraping function defined


---

## Section 3: Build Season-Level and Combined Call-Up Datasets

This section:
1. Scrapes each NBA.com call-up page
2. Saves one CSV per season
3. Builds a combined player-season dataset with aggregated call-up information.


In [6]:
all_callups = []

for src in NBA_CALLUP_SOURCES:
    print("=" * 80)
    print(f"Scraping call-ups for season {src['season_label']} from {src['url']}")
    df_season = scrape_nba_callups(src['url'])

    df_season['season_label'] = src['season_label']
    df_season['season_year'] = src['season_year']

    # Save per-season CSV
    csv_path = src['csv_path']
    df_season.to_csv(csv_path, index=False)
    print(f"Saved {len(df_season)} rows to {csv_path}")

    all_callups.append(df_season)

if not all_callups:
    raise RuntimeError("No call-up data scraped")


def ordered_unique(values):
    """Keep the first occurrence order while dropping nulls/duplicates."""
    seen = []
    for val in values:
        if pd.isna(val) or val in seen:
            continue
        seen.append(val)
    return seen


def format_dates(series):
    dates = []
    for d in series:
        if pd.notna(d):
            dates.append(d.strftime('%m-%d-%Y'))
    return dates


# Combine all seasons into a single row-level dataset
callups_all = pd.concat(all_callups, ignore_index=True)
callups_all = callups_all.sort_values(
    ['player_name', 'season_year', 'callup_date'],
    na_position='last'
).reset_index(drop=True)
print(f"\nTotal combined call-up records: {len(callups_all):,}")

raw_csv = 'callups_nba_2019_2025_all.csv'
callups_all.to_csv(raw_csv, index=False)
print(f"Saved merged raw dataset to {raw_csv}")

# Aggregate to player-season level with deduplicated teams and ordered events
agg = (
    callups_all
    .groupby(['player_name', 'season_label', 'season_year'], as_index=False)
    .agg({
        'gleague_team': ordered_unique,
        'nba_team': ordered_unique,
        'callup_date': format_dates,
        'contract_type': lambda s: [ct for ct in s if pd.notna(ct)],
    })
)

agg = agg.rename(columns={
    'gleague_team': 'gleague_teams',
    'nba_team': 'nba_teams',
    'callup_date': 'callup_dates',
})

agg['times_called_up'] = agg['callup_dates'].apply(len)

# Save combined aggregated dataset
combined_csv = 'callups_nba_2019_2025_aggregated.csv'
agg.to_csv(combined_csv, index=False)

print(f"\nSaved aggregated call-up dataset to {combined_csv}  ({len(agg):,} player-seasons)")
agg.head()


Scraping call-ups for season 2019-20 from https://gleague.nba.com/nba-call-ups-for-the-2019-20-season
Saved 34 rows to callups_nba_2019_20.csv
Scraping call-ups for season 2020-21 from https://gleague.nba.com/nba-call-ups-for-the-2020-21-season
Saved 12 rows to callups_nba_2020_21.csv
Scraping call-ups for season 2022-23 from https://gleague.nba.com/nba-call-ups-from-the-2022-23-season
Saved 35 rows to callups_nba_2022_23.csv
Scraping call-ups for season 2023-24 from https://gleague.nba.com/nba-call-ups-2023-24
Saved 54 rows to callups_nba_2023_24.csv
Scraping call-ups for season 2024-25 from https://gleague.nba.com/nba-call-ups-2024-25
Saved 51 rows to callups_nba_2024_25.csv

Total combined call-up records: 186
Saved merged raw dataset to callups_nba_2019_2025_all.csv

Saved aggregated call-up dataset to callups_nba_2019_2025_aggregated.csv  (166 player-seasons)


Unnamed: 0,player_name,season_label,season_year,gleague_teams,nba_teams,callup_dates,contract_type,times_called_up
0,A.J. Lawson,2022-23,2022,[College Park Skyhawks],"[Minnesota Timberwolves, Dallas Mavericks]","[11-15-2022, 12-26-2022]","[Two-Way, Two-Way]",2
1,AJ Lawson,2024-25,2024,[Long Island Nets],[Raptors 905],[12-11-2024],[Two-Way],1
2,Adam Flagler,2023-24,2023,[Oklahoma City Blue],[Oklahoma City Thunder],[02-11-2024],[Two-Way],1
3,Alex Reese,2024-25,2024,[Rip City Remix],[Philadelphia 76ers],[02-21-2025],[Two-Way],1
4,Alize Johnson,2022-23,2022,[Austin Spurs],[San Antonio Spurs],[11-29-2022],[Standard],1


### Call-Up Outputs

- `callups_nba_2019_2025_all.csv`: Row-level call-up log with every event from the five NBA.com pages.
- `callups_nba_2019_2025_aggregated.csv`: Player-season summary where duplicate call-ups are rolled into ordered `callup_dates` and `contract_type` lists, plus `times_called_up` counts and deduplicated team lists.



---

## Section 6: G-League Player Stats (API Collection)

This section collects G-League player statistics from the SportsRadar API. The API provides comprehensive player data including:
- Player profiles (name, position, height, weight, etc.)
- Season statistics (points, rebounds, assists, shooting percentages, etc.)
- Team rosters

**Source**: SportsRadar NBDL API  
**API Endpoint**: `https://api.sportradar.us/nbdl/trial/v8/en`

**Note**: This code requires an API key stored in a `.env` file. The code includes caching to avoid redundant API calls.


In [None]:
# API Configuration
API_KEY = os.getenv("API_KEY")
if not API_KEY:
    print("Warning: API_KEY not found in .env file")
    print("   Set up API key following instructions in API_SETUP_INSTRUCTIONS.md")
else:
    print("API key loaded")

BASE_URL = "https://api.sportradar.us/nbdl/trial/v8/en"

# Create output directories
os.makedirs("../raw", exist_ok=True)
os.makedirs("../raw_json", exist_ok=True)
os.makedirs("external", exist_ok=True)


In [7]:
# API call function with caching and retry logic
def api_call(url, cache_file=None, max_retries=5):
    """Make API call with caching and exponential backoff for rate limiting."""
    if cache_file and os.path.exists(cache_file):
        print(f"  [cached]")
        with open(cache_file, 'r') as f:
            return json.load(f)
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=30)
            response.raise_for_status()
            data = response.json()
            
            if cache_file:
                with open(cache_file, 'w') as f:
                    json.dump(data, f)
            
            time.sleep(2)  # Rate limiting delay
            return data
            
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429:
                wait_time = (2 ** attempt) * 5  # Exponential backoff: 5, 10, 20, 40, 80 seconds
                print(f"  Rate limited. Waiting {wait_time}s...")
                time.sleep(wait_time)
            else:
                raise
        except Exception as e:
            if attempt < max_retries - 1:
                print(f"  Error: {e}. Retrying...")
                time.sleep(2)
            else:
                raise
    
    return None

print("API call function defined")


API call function defined


In [None]:
# NOTE: This code is commented out since we already have the CSV files
# Uncomment to re-collect data from API if needed

"""
# Fetch league hierarchy to get all teams
print("Fetching league hierarchy...")
url = f"{BASE_URL}/league/hierarchy.json?api_key={API_KEY}"
hierarchy = api_call(url, "../raw_json/hierarchy.json")

teams_data = []
for conf in hierarchy['conferences']:
    for div in conf['divisions']:
        for team in div['teams']:
            teams_data.append({
                'team_id': team['id'],
                'team_name': team.get('market', '') + ' ' + team['name'],
                'alias': team['alias']
            })

df_teams = pd.DataFrame(teams_data)
print(f"\\nFound {len(df_teams)} teams\\n")

all_players = []
all_rosters = []
all_stats = []

for idx, row in df_teams.iterrows():
    team_id = row['team_id']
    team_name = row['team_name']
    
    print(f"[{idx+1}/{len(df_teams)}] {team_name}")
    
    url = f"{BASE_URL}/teams/{team_id}/profile.json?api_key={API_KEY}"
    team_data = api_call(url, f"../raw_json/team_{team_id}.json")
    
    if 'players' not in team_data:
        continue
    
    for player in team_data['players']:
        player_id = player.get('id')
        player_name = player.get('full_name', '')
        
        all_players.append({
            'player_id': player_id,
            'full_name': player_name,
            'position': player.get('position', ''),
            'height': player.get('height', None),
            'weight': player.get('weight', None),
            'birthdate': player.get('birthdate', ''),
            'college': player.get('college', '')
        })
        
        all_rosters.append({
            'team_id': team_id,
            'team_name': team_name,
            'player_id': player_id,
            'player_name': player_name,
            'position': player.get('position', '')
        })
        
        if 'seasons' in player:
            for season in player['seasons']:
                if 'teams' in season:
                    for team in season['teams']:
                        if 'total' in team:
                            stats = team['total']
                            all_stats.append({
                                'player_id': player_id,
                                'player_name': player_name,
                                'team_id': team_id,
                                'team_name': team_name,
                                'position': player.get('position', ''),
                                'games_played': stats.get('games_played', 0),
                                'minutes': stats.get('minutes', 0),
                                'points': stats.get('points', 0),
                                'rebounds': stats.get('rebounds', 0),
                                'assists': stats.get('assists', 0),
                                'steals': stats.get('steals', 0),
                                'blocks': stats.get('blocks', 0),
                                'turnovers': stats.get('turnovers', 0),
                                'field_goals_made': stats.get('field_goals_made', 0),
                                'field_goals_att': stats.get('field_goals_att', 0),
                                'field_goals_pct': stats.get('field_goals_pct', 0),
                                'three_points_made': stats.get('three_points_made', 0),
                                'three_points_att': stats.get('three_points_att', 0),
                                'three_points_pct': stats.get('three_points_pct', 0),
                                'free_throws_made': stats.get('free_throws_made', 0),
                                'free_throws_att': stats.get('free_throws_att', 0),
                                'free_throws_pct': stats.get('free_throws_pct', 0)
                            })

print(f"\\nCollected {len(all_stats)} stat records")
print(f"Collected {len(all_players)} players")
print(f"Collected {len(all_rosters)} roster entries\\n")

df_stats = pd.DataFrame(all_stats)
df_players = pd.DataFrame(all_players).drop_duplicates('player_id')
df_rosters = pd.DataFrame(all_rosters)

if len(df_stats) > 0:
    df_stats['points_per_game'] = (df_stats['points'] / df_stats['games_played'].replace(0, 1)).round(2)
    df_stats['rebounds_per_game'] = (df_stats['rebounds'] / df_stats['games_played'].replace(0, 1)).round(2)
    df_stats['assists_per_game'] = (df_stats['assists'] / df_stats['games_played'].replace(0, 1)).round(2)

df_stats.to_csv('../raw/gleague_player_stats.csv', index=False)
df_players.to_csv('../raw/gleague_players.csv', index=False)
df_rosters.to_csv('../raw/gleague_rosters.csv', index=False)
df_teams.to_csv('../raw/gleague_teams.csv', index=False)

print("Saved:")
print(f"  - gleague_player_stats.csv: {len(df_stats)} records")
print(f"  - gleague_players.csv: {len(df_players)} records")
print(f"  - gleague_rosters.csv: {len(df_rosters)} records")
print(f"  - gleague_teams.csv: {len(df_teams)} records")
"""

print("API collection code defined (commented out - using existing CSV files)")


---

## Section 7: G-League Player Stats (Loading from CSV Files)

This section loads G-League player statistics from CSV files organized by season. These files contain comprehensive player statistics for each season.

**Source Files**: 
- `../raw/gleague_player_season_stats_2019_REG.csv`
- `../raw/gleague_player_season_stats_2021_REG.csv`
- `../raw/gleague_player_season_stats_2022_REG.csv`
- `../raw/gleague_player_season_stats_2023_REG.csv`
- `../raw/gleague_player_season_stats_2024_REG.csv`

**Note**: These files are created from the API collection process above, or may be manually downloaded from SportsRadar.


In [None]:
# Load all player stats from all season files
def load_all_player_stats():
    """Load all player stats from all season files and combine them."""
    stats_files = glob.glob("../raw/gleague_player_season_stats_*.csv")
    
    if not stats_files:
        raise FileNotFoundError("No player stats files found in ../raw/")
    
    print(f"Found {len(stats_files)} stats files")
    
    all_stats = []
    for file in stats_files:
        print(f"Loading {Path(file).name}...")
        df = pd.read_csv(file)
        
        # Rename full_name to player_name if needed
        if 'full_name' in df.columns and 'player_name' not in df.columns:
            df = df.rename(columns={'full_name': 'player_name'})
        
        # Clean player names
        df['player_name'] = df['player_name'].apply(clean_name)
        
        # Extract season year from season_id
        if 'season_id' in df.columns:
            df['season_year'] = df['season_id'].apply(extract_season_year)
        elif 'season' in df.columns:
            df['season_year'] = df['season'].apply(extract_season_year)
        else:
            # Try to infer from filename
            filename = Path(file).name
            if '2024' in filename:
                df['season_year'] = 2024
            elif '2023' in filename:
                df['season_year'] = 2023
            elif '2022' in filename:
                df['season_year'] = 2022
            elif '2021' in filename:
                df['season_year'] = 2021
            elif '2019' in filename:
                df['season_year'] = 2019
            else:
                print(f"Warning: Could not determine season for {file}")
                continue
        
        all_stats.append(df)
    
    # Combine all stats
    combined_stats = pd.concat(all_stats, ignore_index=True)
    
    # Remove duplicates (same player, same season)
    combined_stats = combined_stats.drop_duplicates(
        subset=['player_name', 'season_year'], 
        keep='first'
    )
    
    print(f"\\nTotal player-season records: {len(combined_stats)}")
    print(f"Unique players: {combined_stats['player_name'].nunique()}")
    print(f"Seasons: {sorted(combined_stats['season_year'].dropna().unique())}")
    
    return combined_stats

# Load the data
df_player_stats = load_all_player_stats()
df_player_stats.head()


---

## Section 8: Creating Prediction Dataset

This section merges all data sources to create the final prediction dataset with a `called_up` binary target variable.

**Process**:
1. Load all player stats (from Section 7)
2. Load all callup data (10-day, two-way, conversions from Sections 2-5)
3. Merge callup data with player stats based on player name and season
4. Create `called_up` column: 1 if player was called up in that season, 0 otherwise
5. Add callup details (date, NBA team, contract type) for called-up players

**Output**: `prediction_dataset.csv` - Ready for machine learning modeling


In [None]:
# Load aggregated NBA.com callup data
def parse_list_cell(value):
    """Convert stringified list cells back into Python lists."""
    if isinstance(value, list):
        return value
    if pd.isna(value):
        return []
    text = str(value).strip()
    if not text:
        return []
    try:
        parsed = ast.literal_eval(text)
        if isinstance(parsed, list):
            return parsed
    except (ValueError, SyntaxError):
        pass
    return [text]


def load_nba_callups():
    """Load aggregated NBA.com call-ups with parsed list columns."""
    agg_path = Path('callups_nba_2019_2025_aggregated.csv')
    if not agg_path.exists():
        raise FileNotFoundError(f"Missing aggregated call-up file: {agg_path}")

    df = pd.read_csv(agg_path)
    df['player_name'] = df['player_name'].apply(clean_name)
    df['season_year'] = df['season_year'].astype(int)

    list_cols = ['gleague_teams', 'nba_teams', 'callup_dates', 'contract_type']
    for col in list_cols:
        df[col] = df[col].apply(parse_list_cell)

    df['times_called_up'] = df['times_called_up'].fillna(0).astype(int)
    print(f"Loaded {len(df)} player-season call-up rows from {agg_path.name}")
    return df

# Load aggregated callup data
df_callups_nba = load_nba_callups()
df_callups_nba.head()

In [None]:
# Create the prediction dataset
def create_prediction_dataset():
    """Create the final prediction dataset with called_up column."""

    player_stats = df_player_stats.copy()
    player_stats['player_name'] = player_stats['player_name'].apply(clean_name)
    player_stats['season_year'] = player_stats['season_year'].astype(int)

    callups = df_callups_nba.copy()

    merged = player_stats.merge(
        callups,
        on=['player_name', 'season_year'],
        how='left'
    )

    merged['times_called_up'] = merged['times_called_up'].fillna(0).astype(int)
    merged['called_up'] = (merged['times_called_up'] > 0).astype(int)

    output_file = 'prediction_dataset_callups_nba.csv'
    merged.to_csv(output_file, index=False)

    print("=" * 60)
    print("Prediction Dataset Summary")
    print("=" * 60)
    print(f"Total records: {len(merged):,}")
    print(f"Players called up: {merged['called_up'].sum():,}")
    print(f"Players not called up: {(merged['called_up'] == 0).sum():,}")
    print(f"Call-up rate: {merged['called_up'].mean():.2%}")

    season_summary = merged.groupby('season_year')['called_up'].agg(['sum', 'count'])
    season_summary['call_up_rate'] = (season_summary['sum'] / season_summary['count']).round(3)
    print("\nCall-up rate by season:")
    print(season_summary)

    print(f"\nSaved prediction dataset to {output_file}")
    return merged

# Build dataset and preview
df_prediction = create_prediction_dataset()
df_prediction.head()

---

## Summary

This notebook contains all data collection and processing code for the Next Up project:

1. ✅ **NBA.com Call-Up Data** - Web scraping official G League call-up tables
2. ✅ **Call-Up Aggregation** - Per-season CSVs plus player-level summaries
3. ✅ **Two-Way Contracts** - Loading manually collected data
4. ✅ **Two-Way Conversions** - Loading manually collected data
5. ✅ **G-League Player Stats (API)** - Collecting from SportsRadar API
6. ✅ **G-League Player Stats (CSV)** - Loading from season CSV files
7. ✅ **Prediction Dataset** - Merging all sources with `called_up` target variable

**Final Output**: `prediction_dataset.csv` with 2,437 player-season records ready for modeling.

**Next Steps**:
- Explore the data in `eda.ipynb`
- Build prediction models in `analysis.ipynb`
