nba-game-scraping-V5 used boxscoretraditionalv3 endpoint. This notebook aims to try using the boxscoreadvancedv3 and boxscoreplayertrackv3 endpoints

# Prep work

## Extracting unique game_ids

Cobbled together from Trevor's nba-api-connection-2.ipynb

Using nba-api package to get a complete list of game_ids from 2004 to 2024 (so run on capstone2 environment)


This version wont have season ids, but since the plan is to merge on `GAME_ID` that shouldnt matter

import libraries and setup logging

In [1]:
from nba_api.stats.endpoints import LeagueGameFinder, BoxScoreTraditionalV2
import pandas as pd
import time
from datetime import datetime
from tqdm.auto import tqdm
import logging
import random

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.FileHandler('nba_data_collection.log'),
             logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

def log_message(message):
    """Log message to both file and tqdm"""
    tqdm.write(message)
    logger.info(message)

Function to get all games for a single season 

In [2]:
def get_season_games(season):
    """Get games for a single season with error handling"""
    game_finder = LeagueGameFinder(
        season_nullable=season,
        league_id_nullable='00',
        timeout=60
    )
    
    # Get the raw response first
    response_frames = game_finder.get_data_frames()
    
    # Debug logging
    log_message(f"Response for season {season}: got {len(response_frames)} DataFrames")
    if not response_frames:
        raise ValueError(f"Empty response for season {season}")
    
    games = response_frames[0]
    if len(games) == 0:
        raise ValueError(f"No games found for season {season}")
        
    log_message(f"Retrieved {len(games)} game entries for season {season}")
    return games

Function that takes a `start_year` and `end_year` and compiles all the game_ids for games that happened between them

In [3]:
def compile_game_ids(start_year, end_year):
    game_ids = []
    seasons = [f"{year}-{str(year + 1)[-2:]}" for year in range(start_year, end_year)]

    for season in seasons:
        game_ids.extend(get_season_games(season)['GAME_ID'].unique())
        time.sleep(1)
    return game_ids

Establish `game_ids` list

In [23]:
game_ids = compile_game_ids(2004, 2024)
f"Game ids found from 2004 - 2024: {len(game_ids)}"

2025-02-16 14:44:30,996 - INFO - Response for season 2004-05: got 1 DataFrames
2025-02-16 14:44:30,997 - INFO - Retrieved 2728 game entries for season 2004-05


Response for season 2004-05: got 1 DataFrames
Retrieved 2728 game entries for season 2004-05


2025-02-16 14:44:32,340 - INFO - Response for season 2005-06: got 1 DataFrames
2025-02-16 14:44:32,343 - INFO - Retrieved 2871 game entries for season 2005-06


Response for season 2005-06: got 1 DataFrames
Retrieved 2871 game entries for season 2005-06


2025-02-16 14:44:33,674 - INFO - Response for season 2006-07: got 1 DataFrames
2025-02-16 14:44:33,674 - INFO - Retrieved 2867 game entries for season 2006-07


Response for season 2006-07: got 1 DataFrames
Retrieved 2867 game entries for season 2006-07


2025-02-16 14:44:34,932 - INFO - Response for season 2007-08: got 1 DataFrames
2025-02-16 14:44:34,932 - INFO - Retrieved 2852 game entries for season 2007-08


Response for season 2007-08: got 1 DataFrames
Retrieved 2852 game entries for season 2007-08


2025-02-16 14:44:36,171 - INFO - Response for season 2008-09: got 1 DataFrames
2025-02-16 14:44:36,171 - INFO - Retrieved 2866 game entries for season 2008-09


Response for season 2008-09: got 1 DataFrames
Retrieved 2866 game entries for season 2008-09


2025-02-16 14:44:37,327 - INFO - Response for season 2009-10: got 1 DataFrames
2025-02-16 14:44:37,328 - INFO - Retrieved 2871 game entries for season 2009-10


Response for season 2009-10: got 1 DataFrames
Retrieved 2871 game entries for season 2009-10


2025-02-16 14:44:38,486 - INFO - Response for season 2010-11: got 1 DataFrames
2025-02-16 14:44:38,488 - INFO - Retrieved 2866 game entries for season 2010-11


Response for season 2010-11: got 1 DataFrames
Retrieved 2866 game entries for season 2010-11


2025-02-16 14:44:39,682 - INFO - Response for season 2011-12: got 1 DataFrames
2025-02-16 14:44:39,683 - INFO - Retrieved 2214 game entries for season 2011-12


Response for season 2011-12: got 1 DataFrames
Retrieved 2214 game entries for season 2011-12


2025-02-16 14:44:40,951 - INFO - Response for season 2012-13: got 1 DataFrames
2025-02-16 14:44:40,953 - INFO - Retrieved 2866 game entries for season 2012-13


Response for season 2012-13: got 1 DataFrames
Retrieved 2866 game entries for season 2012-13


2025-02-16 14:44:42,245 - INFO - Response for season 2013-14: got 1 DataFrames
2025-02-16 14:44:42,246 - INFO - Retrieved 2874 game entries for season 2013-14


Response for season 2013-14: got 1 DataFrames
Retrieved 2874 game entries for season 2013-14


2025-02-16 14:44:43,529 - INFO - Response for season 2014-15: got 1 DataFrames
2025-02-16 14:44:43,530 - INFO - Retrieved 2864 game entries for season 2014-15


Response for season 2014-15: got 1 DataFrames
Retrieved 2864 game entries for season 2014-15


2025-02-16 14:44:44,678 - INFO - Response for season 2015-16: got 1 DataFrames
2025-02-16 14:44:44,682 - INFO - Retrieved 2856 game entries for season 2015-16


Response for season 2015-16: got 1 DataFrames
Retrieved 2856 game entries for season 2015-16


2025-02-16 14:44:45,875 - INFO - Response for season 2016-17: got 1 DataFrames
2025-02-16 14:44:45,876 - INFO - Retrieved 2829 game entries for season 2016-17


Response for season 2016-17: got 1 DataFrames
Retrieved 2829 game entries for season 2016-17


2025-02-16 14:44:47,108 - INFO - Response for season 2017-18: got 1 DataFrames
2025-02-16 14:44:47,110 - INFO - Retrieved 2785 game entries for season 2017-18


Response for season 2017-18: got 1 DataFrames
Retrieved 2785 game entries for season 2017-18


2025-02-16 14:44:48,264 - INFO - Response for season 2018-19: got 1 DataFrames
2025-02-16 14:44:48,265 - INFO - Retrieved 2788 game entries for season 2018-19


Response for season 2018-19: got 1 DataFrames
Retrieved 2788 game entries for season 2018-19


2025-02-16 14:44:49,449 - INFO - Response for season 2019-20: got 1 DataFrames
2025-02-16 14:44:49,450 - INFO - Retrieved 2516 game entries for season 2019-20


Response for season 2019-20: got 1 DataFrames
Retrieved 2516 game entries for season 2019-20


2025-02-16 14:44:50,626 - INFO - Response for season 2020-21: got 1 DataFrames
2025-02-16 14:44:50,627 - INFO - Retrieved 2442 game entries for season 2020-21


Response for season 2020-21: got 1 DataFrames
Retrieved 2442 game entries for season 2020-21


2025-02-16 14:44:51,781 - INFO - Response for season 2021-22: got 1 DataFrames
2025-02-16 14:44:51,781 - INFO - Retrieved 2788 game entries for season 2021-22


Response for season 2021-22: got 1 DataFrames
Retrieved 2788 game entries for season 2021-22


2025-02-16 14:44:52,933 - INFO - Response for season 2022-23: got 1 DataFrames
2025-02-16 14:44:52,934 - INFO - Retrieved 2790 game entries for season 2022-23


Response for season 2022-23: got 1 DataFrames
Retrieved 2790 game entries for season 2022-23


2025-02-16 14:44:54,086 - INFO - Response for season 2023-24: got 1 DataFrames
2025-02-16 14:44:54,087 - INFO - Retrieved 2795 game entries for season 2023-24


Response for season 2023-24: got 1 DataFrames
Retrieved 2795 game entries for season 2023-24


'Game ids found from 2004 - 2024: 27660'

## Imports and Helper functions for continual scraping

import necessary libs

In [4]:
import requests
import pickle
import time
from pathlib import Path
from typing import Dict, Optional, List
import logging
from tqdm import tqdm
from requests.exceptions import Timeout, RequestException
import random

functions to load/save the pkl cache to/from disk

In [5]:
def load_or_create_cache(cache_path: str) -> Dict[str, dict]:
    """Load existing cache or create new one if it doesn't exist."""
    if Path(cache_path).exists():
        with open(cache_path, 'rb') as f:
            return pickle.load(f)
    return {}

In [6]:
def save_cache(cache: Dict[str, dict], cache_path: str) -> None:
    """Save the cache to disk."""
    with open(cache_path, 'wb') as f:
        pickle.dump(cache, f)

# Scraping BoxScoreAdvancedV3

Function that uses BoxScoreAdvancedV3 to get info about a *single* game 

In [9]:
def create_row_boxscoreadvanced(game_id: str, timeout: int = 30, max_retries: int = 3) -> Optional[dict]:
    """
    Fetch game stats from NBA API with timeout handling and automatic retries.
    
    Args:
        game_id: NBA game identifier
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    """
    headers = {
        "Host": "stats.nba.com",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
        "Accept": "application/json, text/plain, */*",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "x-nba-stats-origin": "stats",
        "x-nba-stats-token": "true",
        "Connection": "keep-alive",
        "Referer": "https://stats.nba.com/",
        "Pragma": "no-cache",
        "Cache-Control": "no-cache",
    }
    
    #url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={'00'+str(game_id)}&RangeType=0&StartPeriod=0&StartRange=0'
    url = f'https://stats.nba.com/stats/boxscoreadvancedv3?EndPeriod=0&EndRange=0&GameID={game_id}&RangeType=0&StartPeriod=0&StartRange=0'
    #url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={game_id}&RangeType=0&StartPeriod=0&StartRange=0'
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=timeout)
            response.raise_for_status()
            return response.json()
            
        except Timeout:
            wait_time = (attempt + 1) * 5 + random.uniform(1, 3)  # Exponential backoff with jitter
            logging.warning(f"Timeout for game {game_id} (attempt {attempt + 1}/{max_retries}). "
                          f"Waiting {wait_time:.1f} seconds before retry...")
            time.sleep(wait_time)
            
        except RequestException as e:
            logging.error(f"Error fetching game {game_id}: {str(e)}")
            return None
            
    logging.error(f"Max retries ({max_retries}) reached for game {game_id}")
    return None

function that uses above to scrape info for all the game ids. Also incorporates logging, tqdm progress bar, pkl cache (in case of crash)

In [10]:
def scrape_game_stats_boxscoreadvanced(game_ids: List[str], cache_path: str = 'nba_games_cache.pkl', 
                     delay: float = 1.0, save_frequency: int = 10,
                     timeout: int = 30, max_retries: int = 3) -> Dict[str, dict]:
    """
    Scrape game stats with progress bar, timeout handling, and automatic retries.
    
    Args:
        game_ids: List of NBA game IDs to scrape
        cache_path: Path to save/load the pickle cache file
        delay: Time to wait between requests in seconds
        save_frequency: How often to save the cache (every N successful requests)
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    
    Returns:
        Dictionary mapping game IDs to their stats data
    """
    # Setup logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )
    
    # Load existing cache
    cache = load_or_create_cache(cache_path)
    logging.info(f"Loaded cache with {len(cache)} existing games")
    
    # Filter out already cached games
    games_to_scrape = [gid for gid in game_ids if gid not in cache]
    logging.info(f"Found {len(games_to_scrape)} new games to scrape")
    
    successful_requests = 0
    
    # Create progress bar
    pbar = tqdm(games_to_scrape, desc="Scraping games", unit="game")
    
    for game_id in pbar:
        try:
            # Update progress bar description with current game
            pbar.set_description(f"Scraping game {game_id}")
            
            # Fetch game data with timeout handling and retries
            game_data = create_row_boxscoreadvanced(game_id, timeout=timeout, max_retries=max_retries)
            
            if game_data is not None:
                cache[game_id] = game_data
                successful_requests += 1
                
                # Update progress bar postfix with success count
                pbar.set_postfix(
                    successful=successful_requests,
                    cached_total=len(cache)
                )
                
                # Save periodically
                if successful_requests % save_frequency == 0:
                    save_cache(cache, cache_path)
                    logging.info(f"Saved cache with {len(cache)} games")
            
            # Wait between requests
            time.sleep(delay)
            
        except Exception as e:
            logging.error(f"Unexpected error processing game {game_id}: {str(e)}")
            # Save cache on error to preserve progress
            save_cache(cache, cache_path)
            logging.info("Saved cache due to error")
            raise
    
    # Close progress bar
    pbar.close()
    
    # Final save
    save_cache(cache, cache_path)
    logging.info(f"Scraping completed. Final cache contains {len(cache)} games")
    
    return cache

Run the scraping with 0.5s request delay (use 1sec if running into excessive timeout errors/notable skips)

In [None]:
# Since I plan on running for ~20 min as a trial run, I want to randomize the game ids as a way to maximize the chance of running into bugs 
#random.shuffle(game_ids)

In [16]:
try:
    game_stats_boxscoreadvanced = scrape_game_stats_boxscoreadvanced(
        game_ids=game_ids,
        cache_path='nba_games_cache_boxscoreadvanced.pkl',
        delay=0.5,  # 1 second delay between requests
        save_frequency=10,  # Save every 10 successful requests
        timeout=30,  # 30 second timeout
        max_retries=3  # Retry up to 3 times on timeout
    )
except Exception as e:
    logging.error(f"Scraping stopped due to error: {str(e)}")

2025-02-16 14:14:37,446 - INFO - Loaded cache with 0 existing games
2025-02-16 14:14:37,450 - INFO - Found 27660 new games to scrape
Scraping game 0021800631:   0%|          | 9/27660 [00:16<12:13:44,  1.59s/game, cached_total=10, successful=10]2025-02-16 14:14:54,035 - INFO - Saved cache with 10 games
Scraping game 0022101044:   0%|          | 19/27660 [00:31<11:48:15,  1.54s/game, cached_total=20, successful=20]2025-02-16 14:15:09,314 - INFO - Saved cache with 20 games
Scraping game 0021300243:   0%|          | 29/27660 [00:45<10:43:29,  1.40s/game, cached_total=30, successful=30]2025-02-16 14:15:22,586 - INFO - Saved cache with 30 games
Scraping game 0021701214:   0%|          | 39/27660 [01:05<17:55:56,  2.34s/game, cached_total=40, successful=40]2025-02-16 14:15:42,568 - INFO - Saved cache with 40 games
Scraping game 0020500455:   0%|          | 49/27660 [01:23<14:20:19,  1.87s/game, cached_total=50, successful=50]2025-02-16 14:16:00,927 - INFO - Saved cache with 50 games
Scraping

KeyboardInterrupt: 

# Basic eval of trial run

In [None]:
import pandas as pd
import pickle
with open('nba_games_cache_boxscoreadvanced.pkl', 'rb') as f:
    cache = pickle.load(f)


all_games = []
num_skipped = 0
for game_id, game_data in cache.items():
    try:
        assert game_data is not None
        df = pd.json_normalize(game_data)
        all_games.append(df)
    except:
        num_skipped = num_skipped + 1
        continue
print(f'Num skipped: {num_skipped}')
game_level_dataset = pd.concat(all_games, ignore_index=True)
print(f'shape of dataset: {game_level_dataset.shape}')
game_level_dataset.head()

Num skipped: 0
shape of dataset: (810, 68)


Unnamed: 0,meta.version,meta.request,meta.time,boxScoreAdvanced.gameId,boxScoreAdvanced.awayTeamId,boxScoreAdvanced.homeTeamId,boxScoreAdvanced.homeTeam.teamId,boxScoreAdvanced.homeTeam.teamCity,boxScoreAdvanced.homeTeam.teamName,boxScoreAdvanced.homeTeam.teamTricode,...,boxScoreAdvanced.awayTeam.statistics.trueShootingPercentage,boxScoreAdvanced.awayTeam.statistics.usagePercentage,boxScoreAdvanced.awayTeam.statistics.estimatedUsagePercentage,boxScoreAdvanced.awayTeam.statistics.estimatedPace,boxScoreAdvanced.awayTeam.statistics.pace,boxScoreAdvanced.awayTeam.statistics.pacePer40,boxScoreAdvanced.awayTeam.statistics.possessions,boxScoreAdvanced.awayTeam.statistics.PIE,boxScoreAdvanced.homeTeam.statistics,boxScoreAdvanced.awayTeam.statistics
0,1,http://nba.cloud/games/0021800263/boxscoreadva...,2025-02-16T15:14:38.1438Z,21800263,1610612758,1610612762,1610612762,Utah,Jazz,UTA,...,0.611,1.0,0.197,100.0,98.0,81.67,97.0,0.544,,
1,1,http://nba.cloud/games/0021800862/boxscoreadva...,2025-02-16T15:14:41.1441Z,21800862,1610612744,1610612757,1610612757,Portland,Trail Blazers,POR,...,0.51,1.0,0.198,96.94,94.5,78.75,94.0,0.406,,
2,1,http://nba.cloud/games/0022300751/boxscoreadva...,2025-02-16T15:14:43.1443Z,22300751,1610612763,1610612766,1610612766,Charlotte,Hornets,CHA,...,0.555,1.0,0.199,98.8,98.0,81.67,97.0,0.441,,
3,1,http://nba.cloud/games/0020900350/boxscoreadva...,2025-02-16T15:14:44.1444Z,20900350,1610612764,1610612746,1610612746,Los Angeles,Clippers,LAC,...,0.488,1.0,0.198,100.48,98.5,82.08,99.0,0.467,,
4,1,http://nba.cloud/games/0020900869/boxscoreadva...,2025-02-16T15:14:46.1446Z,20900869,1610612755,1610612747,1610612747,Los Angeles,Lakers,LAL,...,0.47,1.0,0.2,92.5,90.0,75.0,90.0,0.472,,


810 matches up with 810 successful requests for box score advanced

In [19]:
game_level_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 810 entries, 0 to 809
Data columns (total 68 columns):
 #   Column                                                                Non-Null Count  Dtype  
---  ------                                                                --------------  -----  
 0   meta.version                                                          810 non-null    int64  
 1   meta.request                                                          810 non-null    object 
 2   meta.time                                                             810 non-null    object 
 3   boxScoreAdvanced.gameId                                               810 non-null    object 
 4   boxScoreAdvanced.awayTeamId                                           810 non-null    int64  
 5   boxScoreAdvanced.homeTeamId                                           810 non-null    int64  
 6   boxScoreAdvanced.homeTeam.teamId                                      810 non-null    int64  
 7  

# Scraping BoxScorePlayerTrackV3

Copied code from Scraping BoxScoreAdvanced and modified accordingly

Function that uses BoxScorePlayerTrackV3 to get info about a *single* game 

In [21]:
def create_row_playertrack(game_id: str, timeout: int = 30, max_retries: int = 3) -> Optional[dict]:
    """
    Fetch game stats from NBA API with timeout handling and automatic retries.
    
    Args:
        game_id: NBA game identifier
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    """
    headers = {
        "Host": "stats.nba.com",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
        "Accept": "application/json, text/plain, */*",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "x-nba-stats-origin": "stats",
        "x-nba-stats-token": "true",
        "Connection": "keep-alive",
        "Referer": "https://stats.nba.com/",
        "Pragma": "no-cache",
        "Cache-Control": "no-cache",
    }
    
    #url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={'00'+str(game_id)}&RangeType=0&StartPeriod=0&StartRange=0'
    url = f'https://stats.nba.com/stats/boxscoreplayertrackv3?GameID={game_id}'
    #url = f'https://stats.nba.com/stats/boxscoreadvancedv3?EndPeriod=0&EndRange=0&GameID={game_id}&RangeType=0&StartPeriod=0&StartRange=0'
    #url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={game_id}&RangeType=0&StartPeriod=0&StartRange=0'
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=timeout)
            response.raise_for_status()
            return response.json()
            
        except Timeout:
            wait_time = (attempt + 1) * 5 + random.uniform(1, 3)  # Exponential backoff with jitter
            logging.warning(f"Timeout for game {game_id} (attempt {attempt + 1}/{max_retries}). "
                          f"Waiting {wait_time:.1f} seconds before retry...")
            time.sleep(wait_time)
            
        except RequestException as e:
            logging.error(f"Error fetching game {game_id}: {str(e)}")
            return None
            
    logging.error(f"Max retries ({max_retries}) reached for game {game_id}")
    return None

function that uses above to scrape info for all the game ids. Also incorporates logging, tqdm progress bar, pkl cache (in case of crash)

In [22]:
def scrape_game_stats_playertrack(game_ids: List[str], cache_path: str = 'nba_games_cache.pkl', 
                     delay: float = 1.0, save_frequency: int = 10,
                     timeout: int = 30, max_retries: int = 3) -> Dict[str, dict]:
    """
    Scrape game stats with progress bar, timeout handling, and automatic retries.
    
    Args:
        game_ids: List of NBA game IDs to scrape
        cache_path: Path to save/load the pickle cache file
        delay: Time to wait between requests in seconds
        save_frequency: How often to save the cache (every N successful requests)
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    
    Returns:
        Dictionary mapping game IDs to their stats data
    """
    # Setup logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )
    
    # Load existing cache
    cache = load_or_create_cache(cache_path)
    logging.info(f"Loaded cache with {len(cache)} existing games")
    
    # Filter out already cached games
    games_to_scrape = [gid for gid in game_ids if gid not in cache]
    logging.info(f"Found {len(games_to_scrape)} new games to scrape")
    
    successful_requests = 0
    
    # Create progress bar
    pbar = tqdm(games_to_scrape, desc="Scraping games", unit="game")
    
    for game_id in pbar:
        try:
            # Update progress bar description with current game
            pbar.set_description(f"Scraping game {game_id}")
            
            # Fetch game data with timeout handling and retries
            game_data = create_row_playertrack(game_id, timeout=timeout, max_retries=max_retries)
            
            if game_data is not None:
                cache[game_id] = game_data
                successful_requests += 1
                
                # Update progress bar postfix with success count
                pbar.set_postfix(
                    successful=successful_requests,
                    cached_total=len(cache)
                )
                
                # Save periodically
                if successful_requests % save_frequency == 0:
                    save_cache(cache, cache_path)
                    logging.info(f"Saved cache with {len(cache)} games")
            
            # Wait between requests
            time.sleep(delay)
            
        except Exception as e:
            logging.error(f"Unexpected error processing game {game_id}: {str(e)}")
            # Save cache on error to preserve progress
            save_cache(cache, cache_path)
            logging.info("Saved cache due to error")
            raise
    
    # Close progress bar
    pbar.close()
    
    # Final save
    save_cache(cache, cache_path)
    logging.info(f"Scraping completed. Final cache contains {len(cache)} games")
    
    return cache

Run the scraping with 0.5s request delay (use 1sec if running into excessive timeout errors/notable skips)

In [None]:
# Since I plan on running for ~20 min as a trial run, I want to randomize the game ids as a way to maximize the chance of running into bugs 
#random.shuffle(game_ids)

In [24]:
try:
    game_stats_playertrack = scrape_game_stats_playertrack(
        game_ids=game_ids,
        cache_path='nba_games_cache_playertrack.pkl',
        delay=0.5,  # 1 second delay between requests
        save_frequency=10,  # Save every 10 successful requests
        timeout=30,  # 30 second timeout
        max_retries=3  # Retry up to 3 times on timeout
    )
except Exception as e:
    logging.error(f"Scraping stopped due to error: {str(e)}")

2025-02-16 14:45:21,655 - INFO - Loaded cache with 0 existing games
2025-02-16 14:45:21,657 - INFO - Found 27660 new games to scrape
Scraping game 0040400305:   0%|          | 9/27660 [00:12<10:17:24,  1.34s/game, cached_total=10, successful=10]2025-02-16 14:45:34,505 - INFO - Saved cache with 10 games
Scraping game 0040400226:   0%|          | 19/27660 [00:26<10:52:08,  1.42s/game, cached_total=20, successful=20]2025-02-16 14:45:48,456 - INFO - Saved cache with 20 games
Scraping game 0040400213:   0%|          | 29/27660 [00:41<11:04:52,  1.44s/game, cached_total=30, successful=30]2025-02-16 14:46:03,075 - INFO - Saved cache with 30 games
Scraping game 0040400231:   0%|          | 39/27660 [00:55<10:43:55,  1.40s/game, cached_total=40, successful=40]2025-02-16 14:46:16,814 - INFO - Saved cache with 40 games
Scraping game 0040400165:   0%|          | 49/27660 [01:08<10:32:50,  1.38s/game, cached_total=50, successful=50]2025-02-16 14:46:30,424 - INFO - Saved cache with 50 games
Scraping

# New Notebook: Merging endpoint pkls into dataset

So I created new subfolder `endpoint_pkls_for_merge` with all the pkls (with slight rename) 

In [29]:
import pandas as pd
import pickle

#pkl_folder_path = 'Data Analytics Capstone//Capstone//Scraping Game Level Dataset//endpoint_pkls_for_merge//'
pkl_folder_path = 'endpoint_pkls_for_merge//'
pkl1 = 'nba_games_cache_boxscore_traditional.pkl'
pkl2 = 'nba_games_cache_boxscore_advanced.pkl'
pkl3 = 'nba_games_cache_boxscore_playertrack.pkl'

## boxscore pkl -> uncleaned df


goal is to have 3 dataframes: `bs_trad`, `bs_adv`, and `bs_player`

### traditional v3

load the pkl from disk

In [32]:
with open(pkl_folder_path + pkl1, 'rb') as f:
    cache_trad = pickle.load(f)

print(f'Number of scraped games in cache: {len(cache_trad)}')

Number of scraped games in cache: 27657


iterate through cache using `pd.json_normalize` to create a row for each game id's value

In [67]:
all_games = []
num_skipped = 0
for game_id, game_data in cache_trad.items():
    try:
        assert game_data is not None
        df = pd.json_normalize(game_data)
        all_games.append(df)
    except:
        num_skipped = num_skipped + 1
        continue
print(f'Num skipped: {num_skipped}')
bs_trad = pd.concat(all_games, ignore_index=True)
print(f'shape of dataset: {bs_trad.shape}')
bs_trad.head()

Num skipped: 0
shape of dataset: (27657, 140)


Unnamed: 0,meta.version,meta.request,meta.time,boxScoreTraditional.gameId,boxScoreTraditional.awayTeamId,boxScoreTraditional.homeTeamId,boxScoreTraditional.homeTeam.teamId,boxScoreTraditional.homeTeam.teamCity,boxScoreTraditional.homeTeam.teamName,boxScoreTraditional.homeTeam.teamTricode,...,boxScoreTraditional.awayTeam.bench.blocks,boxScoreTraditional.awayTeam.bench.turnovers,boxScoreTraditional.awayTeam.bench.foulsPersonal,boxScoreTraditional.awayTeam.bench.points,boxScoreTraditional.homeTeam.bench,boxScoreTraditional.homeTeam.statistics,boxScoreTraditional.homeTeam.starters,boxScoreTraditional.awayTeam.statistics,boxScoreTraditional.awayTeam.starters,boxScoreTraditional.awayTeam.bench
0,1,http://nba.cloud/games/0040400407/boxscoretrad...,2023-08-10T15:58:36.5836Z,40400407,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,2.0,2.0,6.0,14.0,,,,,,
1,1,http://nba.cloud/games/0040400406/boxscoretrad...,2023-08-10T11:24:49.2449Z,40400406,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,1.0,2.0,7.0,14.0,,,,,,
2,1,http://nba.cloud/games/0040400405/boxscoretrad...,2023-08-10T15:58:11.5811Z,40400405,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.0,3.0,5.0,22.0,,,,,,
3,1,http://nba.cloud/games/0040400404/boxscoretrad...,2023-08-10T15:57:54.5754Z,40400404,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,2.0,4.0,6.0,18.0,,,,,,
4,1,http://nba.cloud/games/0040400403/boxscoretrad...,2023-08-10T15:57:41.5741Z,40400403,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,1.0,4.0,2.0,18.0,,,,,,


In [34]:
bs_trad['boxScoreTraditional.gameId'].nunique()

27657

### advanced v3

copied and pasted from traditional v3 section

load the pkl from disk

In [35]:
with open(pkl_folder_path + pkl2, 'rb') as f:
    cache_adv = pickle.load(f)

print(f'Number of scraped games in cache: {len(cache_adv)}')

Number of scraped games in cache: 27658


iterate through cache using `pd.json_normalize` to create a row for each game id's value

In [36]:
all_games = []
num_skipped = 0
for game_id, game_data in cache_adv.items():
    try:
        assert game_data is not None
        df = pd.json_normalize(game_data)
        all_games.append(df)
    except:
        num_skipped = num_skipped + 1
        continue
print(f'Num skipped: {num_skipped}')
bs_adv = pd.concat(all_games, ignore_index=True)
print(f'shape of dataset: {bs_adv.shape}')
bs_adv.head()

Num skipped: 0
shape of dataset: (27658, 68)


Unnamed: 0,meta.version,meta.request,meta.time,boxScoreAdvanced.gameId,boxScoreAdvanced.awayTeamId,boxScoreAdvanced.homeTeamId,boxScoreAdvanced.homeTeam.teamId,boxScoreAdvanced.homeTeam.teamCity,boxScoreAdvanced.homeTeam.teamName,boxScoreAdvanced.homeTeam.teamTricode,...,boxScoreAdvanced.awayTeam.statistics.trueShootingPercentage,boxScoreAdvanced.awayTeam.statistics.usagePercentage,boxScoreAdvanced.awayTeam.statistics.estimatedUsagePercentage,boxScoreAdvanced.awayTeam.statistics.estimatedPace,boxScoreAdvanced.awayTeam.statistics.pace,boxScoreAdvanced.awayTeam.statistics.pacePer40,boxScoreAdvanced.awayTeam.statistics.possessions,boxScoreAdvanced.awayTeam.statistics.PIE,boxScoreAdvanced.homeTeam.statistics,boxScoreAdvanced.awayTeam.statistics
0,1,http://nba.cloud/games/0040400407/boxscoreadva...,2025-02-16T16:01:44.144Z,40400407,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,0.462,1.0,0.193,81.26,77.5,64.58,78.0,0.458,,
1,1,http://nba.cloud/games/0040400406/boxscoreadva...,2025-02-16T16:01:46.146Z,40400406,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,0.544,1.0,0.199,82.4,80.5,67.08,81.0,0.592,,
2,1,http://nba.cloud/games/0040400405/boxscoreadva...,2025-02-16T16:01:47.147Z,40400405,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.526,1.0,0.196,81.22,79.7,66.42,88.0,0.443,,
3,1,http://nba.cloud/games/0040400404/boxscoreadva...,2025-02-16T16:01:48.148Z,40400404,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.441,1.0,0.199,88.84,87.5,72.92,88.0,0.264,,
4,1,http://nba.cloud/games/0040400403/boxscoreadva...,2025-02-16T16:01:50.15Z,40400403,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.521,1.0,0.2,85.64,83.0,69.17,83.0,0.36,,


In [38]:
bs_adv['boxScoreAdvanced.gameId'].nunique()

27658

### playerstats v3

copied and pasted from traditional v3 section

load the pkl from disk

In [39]:
with open(pkl_folder_path + pkl3, 'rb') as f:
    cache_player = pickle.load(f)

print(f'Number of scraped games in cache: {len(cache_player)}')

Number of scraped games in cache: 27659


iterate through cache using `pd.json_normalize` to create a row for each game id's value

In [40]:
all_games = []
num_skipped = 0
for game_id, game_data in cache_player.items():
    try:
        assert game_data is not None
        df = pd.json_normalize(game_data)
        all_games.append(df)
    except:
        num_skipped = num_skipped + 1
        continue
print(f'Num skipped: {num_skipped}')
bs_player = pd.concat(all_games, ignore_index=True)
print(f'shape of dataset: {bs_player.shape}')
bs_player.head()

Num skipped: 0
shape of dataset: (27659, 60)


Unnamed: 0,meta.version,meta.request,meta.time,boxScorePlayerTrack.gameId,boxScorePlayerTrack.awayTeamId,boxScorePlayerTrack.homeTeamId,boxScorePlayerTrack.homeTeam.teamId,boxScorePlayerTrack.homeTeam.teamCity,boxScorePlayerTrack.homeTeam.teamName,boxScorePlayerTrack.homeTeam.teamTricode,...,boxScorePlayerTrack.awayTeam.statistics.contestedFieldGoalPercentage,boxScorePlayerTrack.awayTeam.statistics.uncontestedFieldGoalsMade,boxScorePlayerTrack.awayTeam.statistics.uncontestedFieldGoalsAttempted,boxScorePlayerTrack.awayTeam.statistics.uncontestedFieldGoalsPercentage,boxScorePlayerTrack.awayTeam.statistics.fieldGoalPercentage,boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalsMade,boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalsAttempted,boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalPercentage,boxScorePlayerTrack.homeTeam.statistics,boxScorePlayerTrack.awayTeam.statistics
0,1,http://nba.cloud/games/0040400407/boxscoreplay...,2025-02-16T15:45:22.4522Z,40400407,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,0.0,0.0,0.0,0.0,0.419,0.0,0.0,0.0,,
1,1,http://nba.cloud/games/0040400406/boxscoreplay...,2025-02-16T15:45:23.4523Z,40400406,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,0.0,0.0,0.0,0.0,0.468,0.0,0.0,0.0,,
2,1,http://nba.cloud/games/0040400405/boxscoreplay...,2025-02-16T15:45:24.4524Z,40400405,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.0,0.0,0.0,0.0,0.463,0.0,0.0,0.0,,
3,1,http://nba.cloud/games/0040400404/boxscoreplay...,2025-02-16T15:45:26.4526Z,40400404,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.0,0.0,0.0,0.0,0.371,0.0,0.0,0.0,,
4,1,http://nba.cloud/games/0040400403/boxscoreplay...,2025-02-16T15:45:27.4527Z,40400403,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.0,0.0,0.0,0.0,0.433,0.0,0.0,0.0,,


In [41]:
bs_player['boxScorePlayerTrack.gameId'].nunique()

27659

## filtering down columns for each uncleaned df

question of the day is which columns to keep and which to drop. might as well do them indiv before merging as less of a headache

### traditional v3

keep columns independent of home vs away (cuz I'm gonna copy paste and do a replace all for the away columns)

In [68]:
keep_cols_trad_home = [
    'boxScoreTraditional.homeTeam.teamId',
    'boxScoreTraditional.homeTeam.teamName',
    'boxScoreTraditional.homeTeam.teamTricode',
    'boxScoreTraditional.homeTeam.teamSlug',
    'boxScoreTraditional.homeTeam.statistics.minutes',
    'boxScoreTraditional.homeTeam.statistics.fieldGoalsMade',
    'boxScoreTraditional.homeTeam.statistics.fieldGoalsAttempted',
    'boxScoreTraditional.homeTeam.statistics.fieldGoalsPercentage',
    'boxScoreTraditional.homeTeam.statistics.threePointersMade',
    'boxScoreTraditional.homeTeam.statistics.threePointersAttempted',
    'boxScoreTraditional.homeTeam.statistics.threePointersPercentage',
    'boxScoreTraditional.homeTeam.statistics.freeThrowsMade',
    'boxScoreTraditional.homeTeam.statistics.freeThrowsAttempted',
    'boxScoreTraditional.homeTeam.statistics.freeThrowsPercentage',
    'boxScoreTraditional.homeTeam.statistics.reboundsOffensive',
    'boxScoreTraditional.homeTeam.statistics.reboundsDefensive',
    'boxScoreTraditional.homeTeam.statistics.reboundsTotal',
    'boxScoreTraditional.homeTeam.statistics.assists',
    'boxScoreTraditional.homeTeam.statistics.steals',
    'boxScoreTraditional.homeTeam.statistics.blocks',
    'boxScoreTraditional.homeTeam.statistics.turnovers',
    'boxScoreTraditional.homeTeam.statistics.foulsPersonal',
    'boxScoreTraditional.homeTeam.statistics.points',
    'boxScoreTraditional.homeTeam.statistics.plusMinusPoints',
]

keep_cols_trad_away = [
    'boxScoreTraditional.awayTeam.teamId',
    'boxScoreTraditional.awayTeam.teamName',
    'boxScoreTraditional.awayTeam.teamTricode',
    'boxScoreTraditional.awayTeam.teamSlug',
    'boxScoreTraditional.awayTeam.statistics.minutes',
    'boxScoreTraditional.awayTeam.statistics.fieldGoalsMade',
    'boxScoreTraditional.awayTeam.statistics.fieldGoalsAttempted',
    'boxScoreTraditional.awayTeam.statistics.fieldGoalsPercentage',
    'boxScoreTraditional.awayTeam.statistics.threePointersMade',
    'boxScoreTraditional.awayTeam.statistics.threePointersAttempted',
    'boxScoreTraditional.awayTeam.statistics.threePointersPercentage',
    'boxScoreTraditional.awayTeam.statistics.freeThrowsMade',
    'boxScoreTraditional.awayTeam.statistics.freeThrowsAttempted',
    'boxScoreTraditional.awayTeam.statistics.freeThrowsPercentage',
    'boxScoreTraditional.awayTeam.statistics.reboundsOffensive',
    'boxScoreTraditional.awayTeam.statistics.reboundsDefensive',
    'boxScoreTraditional.awayTeam.statistics.reboundsTotal',
    'boxScoreTraditional.awayTeam.statistics.assists',
    'boxScoreTraditional.awayTeam.statistics.steals',
    'boxScoreTraditional.awayTeam.statistics.blocks',
    'boxScoreTraditional.awayTeam.statistics.turnovers',
    'boxScoreTraditional.awayTeam.statistics.foulsPersonal',
    'boxScoreTraditional.awayTeam.statistics.points',
    'boxScoreTraditional.awayTeam.statistics.plusMinusPoints',
]


keep_cols_trad = [
    'boxScoreTraditional.gameId',
    'boxScoreTraditional.awayTeamId',
    'boxScoreTraditional.homeTeamId'
] + keep_cols_trad_home + keep_cols_trad_away

In [69]:
bs_trad = bs_trad[keep_cols_trad]

### advanced v3

In [44]:
bs_adv.head()

Unnamed: 0,meta.version,meta.request,meta.time,boxScoreAdvanced.gameId,boxScoreAdvanced.awayTeamId,boxScoreAdvanced.homeTeamId,boxScoreAdvanced.homeTeam.teamId,boxScoreAdvanced.homeTeam.teamCity,boxScoreAdvanced.homeTeam.teamName,boxScoreAdvanced.homeTeam.teamTricode,...,boxScoreAdvanced.awayTeam.statistics.trueShootingPercentage,boxScoreAdvanced.awayTeam.statistics.usagePercentage,boxScoreAdvanced.awayTeam.statistics.estimatedUsagePercentage,boxScoreAdvanced.awayTeam.statistics.estimatedPace,boxScoreAdvanced.awayTeam.statistics.pace,boxScoreAdvanced.awayTeam.statistics.pacePer40,boxScoreAdvanced.awayTeam.statistics.possessions,boxScoreAdvanced.awayTeam.statistics.PIE,boxScoreAdvanced.homeTeam.statistics,boxScoreAdvanced.awayTeam.statistics
0,1,http://nba.cloud/games/0040400407/boxscoreadva...,2025-02-16T16:01:44.144Z,40400407,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,0.462,1.0,0.193,81.26,77.5,64.58,78.0,0.458,,
1,1,http://nba.cloud/games/0040400406/boxscoreadva...,2025-02-16T16:01:46.146Z,40400406,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,0.544,1.0,0.199,82.4,80.5,67.08,81.0,0.592,,
2,1,http://nba.cloud/games/0040400405/boxscoreadva...,2025-02-16T16:01:47.147Z,40400405,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.526,1.0,0.196,81.22,79.7,66.42,88.0,0.443,,
3,1,http://nba.cloud/games/0040400404/boxscoreadva...,2025-02-16T16:01:48.148Z,40400404,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.441,1.0,0.199,88.84,87.5,72.92,88.0,0.264,,
4,1,http://nba.cloud/games/0040400403/boxscoreadva...,2025-02-16T16:01:50.15Z,40400403,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.521,1.0,0.2,85.64,83.0,69.17,83.0,0.36,,


In [47]:
keep_cols_adv_home = [
    'boxScoreAdvanced.homeTeam.statistics.estimatedOffensiveRating',
    'boxScoreAdvanced.homeTeam.statistics.offensiveRating',
    'boxScoreAdvanced.homeTeam.statistics.estimatedDefensiveRating',
    'boxScoreAdvanced.homeTeam.statistics.defensiveRating',
    'boxScoreAdvanced.homeTeam.statistics.estimatedNetRating',
    'boxScoreAdvanced.homeTeam.statistics.netRating',
    'boxScoreAdvanced.homeTeam.statistics.assistPercentage',
    'boxScoreAdvanced.homeTeam.statistics.assistToTurnover',
    'boxScoreAdvanced.homeTeam.statistics.assistRatio',
    'boxScoreAdvanced.homeTeam.statistics.offensiveReboundPercentage',
    'boxScoreAdvanced.homeTeam.statistics.defensiveReboundPercentage',
    'boxScoreAdvanced.homeTeam.statistics.reboundPercentage',
    'boxScoreAdvanced.homeTeam.statistics.estimatedTeamTurnoverPercentage',
    'boxScoreAdvanced.homeTeam.statistics.turnoverRatio',
    'boxScoreAdvanced.homeTeam.statistics.effectiveFieldGoalPercentage',
    'boxScoreAdvanced.homeTeam.statistics.trueShootingPercentage',
    'boxScoreAdvanced.homeTeam.statistics.usagePercentage',
    'boxScoreAdvanced.homeTeam.statistics.estimatedUsagePercentage',
    'boxScoreAdvanced.homeTeam.statistics.estimatedPace',
    'boxScoreAdvanced.homeTeam.statistics.pace',
    'boxScoreAdvanced.homeTeam.statistics.pacePer40',
    'boxScoreAdvanced.homeTeam.statistics.possessions',
    'boxScoreAdvanced.homeTeam.statistics.PIE',
]

keep_cols_adv_away = [
    'boxScoreAdvanced.awayTeam.statistics.estimatedOffensiveRating',
    'boxScoreAdvanced.awayTeam.statistics.offensiveRating',
    'boxScoreAdvanced.awayTeam.statistics.estimatedDefensiveRating',
    'boxScoreAdvanced.awayTeam.statistics.defensiveRating',
    'boxScoreAdvanced.awayTeam.statistics.estimatedNetRating',
    'boxScoreAdvanced.awayTeam.statistics.netRating',
    'boxScoreAdvanced.awayTeam.statistics.assistPercentage',
    'boxScoreAdvanced.awayTeam.statistics.assistToTurnover',
    'boxScoreAdvanced.awayTeam.statistics.assistRatio',
    'boxScoreAdvanced.awayTeam.statistics.offensiveReboundPercentage',
    'boxScoreAdvanced.awayTeam.statistics.defensiveReboundPercentage',
    'boxScoreAdvanced.awayTeam.statistics.reboundPercentage',
    'boxScoreAdvanced.awayTeam.statistics.estimatedTeamTurnoverPercentage',
    'boxScoreAdvanced.awayTeam.statistics.turnoverRatio',
    'boxScoreAdvanced.awayTeam.statistics.effectiveFieldGoalPercentage',
    'boxScoreAdvanced.awayTeam.statistics.trueShootingPercentage',
    'boxScoreAdvanced.awayTeam.statistics.usagePercentage',
    'boxScoreAdvanced.awayTeam.statistics.estimatedUsagePercentage',
    'boxScoreAdvanced.awayTeam.statistics.estimatedPace',
    'boxScoreAdvanced.awayTeam.statistics.pace',
    'boxScoreAdvanced.awayTeam.statistics.pacePer40',
    'boxScoreAdvanced.awayTeam.statistics.possessions',
    'boxScoreAdvanced.awayTeam.statistics.PIE',
]

keep_cols_adv = [
    'boxScoreAdvanced.gameId',
    'boxScoreAdvanced.awayTeamId',
    'boxScoreAdvanced.homeTeamId'
] + keep_cols_adv_home + keep_cols_adv_away

In [51]:
bs_adv = bs_adv[keep_cols_adv]

### playerstats v3

In [48]:
bs_player.head()

Unnamed: 0,meta.version,meta.request,meta.time,boxScorePlayerTrack.gameId,boxScorePlayerTrack.awayTeamId,boxScorePlayerTrack.homeTeamId,boxScorePlayerTrack.homeTeam.teamId,boxScorePlayerTrack.homeTeam.teamCity,boxScorePlayerTrack.homeTeam.teamName,boxScorePlayerTrack.homeTeam.teamTricode,...,boxScorePlayerTrack.awayTeam.statistics.contestedFieldGoalPercentage,boxScorePlayerTrack.awayTeam.statistics.uncontestedFieldGoalsMade,boxScorePlayerTrack.awayTeam.statistics.uncontestedFieldGoalsAttempted,boxScorePlayerTrack.awayTeam.statistics.uncontestedFieldGoalsPercentage,boxScorePlayerTrack.awayTeam.statistics.fieldGoalPercentage,boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalsMade,boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalsAttempted,boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalPercentage,boxScorePlayerTrack.homeTeam.statistics,boxScorePlayerTrack.awayTeam.statistics
0,1,http://nba.cloud/games/0040400407/boxscoreplay...,2025-02-16T15:45:22.4522Z,40400407,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,0.0,0.0,0.0,0.0,0.419,0.0,0.0,0.0,,
1,1,http://nba.cloud/games/0040400406/boxscoreplay...,2025-02-16T15:45:23.4523Z,40400406,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,0.0,0.0,0.0,0.0,0.468,0.0,0.0,0.0,,
2,1,http://nba.cloud/games/0040400405/boxscoreplay...,2025-02-16T15:45:24.4524Z,40400405,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.0,0.0,0.0,0.0,0.463,0.0,0.0,0.0,,
3,1,http://nba.cloud/games/0040400404/boxscoreplay...,2025-02-16T15:45:26.4526Z,40400404,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.0,0.0,0.0,0.0,0.371,0.0,0.0,0.0,,
4,1,http://nba.cloud/games/0040400403/boxscoreplay...,2025-02-16T15:45:27.4527Z,40400403,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.0,0.0,0.0,0.0,0.433,0.0,0.0,0.0,,


In [53]:
keep_cols_player_home = [
    'boxScorePlayerTrack.homeTeam.statistics.distance',
    'boxScorePlayerTrack.homeTeam.statistics.reboundChancesOffensive',
    'boxScorePlayerTrack.homeTeam.statistics.reboundChancesDefensive',
    'boxScorePlayerTrack.homeTeam.statistics.reboundChancesTotal',
    'boxScorePlayerTrack.homeTeam.statistics.touches',
    'boxScorePlayerTrack.homeTeam.statistics.secondaryAssists',
    'boxScorePlayerTrack.homeTeam.statistics.freeThrowAssists',
    'boxScorePlayerTrack.homeTeam.statistics.passes',
    'boxScorePlayerTrack.homeTeam.statistics.assists',
    'boxScorePlayerTrack.homeTeam.statistics.contestedFieldGoalsMade',
    'boxScorePlayerTrack.homeTeam.statistics.contestedFieldGoalsAttempted',
    'boxScorePlayerTrack.homeTeam.statistics.contestedFieldGoalPercentage',
    'boxScorePlayerTrack.homeTeam.statistics.uncontestedFieldGoalsMade',
    'boxScorePlayerTrack.homeTeam.statistics.uncontestedFieldGoalsAttempted',
    'boxScorePlayerTrack.homeTeam.statistics.uncontestedFieldGoalsPercentage',
    'boxScorePlayerTrack.homeTeam.statistics.fieldGoalPercentage',
    'boxScorePlayerTrack.homeTeam.statistics.defendedAtRimFieldGoalsMade',
    'boxScorePlayerTrack.homeTeam.statistics.defendedAtRimFieldGoalsAttempted',
    'boxScorePlayerTrack.homeTeam.statistics.defendedAtRimFieldGoalPercentage',
]

keep_cols_player_away = [
    'boxScorePlayerTrack.awayTeam.statistics.distance',
    'boxScorePlayerTrack.awayTeam.statistics.reboundChancesOffensive',
    'boxScorePlayerTrack.awayTeam.statistics.reboundChancesDefensive',
    'boxScorePlayerTrack.awayTeam.statistics.reboundChancesTotal',
    'boxScorePlayerTrack.awayTeam.statistics.touches',
    'boxScorePlayerTrack.awayTeam.statistics.secondaryAssists',
    'boxScorePlayerTrack.awayTeam.statistics.freeThrowAssists',
    'boxScorePlayerTrack.awayTeam.statistics.passes',
    'boxScorePlayerTrack.awayTeam.statistics.assists',
    'boxScorePlayerTrack.awayTeam.statistics.contestedFieldGoalsMade',
    'boxScorePlayerTrack.awayTeam.statistics.contestedFieldGoalsAttempted',
    'boxScorePlayerTrack.awayTeam.statistics.contestedFieldGoalPercentage',
    'boxScorePlayerTrack.awayTeam.statistics.uncontestedFieldGoalsMade',
    'boxScorePlayerTrack.awayTeam.statistics.uncontestedFieldGoalsAttempted',
    'boxScorePlayerTrack.awayTeam.statistics.uncontestedFieldGoalsPercentage',
    'boxScorePlayerTrack.awayTeam.statistics.fieldGoalPercentage',
    'boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalsMade',
    'boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalsAttempted',
    'boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalPercentage',
]


keep_cols_player = [
    'boxScorePlayerTrack.gameId',
    'boxScorePlayerTrack.awayTeamId',
    'boxScorePlayerTrack.homeTeamId'
] + keep_cols_player_home + keep_cols_player_away

In [54]:
bs_player = bs_player[keep_cols_player]

## performing merges

Since bs_trad has the least number of unique game ids (which makes sense cuz it was the first one scraped), the other two will be merged onto it

merging bs_trad and bs_adv

In [70]:
merge1 = pd.merge(bs_trad, bs_adv, how='left', left_on='boxScoreTraditional.gameId', right_on='boxScoreAdvanced.gameId')
merge1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27657 entries, 0 to 27656
Data columns (total 100 columns):
 #   Column                                                                Non-Null Count  Dtype  
---  ------                                                                --------------  -----  
 0   boxScoreTraditional.gameId                                            27657 non-null  object 
 1   boxScoreTraditional.awayTeamId                                        27657 non-null  int64  
 2   boxScoreTraditional.homeTeamId                                        27657 non-null  int64  
 3   boxScoreTraditional.homeTeam.teamId                                   27657 non-null  int64  
 4   boxScoreTraditional.homeTeam.teamName                                 27643 non-null  object 
 5   boxScoreTraditional.homeTeam.teamTricode                              27643 non-null  object 
 6   boxScoreTraditional.homeTeam.teamSlug                                 27643 non-null  object 

merging result and bs_player

In [116]:
merge2 = pd.merge(merge1, bs_player, how='left', left_on='boxScoreTraditional.gameId', right_on='boxScorePlayerTrack.gameId')
merge2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27657 entries, 0 to 27656
Columns: 141 entries, boxScoreTraditional.gameId to boxScorePlayerTrack.awayTeam.statistics.defendedAtRimFieldGoalPercentage
dtypes: float64(126), int64(4), object(11)
memory usage: 29.8+ MB


In [72]:
list(merge2.columns)

['boxScoreTraditional.gameId',
 'boxScoreTraditional.awayTeamId',
 'boxScoreTraditional.homeTeamId',
 'boxScoreTraditional.homeTeam.teamId',
 'boxScoreTraditional.homeTeam.teamName',
 'boxScoreTraditional.homeTeam.teamTricode',
 'boxScoreTraditional.homeTeam.teamSlug',
 'boxScoreTraditional.homeTeam.statistics.minutes',
 'boxScoreTraditional.homeTeam.statistics.fieldGoalsMade',
 'boxScoreTraditional.homeTeam.statistics.fieldGoalsAttempted',
 'boxScoreTraditional.homeTeam.statistics.fieldGoalsPercentage',
 'boxScoreTraditional.homeTeam.statistics.threePointersMade',
 'boxScoreTraditional.homeTeam.statistics.threePointersAttempted',
 'boxScoreTraditional.homeTeam.statistics.threePointersPercentage',
 'boxScoreTraditional.homeTeam.statistics.freeThrowsMade',
 'boxScoreTraditional.homeTeam.statistics.freeThrowsAttempted',
 'boxScoreTraditional.homeTeam.statistics.freeThrowsPercentage',
 'boxScoreTraditional.homeTeam.statistics.reboundsOffensive',
 'boxScoreTraditional.homeTeam.statistics.r

export out to csv to save progress ig

In [73]:
merge2.to_csv('merged_nba_endpoints_uncleaned.csv', index=False)

#### validation merge using composite key

In [74]:
m1 = pd.merge(bs_trad, bs_adv, how='left', left_on=['boxScoreTraditional.gameId','boxScoreTraditional.homeTeamId','boxScoreTraditional.awayTeamId'], right_on=['boxScoreAdvanced.gameId', 'boxScoreAdvanced.homeTeamId', 'boxScoreAdvanced.awayTeamId'])

m2 = pd.merge(m1, bs_player, how='left', left_on=['boxScoreTraditional.gameId','boxScoreTraditional.homeTeamId','boxScoreTraditional.awayTeamId'], right_on=['boxScorePlayerTrack.gameId', 'boxScorePlayerTrack.homeTeamId', 'boxScorePlayerTrack.awayTeamId'])

In [75]:
m2.shape

(27657, 141)

In [76]:
merge2.shape

(27657, 141)

Sweet they are the same size

## renaming columns (and creating df)

here are the basically duplicate columns to be dropped

In [117]:
drop_cols = [
	'boxScoreTraditional.homeTeam.teamId',
	'boxScoreTraditional.awayTeam.teamId',
	'boxScoreAdvanced.gameId',
	'boxScoreAdvanced.awayTeamId',
	'boxScoreAdvanced.homeTeamId',
	'boxScorePlayerTrack.gameId',
	'boxScorePlayerTrack.awayTeamId',
	'boxScorePlayerTrack.homeTeamId',
    'boxScoreTraditional.homeTeam.statistics.assists',
    'boxScoreTraditional.awayTeam.statistics.assists'
]
merge2 = merge2.drop(columns=drop_cols)

function to speed up the column renaming into snake case

In [118]:
def standardize_column_names(columns):
    """
    Standardizes column names from NBA box score data to snake_case format with home_/away_ prefixes.
    Removes boxScore prefixes and 'statistics' parts while preserving team stats and IDs.
    
    Args:
        columns (list): List of original column names
        
    Returns:
        dict: Mapping of original column names to standardized names
    """
    name_mapping = {}
    
    for col in columns:
        # Skip if not an ID or team statistic
        if not ('Id' in col or 'Team.statistics' in col):
            continue
            
        # Remove boxScore prefix and get the relevant part
        parts = col.split('.')
        
        # Handle IDs
        if 'Id' in col:
            # Convert camelCase ID to snake_case
            id_name = parts[-1]
            # Handle camelCase to snake_case conversion
            snake_name = ''.join(['_' + c.lower() if c.isupper() else c.lower() for c in id_name]).lstrip('_')
            name_mapping[col] = snake_name
            continue
            
        # Handle team statistics
        if len(parts) >= 3 and 'Team.statistics' in col:
            # Get team type (home/away) and stat name
            team_type = 'home_' if 'homeTeam' in col else 'away_'
            stat_name = parts[-1]
            
            # Convert camelCase stat name to snake_case
            snake_stat = ''.join(['_' + c.lower() if c.isupper() else c.lower() for c in stat_name]).lstrip('_')
            
            # Combine with prefix
            new_name = f"{team_type}{snake_stat}"
            name_mapping[col] = new_name
    
    return name_mapping

use the function

In [119]:
rename_mapping = standardize_column_names(list(merge2.columns))

df = merge2.rename(columns=rename_mapping, errors='raise')
list(df.columns)

['game_id',
 'away_team_id',
 'home_team_id',
 'boxScoreTraditional.homeTeam.teamName',
 'boxScoreTraditional.homeTeam.teamTricode',
 'boxScoreTraditional.homeTeam.teamSlug',
 'home_minutes',
 'home_field_goals_made',
 'home_field_goals_attempted',
 'home_field_goals_percentage',
 'home_three_pointers_made',
 'home_three_pointers_attempted',
 'home_three_pointers_percentage',
 'home_free_throws_made',
 'home_free_throws_attempted',
 'home_free_throws_percentage',
 'home_rebounds_offensive',
 'home_rebounds_defensive',
 'home_rebounds_total',
 'home_steals',
 'home_blocks',
 'home_turnovers',
 'home_fouls_personal',
 'home_points',
 'home_plus_minus_points',
 'boxScoreTraditional.awayTeam.teamName',
 'boxScoreTraditional.awayTeam.teamTricode',
 'boxScoreTraditional.awayTeam.teamSlug',
 'away_minutes',
 'away_field_goals_made',
 'away_field_goals_attempted',
 'away_field_goals_percentage',
 'away_three_pointers_made',
 'away_three_pointers_attempted',
 'away_three_pointers_percentage',
 

dont know how these got through but I'm not bothering with debugging

In [120]:
rename_mapping2 = {
    'boxScoreTraditional.homeTeam.teamName':'home_team_name',
    'boxScoreTraditional.homeTeam.teamTricode':'home_tri_code',
    'boxScoreTraditional.homeTeam.teamSlug':'home_team_slug',
    'boxScoreTraditional.awayTeam.teamName':'away_team_name',
    'boxScoreTraditional.awayTeam.teamTricode':'away_tri_code',
    'boxScoreTraditional.awayTeam.teamSlug':'away_team_slug',
}
df = df.rename(columns=rename_mapping2, errors='raise')
list(df.columns)

['game_id',
 'away_team_id',
 'home_team_id',
 'home_team_name',
 'home_tri_code',
 'home_team_slug',
 'home_minutes',
 'home_field_goals_made',
 'home_field_goals_attempted',
 'home_field_goals_percentage',
 'home_three_pointers_made',
 'home_three_pointers_attempted',
 'home_three_pointers_percentage',
 'home_free_throws_made',
 'home_free_throws_attempted',
 'home_free_throws_percentage',
 'home_rebounds_offensive',
 'home_rebounds_defensive',
 'home_rebounds_total',
 'home_steals',
 'home_blocks',
 'home_turnovers',
 'home_fouls_personal',
 'home_points',
 'home_plus_minus_points',
 'away_team_name',
 'away_tri_code',
 'away_team_slug',
 'away_minutes',
 'away_field_goals_made',
 'away_field_goals_attempted',
 'away_field_goals_percentage',
 'away_three_pointers_made',
 'away_three_pointers_attempted',
 'away_three_pointers_percentage',
 'away_free_throws_made',
 'away_free_throws_attempted',
 'away_free_throws_percentage',
 'away_rebounds_offensive',
 'away_rebounds_defensive',
 '

## appending season column

this code is grabbed from my monkey patch fix from last time

In [104]:
from nba_api.stats.endpoints import LeagueGameFinder, BoxScoreTraditionalV2
import pandas as pd
import time
from datetime import datetime
from tqdm.auto import tqdm
import logging
import random

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.FileHandler('nba_data_collection.log'),
             logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

def log_message(message):
    """Log message to both file and tqdm"""
    tqdm.write(message)
    logger.info(message)

In [105]:
def get_season_games(season):
    """Get games for a single season with error handling"""
    game_finder = LeagueGameFinder(
        season_nullable=season,
        league_id_nullable='00',
        timeout=60
    )
    
    # Get the raw response first
    response_frames = game_finder.get_data_frames()
    
    # Debug logging
    log_message(f"Response for season {season}: got {len(response_frames)} DataFrames")
    if not response_frames:
        raise ValueError(f"Empty response for season {season}")
    
    games = response_frames[0]
    if len(games) == 0:
        raise ValueError(f"No games found for season {season}")

    season_ids = [season] * len(games['GAME_ID'].unique())    
    log_message(f"Retrieved {len(games)} game entries for season {season}")
    return games['GAME_ID'].unique(), season_ids

In [106]:
def compile_game_season_df(start_year, end_year):
    game_ids = []
    seasons = [f"{year}-{str(year + 1)[-2:]}" for year in range(start_year, end_year)]
    total_seasons = []
    for season in seasons:
        games, season_ids = get_season_games(season)
        game_ids.extend(games)
        total_seasons.extend(season_ids)
        time.sleep(1)
    df = pd.DataFrame({'GAME_ID':game_ids, 'SEASON_ID':total_seasons})
    return df

In [107]:
game_season_df = compile_game_season_df(2004, 2024)
game_season_df = game_season_df.rename(columns={
    'GAME_ID':'game_id',
    'SEASON_ID':'season'
})
game_season_df.info()

2025-02-17 19:56:53,501 - INFO - Response for season 2004-05: got 1 DataFrames
2025-02-17 19:56:53,501 - INFO - Retrieved 2728 game entries for season 2004-05


Response for season 2004-05: got 1 DataFrames
Retrieved 2728 game entries for season 2004-05


2025-02-17 19:56:55,137 - INFO - Response for season 2005-06: got 1 DataFrames
2025-02-17 19:56:55,137 - INFO - Retrieved 2871 game entries for season 2005-06


Response for season 2005-06: got 1 DataFrames
Retrieved 2871 game entries for season 2005-06


2025-02-17 19:56:57,342 - INFO - Response for season 2006-07: got 1 DataFrames
2025-02-17 19:56:57,351 - INFO - Retrieved 2867 game entries for season 2006-07


Response for season 2006-07: got 1 DataFrames
Retrieved 2867 game entries for season 2006-07


2025-02-17 19:56:58,851 - INFO - Response for season 2007-08: got 1 DataFrames
2025-02-17 19:56:58,860 - INFO - Retrieved 2852 game entries for season 2007-08


Response for season 2007-08: got 1 DataFrames
Retrieved 2852 game entries for season 2007-08


2025-02-17 19:57:00,412 - INFO - Response for season 2008-09: got 1 DataFrames
2025-02-17 19:57:00,412 - INFO - Retrieved 2866 game entries for season 2008-09


Response for season 2008-09: got 1 DataFrames
Retrieved 2866 game entries for season 2008-09


2025-02-17 19:57:01,876 - INFO - Response for season 2009-10: got 1 DataFrames
2025-02-17 19:57:01,876 - INFO - Retrieved 2871 game entries for season 2009-10


Response for season 2009-10: got 1 DataFrames
Retrieved 2871 game entries for season 2009-10


2025-02-17 19:57:03,525 - INFO - Response for season 2010-11: got 1 DataFrames
2025-02-17 19:57:03,525 - INFO - Retrieved 2866 game entries for season 2010-11


Response for season 2010-11: got 1 DataFrames
Retrieved 2866 game entries for season 2010-11


2025-02-17 19:57:04,993 - INFO - Response for season 2011-12: got 1 DataFrames
2025-02-17 19:57:04,993 - INFO - Retrieved 2214 game entries for season 2011-12


Response for season 2011-12: got 1 DataFrames
Retrieved 2214 game entries for season 2011-12


2025-02-17 19:57:06,481 - INFO - Response for season 2012-13: got 1 DataFrames
2025-02-17 19:57:06,481 - INFO - Retrieved 2866 game entries for season 2012-13


Response for season 2012-13: got 1 DataFrames
Retrieved 2866 game entries for season 2012-13


2025-02-17 19:57:08,454 - INFO - Response for season 2013-14: got 1 DataFrames
2025-02-17 19:57:08,454 - INFO - Retrieved 2874 game entries for season 2013-14


Response for season 2013-14: got 1 DataFrames
Retrieved 2874 game entries for season 2013-14


2025-02-17 19:57:09,962 - INFO - Response for season 2014-15: got 1 DataFrames
2025-02-17 19:57:09,965 - INFO - Retrieved 2864 game entries for season 2014-15


Response for season 2014-15: got 1 DataFrames
Retrieved 2864 game entries for season 2014-15


2025-02-17 19:57:11,411 - INFO - Response for season 2015-16: got 1 DataFrames
2025-02-17 19:57:11,412 - INFO - Retrieved 2856 game entries for season 2015-16


Response for season 2015-16: got 1 DataFrames
Retrieved 2856 game entries for season 2015-16


2025-02-17 19:57:13,121 - INFO - Response for season 2016-17: got 1 DataFrames
2025-02-17 19:57:13,123 - INFO - Retrieved 2829 game entries for season 2016-17


Response for season 2016-17: got 1 DataFrames
Retrieved 2829 game entries for season 2016-17


2025-02-17 19:57:14,328 - INFO - Response for season 2017-18: got 1 DataFrames
2025-02-17 19:57:14,331 - INFO - Retrieved 2785 game entries for season 2017-18


Response for season 2017-18: got 1 DataFrames
Retrieved 2785 game entries for season 2017-18


2025-02-17 19:57:15,550 - INFO - Response for season 2018-19: got 1 DataFrames
2025-02-17 19:57:15,551 - INFO - Retrieved 2788 game entries for season 2018-19


Response for season 2018-19: got 1 DataFrames
Retrieved 2788 game entries for season 2018-19


2025-02-17 19:57:17,341 - INFO - Response for season 2019-20: got 1 DataFrames
2025-02-17 19:57:17,341 - INFO - Retrieved 2516 game entries for season 2019-20


Response for season 2019-20: got 1 DataFrames
Retrieved 2516 game entries for season 2019-20


2025-02-17 19:57:19,013 - INFO - Response for season 2020-21: got 1 DataFrames
2025-02-17 19:57:19,013 - INFO - Retrieved 2442 game entries for season 2020-21


Response for season 2020-21: got 1 DataFrames
Retrieved 2442 game entries for season 2020-21


2025-02-17 19:57:21,105 - INFO - Response for season 2021-22: got 1 DataFrames
2025-02-17 19:57:21,105 - INFO - Retrieved 2788 game entries for season 2021-22


Response for season 2021-22: got 1 DataFrames
Retrieved 2788 game entries for season 2021-22


2025-02-17 19:57:23,167 - INFO - Response for season 2022-23: got 1 DataFrames
2025-02-17 19:57:23,167 - INFO - Retrieved 2790 game entries for season 2022-23


Response for season 2022-23: got 1 DataFrames
Retrieved 2790 game entries for season 2022-23


2025-02-17 19:57:24,464 - INFO - Response for season 2023-24: got 1 DataFrames
2025-02-17 19:57:24,464 - INFO - Retrieved 2795 game entries for season 2023-24


Response for season 2023-24: got 1 DataFrames
Retrieved 2795 game entries for season 2023-24
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27660 entries, 0 to 27659
Data columns (total 2 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   game_id  27660 non-null  object
 1   season   27660 non-null  object
dtypes: object(2)
memory usage: 432.3+ KB


now do left join to append the season

In [121]:
df_season = pd.merge(df, game_season_df, how='left', on='game_id')
df_season.columns

Index(['game_id', 'away_team_id', 'home_team_id', 'home_team_name',
       'home_tri_code', 'home_team_slug', 'home_minutes',
       'home_field_goals_made', 'home_field_goals_attempted',
       'home_field_goals_percentage',
       ...
       'away_contested_field_goals_attempted',
       'away_contested_field_goal_percentage',
       'away_uncontested_field_goals_made',
       'away_uncontested_field_goals_attempted',
       'away_uncontested_field_goals_percentage', 'away_field_goal_percentage',
       'away_defended_at_rim_field_goals_made',
       'away_defended_at_rim_field_goals_attempted',
       'away_defended_at_rim_field_goal_percentage', 'season'],
      dtype='object', length=132)

overwrite df with df_season

In [122]:
df = df_season

## split up home vs away

markdown blocks are copied from another notebook but the logic is still the same

The only columns without a `HOME_` or `AWAY_` prefix are `GAME_ID` and `SEASON_ID`, so they will show up in `home_df` and `away_df`. 


The plan is to have `home_df` house all the `HOME_` prefixed columns + `AWAY_POINTS` columns + two mentioned ^^. Same principal for `away_df` 


In [123]:
home_df_columns = ['game_id', 'season'] + [col for col in df.columns if col.startswith('home_')] + ['away_points']
away_df_columns = ['game_id', 'season'] + [col for col in df.columns if col.startswith('away_')] + ['home_points']

assert len(home_df_columns)  == len(away_df_columns) 

Now going to create `home_df` and `away_df` adding an extra binary column `is_home_team` (this var will differentiate the two groups once in `games_df`)

In [124]:
home_df = df[home_df_columns]
home_df.loc[:,'is_home_team'] = 1


away_df = df[away_df_columns]
away_df.loc[:,'is_home_team'] = 0

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  home_df.loc[:,'is_home_team'] = 1
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  away_df.loc[:,'is_home_team'] = 0


In [125]:
list(home_df.columns)

['game_id',
 'season',
 'home_team_id',
 'home_team_name',
 'home_tri_code',
 'home_team_slug',
 'home_minutes',
 'home_field_goals_made',
 'home_field_goals_attempted',
 'home_field_goals_percentage',
 'home_three_pointers_made',
 'home_three_pointers_attempted',
 'home_three_pointers_percentage',
 'home_free_throws_made',
 'home_free_throws_attempted',
 'home_free_throws_percentage',
 'home_rebounds_offensive',
 'home_rebounds_defensive',
 'home_rebounds_total',
 'home_steals',
 'home_blocks',
 'home_turnovers',
 'home_fouls_personal',
 'home_points',
 'home_plus_minus_points',
 'home_estimated_offensive_rating',
 'home_offensive_rating',
 'home_estimated_defensive_rating',
 'home_defensive_rating',
 'home_estimated_net_rating',
 'home_net_rating',
 'home_assist_percentage',
 'home_assist_to_turnover',
 'home_assist_ratio',
 'home_offensive_rebound_percentage',
 'home_defensive_rebound_percentage',
 'home_rebound_percentage',
 'home_estimated_team_turnover_percentage',
 'home_turnove

In [126]:
new_column_names = [
    'game_id',
    'season',
    'team_id',
    'team_name',
    'tri_code',
    'team_slug',
    'minutes',
    'field_goals_made',
    'field_goals_attempted',
    'field_goals_percentage',
    'three_pointers_made',
    'three_pointers_attempted',
    'three_pointers_percentage',
    'free_throws_made',
    'free_throws_attempted',
    'free_throws_percentage',
    'rebounds_offensive',
    'rebounds_defensive',
    'rebounds_total',
    'steals',
    'blocks',
    'turnovers',
    'fouls_personal',
    'points',
    'plus_minus_points',
    'estimated_offensive_rating',
    'offensive_rating',
    'estimated_defensive_rating',
    'defensive_rating',
    'estimated_net_rating',
    'net_rating',
    'assist_percentage',
    'assist_to_turnover',
    'assist_ratio',
    'offensive_rebound_percentage',
    'defensive_rebound_percentage',
    'rebound_percentage',
    'estimated_team_turnover_percentage',
    'turnover_ratio',
    'effective_field_goal_percentage',
    'true_shooting_percentage',
    'usage_percentage',
    'estimated_usage_percentage',
    'estimated_pace',
    'pace',
    'pace_per40',
    'possessions',
    'p_i_e',
    'distance',
    'rebound_chances_offensive',
    'rebound_chances_defensive',
    'rebound_chances_total',
    'touches',
    'secondary_assists',
    'free_throw_assists',
    'passes',
    'assists',
    'contested_field_goals_made',
    'contested_field_goals_attempted',
    'contested_field_goal_percentage',
    'uncontested_field_goals_made',
    'uncontested_field_goals_attempted',
    'uncontested_field_goals_percentage',
    'field_goal_percentage',
    'defended_at_rim_field_goals_made',
    'defended_at_rim_field_goals_attempted',
    'defended_at_rim_field_goal_percentage',
    'opponent_points',
    'is_home_team'
]

assert len(new_column_names) == len(away_df.columns)
assert len(new_column_names) == len(home_df.columns)

Performing the renaming. Both `home_df` and `away_df` should have the same columns now 

In [132]:
home_df = home_df.rename(
    columns=dict(zip(list(home_df.columns), new_column_names))
)

away_df = away_df.rename(
    columns=dict(zip(list(away_df.columns), new_column_names))
)


assert list(home_df.columns) == list(away_df.columns)

now ready to create games_df by simply concatenating home_df and away_df

In [133]:
games_df = pd.concat([home_df, away_df], ignore_index=True)

assert len(games_df) == len(df) *2

games_df.head()

Unnamed: 0,game_id,season,team_id,team_name,tri_code,team_slug,minutes,field_goals_made,field_goals_attempted,field_goals_percentage,...,contested_field_goal_percentage,uncontested_field_goals_made,uncontested_field_goals_attempted,uncontested_field_goals_percentage,field_goal_percentage,defended_at_rim_field_goals_made,defended_at_rim_field_goals_attempted,defended_at_rim_field_goal_percentage,opponent_points,is_home_team
0,40400407,2004-05,1610612759,Spurs,SAS,spurs,240:00,29.0,68.0,0.426,...,0.0,0.0,0.0,0.0,0.426,0.0,0.0,0.0,74.0,1
1,40400406,2004-05,1610612759,Spurs,SAS,spurs,240:00,31.0,75.0,0.413,...,0.0,0.0,0.0,0.0,0.413,0.0,0.0,0.0,95.0,1
2,40400405,2004-05,1610612765,Pistons,DET,pistons,265:00,37.0,84.0,0.44,...,0.0,0.0,0.0,0.0,0.44,0.0,0.0,0.0,96.0,1
3,40400404,2004-05,1610612765,Pistons,DET,pistons,240:00,41.0,90.0,0.456,...,0.0,0.0,0.0,0.0,0.456,0.0,0.0,0.0,71.0,1
4,40400403,2004-05,1610612765,Pistons,DET,pistons,240:00,40.0,85.0,0.471,...,0.0,0.0,0.0,0.0,0.471,0.0,0.0,0.0,79.0,1


`games_df` is twice the length of `df` since each row of `df` was split into a home and away row housed in `games_df`. You would need a composite key of (GAME_ID, TEAM_XXXX) to locate info about a specific game

Additionally adding a binary `won_game` column to indicate if team won the game. Logic is basically checking if `points` is greater than `opponent_points`

In [134]:
games_df['won_game'] = (games_df['points'] > games_df['opponent_points']).astype(int)

Following functions are to help create `is_playoff_game` and `is_regular_game` columns

https://github.com/gmf05/nba/blob/master/README.md

Season prefixes are...
```
001 : Pre Season
002 : Regular Season
003 : All-Star
004 : Post Season
```

In [142]:
def is_regular_game(row):
    if str(row['game_id'])[2] == '2':
        return 1
    else:
        return 0

In [140]:
def is_playoff_game(row):
    if str(row['game_id'])[2] == '4':
        return 1
    else:
        return 0

In [143]:
games_df['is_playoff_game'] = games_df.apply(is_playoff_game, axis=1)
games_df['is_regular_game'] = games_df.apply(is_regular_game, axis=1)

In [146]:
games_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55314 entries, 0 to 55313
Data columns (total 72 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   game_id                                55314 non-null  object 
 1   season                                 55314 non-null  object 
 2   team_id                                55314 non-null  int64  
 3   team_name                              55287 non-null  object 
 4   tri_code                               55287 non-null  object 
 5   team_slug                              55287 non-null  object 
 6   minutes                                55287 non-null  object 
 7   field_goals_made                       55287 non-null  float64
 8   field_goals_attempted                  55287 non-null  float64
 9   field_goals_percentage                 55287 non-null  float64
 10  three_pointers_made                    55287 non-null  float64
 11  th

and finally, the export

In [147]:
games_df.to_csv('nba_games_merged_endpoints.csv', index=False)