nba-game-scraping-V5 used boxscoretraditionalv3 endpoint. This notebook aims to try using the boxscoreadvancedv3 and boxscoreplayertrackv3 endpoints

# Prep work

## Extracting unique game_ids

Cobbled together from Trevor's nba-api-connection-2.ipynb

Using nba-api package to get a complete list of game_ids from 2004 to 2024 (so run on capstone2 environment)


This version wont have season ids, but since the plan is to merge on `GAME_ID` that shouldnt matter

import libraries and setup logging

In [1]:
from nba_api.stats.endpoints import LeagueGameFinder, BoxScoreTraditionalV2
import pandas as pd
import time
from datetime import datetime
from tqdm.auto import tqdm
import logging
import random

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.FileHandler('nba_data_collection.log'),
             logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

def log_message(message):
    """Log message to both file and tqdm"""
    tqdm.write(message)
    logger.info(message)

Function to get all games for a single season 

In [2]:
def get_season_games(season):
    """Get games for a single season with error handling"""
    game_finder = LeagueGameFinder(
        season_nullable=season,
        league_id_nullable='00',
        timeout=60
    )
    
    # Get the raw response first
    response_frames = game_finder.get_data_frames()
    
    # Debug logging
    log_message(f"Response for season {season}: got {len(response_frames)} DataFrames")
    if not response_frames:
        raise ValueError(f"Empty response for season {season}")
    
    games = response_frames[0]
    if len(games) == 0:
        raise ValueError(f"No games found for season {season}")
        
    log_message(f"Retrieved {len(games)} game entries for season {season}")
    return games

Function that takes a `start_year` and `end_year` and compiles all the game_ids for games that happened between them

In [3]:
def compile_game_ids(start_year, end_year):
    game_ids = []
    seasons = [f"{year}-{str(year + 1)[-2:]}" for year in range(start_year, end_year)]

    for season in seasons:
        game_ids.extend(get_season_games(season)['GAME_ID'].unique())
        time.sleep(1)
    return game_ids

Establish `game_ids` list

In [14]:
game_ids = compile_game_ids(2004, 2024)
f"Game ids found from 2004 - 2024: {len(game_ids)}"

2025-02-16 14:13:49,695 - INFO - Response for season 2004-05: got 1 DataFrames
2025-02-16 14:13:49,729 - INFO - Retrieved 2728 game entries for season 2004-05


Response for season 2004-05: got 1 DataFrames
Retrieved 2728 game entries for season 2004-05


2025-02-16 14:13:51,303 - INFO - Response for season 2005-06: got 1 DataFrames
2025-02-16 14:13:51,304 - INFO - Retrieved 2871 game entries for season 2005-06


Response for season 2005-06: got 1 DataFrames
Retrieved 2871 game entries for season 2005-06


2025-02-16 14:13:53,148 - INFO - Response for season 2006-07: got 1 DataFrames
2025-02-16 14:13:53,151 - INFO - Retrieved 2867 game entries for season 2006-07


Response for season 2006-07: got 1 DataFrames
Retrieved 2867 game entries for season 2006-07


2025-02-16 14:13:55,163 - INFO - Response for season 2007-08: got 1 DataFrames
2025-02-16 14:13:55,164 - INFO - Retrieved 2852 game entries for season 2007-08


Response for season 2007-08: got 1 DataFrames
Retrieved 2852 game entries for season 2007-08


2025-02-16 14:13:57,139 - INFO - Response for season 2008-09: got 1 DataFrames
2025-02-16 14:13:57,148 - INFO - Retrieved 2866 game entries for season 2008-09


Response for season 2008-09: got 1 DataFrames
Retrieved 2866 game entries for season 2008-09


2025-02-16 14:13:58,597 - INFO - Response for season 2009-10: got 1 DataFrames
2025-02-16 14:13:58,599 - INFO - Retrieved 2871 game entries for season 2009-10


Response for season 2009-10: got 1 DataFrames
Retrieved 2871 game entries for season 2009-10


2025-02-16 14:14:00,004 - INFO - Response for season 2010-11: got 1 DataFrames
2025-02-16 14:14:00,005 - INFO - Retrieved 2866 game entries for season 2010-11


Response for season 2010-11: got 1 DataFrames
Retrieved 2866 game entries for season 2010-11


2025-02-16 14:14:01,460 - INFO - Response for season 2011-12: got 1 DataFrames
2025-02-16 14:14:01,462 - INFO - Retrieved 2214 game entries for season 2011-12


Response for season 2011-12: got 1 DataFrames
Retrieved 2214 game entries for season 2011-12


2025-02-16 14:14:04,947 - INFO - Response for season 2012-13: got 1 DataFrames
2025-02-16 14:14:04,950 - INFO - Retrieved 2866 game entries for season 2012-13


Response for season 2012-13: got 1 DataFrames
Retrieved 2866 game entries for season 2012-13


2025-02-16 14:14:06,879 - INFO - Response for season 2013-14: got 1 DataFrames
2025-02-16 14:14:06,880 - INFO - Retrieved 2874 game entries for season 2013-14


Response for season 2013-14: got 1 DataFrames
Retrieved 2874 game entries for season 2013-14


2025-02-16 14:14:08,329 - INFO - Response for season 2014-15: got 1 DataFrames
2025-02-16 14:14:08,330 - INFO - Retrieved 2864 game entries for season 2014-15


Response for season 2014-15: got 1 DataFrames
Retrieved 2864 game entries for season 2014-15


2025-02-16 14:14:10,242 - INFO - Response for season 2015-16: got 1 DataFrames
2025-02-16 14:14:10,242 - INFO - Retrieved 2856 game entries for season 2015-16


Response for season 2015-16: got 1 DataFrames
Retrieved 2856 game entries for season 2015-16


2025-02-16 14:14:11,638 - INFO - Response for season 2016-17: got 1 DataFrames
2025-02-16 14:14:11,639 - INFO - Retrieved 2829 game entries for season 2016-17


Response for season 2016-17: got 1 DataFrames
Retrieved 2829 game entries for season 2016-17


2025-02-16 14:14:13,064 - INFO - Response for season 2017-18: got 1 DataFrames
2025-02-16 14:14:13,065 - INFO - Retrieved 2785 game entries for season 2017-18


Response for season 2017-18: got 1 DataFrames
Retrieved 2785 game entries for season 2017-18


2025-02-16 14:14:14,931 - INFO - Response for season 2018-19: got 1 DataFrames
2025-02-16 14:14:14,931 - INFO - Retrieved 2788 game entries for season 2018-19


Response for season 2018-19: got 1 DataFrames
Retrieved 2788 game entries for season 2018-19


2025-02-16 14:14:16,829 - INFO - Response for season 2019-20: got 1 DataFrames
2025-02-16 14:14:16,829 - INFO - Retrieved 2516 game entries for season 2019-20


Response for season 2019-20: got 1 DataFrames
Retrieved 2516 game entries for season 2019-20


2025-02-16 14:14:18,198 - INFO - Response for season 2020-21: got 1 DataFrames
2025-02-16 14:14:18,199 - INFO - Retrieved 2442 game entries for season 2020-21


Response for season 2020-21: got 1 DataFrames
Retrieved 2442 game entries for season 2020-21


2025-02-16 14:14:19,618 - INFO - Response for season 2021-22: got 1 DataFrames
2025-02-16 14:14:19,619 - INFO - Retrieved 2788 game entries for season 2021-22


Response for season 2021-22: got 1 DataFrames
Retrieved 2788 game entries for season 2021-22


2025-02-16 14:14:21,088 - INFO - Response for season 2022-23: got 1 DataFrames
2025-02-16 14:14:21,089 - INFO - Retrieved 2790 game entries for season 2022-23


Response for season 2022-23: got 1 DataFrames
Retrieved 2790 game entries for season 2022-23


2025-02-16 14:14:22,245 - INFO - Response for season 2023-24: got 1 DataFrames
2025-02-16 14:14:22,247 - INFO - Retrieved 2795 game entries for season 2023-24


Response for season 2023-24: got 1 DataFrames
Retrieved 2795 game entries for season 2023-24


'Game ids found from 2004 - 2024: 27660'

## Imports and Helper functions for continual scraping

import necessary libs

In [4]:
import requests
import pickle
import time
from pathlib import Path
from typing import Dict, Optional, List
import logging
from tqdm import tqdm
from requests.exceptions import Timeout, RequestException
import random

functions to load/save the pkl cache to/from disk

In [5]:
def load_or_create_cache(cache_path: str) -> Dict[str, dict]:
    """Load existing cache or create new one if it doesn't exist."""
    if Path(cache_path).exists():
        with open(cache_path, 'rb') as f:
            return pickle.load(f)
    return {}

In [6]:
def save_cache(cache: Dict[str, dict], cache_path: str) -> None:
    """Save the cache to disk."""
    with open(cache_path, 'wb') as f:
        pickle.dump(cache, f)

# Scraping BoxScoreAdvancedV3

Function that uses BoxScoreAdvancedV3 to get info about a *single* game 

In [9]:
def create_row_boxscoreadvanced(game_id: str, timeout: int = 30, max_retries: int = 3) -> Optional[dict]:
    """
    Fetch game stats from NBA API with timeout handling and automatic retries.
    
    Args:
        game_id: NBA game identifier
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    """
    headers = {
        "Host": "stats.nba.com",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
        "Accept": "application/json, text/plain, */*",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "x-nba-stats-origin": "stats",
        "x-nba-stats-token": "true",
        "Connection": "keep-alive",
        "Referer": "https://stats.nba.com/",
        "Pragma": "no-cache",
        "Cache-Control": "no-cache",
    }
    
    #url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={'00'+str(game_id)}&RangeType=0&StartPeriod=0&StartRange=0'
    url = f'https://stats.nba.com/stats/boxscoreadvancedv3?EndPeriod=0&EndRange=0&GameID={game_id}&RangeType=0&StartPeriod=0&StartRange=0'
    #url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={game_id}&RangeType=0&StartPeriod=0&StartRange=0'
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=timeout)
            response.raise_for_status()
            return response.json()
            
        except Timeout:
            wait_time = (attempt + 1) * 5 + random.uniform(1, 3)  # Exponential backoff with jitter
            logging.warning(f"Timeout for game {game_id} (attempt {attempt + 1}/{max_retries}). "
                          f"Waiting {wait_time:.1f} seconds before retry...")
            time.sleep(wait_time)
            
        except RequestException as e:
            logging.error(f"Error fetching game {game_id}: {str(e)}")
            return None
            
    logging.error(f"Max retries ({max_retries}) reached for game {game_id}")
    return None

function that uses above to scrape info for all the game ids. Also incorporates logging, tqdm progress bar, pkl cache (in case of crash)

In [10]:
def scrape_game_stats_boxscoreadvanced(game_ids: List[str], cache_path: str = 'nba_games_cache.pkl', 
                     delay: float = 1.0, save_frequency: int = 10,
                     timeout: int = 30, max_retries: int = 3) -> Dict[str, dict]:
    """
    Scrape game stats with progress bar, timeout handling, and automatic retries.
    
    Args:
        game_ids: List of NBA game IDs to scrape
        cache_path: Path to save/load the pickle cache file
        delay: Time to wait between requests in seconds
        save_frequency: How often to save the cache (every N successful requests)
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    
    Returns:
        Dictionary mapping game IDs to their stats data
    """
    # Setup logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )
    
    # Load existing cache
    cache = load_or_create_cache(cache_path)
    logging.info(f"Loaded cache with {len(cache)} existing games")
    
    # Filter out already cached games
    games_to_scrape = [gid for gid in game_ids if gid not in cache]
    logging.info(f"Found {len(games_to_scrape)} new games to scrape")
    
    successful_requests = 0
    
    # Create progress bar
    pbar = tqdm(games_to_scrape, desc="Scraping games", unit="game")
    
    for game_id in pbar:
        try:
            # Update progress bar description with current game
            pbar.set_description(f"Scraping game {game_id}")
            
            # Fetch game data with timeout handling and retries
            game_data = create_row_boxscoreadvanced(game_id, timeout=timeout, max_retries=max_retries)
            
            if game_data is not None:
                cache[game_id] = game_data
                successful_requests += 1
                
                # Update progress bar postfix with success count
                pbar.set_postfix(
                    successful=successful_requests,
                    cached_total=len(cache)
                )
                
                # Save periodically
                if successful_requests % save_frequency == 0:
                    save_cache(cache, cache_path)
                    logging.info(f"Saved cache with {len(cache)} games")
            
            # Wait between requests
            time.sleep(delay)
            
        except Exception as e:
            logging.error(f"Unexpected error processing game {game_id}: {str(e)}")
            # Save cache on error to preserve progress
            save_cache(cache, cache_path)
            logging.info("Saved cache due to error")
            raise
    
    # Close progress bar
    pbar.close()
    
    # Final save
    save_cache(cache, cache_path)
    logging.info(f"Scraping completed. Final cache contains {len(cache)} games")
    
    return cache

Run the scraping with 0.5s request delay (use 1sec if running into excessive timeout errors/notable skips)

In [None]:
# Since I plan on running for ~20 min as a trial run, I want to randomize the game ids as a way to maximize the chance of running into bugs 
#random.shuffle(game_ids)

In [16]:
try:
    game_stats_boxscoreadvanced = scrape_game_stats_boxscoreadvanced(
        game_ids=game_ids,
        cache_path='nba_games_cache_boxscoreadvanced.pkl',
        delay=0.5,  # 1 second delay between requests
        save_frequency=10,  # Save every 10 successful requests
        timeout=30,  # 30 second timeout
        max_retries=3  # Retry up to 3 times on timeout
    )
except Exception as e:
    logging.error(f"Scraping stopped due to error: {str(e)}")

2025-02-16 14:14:37,446 - INFO - Loaded cache with 0 existing games
2025-02-16 14:14:37,450 - INFO - Found 27660 new games to scrape
Scraping game 0021800631:   0%|          | 9/27660 [00:16<12:13:44,  1.59s/game, cached_total=10, successful=10]2025-02-16 14:14:54,035 - INFO - Saved cache with 10 games
Scraping game 0022101044:   0%|          | 19/27660 [00:31<11:48:15,  1.54s/game, cached_total=20, successful=20]2025-02-16 14:15:09,314 - INFO - Saved cache with 20 games
Scraping game 0021300243:   0%|          | 29/27660 [00:45<10:43:29,  1.40s/game, cached_total=30, successful=30]2025-02-16 14:15:22,586 - INFO - Saved cache with 30 games
Scraping game 0021701214:   0%|          | 39/27660 [01:05<17:55:56,  2.34s/game, cached_total=40, successful=40]2025-02-16 14:15:42,568 - INFO - Saved cache with 40 games
Scraping game 0020500455:   0%|          | 49/27660 [01:23<14:20:19,  1.87s/game, cached_total=50, successful=50]2025-02-16 14:16:00,927 - INFO - Saved cache with 50 games
Scraping

KeyboardInterrupt: 

# Basic eval of trial run

In [None]:
import pandas as pd
import pickle
with open('nba_games_cache_boxscoreadvanced.pkl', 'rb') as f:
    cache = pickle.load(f)


all_games = []
num_skipped = 0
for game_id, game_data in cache.items():
    try:
        assert game_data is not None
        df = pd.json_normalize(game_data)
        all_games.append(df)
    except:
        num_skipped = num_skipped + 1
        continue
print(f'Num skipped: {num_skipped}')
game_level_dataset = pd.concat(all_games, ignore_index=True)
print(f'shape of dataset: {game_level_dataset.shape}')
game_level_dataset.head()

Num skipped: 0
shape of dataset: (810, 68)


Unnamed: 0,meta.version,meta.request,meta.time,boxScoreAdvanced.gameId,boxScoreAdvanced.awayTeamId,boxScoreAdvanced.homeTeamId,boxScoreAdvanced.homeTeam.teamId,boxScoreAdvanced.homeTeam.teamCity,boxScoreAdvanced.homeTeam.teamName,boxScoreAdvanced.homeTeam.teamTricode,...,boxScoreAdvanced.awayTeam.statistics.trueShootingPercentage,boxScoreAdvanced.awayTeam.statistics.usagePercentage,boxScoreAdvanced.awayTeam.statistics.estimatedUsagePercentage,boxScoreAdvanced.awayTeam.statistics.estimatedPace,boxScoreAdvanced.awayTeam.statistics.pace,boxScoreAdvanced.awayTeam.statistics.pacePer40,boxScoreAdvanced.awayTeam.statistics.possessions,boxScoreAdvanced.awayTeam.statistics.PIE,boxScoreAdvanced.homeTeam.statistics,boxScoreAdvanced.awayTeam.statistics
0,1,http://nba.cloud/games/0021800263/boxscoreadva...,2025-02-16T15:14:38.1438Z,21800263,1610612758,1610612762,1610612762,Utah,Jazz,UTA,...,0.611,1.0,0.197,100.0,98.0,81.67,97.0,0.544,,
1,1,http://nba.cloud/games/0021800862/boxscoreadva...,2025-02-16T15:14:41.1441Z,21800862,1610612744,1610612757,1610612757,Portland,Trail Blazers,POR,...,0.51,1.0,0.198,96.94,94.5,78.75,94.0,0.406,,
2,1,http://nba.cloud/games/0022300751/boxscoreadva...,2025-02-16T15:14:43.1443Z,22300751,1610612763,1610612766,1610612766,Charlotte,Hornets,CHA,...,0.555,1.0,0.199,98.8,98.0,81.67,97.0,0.441,,
3,1,http://nba.cloud/games/0020900350/boxscoreadva...,2025-02-16T15:14:44.1444Z,20900350,1610612764,1610612746,1610612746,Los Angeles,Clippers,LAC,...,0.488,1.0,0.198,100.48,98.5,82.08,99.0,0.467,,
4,1,http://nba.cloud/games/0020900869/boxscoreadva...,2025-02-16T15:14:46.1446Z,20900869,1610612755,1610612747,1610612747,Los Angeles,Lakers,LAL,...,0.47,1.0,0.2,92.5,90.0,75.0,90.0,0.472,,


810 matches up with 810 successful requests for box score advanced

In [19]:
game_level_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 810 entries, 0 to 809
Data columns (total 68 columns):
 #   Column                                                                Non-Null Count  Dtype  
---  ------                                                                --------------  -----  
 0   meta.version                                                          810 non-null    int64  
 1   meta.request                                                          810 non-null    object 
 2   meta.time                                                             810 non-null    object 
 3   boxScoreAdvanced.gameId                                               810 non-null    object 
 4   boxScoreAdvanced.awayTeamId                                           810 non-null    int64  
 5   boxScoreAdvanced.homeTeamId                                           810 non-null    int64  
 6   boxScoreAdvanced.homeTeam.teamId                                      810 non-null    int64  
 7  

# Scraping BoxScorePlayerTrackV3

Copied code from Scraping BoxScoreAdvanced and modified accordingly

Function that uses BoxScorePlayerTrackV3 to get info about a *single* game 

In [11]:
def create_row_playertrack(game_id: str, timeout: int = 30, max_retries: int = 3) -> Optional[dict]:
    """
    Fetch game stats from NBA API with timeout handling and automatic retries.
    
    Args:
        game_id: NBA game identifier
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    """
    headers = {
        "Host": "stats.nba.com",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
        "Accept": "application/json, text/plain, */*",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "x-nba-stats-origin": "stats",
        "x-nba-stats-token": "true",
        "Connection": "keep-alive",
        "Referer": "https://stats.nba.com/",
        "Pragma": "no-cache",
        "Cache-Control": "no-cache",
    }
    
    #url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={'00'+str(game_id)}&RangeType=0&StartPeriod=0&StartRange=0'
    url = f'https://stats.nba.com/stats/boxscoreplayertrackv3?GameID={game_id}'
    #url = f'https://stats.nba.com/stats/boxscoreadvancedv3?EndPeriod=0&EndRange=0&GameID={game_id}&RangeType=0&StartPeriod=0&StartRange=0'
    #url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={game_id}&RangeType=0&StartPeriod=0&StartRange=0'
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=timeout)
            response.raise_for_status()
            return response.json()
            
        except Timeout:
            wait_time = (attempt + 1) * 5 + random.uniform(1, 3)  # Exponential backoff with jitter
            logging.warning(f"Timeout for game {game_id} (attempt {attempt + 1}/{max_retries}). "
                          f"Waiting {wait_time:.1f} seconds before retry...")
            time.sleep(wait_time)
            
        except RequestException as e:
            logging.error(f"Error fetching game {game_id}: {str(e)}")
            return None
            
    logging.error(f"Max retries ({max_retries}) reached for game {game_id}")
    return None

function that uses above to scrape info for all the game ids. Also incorporates logging, tqdm progress bar, pkl cache (in case of crash)

In [12]:
def scrape_game_stats_playertrack(game_ids: List[str], cache_path: str = 'nba_games_cache.pkl', 
                     delay: float = 1.0, save_frequency: int = 10,
                     timeout: int = 30, max_retries: int = 3) -> Dict[str, dict]:
    """
    Scrape game stats with progress bar, timeout handling, and automatic retries.
    
    Args:
        game_ids: List of NBA game IDs to scrape
        cache_path: Path to save/load the pickle cache file
        delay: Time to wait between requests in seconds
        save_frequency: How often to save the cache (every N successful requests)
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    
    Returns:
        Dictionary mapping game IDs to their stats data
    """
    # Setup logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )
    
    # Load existing cache
    cache = load_or_create_cache(cache_path)
    logging.info(f"Loaded cache with {len(cache)} existing games")
    
    # Filter out already cached games
    games_to_scrape = [gid for gid in game_ids if gid not in cache]
    logging.info(f"Found {len(games_to_scrape)} new games to scrape")
    
    successful_requests = 0
    
    # Create progress bar
    pbar = tqdm(games_to_scrape, desc="Scraping games", unit="game")
    
    for game_id in pbar:
        try:
            # Update progress bar description with current game
            pbar.set_description(f"Scraping game {game_id}")
            
            # Fetch game data with timeout handling and retries
            game_data = create_row_playertrack(game_id, timeout=timeout, max_retries=max_retries)
            
            if game_data is not None:
                cache[game_id] = game_data
                successful_requests += 1
                
                # Update progress bar postfix with success count
                pbar.set_postfix(
                    successful=successful_requests,
                    cached_total=len(cache)
                )
                
                # Save periodically
                if successful_requests % save_frequency == 0:
                    save_cache(cache, cache_path)
                    logging.info(f"Saved cache with {len(cache)} games")
            
            # Wait between requests
            time.sleep(delay)
            
        except Exception as e:
            logging.error(f"Unexpected error processing game {game_id}: {str(e)}")
            # Save cache on error to preserve progress
            save_cache(cache, cache_path)
            logging.info("Saved cache due to error")
            raise
    
    # Close progress bar
    pbar.close()
    
    # Final save
    save_cache(cache, cache_path)
    logging.info(f"Scraping completed. Final cache contains {len(cache)} games")
    
    return cache

Run the scraping with 0.5s request delay (use 1sec if running into excessive timeout errors/notable skips)

In [17]:
# Since I plan on running for ~20 min as a trial run, I want to randomize the game ids as a way to maximize the chance of running into bugs 
random.shuffle(game_ids)

In [None]:
try:
    game_stats_playertrack = scrape_game_stats_playertrack(
        game_ids=game_ids,
        cache_path='nba_games_cache_playertrack.pkl',
        delay=0.5,  # 1 second delay between requests
        save_frequency=10,  # Save every 10 successful requests
        timeout=30,  # 30 second timeout
        max_retries=3  # Retry up to 3 times on timeout
    )
except Exception as e:
    logging.error(f"Scraping stopped due to error: {str(e)}")