v4 works so v5 is a cleaned up version


To get around the timeout errors we faced in prior scraping attempts, this approach will continually write results to a pkl file as the scraping progresses. This way, any error will not invalidate the progress made + allows for a starting point for the reattempt 

# Scraping games to a pkl

## Extracting unique game_ids 

~~~from the shots dataset~~~

In [None]:
# import pandas as pd
# df = pd.read_pickle('all-shots.pkl') 
# df = df.to_pandas(use_pyarrow_extension_array=True)
# print(df['GAME_ID'].nunique())
# game_ids = df['GAME_ID'].unique()

Cobbled together from Trevor's nba-api-connection-2.ipynb

Using nba-api package to get a complete list of game_ids from 2004 to 2024 (so run on capstone2 environment)

In [2]:
from nba_api.stats.endpoints import LeagueGameFinder, BoxScoreTraditionalV2
import pandas as pd
import time
from datetime import datetime
from tqdm.auto import tqdm
import logging
import random

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.FileHandler('nba_data_collection.log'),
             logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

def log_message(message):
    """Log message to both file and tqdm"""
    tqdm.write(message)
    logger.info(message)

In [3]:
def get_season_games(season):
    """Get games for a single season with error handling"""
    game_finder = LeagueGameFinder(
        season_nullable=season,
        league_id_nullable='00',
        timeout=60
    )
    
    # Get the raw response first
    response_frames = game_finder.get_data_frames()
    
    # Debug logging
    log_message(f"Response for season {season}: got {len(response_frames)} DataFrames")
    if not response_frames:
        raise ValueError(f"Empty response for season {season}")
    
    games = response_frames[0]
    if len(games) == 0:
        raise ValueError(f"No games found for season {season}")
        
    log_message(f"Retrieved {len(games)} game entries for season {season}")
    return games

In [4]:
def compile_game_ids(start_year, end_year):
    game_ids = []
    seasons = [f"{year}-{str(year + 1)[-2:]}" for year in range(start_year, end_year)]

    for season in seasons:
        game_ids.extend(get_season_games(season)['GAME_ID'].unique())
        time.sleep(1)
    return game_ids

In [5]:
game_ids = compile_game_ids(2004, 2024)
f"Game ids found from 2004 - 2024: {len(game_ids)}"

2025-02-07 11:18:01,104 - INFO - Response for season 2004-05: got 1 DataFrames
2025-02-07 11:18:01,106 - INFO - Retrieved 2728 game entries for season 2004-05


Response for season 2004-05: got 1 DataFrames
Retrieved 2728 game entries for season 2004-05


2025-02-07 11:18:02,185 - INFO - Response for season 2005-06: got 1 DataFrames
2025-02-07 11:18:02,187 - INFO - Retrieved 2871 game entries for season 2005-06


Response for season 2005-06: got 1 DataFrames
Retrieved 2871 game entries for season 2005-06


2025-02-07 11:18:03,264 - INFO - Response for season 2006-07: got 1 DataFrames
2025-02-07 11:18:03,270 - INFO - Retrieved 2867 game entries for season 2006-07


Response for season 2006-07: got 1 DataFrames
Retrieved 2867 game entries for season 2006-07


2025-02-07 11:18:04,340 - INFO - Response for season 2007-08: got 1 DataFrames
2025-02-07 11:18:04,341 - INFO - Retrieved 2852 game entries for season 2007-08


Response for season 2007-08: got 1 DataFrames
Retrieved 2852 game entries for season 2007-08


2025-02-07 11:18:05,422 - INFO - Response for season 2008-09: got 1 DataFrames
2025-02-07 11:18:05,423 - INFO - Retrieved 2866 game entries for season 2008-09


Response for season 2008-09: got 1 DataFrames
Retrieved 2866 game entries for season 2008-09


2025-02-07 11:18:06,513 - INFO - Response for season 2009-10: got 1 DataFrames
2025-02-07 11:18:06,517 - INFO - Retrieved 2871 game entries for season 2009-10


Response for season 2009-10: got 1 DataFrames
Retrieved 2871 game entries for season 2009-10


2025-02-07 11:18:07,608 - INFO - Response for season 2010-11: got 1 DataFrames
2025-02-07 11:18:07,610 - INFO - Retrieved 2866 game entries for season 2010-11


Response for season 2010-11: got 1 DataFrames
Retrieved 2866 game entries for season 2010-11


2025-02-07 11:18:08,681 - INFO - Response for season 2011-12: got 1 DataFrames
2025-02-07 11:18:08,682 - INFO - Retrieved 2214 game entries for season 2011-12


Response for season 2011-12: got 1 DataFrames
Retrieved 2214 game entries for season 2011-12


2025-02-07 11:18:09,825 - INFO - Response for season 2012-13: got 1 DataFrames
2025-02-07 11:18:09,825 - INFO - Retrieved 2866 game entries for season 2012-13


Response for season 2012-13: got 1 DataFrames
Retrieved 2866 game entries for season 2012-13


2025-02-07 11:18:10,921 - INFO - Response for season 2013-14: got 1 DataFrames
2025-02-07 11:18:10,922 - INFO - Retrieved 2874 game entries for season 2013-14


Response for season 2013-14: got 1 DataFrames
Retrieved 2874 game entries for season 2013-14


2025-02-07 11:18:12,009 - INFO - Response for season 2014-15: got 1 DataFrames
2025-02-07 11:18:12,010 - INFO - Retrieved 2864 game entries for season 2014-15


Response for season 2014-15: got 1 DataFrames
Retrieved 2864 game entries for season 2014-15


2025-02-07 11:18:13,081 - INFO - Response for season 2015-16: got 1 DataFrames
2025-02-07 11:18:13,082 - INFO - Retrieved 2856 game entries for season 2015-16


Response for season 2015-16: got 1 DataFrames
Retrieved 2856 game entries for season 2015-16


2025-02-07 11:18:14,213 - INFO - Response for season 2016-17: got 1 DataFrames
2025-02-07 11:18:14,214 - INFO - Retrieved 2829 game entries for season 2016-17


Response for season 2016-17: got 1 DataFrames
Retrieved 2829 game entries for season 2016-17


2025-02-07 11:18:15,284 - INFO - Response for season 2017-18: got 1 DataFrames
2025-02-07 11:18:15,285 - INFO - Retrieved 2785 game entries for season 2017-18


Response for season 2017-18: got 1 DataFrames
Retrieved 2785 game entries for season 2017-18


2025-02-07 11:18:16,357 - INFO - Response for season 2018-19: got 1 DataFrames
2025-02-07 11:18:16,359 - INFO - Retrieved 2788 game entries for season 2018-19


Response for season 2018-19: got 1 DataFrames
Retrieved 2788 game entries for season 2018-19


2025-02-07 11:18:17,433 - INFO - Response for season 2019-20: got 1 DataFrames
2025-02-07 11:18:17,434 - INFO - Retrieved 2516 game entries for season 2019-20


Response for season 2019-20: got 1 DataFrames
Retrieved 2516 game entries for season 2019-20


2025-02-07 11:18:18,521 - INFO - Response for season 2020-21: got 1 DataFrames
2025-02-07 11:18:18,522 - INFO - Retrieved 2442 game entries for season 2020-21


Response for season 2020-21: got 1 DataFrames
Retrieved 2442 game entries for season 2020-21


2025-02-07 11:18:19,591 - INFO - Response for season 2021-22: got 1 DataFrames
2025-02-07 11:18:19,592 - INFO - Retrieved 2788 game entries for season 2021-22


Response for season 2021-22: got 1 DataFrames
Retrieved 2788 game entries for season 2021-22


2025-02-07 11:18:20,683 - INFO - Response for season 2022-23: got 1 DataFrames
2025-02-07 11:18:20,684 - INFO - Retrieved 2790 game entries for season 2022-23


Response for season 2022-23: got 1 DataFrames
Retrieved 2790 game entries for season 2022-23


2025-02-07 11:18:21,760 - INFO - Response for season 2023-24: got 1 DataFrames
2025-02-07 11:18:21,760 - INFO - Retrieved 2795 game entries for season 2023-24


Response for season 2023-24: got 1 DataFrames
Retrieved 2795 game entries for season 2023-24


'Game ids found from 2004 - 2024: 27660'

In [10]:
game_ids[0:5]

['0040400407', '0040400406', '0040400405', '0040400404', '0040400403']

## Continual scraping 
featuring
- pkl cache
- tqdm + logging
- exponential retries
- manual timeout waits
- ability to start where last attempt stopped (hasnt crashed so this is untested)


Note that the 'cache' is basically a dictionary with game_ids as keys and endpoint response jsons as values  

In [6]:
import requests
import pickle
import time
from pathlib import Path
from typing import Dict, Optional, List
import logging
from tqdm import tqdm
from requests.exceptions import Timeout, RequestException
import random

### helper methods

#### load/save the pkl cache

In [7]:
def load_or_create_cache(cache_path: str) -> Dict[str, dict]:
    """Load existing cache or create new one if it doesn't exist."""
    if Path(cache_path).exists():
        with open(cache_path, 'rb') as f:
            return pickle.load(f)
    return {}

In [8]:
def save_cache(cache: Dict[str, dict], cache_path: str) -> None:
    """Save the cache to disk."""
    with open(cache_path, 'wb') as f:
        pickle.dump(cache, f)

#### scrape for a single game row

In [11]:
def create_game_level_row(game_id: str, timeout: int = 30, max_retries: int = 3) -> Optional[dict]:
    """
    Fetch game stats from NBA API with timeout handling and automatic retries.
    
    Args:
        game_id: NBA game identifier
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    """
    headers = {
        "Host": "stats.nba.com",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:72.0) Gecko/20100101 Firefox/72.0",
        "Accept": "application/json, text/plain, */*",
        "Accept-Language": "en-US,en;q=0.5",
        "Accept-Encoding": "gzip, deflate, br",
        "x-nba-stats-origin": "stats",
        "x-nba-stats-token": "true",
        "Connection": "keep-alive",
        "Referer": "https://stats.nba.com/",
        "Pragma": "no-cache",
        "Cache-Control": "no-cache",
    }
    
    #url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={'00'+str(game_id)}&RangeType=0&StartPeriod=0&StartRange=0'
    url = f'https://stats.nba.com/stats/boxscoretraditionalv3?EndPeriod=0&EndRange=0&GameID={game_id}&RangeType=0&StartPeriod=0&StartRange=0'
    
    for attempt in range(max_retries):
        try:
            response = requests.get(url, headers=headers, timeout=timeout)
            response.raise_for_status()
            return response.json()
            
        except Timeout:
            wait_time = (attempt + 1) * 5 + random.uniform(1, 3)  # Exponential backoff with jitter
            logging.warning(f"Timeout for game {game_id} (attempt {attempt + 1}/{max_retries}). "
                          f"Waiting {wait_time:.1f} seconds before retry...")
            time.sleep(wait_time)
            
        except RequestException as e:
            logging.error(f"Error fetching game {game_id}: {str(e)}")
            return None
            
    logging.error(f"Max retries ({max_retries}) reached for game {game_id}")
    return None

#### scrape for all games

In [12]:
def scrape_game_stats(game_ids: List[str], cache_path: str = 'nba_games_cache.pkl', 
                     delay: float = 1.0, save_frequency: int = 10,
                     timeout: int = 30, max_retries: int = 3) -> Dict[str, dict]:
    """
    Scrape game stats with progress bar, timeout handling, and automatic retries.
    
    Args:
        game_ids: List of NBA game IDs to scrape
        cache_path: Path to save/load the pickle cache file
        delay: Time to wait between requests in seconds
        save_frequency: How often to save the cache (every N successful requests)
        timeout: Request timeout in seconds
        max_retries: Maximum number of retry attempts for timeout errors
    
    Returns:
        Dictionary mapping game IDs to their stats data
    """
    # Setup logging
    logging.basicConfig(
        level=logging.INFO,
        format='%(asctime)s - %(levelname)s - %(message)s'
    )
    
    # Load existing cache
    cache = load_or_create_cache(cache_path)
    logging.info(f"Loaded cache with {len(cache)} existing games")
    
    # Filter out already cached games
    games_to_scrape = [gid for gid in game_ids if gid not in cache]
    logging.info(f"Found {len(games_to_scrape)} new games to scrape")
    
    successful_requests = 0
    
    # Create progress bar
    pbar = tqdm(games_to_scrape, desc="Scraping games", unit="game")
    
    for game_id in pbar:
        try:
            # Update progress bar description with current game
            pbar.set_description(f"Scraping game {game_id}")
            
            # Fetch game data with timeout handling and retries
            game_data = create_game_level_row(game_id, timeout=timeout, max_retries=max_retries)
            
            if game_data is not None:
                cache[game_id] = game_data
                successful_requests += 1
                
                # Update progress bar postfix with success count
                pbar.set_postfix(
                    successful=successful_requests,
                    cached_total=len(cache)
                )
                
                # Save periodically
                if successful_requests % save_frequency == 0:
                    save_cache(cache, cache_path)
                    logging.info(f"Saved cache with {len(cache)} games")
            
            # Wait between requests
            time.sleep(delay)
            
        except Exception as e:
            logging.error(f"Unexpected error processing game {game_id}: {str(e)}")
            # Save cache on error to preserve progress
            save_cache(cache, cache_path)
            logging.info("Saved cache due to error")
            raise
    
    # Close progress bar
    pbar.close()
    
    # Final save
    save_cache(cache, cache_path)
    logging.info(f"Scraping completed. Final cache contains {len(cache)} games")
    
    return cache

### main()

Note: 1s delay may be overkill, but since my attempts did not crash I did not attempt to modify it

In [13]:
try:
    game_stats_2024 = scrape_game_stats(
        game_ids=game_ids,
        cache_path='nba_games_cache.pkl',
        delay=0.5,  # 1 second delay between requests
        save_frequency=10,  # Save every 10 successful requests
        timeout=30,  # 30 second timeout
        max_retries=3  # Retry up to 3 times on timeout
    )
except Exception as e:
    logging.error(f"Scraping stopped due to error: {str(e)}")

2025-02-07 11:21:08,835 - INFO - Loaded cache with 0 existing games
2025-02-07 11:21:08,839 - INFO - Found 27660 new games to scrape
Scraping game 0040400305:   0%|          | 9/27660 [00:07<5:56:44,  1.29game/s, cached_total=10, successful=10]2025-02-07 11:21:16,580 - INFO - Saved cache with 10 games
Scraping game 0040400226:   0%|          | 19/27660 [00:17<7:52:24,  1.03s/game, cached_total=20, successful=20]2025-02-07 11:21:25,939 - INFO - Saved cache with 20 games
Scraping game 0040400213:   0%|          | 29/27660 [00:25<8:02:03,  1.05s/game, cached_total=30, successful=30]2025-02-07 11:21:34,547 - INFO - Saved cache with 30 games
Scraping game 0040400231:   0%|          | 39/27660 [00:34<6:27:04,  1.19game/s, cached_total=40, successful=40]2025-02-07 11:21:43,100 - INFO - Saved cache with 40 games
Scraping game 0040400125:   0%|          | 49/27660 [00:43<6:12:56,  1.23game/s, cached_total=50, successful=50]2025-02-07 11:21:52,202 - INFO - Saved cache with 50 games
Scraping game

# Processing pkl cache into dataset df

In [None]:
import pandas as pd
import pickle

load pkl into cache dictionary

In [14]:
with open('nba_games_cache.pkl', 'rb') as f:
    cache = pickle.load(f)
print(f'number of game ids: {len(game_ids)}\nnumber of scraped games: {len(cache)}')

number of game ids: 27660
number of scraped games: 27657


iterate through cache using `pd.json_normalize()` to create a row for each game id's value

Note: I had a more complicated and verbose version of this wrapped in a function, but while it worked for 2024 sample cache, it didnt for the entire game cache 

In [15]:
all_games = []
num_skipped = 0
for game_id, game_data in cache.items():
    try:
        assert game_data is not None
        df = pd.json_normalize(game_data)
        all_games.append(df)
    except:
        num_skipped = num_skipped + 1
        continue
print(f'Num skipped: {num_skipped}')
game_level_dataset = pd.concat(all_games, ignore_index=True)
print(f'shape of dataset: {game_level_dataset.shape}')
game_level_dataset.head()

Num skipped: 0
shape of dataset: (27657, 140)


Unnamed: 0,meta.version,meta.request,meta.time,boxScoreTraditional.gameId,boxScoreTraditional.awayTeamId,boxScoreTraditional.homeTeamId,boxScoreTraditional.homeTeam.teamId,boxScoreTraditional.homeTeam.teamCity,boxScoreTraditional.homeTeam.teamName,boxScoreTraditional.homeTeam.teamTricode,...,boxScoreTraditional.awayTeam.bench.blocks,boxScoreTraditional.awayTeam.bench.turnovers,boxScoreTraditional.awayTeam.bench.foulsPersonal,boxScoreTraditional.awayTeam.bench.points,boxScoreTraditional.homeTeam.bench,boxScoreTraditional.homeTeam.statistics,boxScoreTraditional.homeTeam.starters,boxScoreTraditional.awayTeam.statistics,boxScoreTraditional.awayTeam.starters,boxScoreTraditional.awayTeam.bench
0,1,http://nba.cloud/games/0040400407/boxscoretrad...,2023-08-10T15:58:36.5836Z,40400407,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,2.0,2.0,6.0,14.0,,,,,,
1,1,http://nba.cloud/games/0040400406/boxscoretrad...,2023-08-10T11:24:49.2449Z,40400406,1610612765,1610612759,1610612759,San Antonio,Spurs,SAS,...,1.0,2.0,7.0,14.0,,,,,,
2,1,http://nba.cloud/games/0040400405/boxscoretrad...,2023-08-10T15:58:11.5811Z,40400405,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,0.0,3.0,5.0,22.0,,,,,,
3,1,http://nba.cloud/games/0040400404/boxscoretrad...,2023-08-10T15:57:54.5754Z,40400404,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,2.0,4.0,6.0,18.0,,,,,,
4,1,http://nba.cloud/games/0040400403/boxscoretrad...,2023-08-10T15:57:41.5741Z,40400403,1610612759,1610612765,1610612765,Detroit,Pistons,DET,...,1.0,4.0,2.0,18.0,,,,,,


## clean up/rename columns + export out to csv

In [16]:
#drop_columns = ['meta.version', 'meta.request', 'meta.time']
keep_columns = [
    'boxScoreTraditional.gameId', 
    'boxScoreTraditional.homeTeam.teamId',
    'boxScoreTraditional.homeTeam.teamName',
    'boxScoreTraditional.homeTeam.teamTricode',
    'boxScoreTraditional.homeTeam.statistics.minutes',
    'boxScoreTraditional.homeTeam.statistics.fieldGoalsMade',
    'boxScoreTraditional.homeTeam.statistics.fieldGoalsAttempted',
    'boxScoreTraditional.homeTeam.statistics.fieldGoalsPercentage',
    'boxScoreTraditional.homeTeam.statistics.threePointersMade',
    'boxScoreTraditional.homeTeam.statistics.threePointersAttempted',
    'boxScoreTraditional.homeTeam.statistics.threePointersPercentage',
    'boxScoreTraditional.homeTeam.statistics.freeThrowsMade',
    'boxScoreTraditional.homeTeam.statistics.freeThrowsAttempted',
    'boxScoreTraditional.homeTeam.statistics.freeThrowsPercentage',
    'boxScoreTraditional.homeTeam.statistics.reboundsOffensive',
    'boxScoreTraditional.homeTeam.statistics.reboundsDefensive',
    'boxScoreTraditional.homeTeam.statistics.reboundsTotal',
    'boxScoreTraditional.homeTeam.statistics.assists',
    'boxScoreTraditional.homeTeam.statistics.steals',
    'boxScoreTraditional.homeTeam.statistics.blocks',
    'boxScoreTraditional.homeTeam.statistics.turnovers',
    'boxScoreTraditional.homeTeam.statistics.foulsPersonal',
    'boxScoreTraditional.homeTeam.statistics.points',
    'boxScoreTraditional.homeTeam.statistics.plusMinusPoints',
    'boxScoreTraditional.awayTeam.teamId',
    'boxScoreTraditional.awayTeam.teamName',
    'boxScoreTraditional.awayTeam.teamTricode',
    'boxScoreTraditional.awayTeam.statistics.minutes',
    'boxScoreTraditional.awayTeam.statistics.fieldGoalsMade',
    'boxScoreTraditional.awayTeam.statistics.fieldGoalsAttempted',
    'boxScoreTraditional.awayTeam.statistics.fieldGoalsPercentage',
    'boxScoreTraditional.awayTeam.statistics.threePointersMade',
    'boxScoreTraditional.awayTeam.statistics.threePointersAttempted',
    'boxScoreTraditional.awayTeam.statistics.threePointersPercentage',
    'boxScoreTraditional.awayTeam.statistics.freeThrowsMade',
    'boxScoreTraditional.awayTeam.statistics.freeThrowsAttempted',
    'boxScoreTraditional.awayTeam.statistics.freeThrowsPercentage',
    'boxScoreTraditional.awayTeam.statistics.reboundsOffensive',
    'boxScoreTraditional.awayTeam.statistics.reboundsDefensive',
    'boxScoreTraditional.awayTeam.statistics.reboundsTotal',
    'boxScoreTraditional.awayTeam.statistics.assists',
    'boxScoreTraditional.awayTeam.statistics.steals',
    'boxScoreTraditional.awayTeam.statistics.blocks',
    'boxScoreTraditional.awayTeam.statistics.turnovers',
    'boxScoreTraditional.awayTeam.statistics.foulsPersonal',
    'boxScoreTraditional.awayTeam.statistics.points',
    'boxScoreTraditional.awayTeam.statistics.plusMinusPoints'
]

rename_cols = {
    'boxScoreTraditional.gameId':'GAME_ID',
    'boxScoreTraditional.homeTeam.teamId':'HOME_ID',
    'boxScoreTraditional.homeTeam.teamName':'HOME_NAME',
    'boxScoreTraditional.homeTeam.teamTricode':'HOME_TRICODE',
    'boxScoreTraditional.homeTeam.statistics.minutes':'HOME_MINUTES',
    'boxScoreTraditional.homeTeam.statistics.fieldGoalsMade':'HOME_FIELD_GOALS_MADE',
    'boxScoreTraditional.homeTeam.statistics.fieldGoalsAttempted':'HOME_FIELD_GOALS_ATTEMPTED',
    'boxScoreTraditional.homeTeam.statistics.fieldGoalsPercentage':'HOME_FIELD_GOALS_PERCENTAGE',
    'boxScoreTraditional.homeTeam.statistics.threePointersMade':'HOME_THREE_POINTERS_MADE',
    'boxScoreTraditional.homeTeam.statistics.threePointersAttempted':'HOME_THREE_POINTERS_ATTEMPTED',
    'boxScoreTraditional.homeTeam.statistics.threePointersPercentage':'HOME_THREE_POINTERS_PERCENTAGE',
    'boxScoreTraditional.homeTeam.statistics.freeThrowsMade':'HOME_FREE_THROWS_MADE',
    'boxScoreTraditional.homeTeam.statistics.freeThrowsAttempted':'HOME_FREE_THROWS_ATTEMPTED',
    'boxScoreTraditional.homeTeam.statistics.freeThrowsPercentage':'HOME_FREE_THROWS_PERCENTAGE',
    'boxScoreTraditional.homeTeam.statistics.reboundsOffensive':'HOME_REBOUNDS_OFFENSIVE',
    'boxScoreTraditional.homeTeam.statistics.reboundsDefensive':'HOME_REBOUNDS_DEFENSIVE',
    'boxScoreTraditional.homeTeam.statistics.reboundsTotal':'HOME_REBOUNDS_TOTAL',
    'boxScoreTraditional.homeTeam.statistics.assists':'HOME_ASSISTS',
    'boxScoreTraditional.homeTeam.statistics.steals':'HOME_STEALS',
    'boxScoreTraditional.homeTeam.statistics.blocks':'HOME_BLOCKS',
    'boxScoreTraditional.homeTeam.statistics.turnovers':'HOME_TURNOVERS',
    'boxScoreTraditional.homeTeam.statistics.foulsPersonal':'HOME_FOULS_PERSONAL',
    'boxScoreTraditional.homeTeam.statistics.points':'HOME_POINTS',
    'boxScoreTraditional.homeTeam.statistics.plusMinusPoints':'HOME_PLUS_MINUS_POINTS',
    'boxScoreTraditional.awayTeam.teamId':'AWAY_ID',
    'boxScoreTraditional.awayTeam.teamName':'AWAY_NAME',
    'boxScoreTraditional.awayTeam.teamTricode':'AWAY_TRICODE',
    'boxScoreTraditional.awayTeam.statistics.minutes':'AWAY_MINUTES',
    'boxScoreTraditional.awayTeam.statistics.fieldGoalsMade':'AWAY_FIELD_GOALS_MADE',
    'boxScoreTraditional.awayTeam.statistics.fieldGoalsAttempted':'AWAY_FIELD_GOALS_ATTEMPTED',
    'boxScoreTraditional.awayTeam.statistics.fieldGoalsPercentage':'AWAY_FIELD_GOALS_PERCENTAGE',
    'boxScoreTraditional.awayTeam.statistics.threePointersMade':'AWAY_THREE_POINTERS_MADE',
    'boxScoreTraditional.awayTeam.statistics.threePointersAttempted':'AWAY_THREE_POINTERS_ATTEMPTED',
    'boxScoreTraditional.awayTeam.statistics.threePointersPercentage':'AWAY_THREE_POINTERS_PERCENTAGE',
    'boxScoreTraditional.awayTeam.statistics.freeThrowsMade':'AWAY_FREE_THROWS_MADE',
    'boxScoreTraditional.awayTeam.statistics.freeThrowsAttempted':'AWAY_FREE_THROWS_ATTEMPTED',
    'boxScoreTraditional.awayTeam.statistics.freeThrowsPercentage':'AWAY_FREE_THROWS_PERCENTAGE',
    'boxScoreTraditional.awayTeam.statistics.reboundsOffensive':'AWAY_REBOUNDS_OFFENSIVE',
    'boxScoreTraditional.awayTeam.statistics.reboundsDefensive':'AWAY_REBOUNDS_DEFENSIVE',
    'boxScoreTraditional.awayTeam.statistics.reboundsTotal':'AWAY_REBOUNDS_TOTAL',
    'boxScoreTraditional.awayTeam.statistics.assists':'AWAY_ASSISTS',
    'boxScoreTraditional.awayTeam.statistics.steals':'AWAY_STEALS',
    'boxScoreTraditional.awayTeam.statistics.blocks':'AWAY_BLOCKS',
    'boxScoreTraditional.awayTeam.statistics.turnovers':'AWAY_TURNOVERS',
    'boxScoreTraditional.awayTeam.statistics.foulsPersonal':'AWAY_FOULS_PERSONAL',
    'boxScoreTraditional.awayTeam.statistics.points':'AWAY_POINTS',
    'boxScoreTraditional.awayTeam.statistics.plusMinusPoints':'AWAY_PLUS_MINUS_POINTS'
}


In [17]:
cleaner_dataset = game_level_dataset[keep_columns].rename(columns=rename_cols)
cleaner_dataset.to_csv('game-level-dataset-cleaned.csv', index=False)

In [18]:
game_level_dataset.to_csv('game-level-dataset.csv', index=False)