# NBA Data Analysis :: API Connection Part 2

## Trevor Rowland :: 2-5-2025

This notebook aims to collect all teams that made the playoffs in the 2004-2024 seasons, and also a Play-by-Play data grabber to use for later projects.

## 1. Collecting Playoff Teams

This function uses the PlayoffPicture endpoint from `nba_api` to get a glimpse of the playoffs in each season. This glimpse contains all of the teams who made the playoffs, which will be used in our 4th hypothesis test using MANOVA to compare playoff teams to non-playoff-making teams.

**Edit:** We are no longer using the `nba_api` endpoints because that is stupid overengineering. Instead, we can just write down every playoff team from wikipedia instead of wrestling with the API.

**Edit 2:** Even Better! All `game_id` values contain whether the game is a playoff game or not.

Playoff Games will start off with the numbers 00...

- 1: Pre-Season Games
- 2: Regular Season Games
- 3: All-Star Games
- 4: Playoff Games

**[Source](<https://github.com/swar/nba_api/issues/220>)**

This means that we are still waiting to get the game data but at least we have an elegant solution when we get those `game_id` datasets.

### 1.a. Playoff Teams

The following is a dictionary of each playoff team that made the NBA playoffs since 2004.

In [None]:
import pandas as pd
import numpy as np

### 1.a. Importing Packages

In [4]:
from nba_api.stats.endpoints import LeagueGameFinder, BoxScoreTraditionalV2
import pandas as pd
import time
from datetime import datetime
from tqdm.auto import tqdm
import logging
import random

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[logging.FileHandler('nba_data_collection.log'),
             logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

def log_message(message):
    """Log message to both file and tqdm"""
    tqdm.write(message)
    logger.info(message)

def exponential_backoff(attempt, base_delay=2, max_delay=60):
    """Calculate exponential backoff time with jitter"""
    delay = min(base_delay * (2 ** attempt) + random.uniform(0, 1), max_delay)
    return delay

def get_season_games(season):
    """Get games for a single season with error handling"""
    game_finder = LeagueGameFinder(
        season_nullable=season,
        league_id_nullable='00',
        timeout=60
    )
    
    # Get the raw response first
    response_frames = game_finder.get_data_frames()
    
    # Debug logging
    log_message(f"Response for season {season}: got {len(response_frames)} DataFrames")
    if not response_frames:
        raise ValueError(f"Empty response for season {season}")
    
    games = response_frames[0]
    if len(games) == 0:
        raise ValueError(f"No games found for season {season}")
        
    log_message(f"Retrieved {len(games)} game entries for season {season}")
    return games

def parse_matchup(matchup):
    """
    Parse the matchup string to determine home and away teams.
    Example formats:
    - "GSW vs. LAL" -> GSW is home
    - "GSW @ LAL" -> GSW is away
    """
    if ' vs.' in matchup:
        return 'home'
    elif ' @' in matchup:
        return 'away'
    else:
        return None

def get_all_game_ids(start_year, end_year):
    """
    First phase: Collect all game IDs for the specified year range
    """
    all_games = []
    seasons = [f"{year}-{str(year + 1)[-2:]}" for year in range(start_year, end_year)]
    
    # Season progress bar
    with tqdm(seasons, desc="Collecting game IDs") as season_pbar:
        for season in season_pbar:
            season_pbar.set_description(f"Getting games for {season}")
            
            try:
                games = get_season_games(season)
                
                # Debug log the structure of the data
                log_message(f"Processing {len(games)} game entries for {season}")
                
                # Create a dictionary to store games temporarily
                season_games_dict = {}
                
                # First pass: organize games by GAME_ID
                for _, game in games.iterrows():
                    game_id = game['GAME_ID']
                    game_location = parse_matchup(game['MATCHUP'])
                    
                    if game_id not in season_games_dict:
                        season_games_dict[game_id] = {'home': None, 'away': None}
                    
                    if game_location == 'home':
                        season_games_dict[game_id]['home'] = game
                    elif game_location == 'away':
                        season_games_dict[game_id]['away'] = game
                
                # Second pass: create game records
                games_processed = 0
                games_skipped = 0
                
                for game_id, game_data in season_games_dict.items():
                    if game_data['home'] is not None and game_data['away'] is not None:
                        home_game = game_data['home']
                        away_game = game_data['away']
                        
                        game_info = {
                            'GAME_ID': game_id,
                            'GAME_DATE': home_game['GAME_DATE'],
                            'SEASON': season,
                            'HOME_TEAM_ID': home_game['TEAM_ID'],
                            'HOME_TEAM_NAME': home_game['TEAM_NAME'],
                            'AWAY_TEAM_ID': away_game['TEAM_ID'],
                            'AWAY_TEAM_NAME': away_game['TEAM_NAME'],
                            'HOME_TEAM_SCORE': home_game['PTS'],
                            'AWAY_TEAM_SCORE': away_game['PTS'],
                            'GAME_TYPE': 'Playoff' if game_id.startswith('004') else 'Regular'
                        }
                        all_games.append(game_info)
                        games_processed += 1
                    else:
                        games_skipped += 1
                        log_message(f"Skipping game {game_id} - Missing {'home' if game_data['home'] is None else 'away'} team data")
                
                log_message(f"Season {season} summary:"
                          f"\n - Total games found: {len(season_games_dict)}"
                          f"\n - Successfully processed: {games_processed}"
                          f"\n - Skipped: {games_skipped}")
                
            except Exception as e:
                log_message(f"Error processing season {season}: {str(e)}")
                continue
            
            time.sleep(1)  # Brief pause between seasons
    
    if not all_games:
        raise ValueError("No games collected for any season")
    
    games_df = pd.DataFrame(all_games)
    log_message(f"Total games collected across all seasons: {len(games_df)}")
    return games_df

def get_box_score(game_id, retries=5):
    """Get box score for a single game with retry logic"""
    for attempt in range(retries):
        try:
            box_score = BoxScoreTraditionalV2(game_id=game_id, timeout=60)
            response_frames = box_score.get_data_frames()
            
            if not response_frames:
                raise ValueError("Empty response")
                
            return response_frames[0]
            
        except Exception as e:
            if attempt == retries - 1:  # Last attempt
                log_message(f"Failed to get box score for game {game_id}: {str(e)}")
                return None
            delay = exponential_backoff(attempt)
            log_message(f"Attempt {attempt + 1} failed for game {game_id}, retrying in {delay:.1f}s")
            time.sleep(delay)
    return None

def collect_nba_game_data(start_year=2004, end_year=2024):
    """
    Collect game-level data and box scores for NBA games between specified years.
    """
    log_message(f"Starting data collection for seasons {start_year}-{end_year}")
    
    try:
        # Phase 1: Get all game IDs and basic game info
        games_df = get_all_game_ids(start_year, end_year)
        log_message(f"Found {len(games_df)} total games")
        
        # Phase 2: Get box scores for each game
        all_box_scores = []
        successful_games = 0
        failed_games = 0
        
        # Process each game
        with tqdm(total=len(games_df), desc="Collecting box scores") as pbar:
            for _, game in games_df.iterrows():
                game_id = game['GAME_ID']
                
                # Get box score
                box_score_df = get_box_score(game_id)
                
                if box_score_df is not None:
                    # Add game information to box score
                    for col, value in game.items():
                        box_score_df[col] = value
                    
                    all_box_scores.append(box_score_df)
                    successful_games += 1
                else:
                    failed_games += 1
                
                # Update progress bar with success rate
                success_rate = (successful_games / (successful_games + failed_games)) * 100
                pbar.set_description(
                    f"Box scores (Success: {successful_games}, Failed: {failed_games}, "
                    f"Rate: {success_rate:.1f}%)"
                )
                pbar.update(1)
                
                time.sleep(1)  # Rate limiting
        
        # Combine box scores
        if all_box_scores:
            box_scores_df = pd.concat(all_box_scores, ignore_index=True)
        else:
            box_scores_df = pd.DataFrame()
        
        # Final summary
        log_message(f"\nData collection completed:"
                    f"\nTotal games found: {len(games_df)}"
                    f"\nSuccessful box scores: {successful_games}"
                    f"\nFailed box scores: {failed_games}"
                    f"\nSuccess rate: {(successful_games / (successful_games + failed_games)) * 100:.1f}%")
        
        return games_df, box_scores_df
        
    except Exception as e:
        log_message(f"Critical error in data collection: {str(e)}")
        raise

def save_data(games_df, box_scores_df, base_filename='nba_data'):
    """Save the collected data to CSV files with timestamps."""
    timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
    
    games_df.to_csv(f'{base_filename}_games_{timestamp}.csv', index=False)
    box_scores_df.to_csv(f'{base_filename}_box_scores_{timestamp}.csv', index=False)
    log_message(f"Data saved to {base_filename}_games_{timestamp}.csv and {base_filename}_box_scores_{timestamp}.csv")

# Example usage
if __name__ == "__main__":
    games_df, box_scores_df = collect_nba_game_data(2006, 2008)
    save_data(games_df, box_scores_df)

2025-02-05 18:06:59,106 - INFO - Starting data collection for seasons 2006-2008


Starting data collection for seasons 2006-2008


Getting games for 2006-07:   0%|          | 0/2 [00:00<?, ?it/s]2025-02-05 18:06:59,172 - INFO - Response for season 2006-07: got 1 DataFrames
Getting games for 2006-07:   0%|          | 0/2 [00:00<?, ?it/s]2025-02-05 18:06:59,173 - INFO - Retrieved 2867 game entries for season 2006-07
Getting games for 2006-07:   0%|          | 0/2 [00:00<?, ?it/s]2025-02-05 18:06:59,174 - INFO - Processing 2867 game entries for 2006-07
Getting games for 2006-07:   0%|          | 0/2 [00:00<?, ?it/s]2025-02-05 18:06:59,277 - INFO - Skipping game 0010600023 - Missing away team data
Getting games for 2006-07:   0%|          | 0/2 [00:00<?, ?it/s]2025-02-05 18:06:59,278 - INFO - Skipping game 0010600015 - Missing home team data
Getting games for 2006-07:   0%|          | 0/2 [00:00<?, ?it/s]2025-02-05 18:06:59,279 - INFO - Skipping game 0010600001 - Missing home team data
Getting games for 2006-07:   0%|          | 0/2 [00:00<?, ?it/s]2025-02-05 18:06:59,280 - INFO - Season 2006-07 summary:
 - Total game

Response for season 2006-07: got 1 DataFrames
Retrieved 2867 game entries for season 2006-07
Processing 2867 game entries for 2006-07
Skipping game 0010600023 - Missing away team data
Skipping game 0010600015 - Missing home team data
Skipping game 0010600001 - Missing home team data
Season 2006-07 summary:
 - Total games found: 1433
 - Successfully processed: 1430
 - Skipped: 3


Getting games for 2007-08:  50%|█████     | 1/2 [00:01<00:01,  1.18s/it]2025-02-05 18:07:00,385 - INFO - Response for season 2007-08: got 1 DataFrames
Getting games for 2007-08:  50%|█████     | 1/2 [00:01<00:01,  1.18s/it]2025-02-05 18:07:00,386 - INFO - Retrieved 2852 game entries for season 2007-08
Getting games for 2007-08:  50%|█████     | 1/2 [00:01<00:01,  1.18s/it]2025-02-05 18:07:00,388 - INFO - Processing 2852 game entries for 2007-08
Getting games for 2007-08:  50%|█████     | 1/2 [00:01<00:01,  1.18s/it]2025-02-05 18:07:00,450 - INFO - Season 2007-08 summary:
 - Total games found: 1426
 - Successfully processed: 1426
 - Skipped: 0


Response for season 2007-08: got 1 DataFrames
Retrieved 2852 game entries for season 2007-08
Processing 2852 game entries for 2007-08
Season 2007-08 summary:
 - Total games found: 1426
 - Successfully processed: 1426
 - Skipped: 0


Getting games for 2007-08: 100%|██████████| 2/2 [00:02<00:00,  1.17s/it]
2025-02-05 18:07:01,465 - INFO - Total games collected across all seasons: 2856
2025-02-05 18:07:01,470 - INFO - Found 2856 total games


Total games collected across all seasons: 2856
Found 2856 total games


Box scores (Success: 185, Failed: 0, Rate: 100.0%):   6%|▋         | 185/2856 [06:53<1:39:25,  2.23s/it]


KeyboardInterrupt: 