# NBA Shot Data 4 :: Data Aggregation

## Trevor Rowland :: 2/2/2025

This notebook will take the cleaned NBA shot data and create a data source of team and player statistic aggregations. There is also an API package to connect to [basketball-reference.com](<basketball-reference.com>) available on PyPi that we will try to connect to for aggregations.

## 1. Importing Packages and Data

In [1]:
import pandas as pd
import polars as pl
import numpy as np

df = pd.read_pickle('/Users/dB/Documents/repos/github/bint-capstone/data-sources/nba/all-shots.pkl')
df = df.to_pandas(use_pyarrow_extension_array=True)

In [2]:
df.head()

Unnamed: 0,SEASON_1,SEASON_2,TEAM_ID,TEAM_NAME,PLAYER_ID,PLAYER_NAME,POSITION_GROUP,POSITION,GAME_DATE,GAME_ID,...,BASIC_ZONE,ZONE_NAME,ZONE_ABB,ZONE_RANGE,LOC_X,LOC_Y,SHOT_DISTANCE,QUARTER,MINS_LEFT,SECS_LEFT
0,2009,2008-09,1610612744,Golden State Warriors,201627,Anthony Morrow,G,SG,04-15-2009,20801229,...,Restricted Area,Center,C,Less Than 8 ft.,-0.0,5.25,0,4,0,1
1,2009,2008-09,1610612744,Golden State Warriors,101235,Kelenna Azubuike,F,SF,04-15-2009,20801229,...,Restricted Area,Center,C,Less Than 8 ft.,-0.0,5.25,0,4,0,9
2,2009,2008-09,1610612756,Phoenix Suns,255,Grant Hill,F,SF,04-15-2009,20801229,...,Restricted Area,Center,C,Less Than 8 ft.,-0.0,5.25,0,4,0,25
3,2009,2008-09,1610612739,Cleveland Cavaliers,200789,Daniel Gibson,G,PG,04-15-2009,20801219,...,Restricted Area,Center,C,Less Than 8 ft.,-0.2,5.25,0,5,0,4
4,2009,2008-09,1610612756,Phoenix Suns,255,Grant Hill,F,SF,04-15-2009,20801229,...,Mid-Range,Left Side,L,8-16 ft.,8.7,7.55,8,4,1,3


## 2. Attempting to Connect to Basketball-Reference

### 2.a. Importing Packages

In [59]:
from nba_api.stats.static import teams, players
from nba_api.stats.endpoints import (
    teamyearbyyearstats,
    leaguedashteamstats,
    leaguedashplayerstats,
    commonteamroster
)
from tqdm import tqdm
import time
import random

### 2.b. Team Data

`get_all_teams()`

Retrieve all NBA teams as a pandas DataFrame.
    
_Returns_:

pd.DataFrame: DataFrame of NBA teams with their details

In [53]:
def get_all_teams()->pd.DataFrame:
    teams_list = teams.get_teams()
    return pd.DataFrame(teams_list)

`get_team_ids()`

Extract active team IDs as a pandas Series.
    
_Returns_:

pd.Series: Series of active team IDs

In [54]:
def get_team_ids():
    return pd.Series([team['id'] for team in teams.get_teams()])

`collect_team_stats(start_year:int, end_year:int)`

Collect yearly team statistics.
    
Args:

start_year (int): Starting year for data collection

end_year (int): Ending year for data collection
    
_Returns_:

pd.DataFrame: Comprehensive team statistics across seasons


In [62]:
def collect_team_stats(start_year=2004, end_year=2024):
    """
    Collect yearly team statistics with robust error handling.
    
    Args:
    start_year (int): Starting year for data collection
    end_year (int): Ending year for data collection
    
    Returns:
    pd.DataFrame: Comprehensive team statistics across seasons
    """
    team_stats_list = []
    team_ids = get_team_ids()
    
    for team_id in tqdm(team_ids, desc="Collecting Team Stats"):
        for season in range(start_year, end_year + 1):
            try:
                # Convert year to NBA season format (e.g., 2020-21)
                season_str = f"{season}-{str(season+1)[-2:]}"
                
                # Collect team stats
                team_stats = leaguedashteamstats.LeagueDashTeamStats(
                    season=season_str
                )
                
                # Directly convert to DataFrame
                df = team_stats.get_data_frames()[0]
                
                # Add team_id and season columns
                df['TEAM_ID'] = team_id
                df['SEASON'] = season_str
                
                team_stats_list.append(df)
                
                # Randomized rate limiting to avoid predictable patterns
                time.sleep(random.uniform(1.5, 3.5))
            
            except Exception as e:
                print(f"Error collecting stats for team {team_id} in season {season_str}: {e}")
                # Wait longer on failure with some randomness
                time.sleep(random.uniform(4, 7))
                continue
    
    # Combine all team stats into a single DataFrame
    if team_stats_list:
        final_df = pd.concat(team_stats_list, ignore_index=True)
        
        # Clean column names
        final_df.columns = [col.lower().replace(' ', '_') for col in final_df.columns]
        
        return final_df
    else:
        print("No team stats collected. Check network or API issues.")
        return pd.DataFrame()

`get_team_roster(team_id:int, season:str)`

Retrieve team roster for a specific season.
    
Args:

team_id (int): NBA team ID

season (str): NBA season in format 'YYYY-YY'
    
Returns:

pd.DataFrame: Team roster details

In [56]:
def get_team_roster(team_id, season):
    try:
        roster = commonteamroster.CommonTeamRoster(team_id=team_id, season=season)
        
        # Get DataFrame directly and clean column names
        df = roster.get_data_frames()[0]
        df.columns = [col.lower().replace(' ', '_') for col in df.columns]
        
        # Add team_id and season columns
        df['team_id'] = team_id
        df['season'] = season
        
        return df
    except Exception as e:
        print(f"Error collecting roster for team {team_id} in season {season}: {e}")
        return pd.DataFrame()

### 2.c. Player Data

`collect_player_stats(start_year:int, end_year:int)`
    
Collect comprehensive player statistics.
    
Args:
    
start_year (int): Starting year for data collection
    
end_year (int): Ending year for data collection
    
    
Returns:
    
pd.DataFrame: Comprehensive player statistics across seasons


In [61]:
def collect_player_stats(start_year=2004, end_year=2024):
    """
    Collect comprehensive player statistics with robust error handling.
    
    Args:
    start_year (int): Starting year for data collection
    end_year (int): Ending year for data collection
    
    Returns:
    pd.DataFrame: Comprehensive player statistics across seasons
    """
    player_stats_list = []
    
    for season in tqdm(range(start_year, end_year + 1), desc="Collecting Player Stats"):
        try:
            # Convert year to NBA season format (e.g., 2020-21)
            season_str = f"{season}-{str(season+1)[-2:]}"
            
            # Collect player stats for the season
            player_stats = leaguedashplayerstats.LeagueDashPlayerStats(
                season=season_str
            )
            
            # Get DataFrame directly
            df = player_stats.get_data_frames()[0]
            
            # Add season column and clean column names
            df['SEASON'] = season_str
            df.columns = [col.lower().replace(' ', '_') for col in df.columns]
            
            player_stats_list.append(df)
            
            # Randomized rate limiting
            time.sleep(random.uniform(1.5, 3.5))
        
        except Exception as e:
            print(f"Error collecting player stats for season {season_str}: {e}")
            # Wait longer on failure with some randomness
            time.sleep(random.uniform(4, 7))
            continue
    
    # Combine all player stats into a single DataFrame
    if player_stats_list:
        final_df = pd.concat(player_stats_list, ignore_index=True)
        return final_df
    else:
        print("No player stats collected. Check network or API issues.")
        return pd.DataFrame()

### 2.d. Testing

In [64]:
# Collect team stats
data_dir = '/Users/dB/Documents/repos/github/bint-capstone/data-sources/nba'
print("Talking to the API...")
print("Collecting Team Statistics...")
team_stats = collect_team_stats()
    
# Collect player stats
print("Collecting Player Statistics...")
player_stats = collect_player_stats()

print("Data pulled from API.")

Talking to the API...
Collecting Team Statistics...


Collecting Team Stats:  23%|██▎       | 7/30 [06:29<21:13, 55.35s/it]

Error collecting stats for team 1610612744 in season 2007-08: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))
Error collecting stats for team 1610612744 in season 2011-12: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
Error collecting stats for team 1610612744 in season 2012-13: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
Error collecting stats for team 1610612744 in season 2015-16: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
Error collecting stats for team 1610612744 in season 2016-17: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
Error collecting stats for team 1610612744 in season 2017-18: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
Error collecting stats for team 1610612744 in season 2018-19: HTTPSConnectionPool(host='stats.nba.co

Collecting Team Stats:  27%|██▋       | 8/30 [14:33<1:10:19, 191.80s/it]

Error collecting stats for team 1610612745 in season 2004-05: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
Error collecting stats for team 1610612745 in season 2005-06: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
Error collecting stats for team 1610612745 in season 2006-07: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)
Error collecting stats for team 1610612745 in season 2007-08: HTTPSConnectionPool(host='stats.nba.com', port=443): Read timed out. (read timeout=30)


Collecting Team Stats:  27%|██▋       | 8/30 [17:21<47:44, 130.20s/it]  


KeyboardInterrupt: 

In [None]:
print('Writing Player Stats to Folder')
team_stats.to_csv('nba_team_stats_2004_2024.csv')
team_stats.to_pickle('nba_team_stats_2004_2024.pkl')

print('Writing Player Stats to Folder')
player_stats.to_csv('nba_player_stats_2004_2024.csv')
player_stats.to_pickle('nba_player_stats_2004_2024.pkl')