# Data Ingestion Notebook

The goal of this project is to build a model to predict how many points a player will score in an NBA game. In this notebook,
I start by creating a function to collect player and team data from NBA.com. I begin with player box score records from the 2010-11 season to the 2022-23 season. I use the 2010-11 season as a cut off because, while the league has changed significantly since then, the 2010-11 Finals represent a series that would at least look somewhat recognizable to modern fans, whereas during the prior year’s Finals, the Lakers and the Celtics failed to reach 100 points in all games but one. 

In this notebook, I gather all of the data to train and test my models by web scraping nba.com.  First, I start with box score data because that is the data set that contains the target variable. After gathering the box score data, I collect additional player and team data to use as features in my model.

## Data Collection

The goal of this part of the notebook is to build functions to pull statistics from all of the seasons that are relevant to my project. The first function takes a list of URLs from nba.com, iterates through each season, collects the data as dataframes, and then concatenates the dataframes for each season into a final dataframe. The second function merges related dataframes into a single dataframe which it exports to a CSV.

In [1]:
import pandas as pd
import requests
import re
from collections import defaultdict

These headers come from the [Stack Overflow](https://stackoverflow.com/questions/59886998/what-headers-am-i-missing-to-scrape-the-nba-stats-data).

In [2]:
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
          'Referer': 'https://www.nba.com/'}

In [3]:
# These are the seasons that I use for my project; they could adjusted to look at other periods of NBA history.
season_list = [
    '2010-11',
    '2011-12',
    '2012-13',
    '2013-14',
    '2014-15',
    '2015-16',
    '2016-17',
    '2017-18',
    '2018-19',
    '2019-20',
    '2020-21',
    '2021-22',
    '2022-23'
    ]

### Player Box Scores

The data I use comes from nba.com. I begin with a variety of player box score statistics, and I format the URLs as a list of dictionaries to stay organized.

In [None]:
# These are the URLs for the box score data I need. The function I create will adjust the season using the above list.
player_box_score_urls = [
    {'player_box_scores_traditional':'https://stats.nba.com/stats/leaguegamelog?Counter=1000&DateFrom=&DateTo=&Direction=DESC&ISTRound=&LeagueID=00&PlayerOrTeam=P&Season=2022-23&SeasonType=Regular%20Season&Sorter=DATE'},
    {'player_box_scores_advanced':'https://stats.nba.com/stats/playergamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'player_box_scores_misc':'https://stats.nba.com/stats/playergamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Misc&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'player_box_scores_scoring':'https://stats.nba.com/stats/playergamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'player_box_scores_usage':'https://stats.nba.com/stats/playergamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Usage&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='}
]

This function iterates through a list of URLs, iterates through each season from the season list to collect data from each one, concatenates the results, and then returns a dictionary with they names and dataframes as key-value pairs.

In [None]:
def collect_player_box_scores(url_list):
    
    default_dict = defaultdict(pd.DataFrame)
    #This pattern allows the for loop to modify the URLs by season.
    pattern = r'(Season=\d{4}-\d{2})'
    
    for dictionary in url_list:
        for name, url in dictionary.items():
            dfs = []
            for season_id in season_list:
                modified_url = re.sub(pattern, f'Season={season_id}', url)
                response = requests.get(modified_url, headers=headers).json()
                rows = response['resultSets'][0]['rowSet']
                columns = response['resultSets'][0]['headers']
                df = pd.DataFrame(rows, columns=columns)
                df['PLAYER_GAME_ID'] = df['PLAYER_ID'].astype(str)+df['GAME_ID'].astype(str)
                dfs.append(df)
        
        final_df = pd.concat(dfs, sort=False)
        default_dict[name] = final_df
        print(name + ' complete')
    return default_dict

In [None]:
player_box_scores_dict = collect_player_box_scores(player_box_score_urls)

This second function merges the related dataframes together into a single dataframe and then exports them into a CSV. Because the data for different statistics is stored in different formats on nba.com, I need to alter the function as I collect more data. Having this function available allows me to make minor tweaks as needed rather than rewriting the code from scratch each time.

In [4]:
def merge_dfs(df_dict,identifier,csv_name):
    
    merged_df = pd.DataFrame()
    
    for name, df in df_dict.items():
        if merged_df.empty:
            merged_df = df
            print(name + ' merge complete')
        else:    
            cols_to_use = df.columns.difference(merged_df.columns)
            cols_to_use = cols_to_use.tolist()
            cols_to_use.append(identifier)
            merged_df = pd.merge(merged_df,df[cols_to_use],on=identifier)
            merged_df.to_csv('./Data/'+csv_name+'.csv',index=False)
            print(name + ' merge complete')
    
    return

In [None]:
merge_dfs(player_box_scores_dict,'PLAYER_GAME_ID','player_box_scores')

The CSV looks good. Moving forward, I continue to use this basic process with minor adjustments as needed. I no longer need the dictionary storing all of that data, so I delete it.

In [None]:
del player_box_scores_dict

## Team Box Scores

In [None]:
team_box_score_urls = [
    {'team_box_scores_traditional':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'team_box_scores_advanced':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'team_box_scores_four_factors':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Four%20Factors&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'team_box_scores_misc':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Misc&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'team_box_scores_scoring':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='}
]

In [None]:
def collect_team_box_scores(url_list):
    
    default_dict = defaultdict(pd.DataFrame)
    #This pattern allows the for loop to modify the URLs by season.
    pattern = r'(Season=\d{4}-\d{2})'
    
    for dictionary in url_list:
        for name, url in dictionary.items():
            dfs = []
            for season_id in season_list:
                modified_url = re.sub(pattern, f'Season={season_id}', url)
                response = requests.get(modified_url, headers=headers).json()
                rows = response['resultSets'][0]['rowSet']
                columns = response['resultSets'][0]['headers']
                df = pd.DataFrame(rows, columns=columns)
                #I adjust the code slightly to reference team id instead of player id
                df['TEAM_GAME_ID'] = df['TEAM_ID'].astype(str)+df['GAME_ID'].astype(str)
                dfs.append(df)
        
        final_df = pd.concat(dfs, sort=False)
        default_dict[name] = final_df
        print(name + ' complete')
    return default_dict

In [None]:
team_box_scores_dict = collect_team_box_scores(team_box_score_urls)

In [None]:
merge_dfs(team_box_scores_dict,'TEAM_GAME_ID','team_box_scores')

In [None]:
del team_box_scores_dict

### Draft Data

In [None]:
draft_urls = [
    {'draft_basics':'https://stats.nba.com/stats/drafthistory?College=&LeagueID=00&OverallPick=&RoundNum=&RoundPick=&Season=&TeamID=0&TopX='},
    {'combine_stats':'https://stats.nba.com/stats/draftcombinedrillresults?LeagueID=00&SeasonYear=2022-23&default=2022-23&initial=2022-23&seasonRange=1947%2C2023'},
    {'draft_measurements':'https://stats.nba.com/stats/draftcombineplayeranthro?LeagueID=00&SeasonYear=2022-23'},
    {'draft_strength_agility':'https://stats.nba.com/stats/draftcombinedrillresults?LeagueID=00&SeasonYear=2022-23'}
]

In [5]:
def collect_data(url_list):
    
    default_dict = defaultdict(pd.DataFrame)
    #This pattern allows the for loop to modify the URLs by season.
    pattern = r'(Season=\d{4}-\d{2})'
    
    for dictionary in url_list:
        for name, url in dictionary.items():
            dfs = []
            for season_id in season_list:
                modified_url = re.sub(pattern, f'Season={season_id}', url)
                response = requests.get(modified_url, headers=headers).json()
                rows = response['resultSets'][0]['rowSet']
                columns = response['resultSets'][0]['headers']
                df = pd.DataFrame(rows, columns=columns)
                dfs.append(df)
        
        final_df = pd.concat(dfs, sort=False)
        default_dict[name] = final_df
        print(name + ' complete')
    return default_dict

In [None]:
draft_dict = collect_data(draft_urls)

In [None]:
draft_dict['draft_basics']['PLAYER_ID'] = draft_dict['draft_basics']['PERSON_ID']

In [None]:
merge_dfs(draft_dict,'PLAYER_ID','draft_data')

In [None]:
del draft_dict

### Player Season Stats

In [21]:
player_season_urls = [
    {'traditional_player_stats':'https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2022-23&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight='},
    {'advanced_player_stats':'https://stats.nba.com/stats/leaguedashplayerstats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2022-23&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight='},
]

In [22]:
player_season_dict = collect_data(player_season_urls)

traditional_player_stats complete
advanced_player_stats complete


In [23]:
merge_dfs(player_season_dict,'PLAYER_ID','player_season_data')

traditional_player_stats merge complete
advanced_player_stats merge complete


In [24]:
del player_season_dict

### Team Offense Season Stats

In [25]:
team_offense_urls = [
    {'traditional_team_stats':'https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2022-23&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision='},
    {'advanced_team_stats':'https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2022-23&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision'},
    {'team_scoring_stats':'https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2022-23&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision='},
]

In [26]:
team_offense_dict = collect_data(team_offense_urls)

traditional_team_stats complete
advanced_team_stats complete
team_scoring_stats complete


In [27]:
merge_dfs(team_offense_dict,'TEAM_ID','team_season_data')

traditional_team_stats merge complete
advanced_team_stats merge complete
team_scoring_stats merge complete


In [29]:
del team_offense_dict

### Team Defense Season Stats

In [30]:
team_defense_urls = [   
    {'opponent_scoring_stats':'https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Opponent&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2022-23&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision'},
    {'team_defense_stats':'https://stats.nba.com/stats/leaguedashteamstats?Conference=&DateFrom=&DateTo=&Division=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Defense&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2022-23&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&TwoWay=0&VsConference=&VsDivision='}
]

In [31]:
team_defense_dict = collect_data(team_defense_urls)

opponent_scoring_stats complete
team_defense_stats complete


In [32]:
merge_dfs(team_defense_dict,'TEAM_ID','team_season_data')

opponent_scoring_stats merge complete
team_defense_stats merge complete


In [33]:
del team_defense_dict

### Team Shooting Distributions

The following statistical sets don't work with the collect_data formula because their headers are dictionaries instead of lists. Also, there are two header rows instead of one. I am adjust the formula for these special cases to get the data I want. 

In [16]:
team_shooting_urls = [
    {'team_opponent_shot_dist':'https://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFrom=&DateTo=&DistanceRange=5ft%20Range&Division=&GameScope=&GameSegment=&ISTRound=&LastNGames=0&Location=&MeasureType=Opponent&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='},
    {'team_shot_locations':'https://stats.nba.com/stats/leaguedashteamshotlocations?Conference=&DateFrom=&DateTo=&DistanceRange=5ft%20Range&Division=&GameScope=&GameSegment=&ISTRound=&LastNGames=0&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision='}
]

In [17]:
def collect_team_shooting_data(url_list):
    
    default_dict = defaultdict(pd.DataFrame)
    #This pattern allows the for loop to modify the URLs by season.
    pattern = r'(Season=\d{4}-\d{2})'
    
    for dictionary in url_list:
        for name, url in dictionary.items():
            dfs = []
            for season_id in season_list:
                modified_url = re.sub(pattern, f'Season={season_id}', url)
                response = requests.get(modified_url, headers=headers).json()
                
                #This part of the function concatenates the two header rows into a single row.
                headers1 = response['resultSets']['headers'][0]['columnNames']
                headers2 = response['resultSets']['headers'][1]['columnNames']
                concatenated_headers = []
                
                for i in headers1:
                    for z in range(2, 5):
                        new_header = i + headers2[z]
                        concatenated_headers.append(new_header)
            
                final_headers = headers2[:2] + concatenated_headers
                
                #This adjustment correctly names the rows and columns
                rows = response['resultSets']['rowSet']
                columns = final_headers
                df = pd.DataFrame(rows, columns=columns)
                dfs.append(df)
        
        final_df = pd.concat(dfs, sort=False)
        default_dict[name] = final_df
        print(name + ' complete')
    return default_dict

In [18]:
team_shooting_dict = collect_team_shooting_data(team_shooting_urls)

team_opponent_shot_dist complete
team_shot_locations complete


In [19]:
merge_dfs(team_shooting_dict,'TEAM_ID','team_shooting_data')

team_opponent_shot_dist merge complete
team_shot_locations merge complete


In [20]:
del team_shooting_dict

### Player Shooting Distributions

This is the final data I want to use. Unfortunately, the player shooting distributions are formatted differently than the team shooting distributions, so it is necessary to adjust the function once again.

In [10]:
player_shooting_urls = [
    {'player_shooting_distribution':'https://stats.nba.com/stats/leaguedashplayershotlocations?College=&Conference=&Country=&DateFrom=&DateTo=&DistanceRange=5ft%20Range&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2022-23&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight='},
    {'player_opponent_shooting':'https://stats.nba.com/stats/leaguedashplayershotlocations?College=&Conference=&Country=&DateFrom=&DateTo=&DistanceRange=5ft%20Range&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&Location=&MeasureType=Opponent&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight='}
]

In [11]:
def collect_player_shooting_data(url_list):
    
    default_dict = defaultdict(pd.DataFrame)
    #This pattern allows the for loop to modify the URLs by season.
    pattern = r'(Season=\d{4}-\d{2})'
    
    for dictionary in url_list:
        for name, url in dictionary.items():
            dfs = []
            for season_id in season_list:
                modified_url = re.sub(pattern, f'Season={season_id}', url)
                response = requests.get(modified_url, headers=headers).json()
                
                #This part of the function concatenates the two header rows into a single row.
                headers1 = response['resultSets']['headers'][0]['columnNames']
                headers2 = response['resultSets']['headers'][1]['columnNames']
                concatenated_headers = []
          
                for i in headers1:
                    for z in range(6, 9):
                        new_header = i + headers2[z]
                        concatenated_headers.append(new_header)
            
                final_headers = headers2[:6] + concatenated_headers
                
                #This adjustment correctly names the rows and columns
                rows = response['resultSets']['rowSet']
                columns = final_headers
                df = pd.DataFrame(rows, columns=columns)
                dfs.append(df)
        
        final_df = pd.concat(dfs, sort=False)
        default_dict[name] = final_df
        print(name + ' complete')
    return default_dict

In [12]:
player_shooting_dict = collect_player_shooting_data(player_shooting_urls)

player_shooting_distribution complete
player_opponent_shooting complete


In [14]:
merge_dfs(player_shooting_dict,'PLAYER_ID','player_shooting_data')

player_shooting_distribution merge complete
player_opponent_shooting merge complete


In [15]:
del player_shooting_dict

With all the data collected, it's time to move on to preprocessing.