# Data Collection Notebook

The goal of this project is to build a model to predict how many points a player will score in an NBA game. In this notebook,
I start by creating a function to collect player and team data from NBA.com. I begin with player box score records from the 2015-16 season to the 2022-23 season. I use the 2015-16 season as a cut off because, while the league has changed significantly since then, teams had already adapted core features of modern basketball at that point: many power forwards were expected and able to shoot three point shots, for instance. 

In this notebook, I gather all of the data to train and test my models by web scraping nba.com.  I learned how to do this from the [this video](https://www.youtube.com/watch?v=o6Ih934hADU) from Dataquest and [this video](https://www.youtube.com/watch?v=IELK56jIsEo) from Learn with Jabe, and while I wrote my own code for my specific purpose, I adapted my approach from those videos. First, I start with box score data because that is the data set that contains the target variable. After gathering the box score data, I collect additional player and team data to use as features in my model.

## 1. Imports and Setup

First, I import the packages that I am going to need for data collection, define the headers that will show NBA.com that it can permit my request to web scrape the data I need, and define the relevant seasons for my study.

In [1]:
import pandas as pd
import requests
import re
from collections import defaultdict

These headers come from the [Stack Overflow](https://stackoverflow.com/questions/59886998/what-headers-am-i-missing-to-scrape-the-nba-stats-data).

In [2]:
headers= {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36',
          'Referer': 'https://www.nba.com/'}

In [3]:
season_list = [
    '2015-16',
    '2016-17',
    '2017-18',
    '2018-19',
    '2019-20',
    '2020-21',
    '2021-22',
    '2022-23'
]

## 2. Data Collection

In this part of the notebook, I build functions to pull statistics from all of the seasons that are relevant to my project. The first function takes a list of URLs from nba.com, iterates through each season, collects the data as dataframes, and then concatenates the dataframes for each season into a final dataframe. The second function merges related dataframes into a single dataframe which it exports to a CSV.

### 2a. Collecting Player Box Scores

I start with player box scores, which contain metrics related to each player's performance in each game that was played, because this information will serve as the backbone of my project.

In [4]:
# These are the URLs for the box score data I need. The function I create will adjust the season using the above list.
player_box_score_urls = [
    {'player_box_scores_traditional':'https://stats.nba.com/stats/leaguegamelog?Counter=1000&DateFrom=&DateTo=&Direction=DESC&ISTRound=&LeagueID=00&PlayerOrTeam=P&Season=2022-23&SeasonType=Regular%20Season&Sorter=DATE'},
    {'player_box_scores_advanced':'https://stats.nba.com/stats/playergamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'player_box_scores_misc':'https://stats.nba.com/stats/playergamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Misc&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'player_box_scores_scoring':'https://stats.nba.com/stats/playergamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'player_box_scores_usage':'https://stats.nba.com/stats/playergamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Usage&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='}
]

This function iterates through a list of URLs, iterates through each season from the season list to collect data from each one, concatenates the results, and then returns a dictionary with they names and dataframes as key-value pairs.

In [5]:
def collect_player_box_scores(url_list):
    
    default_dict = defaultdict(pd.DataFrame)
    #This pattern allows the for loop to modify the URLs by season.
    pattern = r'(Season=\d{4}-\d{2})'
    
    for dictionary in url_list:
        for name, url in dictionary.items():
            dfs = []
            for season_id in season_list:
                modified_url = re.sub(pattern, f'Season={season_id}', url)
                response = requests.get(modified_url, headers=headers).json()
                rows = response['resultSets'][0]['rowSet']
                columns = response['resultSets'][0]['headers']
                df = pd.DataFrame(rows, columns=columns)
                df['PLAYER_GAME_ID'] = df['PLAYER_ID'].astype(str)+df['GAME_ID'].astype(str)
                dfs.append(df)
        
        final_df = pd.concat(dfs, sort=False)
        default_dict[name] = final_df
        print(name + ' complete')
    return default_dict

In [6]:
player_box_scores_dict = collect_player_box_scores(player_box_score_urls)

player_box_scores_traditional complete
player_box_scores_advanced complete
player_box_scores_misc complete
player_box_scores_scoring complete
player_box_scores_usage complete


This second function merges the related dataframes together into a single dataframe and then exports them into a CSV. 

In [7]:
def merge_dfs(df_dict,identifier,csv_name):
    
    merged_df = pd.DataFrame()
    
    for name, df in df_dict.items():
        if merged_df.empty:
            merged_df = df
            print(name + ' merge complete')
        else:    
            cols_to_use = df.columns.difference(merged_df.columns)
            cols_to_use = cols_to_use.tolist()
            cols_to_use.append(identifier)
            merged_df = pd.merge(merged_df,df[cols_to_use],on=identifier)
            merged_df.to_csv('./Data/'+csv_name+'.csv',index=False)
            print(name + ' merge complete')
    
    return

In [8]:
merge_dfs(player_box_scores_dict,'PLAYER_GAME_ID','player_box_scores')

player_box_scores_traditional merge complete
player_box_scores_advanced merge complete
player_box_scores_misc merge complete
player_box_scores_scoring merge complete
player_box_scores_usage merge complete


The CSV looks good. Moving forward, I continue to use this basic process with minor adjustments as needed. I no longer need the dictionary storing all of that data, so I delete it.

In [9]:
del player_box_scores_dict

### 2b. Collecting Team Box Scores

Next, I collect data related to the team's performance in each game. This is going to be important because team tendencies impact player scoring. For instance, the pace a team plays with, the frequency of fast breaks, and its shot profile could all be relevant to individual player scoring.

In [10]:
team_box_score_urls = [
    {'team_box_scores_traditional':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Base&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'team_box_scores_advanced':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Advanced&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'team_box_scores_four_factors':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Four%20Factors&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'team_box_scores_misc':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Misc&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='},
    {'team_box_scores_scoring':'https://stats.nba.com/stats/teamgamelogs?DateFrom=&DateTo=&GameSegment=&ISTRound=&LastNGames=0&LeagueID=00&Location=&MeasureType=Scoring&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PaceAdjust=N&PerMode=Totals&Period=0&PlusMinus=N&Rank=N&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&VsConference=&VsDivision='}
]

I have to adjust the formula here because I want to create a team game id column rather than a player game id column.

In [11]:
def collect_team_box_scores(url_list):
    
    default_dict = defaultdict(pd.DataFrame)
    pattern = r'(Season=\d{4}-\d{2})'
    
    for dictionary in url_list:
        for name, url in dictionary.items():
            dfs = []
            for season_id in season_list:
                modified_url = re.sub(pattern, f'Season={season_id}', url)
                response = requests.get(modified_url, headers=headers).json()
                rows = response['resultSets'][0]['rowSet']
                columns = response['resultSets'][0]['headers']
                df = pd.DataFrame(rows, columns=columns)
                #I adjust the code slightly to reference team id instead of player id
                df['TEAM_GAME_ID'] = df['TEAM_ID'].astype(str)+df['GAME_ID'].astype(str)
                dfs.append(df)
        
        final_df = pd.concat(dfs, sort=False)
        default_dict[name] = final_df
        print(name + ' complete')
    return default_dict

In [12]:
team_box_scores_dict = collect_team_box_scores(team_box_score_urls)

team_box_scores_traditional complete
team_box_scores_advanced complete
team_box_scores_four_factors complete
team_box_scores_misc complete
team_box_scores_scoring complete


In [13]:
merge_dfs(team_box_scores_dict,'TEAM_GAME_ID','team_box_scores')

team_box_scores_traditional merge complete
team_box_scores_advanced merge complete
team_box_scores_four_factors merge complete
team_box_scores_misc merge complete
team_box_scores_scoring merge complete


In [14]:
del team_box_scores_dict

With all the data collected, it's time to move on to preprocessing.

### 2c. Collecting Player Bios

The final data set I collect for my analysis shows general information for each player, including their age, height, weight, and draft pick.

In [26]:
player_bios_urls = [
    {'player_bios':'https://stats.nba.com/stats/leaguedashplayerbiostats?College=&Conference=&Country=&DateFrom=&DateTo=&Division=&DraftPick=&DraftYear=&GameScope=&GameSegment=&Height=&ISTRound=&LastNGames=0&LeagueID=00&Location=&Month=0&OpponentTeamID=0&Outcome=&PORound=0&PerMode=PerGame&Period=0&PlayerExperience=&PlayerPosition=&Season=2023-24&SeasonSegment=&SeasonType=Regular%20Season&ShotClockRange=&StarterBench=&TeamID=0&VsConference=&VsDivision=&Weight='}
]

I had to adjust the formula slightly again because there was no need to include a team id.

In [30]:
def collect(url_list):
    
    #There is no need to merge multiple dataframes so I do not include a default dictionary.
    pattern = r'(Season=\d{4}-\d{2})'
    
    for dictionary in url_list:
        for name, url in dictionary.items():
            dfs = []
            for season_id in season_list:
                modified_url = re.sub(pattern, f'Season={season_id}', url)
                response = requests.get(modified_url, headers=headers).json()
                rows = response['resultSets'][0]['rowSet']
                columns = response['resultSets'][0]['headers']
                df = pd.DataFrame(rows, columns=columns)
                #There is no need for game ids for this function, however, there is a need to add a season ID.
                df['SEASON_ID'] = season_id
                dfs.append(df)
        
        final_df = pd.concat(dfs, sort=False)
        print(name + ' complete')
    return final_df

In [31]:
player_bios_df = collect(player_bios_urls)

player_bios complete


In [32]:
player_bios_df.to_csv('./Data/player_bios.csv',index=False)

Now that I have all of the data I need, in the next notebook, I merge each of these data sets together and clean the results.