# **NBA Play-By-Play Data - 2023-2024 Season**
Brendan Keane
August 8, 2024

## **Summary**
This notebook scrapes the [NBA website](https://nba.com) for play-by-play data from the 2023-2024 season using *Python* and *Beautiful Soup*. This process is broken down into the following steps:
1. Import packages and define constants
1. Retrieve play-by-play JSON data from [nba.com](http://nba.com/game)
1. Save play-by-play JSON for all 1,230 NBA games
1. Combine all 1,230 JSONs into one CSV

## **1. Import packages and define constants**

In [None]:
# Imports
import os
import pandas as pd

# Constants
fp = '../data/raw/'
files = os.listdir(fp)
print(f"{len(files)} files found in {fp}")

## **2. Retrieve play-by-play JSON data from [nba.com](http://nba.com/game)**

## **3. Helper functions**

In [None]:
def create_teams(df):
    """
    Function adds columns 'home' and 'away' containing team tri-codes (ex. 'BOS' for the Boston Celtics) to the uncleaned dataframe.

    Args:
      df (dataframe): Dataframe to add the columns to

    Returns:
      df (dataframe): Dataframe with the columns 'home' and 'away' columns added

    """

    # Get home team names
    home_df = df[df['location'] == 'h']
    home = home_df['teamTricode'].unique()[0]

    # Get away team names
    away_df = df[df['location'] == 'v']
    away = away_df['teamTricode'].unique()[0]

    # Add home and away team names to the dataframe
    df['home'] = home
    df['away'] = away

    return df


def create_shotval(df):
    """
    This function adds a column called 'shotVal' to the dataframe. This column represents the point value of a field goal or free throw attempt. If a field goal is made, the value is 2 for a 2-pointer or 3 for a 3-pointer. If it is a free throw, the value is 1. If it is not a field goal attempt, the value is '' (blank).

    Args:
      df (dataframe): Dataframe to add the column to

    Returns:
      df (dataframe): Dataframe with the 'shotVal' column added

    """

    # Sets all non-field goal rows to ''
    df.loc[df['isFieldGoal'] == 0, 'shotVal'] = ''

    # Sets all field goal rows to 2 or 3 depending on if it is a 2 or 3 pointer
    df.loc[(df['isFieldGoal'] == 1) & (df['description'].str.contains('3PT')), 'shotVal'] = 3
    df.loc[(df['isFieldGoal'] == 1) & (~df['description'].str.contains('3PT')), 'shotVal'] = 2

    # Sets all free throw rows to 1
    df.loc[df['actionType'] == 'Free Throw', 'shotVal'] = 1

    return df


def create_scoreval(df):
    """
    The function adds a column called 'scoreVal' to the dataframe. This column represents the points scored from a field goal or free throw attempt. If it is a non-field goal, the value is '' (blank). If it is a made field goal, the value is the shot value (2 for a 2-pointer, 3 for a 3-pointer). If it is a made free throw, the value is 1. If it is a missed field goal or free throw, the value is 0.

    Args:
      df (dataframe): Dataframe to add the column to

    Returns:
      df (dataframe): Dataframe with the 'scoreVal' column added

    """

    # Sets all non-field goal rows to 0
    df.loc[df['isFieldGoal'] == 0, 'scoreVal'] = ''

    # Sets all field goal rows to their shot value if they are made shots and 0 if they are missed shots
    df.loc[df['actionType'] == 'Made Shot', 'scoreVal'] = df['shotVal']
    df.loc[df['actionType'] == 'Missed Shot', 'scoreVal'] = 0

    # Sets all free throw rows to 1 if they are made shots and 0 if they are missed shots
    df.loc[(df['actionType'] == 'Free Throw') & (df['description'].str.contains('MISS')), 'scoreVal'] = 0
    df.loc[(df['actionType'] == 'Free Throw') & (~df['description'].str.contains('MISS')), 'scoreVal'] = 1

    return df


def convert_clock(df):
    """
    This function converts the 'clock' column from a string to a time format.

    Args:
      df (dataframe): Dataframe to convert the 'clock' column in

    Returns:
      df (dataframe): Dataframe with the 'clock' column converted to a time format

    """

    # Cleaning the string to create a time format value
    df['clock'] = df['clock'].str.strip('PT')
    df['clock'] = df['clock'].str.replace('M', ':')
    df['clock'] = df['clock'].str.replace('S', '')
    df['clock'] = pd.to_datetime(df['clock'], format='%M:%S.%f').dt.time

    # Extracting the minutes and seconds from the time format
    df['clock'] = df['clock'].astype(str).str.slice(start=3)

    return df

## **4. Clean dataframe**

In [None]:
def clean_df(df):
    """
    Function cleans the dataframe by removing any rows with missing values or columns that are not needed. It also adds columns 'home' and 'away' containing team tri-codes (ex. 'BOS' for the Boston Celtics) to the dataframe. It also adds columns 'shotVal' and 'scoreVal' which represent the point value and points scored from a field goal or free throw. Additionally, it converts the 'clock' column from a string to a time format.

    Args:
      df (dataframe): Dataframe to be cleaned

    Returns:
      df (dataframe): Cleaned dataframe

    """

    # List of columns to keep
    export_columns = ['game_id', 'period', 'clock', 'home', 'scoreHome', 'away', 'scoreAway',\
        'playerNameI', 'teamTricode', 'description', 'actionType', 'subType',\
        'xLegacy', 'yLegacy', 'shotDistance', 'isFieldGoal', 'shotVal', 'scoreVal',\
        'location']

    # Adds 'home' and 'away' columns with the team tri-code
    df = create_teams(df)

    # Creates 'shotVal' and 'scoreVal' columns. These columns represent the
    # point value and points scored from a field goal or free throw
    df = create_shotval(df)
    df = create_scoreval(df)

    # Converting the 'clock' from a string to a time format
    df = convert_clock(df)

    return df[export_columns].reset_index(drop=True)

df = clean_df(df)
print(f"Successfully cleaned dataframe containing {str(df.shape[0])} rows and {str(df.shape[1])} columns. \n")

In [None]:
df.sample(5)

In [None]:
# Function loops over every file in the list and converts it into a dataframe
# and adds the game id to the dataframe as a column titled 'game_id'
def load_game(files):
    """
    Function loops over every file in the list and converts it into a dataframe
    and adds the game id to the dataframe as a column titled 'game_id'

    Args:
      files (list): List of files to be processed

    Returns:
      df (dataframe): Dataframe containing all the data from the files
    """

    games = []

    # Looping over every file in the list
    for file in files:
        print("Processing file: " + file)

        # Reading the file as a dataframe
        temp_df = pd.read_json('../data/raw/' + file)

        # Adding the game id as a column
        temp_df['game_id'] = file.split('.')[0]

        # Cleaning the values within the dataframe
        temp_df = clean_df(temp_df)

        # Appending the dataframe to the main dataframe
        games.append(temp_df)

    # Combining all the dataframes into one dataframe
    df = pd.concat(games)

    df.to_csv('../data/processed/S2324_all_games.csv', index=False)

    return df

df = load_game(files)
print(f"Successfully loaded dataframe containing {str(df.shape[0])} rows and {str(df.shape[1])} columns from {len(files)} games. \n")
df.columns

In [51]:
def save_by_team(fp):
  """
  Function saves the dataframe to a csv file for each team in the dataframe. The csv files are saved in the 'processed' folder.

  Args:
    fp (str): File path to save the csv files to

  Returns:
    None

  """
  try:
    df = pd.read_csv(fp)
    teams = df['home'].unique()

    for team in teams:
        team_df = df[(df['home'] == team) | (df['away'] == team)]
        team_df.to_csv(f'../data/processed/team/{team}_games.csv', index=False)

    print(f"Successfully saved csv files for each team in the dataframe. \n")

  except:
    print("Error: Could not save csv files for each team in the dataframe. \n")


save_by_team('../data/processed/S2324_all_games.csv')

Successfully saved csv files for each team in the dataframe. 



In [52]:
def save_by_player(fp):
  """
  Function saves the dataframe to a csv file for each player in the dataframe. The csv files are saved in the 'processed' folder.

  Args:
    fp (str): File path to save the csv files to

  Returns:
    None

  """
  try:
    df = pd.read_csv(fp)
    players = df['playerNameI'].unique()

    for player in players:
        player_df = df[df['playerNameI'] == player]
        player_df.to_csv(f'../data/processed/player/{player}_games.csv', index=False)

    print(f"Successfully saved csv files for each player in the dataframe. \n")

  except:
    print("Error: Could not save csv files for each player in the dataframe. \n")

save_by_player('../data/processed/S2324_all_games.csv')

Successfully saved csv files for each player in the dataframe. 

