# **NBA Play-By-Play Data - 2023-2024 Season**
Brendan Keane
August 8, 2024

## **Summary**
This notebook scrapes the [NBA website](https://nba.com) for play-by-play data from the 2023-2024 season using *Python* and *Beautiful Soup*. This process is broken down into the following steps:
1. Import packages and define constants
1. Retrieve play-by-play JSON data from [nba.com](http://nba.com/game)
1. Save play-by-play JSON for all 1,230 NBA games
1. Combine all 1,230 JSONs into one CSV

## **1. Import packages and define constants**

In [1]:
# Imports
from bs4 import BeautifulSoup
import os
import requests
import json
import time
import random


# Constants

# URL beginning and end to access pages with shot charts
URL_START = "https://www.nba.com/game/00"
URL_END = "/game-charts"

# Base number for game URLs (2022-2023 season) and total games for the season
# Note: Each game URL increments by 1 from `22200001` to `22201230`
GAME_NUM_BASE = 22300001
TOTAL_GAMES = 1230

# All game URLs for 2022-2023 season
URL_LIST = []
for game in range(TOTAL_GAMES):
  req_url = URL_START + str(GAME_NUM_BASE + game) + URL_END
  URL_LIST.append(req_url)

# Request headers
HEADERS = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

# Export directory path
EXPORT_PATH = "../data/raw/"


## **2. Retrieve play-by-play JSON data from [nba.com](http://nba.com/game)**

In [2]:
def get_soup(url):
  """
  Function to get the soup object from a given URL

  Args:
    url (str): URL to scrape

  Returns:
    soup (BeautifulSoup): Soup object of the URL

  """
  try:
    response = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup
  except:
    print("Error: Could not get soup object from URL")
    return None

def get_json(soup):
  """
  Function to get the JSON object from a given soup object.
  Note: The JSON object is stored in a script tag with the id "__NEXT_DATA__"

  Args:
    soup (BeautifulSoup): Soup object of the URL

  Returns:
    json_obj (dict): JSON object of the URL's play-by-play data

  """
  try:
    script = soup.find(id="__NEXT_DATA__")
    json_obj = json.loads(script.text)
    return json_obj['props']['pageProps']['playByPlay']['actions']
  except:
    print("Error: Could not get play-by-play JSON object from soup object")
    return None


def save_json(json_obj, game_num):
  """
  Function to save the JSON object to a file

  Args:
    json_obj (dict): JSON object to save
    game_num (int): Game number to save the JSON object as

  """
  try:
    with open(EXPORT_PATH + "S2324-G" + "{0:0=4d}".format(game_num) + ".json", 'w') as f:
      json.dump(json_obj, f, indent=4)
  except:
    print(f"Error: Could not save JSON object to file with code {game_num}")

## **3. Save play-by-play JSON for all 1,230 NBA games**

In [3]:
def export_game(game_num):
  """
  Function to export the JSON object of a game to a file

  Args:
    game_num (int): Game number to export

  """
  try:
    url = URL_LIST[game_num - 1]
    soup = get_soup(url)
    json_obj = get_json(soup)
    save_json(json_obj, game_num)
    print(f"Exported game {game_num}")
  except:
    print(f"Error: Could not export game {game_num}")


def export_all_games():
  """
  Function to export all games from the 2023-2024 season

  """
  try:
    for game_num in range(1, TOTAL_GAMES + 1):
      export_game(game_num)
      time.sleep(random.randint(1, 3))
  except:
    print("Error: Could not export all games")

export_game(1073)

Exported game 1073


### *Caution*
Running the `export_all_games()` function will scrape and export all 1,230 NBA games. When I ran this function, it took **65m 27.6s**.

In [4]:
# export_all_games()

## **4. Combine all 1,230 JSONs into one CSV**

In [5]:
export_game(1)

Exported game 1
