# **NBA Game Metadata - 2023-2024 Season**
Brendan Keane
October 7, 2024

## **Summary**
This notebook scrapes the [NBA website](https://nba.com) for play-by-play data from the 2023-2024 season using *Python* and *Beautiful Soup*. This process is broken down into the following steps:
1. Import packages and define constants
1. Retrieve play-by-play JSON data from [nba.com](http://nba.com/game)
1. Save play-by-play JSON for all 1,230 NBA games
1. Combine all 1,230 JSONs into one CSV

## **1. Import packages and define constants**

In [9]:
# Imports
from bs4 import BeautifulSoup
import os
import requests
import json
import time
import random


# Constants

# URL beginning and end to access pages with shot charts
URL_START = "https://www.nba.com/game/00"
URL_END = "/game-charts"

# Base number for game URLs (2022-2023 season) and total games for the season
# Note: Each game URL increments by 1 from `22200001` to `22201230`
GAME_NUM_BASE = 22300001
TOTAL_GAMES = 1230

# All game URLs for 2022-2023 season
URL_LIST = []
for game in range(TOTAL_GAMES):
  req_url = URL_START + str(GAME_NUM_BASE + game) + URL_END
  URL_LIST.append(req_url)

# Request headers
HEADERS = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }

# Export directory path
EXPORT_PATH = "../data/raw/"


## **2. Retrieve play-by-play JSON data from [nba.com](http://nba.com/game)**

In [10]:
def get_soup(url):
  """
  Function to get the soup object from a given URL

  Args:
    url (str): URL to scrape

  Returns:
    soup (BeautifulSoup): Soup object of the URL

  """
  try:
    response = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup
  except:
    print("Error: Could not get soup object from URL")
    return None

def get_json(soup):
  """
  Function to get the JSON object from a given soup object.
  Note: The JSON object is stored in a script tag with the id "__NEXT_DATA__"

  Args:
    soup (BeautifulSoup): Soup object of the URL

  Returns:
    json_obj (dict): JSON object of the URL's play-by-play data

  """
  try:
    script = soup.find(id="__NEXT_DATA__")
    json_obj = json.loads(script.text)
    return json_obj['props']['pageProps']['game']
  except:
    print("Error: Could not get play-by-play JSON object from soup object")
    return None


def save_json(json_obj, game_num):
  """
  Function to save the JSON object to a file

  Args:
    json_obj (dict): JSON object to save
    game_num (int): Game number to save the JSON object as

  """
  try:
    with open(EXPORT_PATH + "S2324-M" + "{0:0=4d}".format(game_num) + ".json", 'w') as f:
      json.dump(json_obj, f, indent=4)
  except:
    print(f"Error: Could not save JSON object to file with code {game_num}")

## **3. Save play-by-play JSON for all 1,230 NBA games**

In [11]:
def export_game(game_num):
  """
  Function to export the JSON object of a game to a file

  Args:
    game_num (int): Game number to export

  """
  try:
    url = URL_LIST[game_num - 1]
    soup = get_soup(url)
    json_obj = get_json(soup)
    save_json(json_obj, game_num)
    print(f"Exported game {game_num}")
  except:
    print(f"Error: Could not export game {game_num}")


def export_all_games():
  """
  Function to export all games from the 2023-2024 season

  """
  try:
    for game_num in range(1, TOTAL_GAMES + 1):
      export_game(game_num)
      time.sleep(random.randint(1, 3))
  except:
    print("Error: Could not export all games")

export_game(1073)

Exported game 1073


### *Caution*
Running the `export_all_games()` function will scrape and export all 1,230 NBA games. When I ran this function, it took **65m 27.6s**.

In [14]:
export_all_games()

Exported game 1
Exported game 2
Exported game 3
Exported game 4
Exported game 5
Exported game 6
Exported game 7
Exported game 8
Exported game 9
Exported game 10
Exported game 11
Error: Could not get play-by-play JSON object from soup object
Exported game 12
Exported game 13
Exported game 14
Exported game 15
Exported game 16
Exported game 17
Exported game 18
Exported game 19
Exported game 20
Exported game 21
Exported game 22
Exported game 23
Exported game 24
Exported game 25
Exported game 26
Exported game 27
Exported game 28
Exported game 29
Exported game 30
Exported game 31
Exported game 32
Exported game 33
Exported game 34
Exported game 35
Exported game 36
Exported game 37
Exported game 38
Exported game 39
Exported game 40
Exported game 41
Exported game 42
Exported game 43
Exported game 44
Exported game 45
Exported game 46
Exported game 47
Exported game 48
Exported game 49
Exported game 50
Exported game 51
Exported game 52
Exported game 53
Exported game 54
Exported game 55
Error: Coul

## **4. Combine all 1,230 JSONs into one CSV**

In [15]:
missed_games = [12, 56, 110, 111, 130, 181, 184, 275, 299, 304, 329, 462, 471, 472, 545, 580, 643, 742, 1053, 1054, 1081, 1107, 1147, 1155, 1157, 1211]

for game in missed_games:
  export_game(game)

Exported game 12
Exported game 56
Exported game 110
Exported game 111
Exported game 130
Exported game 181
Exported game 184
Exported game 275
Exported game 299
Exported game 304
Exported game 329
Exported game 462
Exported game 471
Exported game 472
Exported game 545
Exported game 580
Exported game 643
Exported game 742
Exported game 1053
Exported game 1054
Exported game 1081
Exported game 1107
Exported game 1147
Exported game 1155
Exported game 1157
Exported game 1211


# Data for processing
1. gameId
1. gameEt
1. gameTimeUTC
1. duration
1. attendance
1. sellout
1. arena
    1. arenaName
    1. arenaCity
    1. arenaState
    1. arenaCountry
    1. arenaTimezone
1. homeTeam
    1. teamId
    1. teamTricode
    1. teamName
    1. teamCity
    1. teamWins
    1. teamLosses
    1. score
    1. statistics
        1. fieldGoalsMade
        1. fieldGoalsAttempted
        1. fieldGoalsPercentage
        1. threePointersMade
        1. threePointersAttempted
        1. threePointersPercentage
        1. freeThrowsMade
        1. freeThrowsAttempted
        1. freeThrowsPercentage
        1. reboundsOffensive
        1. reboundsDefensive
        1. reboundsTotal
        1. assists
        1. steals
        1. blocks
        1. turnovers
        1. foulsPersonal
        1. points
        1. plusMinusPoints
1. awayTeam
    1. teamId
    1. teamTricode
    1. teamName
    1. teamCity
    1. teamWins
    1. teamLosses
    1. score
    1. statistics
        1. fieldGoalsMade
        1. fieldGoalsAttempted
        1. fieldGoalsPercentage
        1. threePointersMade
        1. threePointersAttempted
        1. threePointersPercentage
        1. freeThrowsMade
        1. freeThrowsAttempted
        1. freeThrowsPercentage
        1. reboundsOffensive
        1. reboundsDefensive
        1. reboundsTotal
        1. assists
        1. steals
        1. blocks
        1. turnovers
        1. foulsPersonal
        1. points
        1. plusMinusPoints
1. homeTeamPlayers
    1. personId
    1. name
    1. nameI
    1. firstName
    1. familyName
    1. jerseyNum
1. awayTeamPlayers
    1. personId
    1. name
    1. nameI
    1. firstName
    1. familyName
    1. jerseyNum