# Extracting NBA Data
I have a few project ideas I want to try out for research purposes, and in order to do them, I need data. I'm going to be extracting data from a number of different online sources. The data I'm interested in are the following:
- Box Score Data
- Play-by-Play Data
- General Game Data
  - Game Date
  - Regular Season/Playoffs
  - Home and Away Team
  - Final Score
  - Stadium Attendance
  - National Broadcast Data

The main goal here is to get this data while not putting any strain on the data host's servers.

## Data Sources
This data is available in a lot of different places, here a few:
- Basketball Reference (Box Score/Play-By-Play Data)
- NBA.com (Box Score/Play-By-Play Data)

## Extracting Data from NBA.com
NBA.com has a few different places from which you can extract the above data. I'm going to extract data from a Content Delivery Network (CDN) used by NBA.com, which hosts play-by-play and box-score data, as well as retrieve data from a JSON object that loads dynamically when navigating to specific pages on NBA.com. 

### NBA.com's Content Delivery Network (CDN)
As of March 5, 2024, you can still pull box-score and play-by-play directly from NBA.com's CDN if you know the specific game's "game ID". The format you want to use is the following:

In [None]:
game_id = '0022000181'

# For Play-By-Play Data:
format_play_by_play = 'https://cdn.nba.com/static/json/liveData/playbyplay/playbyplay_{}.json'

# For Box-Score Data:
format_box_score = 'https://cdn.nba.com/static/json/liveData/boxscore/boxscore_{}.json'

print(format_play_by_play.format(game_id))
print(format_box_score.format(game_id))

https://cdn.nba.com/static/json/liveData/playbyplay/playbyplay_0022000181.json
https://cdn.nba.com/static/json/liveData/boxscore/boxscore_0022000181.json


If you navigate to either of these links, it'll take you to a raw JSON file, which contains all the box-score and play-by-play data for the NBA game with the game id '0022000181'.

Just from navigating to the CDN for different games from different years, it appears that the CDN links don't work for games prior to the 2019-2020 season, so for games that precede that date, I'll instead have to navigate directly to NBA.com's page for those specific games. 

### NBA.com's Dynamic JSON Object
When you navigate to any NBA page that provides information on a past game, you're greeted with the ability to navigate to the 'Summary,' 'Box Score', 'Game Charts', and 'Play-By-Play' tabs. Each of these provide information about the game in an easily digestable format. 

You could navigate to each of these pages one by one and extract information directly from the raw html, however, all the data that makes up the content in the divs and tables on this page are populated via a dynamic JSON object that you can find embedded in the first <b>script</b> tag at the top of the html document tree. By pulling down this JSON object directly, you can retrieve all the relevant game information in a structured format without needing to parse through all the raw html.

The highlighted script tag is collapsed, but contains the JSON object we want to extract:

![alt text](../imgs/json_in_script.png)

The script tag has the id "__NEXT_DATA__", so we'll specify that when we extract the html for the page.

For this, we'll use the <b>requests</b> and <b>BeautifulSoup</b> Python packages to extract the html page and use the <b>json</b> package to convert the resulting string to a JSON object which we'll save to further analyze. 

The code below is an example of what that looks like: 

In [42]:
import requests
from bs4 import BeautifulSoup
import json

# URL of the webpage you want to scrape
url = 'https://www.nba.com/game/lac-vs-mil-0029500281' 

# Fetch the content of the webpage
response = requests.get(url)

# Use BeautifulSoup to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find the specific script tag with the ID we want
script_tag = soup.find('script', id='__NEXT_DATA__')

# Load the script tag into a json object
json_object = json.loads(script_tag.string)

There is a lot of information for each game contained in these JSON files, including the play-by-play data, the box-score data, and general game data. In the next post, I'll go through how to transform this data from JSON to a pandas dataframe that we can then analyze as a flat file. 