# Web Scraping NBA Data
I have a few project ideas I want to try out for research purposes, and in order to do them, I need data. I'm going to be extracting data directly from NBA.com in order to do this. I'm interested in gathering:
- Box Score Data
- Play-by-Play Data
- Game Data
  - Game Date
  - Regular Season/Playoffs
  - Home and Away Team
  - Final Score
  - Stadium Attendance
  - National Broadcast Data

The main goal here is to get this data while not putting any needless strain on NBA.com's servers.

## Data Sources
NBA.com has a few different places from which you can extract the above data. I'm going to extract data from a Content Delivery Network (CDN) used by NBA.com, which hosts play-by-play and box-score data, as well as retrieve data from a JSON object that loads dynamically when navigating to specific pages on NBA.com. 

### NBA.com's Content Delivery Network (CDN)
As of March 5, 2024, you can still pull box-score and play-by-play directly from NBA.com's CDN if you know the specific game's "game ID". The format you want to use is the following:

In [5]:
game_id = '0022000181'

# For Play-By-Play Data:
format_play_by_play = 'https://cdn.nba.com/static/json/liveData/playbyplay/playbyplay_{}.json'

# For Box-Score Data:
format_box_score = 'https://cdn.nba.com/static/json/liveData/boxscore/boxscore_{}.json'

print(format_play_by_play.format(game_id))
print(format_box_score.format(game_id))

https://cdn.nba.com/static/json/liveData/playbyplay/playbyplay_0022000181.json
https://cdn.nba.com/static/json/liveData/boxscore/boxscore_0022000181.json


If you navigate to either of these links, it'll take you to a raw JSON file, which contains all the box-score and play-by-play data for the NBA game with the game id '0022000181'.

Just from navigating to the CDN for different games from different years, it appears that the CDN links don't work for games prior to the 2019-2020 season, so for games that precede that date, I'll instead have to navigate directly to NBA.com's page for those specific games. 

### NBA.com's Dynamic JSON Object
When you navigate to any NBA page that provides information on a past game, you're greeted with the ability to navigate to the 'Summary,' 'Box Score', 'Game Charts', and 'Play-By-Play' tabs. Each of these provide information about the game in an easily digestable format. 

You could navigate to each of these pages one by one and extract information directly from the raw html, however, all the data that makes up the content in the divs and tables on this page are populated via a dynamic JSON object that you can find embedded in the first <b>script</b> tag at the top of the html document tree. By pulling down this JSON object directly, you can retrieve all the relevant game information in a structured format without needing to parse through all the raw html.

The highlighted script tag is collapsed, but contains the JSON object we want to extract:

![alt text](imgs/json_in_script.png)

The script tag has the id "__NEXT_DATA__", so we'll specify that when we extract the html for the page.

For this, we'll use the <b>requests</b> and <b>BeautifulSoup</b> Python packages to extract the html page and use the <b>json</b> package to convert the resulting string to a JSON object which we'll save to further analyze. 

The code below is an example of what that looks like: 

In [98]:
import requests
from bs4 import BeautifulSoup
import json

# URL of the webpage you want to scrape
url = 'https://www.nba.com/game/lac-vs-mil-0022300880' # regular season game
url = 'https://www.nba.com/game/ind-vs-lal-0062300001' # in season tournamenl
url = 'https://www.nba.com/game/cle-vs-atl-0041400301' # playoff game

# Fetch the content of the webpage
response = requests.get(url)

# Use BeautifulSoup to parse the HTML content
soup = BeautifulSoup(response.text, 'html.parser')

# Find the specific script tag with the ID we want
script_tag = soup.find('script', id='__NEXT_DATA__')

# Load the script tag into a json object
json_object = json.loads(script_tag.string)


The JSON object is fairly heavily nested, and for now we only want the box score data, the play by play data, and some general game data. 

#### Box Score Data
The box-score data can be found in the 'game' key, within the 'homeTeam' and 'awayTeam' keys:

In [75]:
box_score_home = json_object['props']['pageProps']['game']['homeTeam']
box_score_away = json_object['props']['pageProps']['game']['awayTeam']

We can now take this json data, and turn it into a flat file using the pandas package in python:

In [1]:
import pandas as pd

def generate_box_score(box_score_dict):
    rows = []
    for player in box_score_dict['players']:
        row = {'player_id' : player['personId']}
        player_name = '{} {}'.format(player['firstName'], player['familyName'])
        row.update({'player_name' : player_name})
        row.update(player['statistics'])
        rows.append(row)
    return pd.DataFrame(rows)

The resulting dataframe looks as follows:

In [149]:
generate_box_score(box_score_away)

Unnamed: 0,player_id,player_name,minutes,fieldGoalsMade,fieldGoalsAttempted,fieldGoalsPercentage,threePointersMade,threePointersAttempted,threePointersPercentage,freeThrowsMade,...,reboundsOffensive,reboundsDefensive,reboundsTotal,assists,steals,blocks,turnovers,foulsPersonal,points,plusMinusPoints
0,1627741,Buddy Hield,32:22,3,11,0.273,2,9,0.222,0,...,1,4,5,4,1,0,2,3,8,-19
1,1630167,Obi Toppin,29:09,5,11,0.455,3,7,0.429,0,...,1,1,2,0,0,0,0,5,13,-12
2,1626167,Myles Turner,25:13,3,11,0.273,1,5,0.2,3,...,1,6,7,1,1,0,2,6,10,-17
3,1628971,Bruce Brown,19:26,2,9,0.222,0,2,0.0,0,...,0,2,2,1,2,1,0,2,4,0
4,1630169,Tyrese Haliburton,35:15,8,14,0.571,2,8,0.25,2,...,0,1,1,11,1,1,3,1,20,-19
5,1631097,Bennedict Mathurin,30:08,5,11,0.455,1,5,0.2,9,...,0,2,2,0,1,1,0,4,20,-5
6,1630543,Isaiah Jackson,18:38,2,6,0.333,0,0,0.0,6,...,1,4,5,0,0,4,1,3,10,5
7,1630174,Aaron Nesmith,26:28,4,10,0.4,1,5,0.2,6,...,1,2,3,1,1,0,1,5,15,-13
8,204456,T.J. McConnell,17:30,3,11,0.273,0,0,0.0,2,...,3,1,4,9,3,0,0,4,8,5
9,1641716,Jarace Walker,1:10,0,0,0.0,0,0,0.0,0,...,0,0,0,0,0,0,0,0,0,1


#### The Play-By-Play Data
The play-by-play data can be found inside the 'play-by-play' key:

In [68]:
play_by_play_data = json_object['props']['pageProps']['playByPlay']

#### General Game Data
The game data I'm looking to extract is found mostly within the same 'game' key that the box score data was embedded inside. However, some of it is also found in the parent key 'analyticsObject'.

The key-value pairs I want to pull from this again are:
- Game Date
- Regular Season/Playoffs
- Home and Away Team
- Final Score
- Stadium Attendance
- National TV Broadcast

In [113]:
# Date in US Eastern Timezone:
game_date = json_object['props']['pageProps']['game']['gameEt']

# Season Year
season_year = json_object['props']['pageProps']['analyticsObject']['season']

# Regular Season or Playoffs
season_type = json_object['props']['pageProps']['analyticsObject']['seasonType']

# Game Attendance
attendance = json_object['props']['pageProps']['game']['attendance']

# Game Sold Out
sellout = json_object['props']['pageProps']['game']['sellout']

# Game Duration
duration = json_object['props']['pageProps']['game']['duration']

# Game Label
game_label = json_object['props']['pageProps']['game']['gameLabel'] 

# Game Sub-Label (What game of the series was it, game 1, 2... etc)
game_sublabel = json_object['props']['pageProps']['game']['gameSubLabel'] 

# Series Game Number--not sure how this is different from sublabel
series_game_number = json_object['props']['pageProps']['game']['seriesGameNumber']

# Series Text (Information about the series if it's a playoff game)
series_text = json_object['props']['pageProps']['game']['seriesText']

# The home/away team name, and corresponding NBA ID for the team:
home_team_id = json_object['props']['pageProps']['game']['homeTeam']['teamId']
home_team_city = json_object['props']['pageProps']['game']['homeTeam']['teamCity']
home_team_name = json_object['props']['pageProps']['game']['homeTeam']['teamName']

away_team_id = json_object['props']['pageProps']['game']['awayTeam']['teamId']
away_team_city = json_object['props']['pageProps']['game']['awayTeam']['teamCity']
away_team_name = json_object['props']['pageProps']['game']['awayTeam']['teamName']

# Final Score 
home_team_score = json_object['props']['pageProps']['game']['homeTeam']['score']
away_team_score = json_object['props']['pageProps']['game']['awayTeam']['score']

# National TV Broadcast
national_tv_broadcast = json_object['props']['pageProps']['game']['broadcasters']['nationalBroadcasters'][0].get('broadcasterDisplay', None)


One note on the above, the broadcast data has a lot of information--it contains TV and radio broadcasts both locally and nationally that played the game--I'm only interested in whether the game was played nationally, which appears to always appear in the 'nationalBroadcasters' key. If it wasn't nationally broadcast, the key still exists, but the list value is just empty. 

The next step is to turn this into a pandas dataframe, since I'm going to store this data in a relational database:

In [117]:
import pandas as pd

data = {
    'game_date' : game_date,
    'season_year' : season_year,
    'season_type' : season_type,
    'attendance' : attendance,
    'sellout' : sellout,
    'duration' : duration,
    'game_label' : game_label,
    'game_sublabel' : game_sublabel,
    'series_text' : series_text,
    'series_game_number' : series_game_number,
    'series_text' : series_text,
    'home_team_id' : home_team_id,
    'home_team_city' : home_team_city,
    'home_team_name' : home_team_name,
    'away_team_id' : home_team_id,
    'away_team_city' : away_team_city,
    'away_team_name' : away_team_name,
    'home_team_score' : home_team_score,
    'away_team_score' : away_team_score,
    'national_tv_broadcast' : national_tv_broadcast
}

df_game_data = pd.DataFrame([data])
df_game_data

Unnamed: 0,game_date,season_year,season_type,attendance,sellout,duration,game_label,game_sublabel,series_text,series_game_number,home_team_id,home_team_city,home_team_name,away_team_id,away_team_city,away_team_name,home_team_score,away_team_score,national_tv_broadcast
0,2015-05-20T20:30:00Z,2014,Playoffs,18489,1,2:37,East - Conf. Finals,Game 1,CLE leads 1-0,Game 1,1610612737,Atlanta,Hawks,1610612737,Cleveland,Cavaliers,89,97,TNT
