# **November 8, 2022**
Working on using [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) to extract *JSON data* from [NBA.com](https://www.nba.com/schedule?cal=0&pd=false&region=1). These data will be scraped for the *Shot Mesh* project.

In [2]:
# Importing packages
from bs4 import BeautifulSoup
import requests
import json

In [3]:
# Requesting webpage
gamepage = requests.get('https://www.nba.com/game/den-vs-gsw-0022200026/game-charts')

In [4]:
# Turning it into soup
gamesoup = BeautifulSoup(gamepage.text, 'lxml')

### Content tag for shot data
`<script id="__NEXT_DATA__" type="application/json">`

In [5]:
gamescript = gamesoup.find(id='__NEXT_DATA__')

In [19]:
# New variable containing the script content
gamejson = json.loads(gamescript.text)

# FOR LATER... DESIRED FORMAT FOR EXPORTING JSON FILES
# json_obj = json.dumps(gamejson, indent=4)
# with open("gamedata.json", "w") as outfile:
#   outfile.write(json_obj)


dict

# **November 9, 2022**
Converting the *text* inside `id='__NEXT_DATA__'` into a more usable data format. 

### **Arguments for JSON**
The origional data is in this format. This structure also provides high levels of data structure flexability, which could be advantagous when converting from `JSON` to a *Python-specific class*.

### **Arguments for CSV**
This structure is more accessible for other parties. Based on the complexity of my drafted data strictures, I believe that a `CSV` can capture all relevant information without substantial redundancies.

In [25]:
# New variable for ONLY play-by-play data
pbp = gamejson['props']['pageProps']['playByPlay']['actions']

# Saving data to JSON
json_obj = json.dumps(pbp, indent=4)
with open("gamedata.json", "w") as outfile:
  outfile.write(json_obj)

In [26]:
# Starting to work with pandas
import pandas as pd
pbp_df = pd.read_json("gamedata.json")

In [34]:
action_type = pbp_df['actionType']
action_type.unique()

array(['period', 'Jump Ball', 'Made Shot', 'Missed Shot', 'Rebound',
       'Turnover', 'Foul', '', 'Violation', 'Free Throw', 'Substitution',
       'Timeout', 'Instant Replay'], dtype=object)

### Notes about *actionType*
While I believe only rows with *Made Shot*, *Missed Shot*, and *Free Throw* action types will be important, I need to verify that no other action types contain positional data (seen as non-zero values in `xLegacy` and `yLegacy`).

In [50]:
# Showing all action types for non-zero court positions
pbp_df[(pbp_df['xLegacy'] != 0) & (pbp_df['yLegacy'] != 0)]['actionType'].unique()

array(['Made Shot', 'Missed Shot'], dtype=object)

In [62]:
# Identifying the range of actionId and actionNumber
print('MAX actionId: \t' + str(pbp_df.actionId.max()))
print('MIN actionId: \t' + str(pbp_df.actionId.min()))
print('MAX actionNumber: \t' + str(pbp_df.actionNumber.max()))
print('MIN actionNumber: \t' + str(pbp_df.actionNumber.min()))

Index(['actionNumber', 'clock', 'period', 'teamId', 'teamTricode', 'personId',
       'playerName', 'playerNameI', 'xLegacy', 'yLegacy', 'shotDistance',
       'shotResult', 'isFieldGoal', 'scoreHome', 'scoreAway', 'pointsTotal',
       'location', 'description', 'actionType', 'subType', 'videoAvailable',
       'actionId'],
      dtype='object')
MAX actionId: 	503
MIN actionId: 	1
MAX actionNumber: 	696
MIN actionNumber: 	2


### Notes about *FT* vs. *FG*
There appears to be no numeric link between a *FG* and a *FT* besides the time in the `clock` column. 

In [100]:
# Helper functions for searching through column values
def searchColRange(df, col, min, max):
  if max == -1:
    return df[df[col] > min]

  return df[(df[col] >= min) & (df[col] <= max)]

def searchActionId(df, min = 1, max = -1):
  return searchColRange(df, 'actionId', min, max)

def searchActionNumber(df, min = 1, max = -1):
  return searchColRange(df, 'actionNumber', min, max)

searchActionNumber(pbp_df, 93, 95)


Unnamed: 0,actionNumber,clock,period,teamId,teamTricode,personId,playerName,playerNameI,xLegacy,yLegacy,shotDistance,shotResult,isFieldGoal,scoreHome,scoreAway,pointsTotal,location,description,actionType,subType,videoAvailable,actionId
63,93,PT05M03.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,0,,0,21,22,43,h,Curry Free Throw 1 of 3 (7 PTS),Free Throw,Free Throw 1 of 3,1,64
64,94,PT05M03.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,0,,0,22,22,44,h,Curry Free Throw 2 of 3 (8 PTS),Free Throw,Free Throw 2 of 3,1,65
65,95,PT05M03.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,0,,0,23,22,45,h,Curry Free Throw 3 of 3 (9 PTS),Free Throw,Free Throw 3 of 3,0,66


In [112]:
# Seeing how 'Free Throw' is documented
pbp_df[pbp_df['subType'] == 'Free Throw 1 of 1']

Unnamed: 0,actionNumber,clock,period,teamId,teamTricode,personId,playerName,playerNameI,xLegacy,yLegacy,shotDistance,shotResult,isFieldGoal,scoreHome,scoreAway,pointsTotal,location,description,actionType,subType,videoAvailable,actionId
20,32,PT09M14.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,0,,0,9.0,10.0,19,h,Curry Free Throw 1 of 1 (4 PTS),Free Throw,Free Throw 1 of 1,1,21
382,530,PT10M19.00S,4,1610612744,GSW,203952,Wiggins,A. Wiggins,0,0,0,,0,,,0,h,MISS Wiggins Free Throw 1 of 1,Free Throw,Free Throw 1 of 1,1,383
398,551,PT09M18.00S,4,1610612744,GSW,1630164,Wiseman,J. Wiseman,0,0,0,,0,93.0,103.0,196,h,Wiseman Free Throw 1 of 1 (11 PTS),Free Throw,Free Throw 1 of 1,1,399
446,616,PT04M21.00S,4,1610612743,DEN,203999,Jokic,N. Jokic,0,0,0,,0,105.0,114.0,219,v,Jokic Free Throw 1 of 1 (22 PTS),Free Throw,Free Throw 1 of 1,1,447


### Notes about **actionType == 'Free Throw'**
Free throw **makes** include `scoreHome`, `scoreAway`, and `pointsTotal` values. Free throw **misses** have *NULL* values in the aformentioned columns. Although it would be ideal to include **FTM** and **FTA** in this dataset, isolated free-throws (meaning no and-1 opportunity) do not include `xLegacy` and `yLegacy` values.

### Key insights
1. In these data, `subType == 'Free Throw 1 of 1'` indicates that **the free throw is an and-1 opportunity**
2. Important data is stored in...
    1. `actionType == 'Missed Shot'`
    2. `actionType == 'Made Shot'`
    3. `subType == 'Free Throw 1 of 1'`.

In [114]:
# Fucntion that selects the rows necessary for Shotmesh data
def getScoreData(df):
  return df[(df['actionType'] == 'Made Shot') | (df['actionType'] == 'Missed Shot') | (df['subType'] == 'Free Throw 1 of 1')]

score = getScoreData(pbp_df)

In [118]:
score[score['actionType'] == 'Free Throw']

Unnamed: 0,actionNumber,clock,period,teamId,teamTricode,personId,playerName,playerNameI,xLegacy,yLegacy,shotDistance,shotResult,isFieldGoal,scoreHome,scoreAway,pointsTotal,location,description,actionType,subType,videoAvailable,actionId
20,32,PT09M14.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,0,,0,9.0,10.0,19,h,Curry Free Throw 1 of 1 (4 PTS),Free Throw,Free Throw 1 of 1,1,21
382,530,PT10M19.00S,4,1610612744,GSW,203952,Wiggins,A. Wiggins,0,0,0,,0,,,0,h,MISS Wiggins Free Throw 1 of 1,Free Throw,Free Throw 1 of 1,1,383
398,551,PT09M18.00S,4,1610612744,GSW,1630164,Wiseman,J. Wiseman,0,0,0,,0,93.0,103.0,196,h,Wiseman Free Throw 1 of 1 (11 PTS),Free Throw,Free Throw 1 of 1,1,399
446,616,PT04M21.00S,4,1610612743,DEN,203999,Jokic,N. Jokic,0,0,0,,0,105.0,114.0,219,v,Jokic Free Throw 1 of 1 (22 PTS),Free Throw,Free Throw 1 of 1,1,447


In [120]:
score[score['clock'] == 'PT09M18.00S']

Unnamed: 0,actionNumber,clock,period,teamId,teamTricode,personId,playerName,playerNameI,xLegacy,yLegacy,shotDistance,shotResult,isFieldGoal,scoreHome,scoreAway,pointsTotal,location,description,actionType,subType,videoAvailable,actionId
395,545,PT09M18.00S,4,1610612744,GSW,1630164,Wiseman,J. Wiseman,-34,16,4,Made,1,92,103,195,h,Wiseman 4' Layup (10 PTS) (Wiggins 3 AST),Made Shot,Layup Shot,1,396
398,551,PT09M18.00S,4,1610612744,GSW,1630164,Wiseman,J. Wiseman,0,0,0,,0,93,103,196,h,Wiseman Free Throw 1 of 1 (11 PTS),Free Throw,Free Throw 1 of 1,1,399


### Notes about **and-1** in these data
If the **and-1** opportunity is a miss, then `pointsTotal = 0` in the respective FT row.

In [144]:
# Checking to see how to best work through time format
import re
score[['clock', 'period']].tail(5)
dig = re.compile('[0-9]{2}')

for item in score['clock']:
  time = dig.findall(item)
  print(time[0], ' minutes \t', time[1], '.', time[2], ' seconds')
  break

11  minutes 	 43 . 00  seconds


PT12M00.00S
PT12M00.00S
PT11M43.00S
PT11M29.00S
PT11M26.00S
PT11M19.00S
PT10M59.00S
PT10M57.00S
PT10M52.00S
PT10M40.00S
PT10M22.00S
PT10M12.00S
PT10M05.00S
PT09M50.00S
PT09M34.00S
PT09M34.00S
PT09M30.00S
PT09M20.00S
PT09M14.00S
PT09M14.00S
PT09M14.00S
PT09M04.00S
PT09M04.00S
PT08M53.00S
PT08M41.00S
PT08M33.00S
PT08M30.00S
PT08M28.00S
PT08M20.00S
PT07M54.00S
PT07M54.00S
PT07M52.00S
PT07M43.00S
PT07M27.00S
PT07M27.00S
PT07M27.00S
PT07M26.00S
PT07M22.00S
PT07M22.00S
PT07M22.00S
PT07M22.00S
PT07M22.00S
PT07M04.00S
PT06M55.00S
PT06M49.00S
PT06M39.00S
PT06M39.00S
PT06M28.00S
PT06M19.00S
PT06M19.00S
PT06M12.00S
PT05M59.00S
PT05M55.00S
PT05M44.00S
PT05M41.00S
PT05M26.00S
PT05M26.00S
PT05M25.00S
PT05M25.00S
PT05M16.00S
PT05M03.00S
PT05M03.00S
PT05M03.00S
PT05M03.00S
PT05M03.00S
PT05M03.00S
PT04M51.00S
PT04M31.00S
PT04M29.00S
PT04M24.00S
PT04M23.00S
PT04M23.00S
PT04M23.00S
PT04M23.00S
PT04M23.00S
PT04M12.00S
PT04M09.00S
PT03M58.00S
PT03M56.00S
PT03M41.00S
PT03M28.00S
PT03M20.00S
PT03M18.00S
PT03