# **November 8, 2022**
Working on using [Beautiful Soup](https://beautiful-soup-4.readthedocs.io/en/latest/) to extract *JSON data* from [NBA.com](https://www.nba.com/schedule?cal=0&pd=false&region=1). These data will be scraped for the *Shot Mesh* project.

In [2]:
# Importing packages
from bs4 import BeautifulSoup
import requests
import json

In [3]:
# Requesting webpage
gamepage = requests.get('https://www.nba.com/game/den-vs-gsw-0022200026/game-charts')

In [4]:
# Turning it into soup
gamesoup = BeautifulSoup(gamepage.text, 'lxml')

### Content tag for shot data
`<script id="__NEXT_DATA__" type="application/json">`

In [5]:
gamescript = gamesoup.find(id='__NEXT_DATA__')

In [6]:
# New variable containing the script content
gamejson = json.loads(gamescript.text)

# FOR LATER... DESIRED FORMAT FOR EXPORTING JSON FILES
# json_obj = json.dumps(gamejson, indent=4)
# with open("gamedata.json", "w") as outfile:
#   outfile.write(json_obj)


# **November 9, 2022**
Converting the *text* inside `id='__NEXT_DATA__'` into a more usable data format. 

### **Arguments for JSON**
The origional data is in this format. This structure also provides high levels of data structure flexability, which could be advantagous when converting from `JSON` to a *Python-specific class*.

### **Arguments for CSV**
This structure is more accessible for other parties. Based on the complexity of my drafted data strictures, I believe that a `CSV` can capture all relevant information without substantial redundancies.

In [7]:
# New variable for ONLY play-by-play data
pbp = gamejson['props']['pageProps']['playByPlay']['actions']

# Saving data to JSON
json_obj = json.dumps(pbp, indent=4)
with open("gamedata.json", "w") as outfile:
  outfile.write(json_obj)

In [8]:
# Starting to work with pandas
import pandas as pd
pbp_df = pd.read_json("gamedata.json")

In [9]:
action_type = pbp_df['actionType']
action_type.unique()

array(['period', 'Jump Ball', 'Made Shot', 'Missed Shot', 'Rebound',
       'Turnover', 'Foul', '', 'Violation', 'Free Throw', 'Substitution',
       'Timeout', 'Instant Replay'], dtype=object)

### Notes about *actionType*
While I believe only rows with *Made Shot*, *Missed Shot*, and *Free Throw* action types will be important, I need to verify that no other action types contain positional data (seen as non-zero values in `xLegacy` and `yLegacy`).

In [10]:
# Showing all action types for non-zero court positions
pbp_df[(pbp_df['xLegacy'] != 0) & (pbp_df['yLegacy'] != 0)]['actionType'].unique()

array(['Made Shot', 'Missed Shot'], dtype=object)

In [11]:
# Identifying the range of actionId and actionNumber
print('MAX actionId: \t' + str(pbp_df.actionId.max()))
print('MIN actionId: \t' + str(pbp_df.actionId.min()))
print('MAX actionNumber: \t' + str(pbp_df.actionNumber.max()))
print('MIN actionNumber: \t' + str(pbp_df.actionNumber.min()))

MAX actionId: 	503
MIN actionId: 	1
MAX actionNumber: 	696
MIN actionNumber: 	2


### Notes about *FT* vs. *FG*
There appears to be no numeric link between a *FG* and a *FT* besides the time in the `clock` column. 

In [12]:
# Helper functions for searching through column values
def searchColRange(df, col, min, max):
  if max == -1:
    return df[df[col] > min]

  return df[(df[col] >= min) & (df[col] <= max)]

def searchActionId(df, min = 1, max = -1):
  return searchColRange(df, 'actionId', min, max)

def searchActionNumber(df, min = 1, max = -1):
  return searchColRange(df, 'actionNumber', min, max)

searchActionNumber(pbp_df, 93, 95)


Unnamed: 0,actionNumber,clock,period,teamId,teamTricode,personId,playerName,playerNameI,xLegacy,yLegacy,...,isFieldGoal,scoreHome,scoreAway,pointsTotal,location,description,actionType,subType,videoAvailable,actionId
63,93,PT05M03.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,...,0,21,22,43,h,Curry Free Throw 1 of 3 (7 PTS),Free Throw,Free Throw 1 of 3,1,64
64,94,PT05M03.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,...,0,22,22,44,h,Curry Free Throw 2 of 3 (8 PTS),Free Throw,Free Throw 2 of 3,1,65
65,95,PT05M03.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,...,0,23,22,45,h,Curry Free Throw 3 of 3 (9 PTS),Free Throw,Free Throw 3 of 3,0,66


In [13]:
# Seeing how 'Free Throw' is documented
pbp_df[pbp_df['subType'] == 'Free Throw 1 of 1']

Unnamed: 0,actionNumber,clock,period,teamId,teamTricode,personId,playerName,playerNameI,xLegacy,yLegacy,...,isFieldGoal,scoreHome,scoreAway,pointsTotal,location,description,actionType,subType,videoAvailable,actionId
20,32,PT09M14.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,...,0,9.0,10.0,19,h,Curry Free Throw 1 of 1 (4 PTS),Free Throw,Free Throw 1 of 1,1,21
382,530,PT10M19.00S,4,1610612744,GSW,203952,Wiggins,A. Wiggins,0,0,...,0,,,0,h,MISS Wiggins Free Throw 1 of 1,Free Throw,Free Throw 1 of 1,1,383
398,551,PT09M18.00S,4,1610612744,GSW,1630164,Wiseman,J. Wiseman,0,0,...,0,93.0,103.0,196,h,Wiseman Free Throw 1 of 1 (11 PTS),Free Throw,Free Throw 1 of 1,1,399
446,616,PT04M21.00S,4,1610612743,DEN,203999,Jokic,N. Jokic,0,0,...,0,105.0,114.0,219,v,Jokic Free Throw 1 of 1 (22 PTS),Free Throw,Free Throw 1 of 1,1,447


### Notes about **actionType == 'Free Throw'**
Free throw **makes** include `scoreHome`, `scoreAway`, and `pointsTotal` values. Free throw **misses** have *NULL* values in the aformentioned columns. Although it would be ideal to include **FTM** and **FTA** in this dataset, isolated free-throws (meaning no and-1 opportunity) do not include `xLegacy` and `yLegacy` values.

### Key insights
1. In these data, `subType == 'Free Throw 1 of 1'` indicates that **the free throw is an and-1 opportunity**
2. Important data is stored in...
    1. `actionType == 'Missed Shot'`
    2. `actionType == 'Made Shot'`
    3. `subType == 'Free Throw 1 of 1'`.

In [14]:
# Fucntion that selects the rows necessary for Shotmesh data
def getScoreData(df):
  return df[(df['actionType'] == 'Made Shot') | (df['actionType'] == 'Missed Shot') | (df['subType'] == 'Free Throw 1 of 1')]

score = getScoreData(pbp_df)

In [15]:
score[score['actionType'] == 'Free Throw']

Unnamed: 0,actionNumber,clock,period,teamId,teamTricode,personId,playerName,playerNameI,xLegacy,yLegacy,...,isFieldGoal,scoreHome,scoreAway,pointsTotal,location,description,actionType,subType,videoAvailable,actionId
20,32,PT09M14.00S,1,1610612744,GSW,201939,Curry,S. Curry,0,0,...,0,9.0,10.0,19,h,Curry Free Throw 1 of 1 (4 PTS),Free Throw,Free Throw 1 of 1,1,21
382,530,PT10M19.00S,4,1610612744,GSW,203952,Wiggins,A. Wiggins,0,0,...,0,,,0,h,MISS Wiggins Free Throw 1 of 1,Free Throw,Free Throw 1 of 1,1,383
398,551,PT09M18.00S,4,1610612744,GSW,1630164,Wiseman,J. Wiseman,0,0,...,0,93.0,103.0,196,h,Wiseman Free Throw 1 of 1 (11 PTS),Free Throw,Free Throw 1 of 1,1,399
446,616,PT04M21.00S,4,1610612743,DEN,203999,Jokic,N. Jokic,0,0,...,0,105.0,114.0,219,v,Jokic Free Throw 1 of 1 (22 PTS),Free Throw,Free Throw 1 of 1,1,447


In [16]:
score[score['clock'] == 'PT09M18.00S']

Unnamed: 0,actionNumber,clock,period,teamId,teamTricode,personId,playerName,playerNameI,xLegacy,yLegacy,...,isFieldGoal,scoreHome,scoreAway,pointsTotal,location,description,actionType,subType,videoAvailable,actionId
395,545,PT09M18.00S,4,1610612744,GSW,1630164,Wiseman,J. Wiseman,-34,16,...,1,92,103,195,h,Wiseman 4' Layup (10 PTS) (Wiggins 3 AST),Made Shot,Layup Shot,1,396
398,551,PT09M18.00S,4,1610612744,GSW,1630164,Wiseman,J. Wiseman,0,0,...,0,93,103,196,h,Wiseman Free Throw 1 of 1 (11 PTS),Free Throw,Free Throw 1 of 1,1,399


### Notes about **and-1** in these data
If the **and-1** opportunity is a miss, then `pointsTotal = 0` in the respective FT row.

In [17]:
# Checking to see how to best work through time format
import re
score[['clock', 'period']].tail(5)
dig = re.compile('[0-9]{2}')

# Getting clock/period value at row 100
clock = dig.findall(score.iloc[100].clock)
period = score.iloc[100].period

# Conversion into overall seconds
period_seconds = (period - 1) * 3600
clock_seconds = int(clock[0]) * 60 + int(clock[1]) + int(clock[2]) / 100
game_seconds = period_seconds + clock_seconds

# By converting to seconds, I can get exact period/gameclock times through
# a simple computation. This also supports overtime functions (but might need to
# be implemented via dictionary)
print("Period: ", period, "\tSeconds: ", period_seconds)
print("Clock: ", clock, "\tSeconds: ", clock_seconds)
print("Game seconds: ", game_seconds)


Period:  3 	Seconds:  7200
Clock:  ['07', '11', '00'] 	Seconds:  431.0
Game seconds:  7631.0


In [18]:
schedule_page = requests.get('https://www.nba.com/schedule?cal=0&pd=false&region=1')
schedule_soup = BeautifulSoup(schedule_page.text, 'lxml')
links = schedule_soup.find_all("a", attr={'href': re.compile("\/game\/[a-z]{3}-vs-[a-z]{3}-[0-9]{10}")})


In [19]:
import time
page = requests.get('https://www.nba.com/schedule')
time.sleep(5)
soup = BeautifulSoup(page.text, 'html.parser')
for url in schedule_soup.find_all(href=re.compile(".*game.*")):
  print(url['href'])

/games
/games
/stats/tools/media-central-game-stats
https://nbagameworn.nba.com/iSynApp/showHomePage.action?sid=1101561


In [26]:
headers = {
    'Access-Control-Allow-Origin': '*',
    'Access-Control-Allow-Methods': 'GET',
    'Access-Control-Allow-Headers': 'Content-Type',
    'Access-Control-Max-Age': '3600',
    'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'
    }
  
url = "https://www.nba.com/schedule?bc=LP&cal=4&pd=false&region=1"
req = requests.get(url, headers)
soup = BeautifulSoup(req.content, "html.parser")
page = soup.prettify()

with open("schedulepage.html", "w") as outfile:
  outfile.write(soup.text)

# Game ID
After a few hours trying to figure out if I could effectively scrape urls from the NBA schedule page, I learned the following...

Any NBA game page can be accessed through `nba.com/game/[10-digit gameID]`.

This year's numbers run from `0022200001` to `0022201230` for the 1,230 regular season NBA games played each year.

In [30]:
url_start = 'https://www.nba.com/game/00'
url_end = '/game-charts'
game_num = 22200001
total_games = 1230
url_list = []

for game in range(total_games):
  req_url = url_start + str(game_num + game) + url_end
  url_list.append(req_url)

url_list[120]

'https://www.nba.com/game/0022200121/game-charts'

In [38]:
import datetime
datetime.date.today() - datetime.timedelta(days=1)

datetime.date(2022, 11, 17)

In [43]:
"{0:0=4d}".format(5)

'0005'

In [49]:
import requests
 
# use to parse html text
from lxml.html import fromstring 
from itertools import cycle
import traceback
 
 
def to_get_proxies():
    # website to get free proxies
    url = 'https://free-proxy-list.net/' 
 
    response = requests.get(url)
 
    parser = fromstring(response.text)
    # using a set to avoid duplicate IP entries.
    proxies = set() 
 
    for i in parser.xpath('//tbody/tr')[:10]:
 
        # to check if the corresponding IP is of type HTTPS
        if i.xpath('.//td[7][contains(text(),"yes")]'):
 
            # Grabbing IP and corresponding PORT
            proxy = ":".join([i.xpath('.//td[1]/text()')[0],
                              i.xpath('.//td[2]/text()')[0]])
 
            proxies.add(proxy)
        return proxies

proxy_list = to_get_proxies()
proxy_list

{'20.234.198.245:8080'}

In [5]:
import json
tj = json.loads("[]")
len(tj)

0