## Web scraping from www.nba.com/stats

- I want to scrape each Quarter of the 2021/2022 NBA season into four separate data frames.
 - e.g., dataframe2 contains all the 2nd quarters of every game played in the season.
- nba.com is a JavaScript website, so that I will use selenium.

In [1]:
import html5lib
from pandas.io.html import read_html
import numpy as np
import pandas as pd
from selenium import webdriver

In [2]:
# Set the Webdriver

driver = webdriver.Chrome("/usr/local/bin/chromedriver")
driver.maximize_window()

In [3]:
# I filter through every quarter
# After the web driver goes onto the target site, we set the driver to wait 10 seconds (to avoid errors)
# Now, the driver can select "All" in the page dropdown with the XPath. This display all rows of the table on one page
# Through the class id, the driver can find this table
# Then convert the table from JavaScript to Html
# Now pandas can convert it to a Dataframe
# At the end, the data frame, will be added to the empty dictionary

df = {}
for i in range(1,5):
    driver.get("https://www.nba.com/stats/teams/boxscores-traditional?Season=2021-22&Period="+str(i))
    driver.implicitly_wait(20)
    all_pages = driver.find_element_by_xpath("/html/body/div[1]/div[2]/div[2]/div[3]/section[2]/div/div[2]/div[2]/div[1]/div[3]/div/label/div/select/option[1]")
    all_pages.click()
    table = driver.find_elements_by_class_name("Crom_table__p1iZz")
    table_html = table[0].get_attribute('outerHTML')
    df[i] = read_html(table_html)[0]
    
driver.quit()    

In [4]:
# Control of the data:
# The data looks good, organized, and clean
# Only some data cleaning is necessary

df[1].head()

Unnamed: 0,TEAM,MATCH UP,GAME DATE,W/L,MIN,PTS,FGM,FGA,FG%,3PM,...,FT%,OREB,DREB,REB,AST,TOV,STL,BLK,PF,+/-
0,PHX,PHX vs. SAC,04/10/2022,L,12,24,11,26,42.3,2,...,0.0,5,8,13,5,2,1,1,3,-9
1,PHI,PHI vs. DET,04/10/2022,W,12,36,13,21,61.9,1,...,100.0,3,6,9,6,2,5,3,5,6
2,MEM,MEM vs. BOS,04/10/2022,L,12,25,9,28,32.1,4,...,60.0,6,6,12,8,1,5,0,5,-7
3,TOR,TOR @ NYK,04/10/2022,L,12,22,9,24,37.5,4,...,0.0,3,8,11,6,2,2,1,3,-9
4,BOS,BOS @ MEM,04/10/2022,W,12,32,13,22,59.1,4,...,100.0,3,13,16,7,5,0,1,4,7


In [5]:
NBA_1st = df[1]
NBA_2nd = df[2]
NBA_3rd = df[3]
NBA_4th = df[4]

In [6]:
# Saved the data frames to a CSV file

NBA_1st.to_csv("data/NBA_1st.csv", index = False)
NBA_2nd.to_csv("data/NBA_2nd.csv", index = False)
NBA_3rd.to_csv("data/NBA_3rd.csv", index = False)
NBA_4th.to_csv("data/NBA_4th.csv", index = False)