# Web scraping from www.nba.com/stats

- I want to extract the data from each quarter of the 2023/2024 NBA season and store it in four separate data frames.
 - For example, dataframe2 will contain all the data from the second quarter of each game of the season.
- Since nba.com is a JavaScript-based website, I will use Selenium to perform the extraction.

In [1]:
import html5lib
from pandas.io.html import read_html
import numpy as np
import pandas as pd
from selenium import webdriver

Set the Webdriver

In [2]:
options = webdriver.ChromeOptions()
options.binary_location = "C:/Users/fre_d/OneDrive/Documentos/chrome-win64/chrome.exe"
chrome_driver_binary = "C:/Users/fre_d/OneDrive/Documentos/chromedriver-win64/chromedriver"
driver = webdriver.Chrome(chrome_driver_binary, chrome_options=options)
driver.maximize_window()

  driver = webdriver.Chrome(chrome_driver_binary, chrome_options=options)


- I filter every quarter of the matches.
- Once the web controller accesses the target site, we set the controller to wait for 10 seconds, which helps to avoid errors.
- Then, the controller can select the “All” option from the drop-down menu on the page using XPath, which will display all the rows of the table on a single page.
- Through the class ID, the controller finds this table.
- Then, it converts the table from JavaScript to HTML.
- Now, pandas can convert it into a DataFrame.
- At the end, the DataFrame will be added to the empty dictionary.

In [3]:
df = {}
for i in range(1,5):
    driver.get("https://www.nba.com/stats/teams/boxscores-traditional?Season=2023-24&Period="+str(i)+"&SeasonType=Regular+Season")
    driver.implicitly_wait(20)
    all_pages = driver.find_element_by_xpath("/html/body/div[1]/div[2]/div[2]/div[3]/section[2]/div/div[2]/div[2]/div[1]/div[3]/div/label/div/select/option[1]")
    all_pages.click()
    table = driver.find_elements_by_class_name("Crom_table__p1iZz")
    table_html = table[0].get_attribute('outerHTML')
    df[i] = read_html(table_html)[0]
    
driver.quit()    

#### Control of the data:
- The data looks good, organized, and clean
- Only some data cleaning is necessary

In [4]:
df[1].head()

Unnamed: 0,TEAM,MATCH UP,GAME DATE,W/L,MIN,PTS,FGM,FGA,FG%,3PM,...,FT%,OREB,DREB,REB,AST,TOV,STL,BLK,PF,+/-
0,POR,POR @ SAC,04/14/2024,L,12,18,6,28,21.4,1,...,100.0,9,8,17,4,6,5,0,5,-12
1,SAC,SAC vs. POR,04/14/2024,W,12,30,11,18,61.1,3,...,71.4,0,10,10,8,5,5,1,5,12
2,CHI,CHI @ NYK,04/14/2024,L,12,28,12,18,66.7,3,...,50.0,1,5,6,6,5,4,2,3,-1
3,DAL,DAL @ OKC,04/14/2024,L,12,22,7,19,36.8,3,...,71.4,0,7,7,5,7,1,0,3,-17
4,MEM,MEM vs. DEN,04/14/2024,L,12,26,12,31,38.7,1,...,50.0,8,9,17,5,3,2,1,4,-3


In [5]:
NBA_1st = df[1]
NBA_2nd = df[2]
NBA_3rd = df[3]
NBA_4th = df[4]

Saved the data frames to a CSV file

In [6]:
NBA_1st.to_csv("data/NBA_1st.csv", index = False)
NBA_2nd.to_csv("data/NBA_2nd.csv", index = False)
NBA_3rd.to_csv("data/NBA_3rd.csv", index = False)
NBA_4th.to_csv("data/NBA_4th.csv", index = False)