## NBA Stats Web Scraping - with Selenium

The idea of using Selenium (ontop of Beautiful Soup) to scrape webpages is due to the need to interact with some webpages, which is not something BS4 can do by itself.

The example here involves NBA.com's Advanced Stats page, which requires some clicking on the screen to show all the players' advanced stats. (https://www.nba.com/stats/players/advanced/?sort=GP&dir=-1)

<b>Data Source:</b> https://www.nba.com/stats/players/advanced/?sort=GP&dir=-1

<b>Selenium Documentation:</b> https://www.selenium.dev/selenium/docs/api/py/index.html

<b> Tutorial Credits:</b> Nick from 'Nick's Niche' Youtube channel (Full tutorial: https://www.youtube.com/watch?v=LLOJOPXA9PY&list=PLTUcfu017zJCI7ENgSEK2NjwKp_MJocVL&index=4) Thank you for the comprehensive video series!

#### Installation instructions:
Before you begin, pip install selenium, and then also install the correct web drivers from the Selenium site: https://www.selenium.dev/selenium/docs/api/py/index.html#drivers (under Drivers section).

Install the appropriate version depending on browser & browser version, and place in correct folder path.

In [74]:
#Import relevant libraries
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from bs4 import BeautifulSoup
import pandas as pd
import os

#get current working directory file path from os library (optional)
cwd = os.getcwd()

#Create webdriver object with the path link pointing to where the driver exe file is located.
driver = webdriver.Chrome(cwd + '/chromedriver')

  driver = webdriver.Chrome(cwd + '/chromedriver')


In [None]:
#This block of code is just to test out that we can in fact navigate to Yahoo.com and search 'seleniumhq' in the search box.
#(This was given as an example in the Selenium docs.)

"""
driver.get('http://www.yahoo.com')
assert 'Yahoo' in driver.title
elem = driver.find_element(By.NAME, 'p')  # Find the search box
elem.send_keys('seleniumhq' + Keys.RETURN)
browser.quit()
"""

------------------------

In [79]:
#Specify the interested URL - NBA.com's advanced stats page, and navigate to webpage with driver object.
url = r"https://www.nba.com/stats/players/advanced/?sort=GP&dir=-1"
driver.get(url)

In [81]:
#Need to look for the xpath for 'Page' dropdown button, so Selenium can select Page -> All.
#(Note we cannot perform this action with Beautiful Soup alone.)

page_button = Select(driver.find_element_by_xpath(r'/html/body/main/div/div/div[2]/div/div/nba-stat-table/div[1]/div/div/select'))
#Selenium selects the 1st item in the Page dropdown menu, which is 'All'.
page_button.select_by_index(0)

  page_button = Select(driver.find_element_by_xpath(r'/html/body/main/div/div/div[2]/div/div/nba-stat-table/div[1]/div/div/select'))


In [82]:
#Now we can use our normal Beautiful Soup techniques to extract the data.
src = driver.page_source
parser = BeautifulSoup(src,'lxml')
#after inspecting, identify the class name of the div = "nba-stat-table__overflow". This is the table.
table = parser.find("div", attrs = {"class":"nba-stat-table__overflow"})
#headers is located inside the 'th' in the table
headers = table.findAll('th')
#Create headerlist by iterating through headers
headerlist = [h.text.strip() for h in headers[1:]]

In [83]:
#strip unwanted fields
headerlist1 = [a for a in headerlist if not 'RANK' in a]
#all data is located in the 'tr' tags
rows = table.findAll('tr')[1:]
#create the list of lists to contain all data for each player
players_stats = [[td.getText().strip() for td in rows[i].findAll('td')[1:]] for i in range(len(rows))]

In [84]:
#strip out unwanted fields (need to match the headers exactly)
headerlist1 = headerlist1[:-5]
#create the Pandas DF object with players_stats and columns.
stats = pd.DataFrame(players_stats, columns = headerlist1)

In [85]:
#Result (first 5 columns o)
stats.head()

Unnamed: 0,PLAYER,TEAM,AGE,GP,W,L,MIN,OFFRTG,DEFRTG,NETRTG,...,OREB%,DREB%,REB%,TO Ratio,eFG%,TS%,USG%,PACE,PIE,POSS
0,Domantas Sabonis,IND,25,29,12,17,34.2,108.8,104.2,4.6,...,9.5,24.5,17.0,13.9,62.2,65.6,21.3,98.12,16.6,2031
1,Gordon Hayward,CHA,31,29,15,14,34.5,111.5,112.1,-0.6,...,2.4,11.2,6.7,8.9,51.9,57.2,20.9,103.15,10.2,2144
2,Jonas Valanciunas,NOP,29,29,8,21,31.2,106.0,111.3,-5.3,...,10.4,27.8,18.6,10.5,55.4,59.5,24.4,97.43,15.8,1836
3,Kelly Oubre Jr.,CHA,26,29,15,14,29.5,112.9,113.9,-1.0,...,3.9,9.5,6.7,6.3,57.3,58.2,22.2,100.8,9.1,1791
4,Miles Bridges,CHA,23,29,15,14,36.7,113.0,111.2,1.7,...,3.3,15.7,9.4,8.4,52.9,56.4,22.2,101.63,11.0,2253


In [48]:
#Export to Excel
pd.DataFrame.to_excel(stats, "202122_advanced_stats.xlsx", index=False)