### NBA Stats Web Scraping
This notebook shows how to extract data/stats from sites like basketball_reference.com, using python's bs4 (BeautifulSoup) module, and convert to Pandas DF, and then output to Excel format.

<b>Source:</b> https://www.basketball-reference.com/leagues/NBA_2022_per_game.html

<b>Beautiful Soup documentation:</b> https://www.crummy.com/software/BeautifulSoup/bs4/doc/

<b> Credits: </b> Nick from 'Nick's Niche' Youtube channel 
(Full tutorial: https://www.youtube.com/watch?v=LLOJOPXA9PY&list=PLTUcfu017zJCI7ENgSEK2NjwKp_MJocVL&index=4)
Thank you for the comprehensive video series!

In [1]:
#Import all relevant libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import openpyxl

In [3]:
url = r"https://www.basketball-reference.com/leagues/NBA_2022_per_game.html"
source = requests.get(url)     # gets Page Source
soup = BeautifulSoup(source.content, 'lxml') 

Next, we need to explore the page's page source manually to identify which HTML element we want to scrape. We do this by right-click, and 'inspecting' the webpage. 

In this case, we want the data elements in the Table, which resides inside each 'tr' tags (table row).

In [22]:
#Use the findAll method to find the first 2 'tr' elements. Only for sanity-checking
# soup.findAll('tr', limit=2)

In [7]:
#create the data header list by extracting the values from the 1st 'tr' element of the table.
header = [th.getText() for th in soup.findAll('tr', limit=1)[0].findAll('th')]
#remove the 1st column (Rank), as we don't need it
header = header[1:]

In [11]:
#After creating the header, create a 'rows' variable to store all the other rows, which contains the data for each player.
rows = soup.findAll('tr')[1:]

In [12]:
#Convert this object into list-of-list format, in order to be ready to be converted to pd.DataFrame() format.
player_stats = [[td.getText() for td in rows[i].findAll('td')] for i in range(len(rows))]

In [16]:
#Check the 1st row of player_stats to ensure data for the 1st player displays correctly (sanity-check)
player_stats[0]

['Precious Achiuwa',
 'C',
 '22',
 'TOR',
 '21',
 '17',
 '26.5',
 '3.4',
 '8.8',
 '.386',
 '0.3',
 '1.2',
 '.269',
 '3.0',
 '7.5',
 '.405',
 '.405',
 '0.9',
 '1.7',
 '.543',
 '2.3',
 '5.9',
 '8.2',
 '1.6',
 '0.5',
 '0.6',
 '1.1',
 '2.5',
 '8.0']

In [19]:
#Create the Pandas DF using the above 'player_stats' and 'header' objects, and check first 5 rows.
stats = pd.DataFrame(player_stats, columns = header)
stats.head()

Unnamed: 0,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,FG%,...,FT%,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS
0,Precious Achiuwa,C,22,TOR,21,17,26.5,3.4,8.8,0.386,...,0.543,2.3,5.9,8.2,1.6,0.5,0.6,1.1,2.5,8.0
1,Steven Adams,C,28,MEM,28,28,24.8,2.6,4.9,0.529,...,0.61,3.8,4.8,8.6,2.6,0.8,0.6,1.7,1.6,6.9
2,Bam Adebayo,C,24,MIA,18,18,32.9,7.0,13.5,0.519,...,0.759,2.7,7.4,10.2,3.2,1.1,0.3,2.9,3.3,18.7
3,Santi Aldama,PF,21,MEM,16,0,9.8,1.5,4.1,0.364,...,0.583,1.0,1.6,2.6,0.8,0.1,0.2,0.3,1.1,3.6
4,LaMarcus Aldridge,C,36,BRK,25,8,23.6,6.0,10.4,0.573,...,0.833,1.4,4.3,5.7,0.9,0.4,1.2,0.8,1.7,14.0


Now that we have this data in a Pandas DF format, there are many methods and techniques we can apply to analyze the data thoroughly.

Otherwise, we can export to Excel to perform basic ETL and analysis as well.

In [21]:
#Export to Excel. Can specify output path and file name.
pd.DataFrame.to_excel(stats, "NBA_Stats_2021_2022.xlsx", index=False)

Again, credits go out to <b>Nick from 'Nick's Niche'</b> youtube channel. Please check out his full tutorial at https://www.youtube.com/watch?v=LLOJOPXA9PY&list=PLTUcfu017zJCI7ENgSEK2NjwKp_MJocVL&index=4. 