Web Scraping NBA Stats With Python: Data Project [Part 1 of 3]

* 3 part Machine Learning project, to try to predict who will be the MVP in a given MBA season and for that we need a lot of data about the players and their statistics.
So in this first part of the project, Web Sraping will be done to collect the various types of data and load them into Pandas to facilitate analysis.

Downloading MVP votes with requests

In [1]:
years = list(range(1991,2022))

In [2]:
url_start = "https://www.basketball-reference.com/awards/awards_{}.html"

In [3]:
import requests

for year in years:
    url = url_start.format(year)
    data = requests.get(url)
    
    with open("mvp/{}.html".format(year), "w+", encoding="utf-8") as f:
        f.write(data.text)


Parsing the votes table with BeautifulSoup

In [4]:
from bs4 import BeautifulSoup

In [5]:
with open("mvp/1991.html", encoding="utf-8") as f:
    page = f.read()

In [6]:
soup = BeautifulSoup(page, "html.parser")

BeautifulSoup found and remove header

In [7]:
soup.find('tr', class_="over_header").decompose()

MVP table

In [8]:
mvp_table = soup.find(id="mvp")

In [9]:
import pandas as pd

In [10]:
mvp_1991 = pd.read_html(str(mvp_table))[0]

In [11]:
mvp_1991

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,PTS,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48
0,1,Michael Jordan,27,CHI,77.0,891.0,960,0.928,82,37.0,31.5,6.0,5.5,2.7,1.0,0.539,0.312,0.851,20.3,0.321
1,2,Magic Johnson,31,LAL,10.0,497.0,960,0.518,79,37.1,19.4,7.0,12.5,1.3,0.2,0.477,0.32,0.906,15.4,0.251
2,3,David Robinson,25,SAS,6.0,476.0,960,0.496,82,37.7,25.6,13.0,2.5,1.5,3.9,0.552,0.143,0.762,17.0,0.264
3,4,Charles Barkley,27,PHI,2.0,222.0,960,0.231,67,37.3,27.6,10.1,4.2,1.6,0.5,0.57,0.284,0.722,13.4,0.258
4,5,Karl Malone,27,UTA,0.0,142.0,960,0.148,82,40.3,29.0,11.8,3.3,1.1,1.0,0.527,0.286,0.77,15.5,0.225
5,6,Clyde Drexler,28,POR,1.0,75.0,960,0.078,82,34.8,21.5,6.7,6.0,1.8,0.7,0.482,0.319,0.794,12.4,0.209
6,7,Kevin Johnson,24,PHO,0.0,32.0,960,0.033,77,36.0,22.2,3.5,10.1,2.1,0.1,0.516,0.205,0.843,12.7,0.22
7,8,Dominique Wilkins,31,ATL,0.0,29.0,960,0.03,81,38.0,25.9,9.0,3.3,1.5,0.8,0.47,0.341,0.829,11.4,0.177
8,9T,Larry Bird,34,BOS,0.0,25.0,960,0.026,60,38.0,19.4,8.5,7.2,1.8,1.0,0.454,0.389,0.891,6.6,0.14
9,9T,Terry Porter,27,POR,0.0,25.0,960,0.026,81,32.9,17.0,3.5,8.0,2.0,0.1,0.515,0.415,0.823,13.0,0.235


A look at the pages we'll scrape

In [12]:
dfs = []
for year in years:
    with open("mvp/1991.html", encoding="utf-8".format(year)) as f:
        page = f.read()
    
    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="over_header").decompose()
    mvp_table = soup.find(id="mvp")
    mvp = pd.read_html(str(mvp_table))[0]
    mvp["Year"] = year
    
    dfs.append(mvp)
   
    
   

Combine MVP votes with pandas

In [13]:
mvps = pd.concat(dfs)

Combining data from multiple web pages, we now have a data frame with all MVP votes from every year from 1991 to 2021

In [14]:
mvps.head()

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,...,TRB,AST,STL,BLK,FG%,3P%,FT%,WS,WS/48,Year
0,1,Michael Jordan,27,CHI,77.0,891.0,960,0.928,82,37.0,...,6.0,5.5,2.7,1.0,0.539,0.312,0.851,20.3,0.321,1991
1,2,Magic Johnson,31,LAL,10.0,497.0,960,0.518,79,37.1,...,7.0,12.5,1.3,0.2,0.477,0.32,0.906,15.4,0.251,1991
2,3,David Robinson,25,SAS,6.0,476.0,960,0.496,82,37.7,...,13.0,2.5,1.5,3.9,0.552,0.143,0.762,17.0,0.264,1991
3,4,Charles Barkley,27,PHI,2.0,222.0,960,0.231,67,37.3,...,10.1,4.2,1.6,0.5,0.57,0.284,0.722,13.4,0.258,1991
4,5,Karl Malone,27,UTA,0.0,142.0,960,0.148,82,40.3,...,11.8,3.3,1.1,1.0,0.527,0.286,0.77,15.5,0.225,1991


In [15]:
mvps.to_csv("mvps.csv")

To predict who will be the next MVP of the season, you will need data from all players from 1991 to 2021, not just the ones who won the MVP, so map the votes of and train a machine learning model.

Downloading player stats

In [16]:
player_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_per_game.html"

url = player_stats_url.format(1991)
data = requests.get(url)
with open("player/1991.html", "w+", encoding="utf-8") as f:
    f.write(data.text)

The REQUESTS brings in just the web page as it assumes the web brownser renders and runs JavaScript on the page to download all the lines. To get all the data on the page and work around this problem, a brownser using SELENIUM will be needed.

Using selenium to scrape a Javascript page

Install Selenium Chrome Driver 98.0.4758.48 will allow Python to automate the browser https://chromedriver.chromium.org/downloads

In [17]:
from selenium import webdriver

In [18]:
driver = webdriver.Chrome(executable_path="/users/super/chromedriver")

  driver = webdriver.Chrome(executable_path="/users/super/chromedriver")


A new chrome window was created that is being controlled by selenium so that it is possible to render and get all the necessary lines and the HTML of that page 

In [19]:
import time

year = 1991
url = player_stats_url.format(year)

driver.get(url)
driver.execute_script("window.scrollTo(1,10000)")
time.sleep(2)

html = driver.page_source

In [20]:
with open("player/{}.html".format(year), "w+", encoding="utf-8") as f:
    f.write(html)

And now the file has all the lines and statistics from 1991, because the JavaScript was executed, you must do the same code for the rest of the seasons.

In [21]:
for year in years:
    url = player_stats_url.format(year)

    driver.get(url)
    driver.execute_script("window.scrollTo(1,10000)")
    time.sleep(2)

    html = driver.page_source
    
    with open("player/{}.html".format(year), "w+", encoding="utf-8") as f:
        f.write(html)

Parsing the stats with BeautifulSoup

In [22]:
df = []
for year in years:
    with open("player/{}.html".format(year), encoding="utf-8") as f:
        page = f.read()

    soup = BeautifulSoup(page, "html.parser")
    soup.find('tr', class_="thead").decompose()
    player_table = soup.find(id="per_game_stats")
    player = pd.read_html(str(player_table))[0]
    player["Year"] = year
    
    dfs.append(player)

Combining player stats with pandas

In [23]:
players = pd.concat(dfs)

We have all players from 1991 to 2021 in the same data table.

In [24]:
players

Unnamed: 0,Rank,Player,Age,Tm,First,Pts Won,Pts Max,Share,G,MP,...,2P,2PA,2P%,eFG%,FT,FTA,ORB,DRB,TOV,PF
0,1,Michael Jordan,27,CHI,77.0,891.0,960.0,0.928,82,37.0,...,,,,,,,,,,
1,2,Magic Johnson,31,LAL,10.0,497.0,960.0,0.518,79,37.1,...,,,,,,,,,,
2,3,David Robinson,25,SAS,6.0,476.0,960.0,0.496,82,37.7,...,,,,,,,,,,
3,4,Charles Barkley,27,PHI,2.0,222.0,960.0,0.231,67,37.3,...,,,,,,,,,,
4,5,Karl Malone,27,UTA,0.0,142.0,960.0,0.148,82,40.3,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
725,,Delon Wright,28,SAC,,,,,27,25.8,...,2.6,5.3,.500,.536,1.1,1.3,1.0,2.9,1.3,1.1
726,,Thaddeus Young,32,CHI,,,,,68,24.3,...,5.3,9.1,.580,.568,1.0,1.7,2.5,3.8,2.0,2.2
727,,Trae Young,22,ATL,,,,,63,33.7,...,5.6,11.3,.491,.499,7.7,8.7,0.6,3.3,4.1,1.8
728,,Cody Zeller,28,CHO,,,,,48,20.9,...,3.7,6.2,.598,.565,1.8,2.5,2.5,4.4,1.1,2.5


In [25]:
players.to_csv("players.csv")

Downloading team data

The time records by year are very important so the machine learning model can use it as a predictor 

In [26]:
team_stats_url = "https://www.basketball-reference.com/leagues/NBA_{}_standings.html"

Obtemos os registros por ano usando request.get é muito mais rápido

In [27]:

for year in years:
    url = team_stats_url.format(year)

    data = requests.get(url)

    with open("team/{}.html".format(year), "w+", encoding="utf-8") as f:
        f.write(data.text)

Parsing the team data with BeautifulSoup

In [28]:
dfs = []
for year in years:
    with open("team/{}.html".format(year), encoding="utf-8") as f:
        page = f.read()
    
    soup = BeautifulSoup(page, "html.parser")
    soup.find("tr", class_="thead").decompose()
    team_table = soup.find_all(id="divs_standings_E")
    team = pd.read_html(str(team_table))[0]
    team["Year"] = year
    team["Team"] = team["Eastern Conference"]
    del team["Eastern Conference"]
    dfs.append(team)

    team_table = soup.find_all(id="divs_standings_W")
    team = pd.read_html(str(team_table))[0]
    team["Year"] = year
    team["Team"] = team["Western Conference"]    
    del team["Western Conference"]
    dfs.append(team)


Combining team stats with pandas

In [29]:
teams = pd.concat(dfs)

Analysis of all teams HTML files from 1991 to 2021 and a single table

In [30]:
teams.tail()

Unnamed: 0,W,L,W/L%,GB,PS/G,PA/G,SRS,Year,Team
13,42,30,0.583,—,112.4,110.2,2.26,2021,Dallas Mavericks*
14,38,34,0.528,4.0,113.3,112.3,1.07,2021,Memphis Grizzlies*
15,33,39,0.458,9.0,111.1,112.8,-1.58,2021,San Antonio Spurs
16,31,41,0.431,11.0,114.6,114.9,-0.2,2021,New Orleans Pelicans
17,17,55,0.236,25.0,108.8,116.7,-7.5,2021,Houston Rockets


In [31]:
teams.head()

Unnamed: 0,W,L,W/L%,GB,PS/G,PA/G,SRS,Year,Team
0,56,26,0.683,—,111.5,105.7,5.22,1991,Boston Celtics*
1,44,38,0.537,12.0,105.4,105.6,-0.39,1991,Philadelphia 76ers*
2,39,43,0.476,17.0,103.1,103.3,-0.43,1991,New York Knicks*
3,30,52,0.366,26.0,101.4,106.4,-4.84,1991,Washington Bullets
4,26,56,0.317,30.0,102.9,107.5,-4.53,1991,New Jersey Nets


In [32]:
teams.to_csv("teams.csv")

* Web Scraping with REQUESTS library;
* Analysis with BeautifulSoup and Pandas;
* Use of SELENIUM for advanced Web Scraping, initializing the Web Driver, executing the Java Script and getting the rendered HTML;
* Combining all data into a single Pandas data frame and writing it to CSV for players, teams and mvp's;
* All HTML files archived and saved (in case we want to go back and process them in a different way)

Next steps: Prepare the data for Machine Learning. So getting the predictors right and making sure you clean the data properly.