# Predicting All-NBA Team and Player Salaries - Data Acquisition
---

This notebook launches our project by webscraping National Basketball Association (NBA) player and team data on salary and statistics (total, per-game, and advanced) from various basketball sites. Most of this data will come from popular statistics website ```Basketball-Reference```, while others is also sourced from ```HoopsHype``` and ```Wikipedia```. Additional data from Kaggle datasets may be used at a later time as needed.

Using webscraping methods ```BeautifulSoup``` and ```Selenium```, we will start by saving a snapshot of the HTML page housing our desired statistics. We will then use those pages to inspect and pull the tables the interest into CSV's for further data cleaning and feature engineering. You will see that throughout this notebook we incorporate sleep times / rate limits between data pulling requests, this will prevent server overload and potential IP blocking, and will ensure respectful and responsible webscraping. 

See the end of this notebook for a list of sources and tutorials used in developing my webscraping script. 

Further detailed notebooks on the various segments of this project can be found at the following: 
- [02_Data_Cleaning_and_EDA]('./02_Data_Cleaning_and_EDA')
- [03_Data_Modeling_I]('./03_Data_Modeling_I')
- [04_Data_Modeling_II]('./04_Data_Modeling_II')

For more information on the background, a summary of methods, and findings, please see the associated [README](../README.md) for this analysis. 

### Contents
- [1. Salary Data](#1.-Salary-Data)
    - [I. Player Salary](#I.-Player-Salary)
    - [II. Team Payroll](#II.-Team-Payroll)
    - [III. Salary Caps](#III.-Salary-Caps)
- [2. Statistics Data](#2.-Statistics-Data)
    - [IV. Player Statistics](#IV.-Player-Statistics)
    - [V. All-NBA Team Winners and Nominees](#V.-All-NBA-Team-Winners-and-Nominees)
- [3. Additional Performance-Related Data](#3.-Additional-Performance-Related-Data)
    - [VI. Team Rankings](#VI.-Team-Rankings)
    - [VII. All-Star-Appearance](#VII.-All-Star-Appearance)

In [39]:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
import time

In [2]:
# pip install selenium

### Define Years of Interest
##### To be periodically updated to include latest available seasons. This will enable the model to remain robust and reflective of the most recent trends and patterns in the NBA.

Note: Year variable may need to be manipulated by adding +1 when called to mirror which season the specific site is referring to. We are considering each year to refer to the the season _start_ date (e.g., 1991 will refer to the 1991-1992 season). Some sites, such as that showing team rankings, understandably will refer to the 1991 team ranking based on the season _end_ date (e.g., 1991 will refer to the 1990-1991 season). We will adjust the year variable accordingly in our loops.

In [28]:
years = list(range(1990,2023)) # stopping at 2022 which will be the 2022-23 season

### Selenium Setup

##### Several of the following sections ([Player Statistics](#V.-Player-Statistics), [Salary Caps](#VI.-Salary-Caps), [All-NBA Team Winners](#VII.-All-NBA-Team-Winners-and-Nominees)) will use the [Selenium](https://selenium-python.readthedocs.io/index.html) Python package to webscrape dynamic tables which render all rows using JavaScript after the page has loaded. To ensure smooth execution for other users, follow these steps if you wish to run the provided code:

1. Install Selenium with ```pip install selenium```
<br></br>
2. Choose the web browser you intend to run this code on (different browsers require different drivers) and ensure that you have the latest version of the browser installed.
<br></br>
3. Download, extract, and save [Selenium webdriver](https://selenium-python.readthedocs.io/installation.html#drivers) for your chosen browser to the current ```code``` folder housing this notebook.
    1. Perform the download, extract, save
    2. Operationalize webdriver
        1. <u>Method 1</u>: Specify location of driver in the cell below, for example: ```driver = webdriver.Edge(executable_path='c:\User-Path\msedgedriver.exe')```. Replace the following:
            - ```Edge``` with the browser you installed the driver for, 
            - ```User-Path``` with the path where driver is saved, and 
            - ```msedgedriver.exe```  with the name of the extracted driver
        2. <u>Method 2</u>: To **bypass** having to perform Method 1 with each use of the webdriver, consider adding the extracted webdriver to the Path variable on your machine (accessible via Advanced System Settings on your computer). In this case, do not update the ```driver = webdriver.Edge()``` code below, except for changing ```Edge``` to your preferred browser.
<br></br>
4. As new versions of browsers/webdrivers become available, **re-perform** Steps 2 & 3 above.

By following these steps, you'll ensure that the provided code continues to work smoothly as you access the latest player statistics using Selenium.

In [29]:
driver = webdriver.Edge()

---

## **1. Salary Data**

## I. Player Salary

In [5]:
for year in years:
    sal_url = f'https://hoopshype.com/salaries/players/{year}-{year+1}/'
    url = sal_url.format(year)
    res = requests.get(url)
    if res.status_code >= 200:
        with open("webscraping/salary/{}.html".format(year), "w+", encoding="utf-8") as file:
            file.write(res.text)
            lag = np.random.uniform(4,6)
            print(f'Finished writing {year}, waiting ... {round(lag,2)}')
            time.sleep(lag)

Finished writing 1990, waiting ... 4.41
Finished writing 1991, waiting ... 4.51
Finished writing 1992, waiting ... 4.7
Finished writing 1993, waiting ... 4.89
Finished writing 1994, waiting ... 4.42
Finished writing 1995, waiting ... 5.23
Finished writing 1996, waiting ... 5.55
Finished writing 1997, waiting ... 5.84
Finished writing 1998, waiting ... 4.52
Finished writing 1999, waiting ... 5.51
Finished writing 2000, waiting ... 4.33
Finished writing 2001, waiting ... 4.13
Finished writing 2002, waiting ... 5.88
Finished writing 2003, waiting ... 5.03
Finished writing 2004, waiting ... 4.98
Finished writing 2005, waiting ... 4.95
Finished writing 2006, waiting ... 5.56
Finished writing 2007, waiting ... 5.52
Finished writing 2008, waiting ... 5.02
Finished writing 2009, waiting ... 4.79
Finished writing 2010, waiting ... 5.49
Finished writing 2011, waiting ... 5.24
Finished writing 2012, waiting ... 4.46
Finished writing 2013, waiting ... 5.25
Finished writing 2014, waiting ... 4.78
F

In [6]:
sal_all = []

for year in years:
    with open('webscraping/salary/{}.html'.format(year), encoding="utf-8") as file:
        page = file.read()
    soup = BeautifulSoup(page, 'html.parser')
    salary_table = soup.find(class_="hh-salaries-ranking-table")
    salaries1 = pd.read_html(str(salary_table))[0].drop(columns="Unnamed: 0")    
    salaries1['Year'] = year
    salaries1.rename(columns={salaries1.columns.tolist()[1]: 'Salary',
                              salaries1.columns.tolist()[2]: 'Salary_Adj'}, inplace=True)
    
    print(f'Finished scraping {year}')
    sal_all.append(salaries1)    

Finished scraping 1990
Finished scraping 1991
Finished scraping 1992
Finished scraping 1993
Finished scraping 1994
Finished scraping 1995
Finished scraping 1996
Finished scraping 1997
Finished scraping 1998
Finished scraping 1999
Finished scraping 2000
Finished scraping 2001
Finished scraping 2002
Finished scraping 2003
Finished scraping 2004
Finished scraping 2005
Finished scraping 2006
Finished scraping 2007
Finished scraping 2008
Finished scraping 2009
Finished scraping 2010
Finished scraping 2011
Finished scraping 2012
Finished scraping 2013
Finished scraping 2014
Finished scraping 2015
Finished scraping 2016
Finished scraping 2017
Finished scraping 2018
Finished scraping 2019
Finished scraping 2020
Finished scraping 2021
Finished scraping 2022


In [7]:
salaries = pd.concat(sal_all)
salaries.to_csv('../data/salaries.csv', index=False)
print(salaries.shape)
print(f'{salaries.Year.min()}-{salaries.Year.max()}')

(15778, 4)
1990-2022


## II. Team Payroll

In [8]:
for year in years:
    teamsal_url = f'https://hoopshype.com/salaries/{year}-{year+1}/'
    url = teamsal_url.format(year)
    res = requests.get(url)
    if res.status_code >= 200:
        with open("webscraping/team/payroll/{}.html".format(year), "w+", encoding="utf-8") as file:
            file.write(res.text)
            lag = np.random.uniform(4,6)
            print(f'Finished writing {year}, waiting ... {round(lag,2)}')
            time.sleep(lag)

Finished writing 1990, waiting ... 4.38
Finished writing 1991, waiting ... 4.23
Finished writing 1992, waiting ... 4.18
Finished writing 1993, waiting ... 4.55
Finished writing 1994, waiting ... 5.15
Finished writing 1995, waiting ... 5.51
Finished writing 1996, waiting ... 4.64
Finished writing 1997, waiting ... 4.82
Finished writing 1998, waiting ... 5.55
Finished writing 1999, waiting ... 5.94
Finished writing 2000, waiting ... 5.53
Finished writing 2001, waiting ... 4.9
Finished writing 2002, waiting ... 5.1
Finished writing 2003, waiting ... 5.94
Finished writing 2004, waiting ... 4.99
Finished writing 2005, waiting ... 5.06
Finished writing 2006, waiting ... 4.94
Finished writing 2007, waiting ... 4.46
Finished writing 2008, waiting ... 4.46
Finished writing 2009, waiting ... 4.84
Finished writing 2010, waiting ... 5.64
Finished writing 2011, waiting ... 5.63
Finished writing 2012, waiting ... 4.66
Finished writing 2013, waiting ... 4.58
Finished writing 2014, waiting ... 4.39
Fi

In [9]:
teamsal_all = []

for year in years:
    with open('webscraping/team/payroll/{}.html'.format(year), encoding="utf-8") as file:
        page = file.read()
    soup = BeautifulSoup(page, 'html.parser')
    teamsalary_table = soup.find(class_="hh-salaries-ranking-table")
    teamsalaries1 = pd.read_html(str(teamsalary_table))[0].drop(columns="Unnamed: 0")    
    teamsalaries1['Year'] = year
    teamsalaries1.rename(columns={teamsalaries1.columns.tolist()[1]: 'Payroll',
                              teamsalaries1.columns.tolist()[2]: 'Payroll_Adj'}, inplace=True) # Adjusted for 2022-2023 dollars
    
    print(f'Finished scraping {year}')
    teamsal_all.append(teamsalaries1)    

Finished scraping 1990
Finished scraping 1991
Finished scraping 1992
Finished scraping 1993
Finished scraping 1994
Finished scraping 1995
Finished scraping 1996
Finished scraping 1997
Finished scraping 1998
Finished scraping 1999
Finished scraping 2000
Finished scraping 2001
Finished scraping 2002
Finished scraping 2003
Finished scraping 2004
Finished scraping 2005
Finished scraping 2006
Finished scraping 2007
Finished scraping 2008
Finished scraping 2009
Finished scraping 2010
Finished scraping 2011
Finished scraping 2012
Finished scraping 2013
Finished scraping 2014
Finished scraping 2015
Finished scraping 2016
Finished scraping 2017
Finished scraping 2018
Finished scraping 2019
Finished scraping 2020
Finished scraping 2021
Finished scraping 2022


In [10]:
teamsalaries = pd.concat(teamsal_all)
teamsalaries.to_csv('../data/team_payroll.csv', index=False)
print(teamsalaries.shape)
print(f'{teamsalaries.Year.min()}-{teamsalaries.Year.max()}')

(966, 4)
1990-2022


## III. Salary Caps

In [11]:
url = 'https://www.basketball-reference.com/contracts/salary-cap-history.html'
driver.get(url)
driver.execute_script('window.scrollTo(1,10000)')
html = driver.page_source

with open('webscraping/salary/salarycap/salarycap.html', "w+", encoding="utf-8") as file:
    file.write(html)

In [12]:
with open('webscraping/salary/salarycap/salarycap.html', encoding="utf-8") as file:
    page = file.read()
soup = BeautifulSoup(page, 'html.parser')
table = soup.find('table', {'id' : 'salary_cap_history'})
salarycap = pd.read_html(str(table))[0] 

In [13]:
salarycap
salarycap.to_csv('../data/salarycap.csv', index=False)

---

## **2. Statistics Data**

## IV. Player Statistics
##### We will be scraping 3 types of player statistics: per-game, totals, and advanced stats.

In [14]:
stats = ['per_game', 'totals', 'advanced']

In [15]:
for stat in stats:
    for year in years:
        url = f'https://www.basketball-reference.com/leagues/NBA_{year+1}_{stat}.html' # +1 here because site is determining year by the season end date and our list is defining it by season start
        driver.get(url)
        driver.execute_script('window.scrollTo(1,10000)')
        html = driver.page_source

        with open(f'webscraping/players/{stat}/{year}.html', "w+", encoding="utf-8") as file:
            file.write(html)
        lag = np.random.uniform(4,6)
        print(f'Finished writing {year} {stat} stats, waiting ... {round(lag,2)}')
        time.sleep(lag)

Finished writing 1990 per_game stats, waiting ... 4.97
Finished writing 1991 per_game stats, waiting ... 5.1
Finished writing 1992 per_game stats, waiting ... 5.47
Finished writing 1993 per_game stats, waiting ... 5.89
Finished writing 1994 per_game stats, waiting ... 5.83
Finished writing 1995 per_game stats, waiting ... 4.48
Finished writing 1996 per_game stats, waiting ... 5.8
Finished writing 1997 per_game stats, waiting ... 5.63
Finished writing 1998 per_game stats, waiting ... 5.02
Finished writing 1999 per_game stats, waiting ... 5.92
Finished writing 2000 per_game stats, waiting ... 4.75
Finished writing 2001 per_game stats, waiting ... 5.17
Finished writing 2002 per_game stats, waiting ... 4.07
Finished writing 2003 per_game stats, waiting ... 4.89
Finished writing 2004 per_game stats, waiting ... 4.09
Finished writing 2005 per_game stats, waiting ... 5.17
Finished writing 2006 per_game stats, waiting ... 5.75
Finished writing 2007 per_game stats, waiting ... 4.25
Finished wri

In [16]:
pg_all = []
tot_all = []
adv_all = []

for stat in stats:
    for year in years:  
        with open(f'webscraping/players/{stat}/{year}.html', encoding="utf-8") as file:
            page = file.read()
        soup = BeautifulSoup(page, 'html.parser')
        table = soup.find('table', {'id' : f'{stat}_stats'})
        soup.find('tr', class_ = 'thead').decompose()
        df = pd.read_html(str(table))[0] 
        df['Year'] = year
        df['Stat'] = stat
        
        print(f'Finished scraping {year} {stat} stats')
        
        if stat == 'per_game':
            pg_all.append(df)
        if stat == 'totals':
            tot_all.append(df)
        if stat == 'advanced':
            adv_all.append(df)

Finished scraping 1990 per_game stats
Finished scraping 1991 per_game stats
Finished scraping 1992 per_game stats
Finished scraping 1993 per_game stats
Finished scraping 1994 per_game stats
Finished scraping 1995 per_game stats
Finished scraping 1996 per_game stats
Finished scraping 1997 per_game stats
Finished scraping 1998 per_game stats
Finished scraping 1999 per_game stats
Finished scraping 2000 per_game stats
Finished scraping 2001 per_game stats
Finished scraping 2002 per_game stats
Finished scraping 2003 per_game stats
Finished scraping 2004 per_game stats
Finished scraping 2005 per_game stats
Finished scraping 2006 per_game stats
Finished scraping 2007 per_game stats
Finished scraping 2008 per_game stats
Finished scraping 2009 per_game stats
Finished scraping 2010 per_game stats
Finished scraping 2011 per_game stats
Finished scraping 2012 per_game stats
Finished scraping 2013 per_game stats
Finished scraping 2014 per_game stats
Finished scraping 2015 per_game stats
Finished scr

In [17]:
b0 ='\033[1m'
b1 = '\033[0;0m'

In [18]:
pg = pd.concat(pg_all)
pg.to_csv('../data/per_game_data.csv', index=False)
print(f' --- {b0}Per Game{b1} ---')
print(pg.shape)
print(f'{pg.Year.min()}-{pg.Year.max()}')

tot = pd.concat(tot_all)
tot.to_csv('../data/totals_data.csv', index=False)
print(f'\n --- {b0}Totals{b1} ---')
print(tot.shape)
print(f'{tot.Year.min()}-{tot.Year.max()}')

adv = pd.concat(adv_all)
adv.to_csv('../data/advanced_data.csv', index=False)
print(f'\n --- {b0}Advanced{b1} ---')
print(adv.shape)
print(f'{adv.Year.min()}-{adv.Year.max()}')

 --- [1mPer Game[0;0m ---
(19589, 32)
1990-2022

 --- [1mTotals[0;0m ---
(19589, 32)
1990-2022

 --- [1mAdvanced[0;0m ---
(19589, 31)
1990-2022


In [103]:
# PREVIEW TABLES
pg

Unnamed: 0,Rk,Player,Pos,Age,Tm,G,GS,MP,FG,FGA,...,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,Year,Stat
0,1,Mark Acres,C,27,ORL,80,50,21.1,1.7,3.6,...,3.5,5.4,0.8,0.5,0.3,0.9,3.1,4.5,1990,per_game
1,2,Michael Adams,PG,27,DEN,79,74,34.1,5.0,12.5,...,2.2,2.8,6.3,1.5,0.0,1.8,1.7,15.5,1990,per_game
2,3,Mark Aguirre,SF,30,DET,78,40,25.7,5.6,11.5,...,2.4,3.9,1.9,0.4,0.2,1.6,2.6,14.1,1990,per_game
3,4,Danny Ainge,PG,30,SAC,75,68,36.4,6.7,15.4,...,3.4,4.3,6.0,1.5,0.2,2.5,3.2,17.9,1990,per_game
4,5,Mark Alarie,PF,26,WSB,82,10,23.1,4.5,9.6,...,2.7,4.6,1.7,0.7,0.5,1.2,2.7,10.5,1990,per_game
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
836,601,Thaddeus Young,PF,33,TOR,26,0,18.3,2.6,5.5,...,2.9,4.4,1.7,1.2,0.4,0.8,1.7,6.3,2022,per_game
837,602,Trae Young,PG,23,ATL,76,76,34.9,9.4,20.3,...,3.1,3.7,9.7,0.9,0.1,4.0,1.7,28.4,2022,per_game
838,603,Omer Yurtseven,C,23,MIA,56,12,12.6,2.3,4.4,...,3.7,5.3,0.9,0.3,0.4,0.7,1.5,5.3,2022,per_game
839,604,Cody Zeller,C,29,POR,27,0,13.1,1.9,3.3,...,2.8,4.6,0.8,0.3,0.2,0.7,2.1,5.2,2022,per_game


In [None]:
tot

In [None]:
adv

## V. All-NBA Team Winners and Nominees

In [19]:
for year in years:
    url = f'https://www.basketball-reference.com/awards/awards_{year+1}.html'
    driver.get(url)
    driver.execute_script('window.scrollTo(1,10000)') # Assures entire page is saved for scraping
    html = driver.page_source

    with open(f'webscraping/players/all_team/{year}.html', "w+", encoding="utf-8") as file:
        file.write(html)
    lag = np.random.uniform(4,6)
    print(f'Finished writing {year}, waiting ... {round(lag,2)}')
    time.sleep(lag)

Finished writing 1990, waiting ... 4.35
Finished writing 1991, waiting ... 4.18
Finished writing 1992, waiting ... 4.79
Finished writing 1993, waiting ... 4.55
Finished writing 1994, waiting ... 5.32
Finished writing 1995, waiting ... 4.03
Finished writing 1996, waiting ... 4.48
Finished writing 1997, waiting ... 5.78
Finished writing 1998, waiting ... 4.9
Finished writing 1999, waiting ... 5.96
Finished writing 2000, waiting ... 5.24
Finished writing 2001, waiting ... 5.64
Finished writing 2002, waiting ... 5.36
Finished writing 2003, waiting ... 5.01
Finished writing 2004, waiting ... 5.39
Finished writing 2005, waiting ... 4.33
Finished writing 2006, waiting ... 5.82
Finished writing 2007, waiting ... 5.68
Finished writing 2008, waiting ... 4.42
Finished writing 2009, waiting ... 5.07
Finished writing 2010, waiting ... 4.95
Finished writing 2011, waiting ... 4.66
Finished writing 2012, waiting ... 5.79
Finished writing 2013, waiting ... 5.45
Finished writing 2014, waiting ... 5.0
Fi

In [20]:
all = []

for year in years:  
    with open(f'webscraping/players/all_team/{year}.html', encoding="utf-8") as file:
        page = file.read()
    soup = BeautifulSoup(page, 'html.parser')
    table = soup.find('table', {'id' : 'leading_all_nba'})
    try:
        soup.find('tr', class_ = 'over_header').decompose()
        soup.find('tr', {'id' : 'start_2nd'}).decompose()
        soup.find('tr', {'id' : 'start_3rd'}).decompose()
        soup.find('tr', {'id' : 'start_ORV'}).decompose()
        soup.find('div', class_ = 'topscroll_div').decompose()
    except:
        soup.find('tr', class_ = 'over_header').decompose()
        soup.find('tr', {'id' : 'start_1T'}).decompose()
        soup.find('tr', {'id' : 'start_2T'}).decompose()
        soup.find('tr', {'id' : 'start_3T'}).decompose()
        soup.find('tr', {'id' : 'start_ORV'}).decompose()
        soup.find('div', class_ = 'topscroll_div').decompose()
               
    df = pd.read_html(str(table), header=1)[0] 
    df['Year'] = year   
    
    print(f'Finished scraping {year}')
    all.append(df)    

Finished scraping 1990
Finished scraping 1991
Finished scraping 1992
Finished scraping 1993
Finished scraping 1994
Finished scraping 1995
Finished scraping 1996
Finished scraping 1997
Finished scraping 1998
Finished scraping 1999
Finished scraping 2000
Finished scraping 2001
Finished scraping 2002
Finished scraping 2003
Finished scraping 2004
Finished scraping 2005
Finished scraping 2006
Finished scraping 2007
Finished scraping 2008
Finished scraping 2009
Finished scraping 2010
Finished scraping 2011
Finished scraping 2012
Finished scraping 2013
Finished scraping 2014
Finished scraping 2015
Finished scraping 2016
Finished scraping 2017
Finished scraping 2018
Finished scraping 2019
Finished scraping 2020
Finished scraping 2021
Finished scraping 2022


In [21]:
all_nba = pd.concat(all)
all_nba.to_csv('../data/all_nba_teams.csv', index=False)
print(all_nba.shape)
print(f'{all_nba.Year.min()}-{all_nba.Year.max()}')

(1354, 24)
1990-2022


---

## **3. Additional Performance-Related Data**

## VI. Team Rankings

In [22]:
for year in years:
    url = f'https://www.basketball-reference.com/leagues/NBA_{year+1}_ratings.html' # +1 here because site is determining year by the season end date and our list is defining it by season start
    res = requests.get(url)
    if res.status_code >= 200:
        with open("webscraping/team/rank/{}.html".format(year), "w+", encoding="utf-8") as file:
            file.write(res.text)
            lag = np.random.uniform(4,6)
            print(f'Finished writing {year}, waiting ... {round(lag,2)}')
            time.sleep(lag)

Finished writing 1990, waiting ... 4.78
Finished writing 1991, waiting ... 5.6
Finished writing 1992, waiting ... 5.82
Finished writing 1993, waiting ... 5.16
Finished writing 1994, waiting ... 5.99
Finished writing 1995, waiting ... 4.39
Finished writing 1996, waiting ... 4.64
Finished writing 1997, waiting ... 5.85
Finished writing 1998, waiting ... 5.4
Finished writing 1999, waiting ... 4.9
Finished writing 2000, waiting ... 4.39
Finished writing 2001, waiting ... 5.45
Finished writing 2002, waiting ... 5.52
Finished writing 2003, waiting ... 4.18
Finished writing 2004, waiting ... 4.02
Finished writing 2005, waiting ... 4.5
Finished writing 2006, waiting ... 4.26
Finished writing 2007, waiting ... 5.86
Finished writing 2008, waiting ... 4.35
Finished writing 2009, waiting ... 4.11
Finished writing 2010, waiting ... 5.33
Finished writing 2011, waiting ... 5.24
Finished writing 2012, waiting ... 5.15
Finished writing 2013, waiting ... 4.09
Finished writing 2014, waiting ... 4.92
Fini

In [23]:
all = []

for year in years:
    with open('webscraping/team/rank/{}.html'.format(year), encoding="utf-8") as file:
        page = file.read()
    soup = BeautifulSoup(page, 'html.parser')
    table = soup.find('table',  {'id' : 'ratings'})
    soup.find('tr', class_ = 'over_header').decompose()
    df = pd.read_html(str(table))[0]  
    df['Year'] = year
    
    print(f'Finished scraping {year}')
    all.append(df)    

Finished scraping 1990
Finished scraping 1991
Finished scraping 1992
Finished scraping 1993
Finished scraping 1994
Finished scraping 1995
Finished scraping 1996
Finished scraping 1997
Finished scraping 1998
Finished scraping 1999
Finished scraping 2000
Finished scraping 2001
Finished scraping 2002
Finished scraping 2003
Finished scraping 2004
Finished scraping 2005
Finished scraping 2006
Finished scraping 2007
Finished scraping 2008
Finished scraping 2009
Finished scraping 2010
Finished scraping 2011
Finished scraping 2012
Finished scraping 2013
Finished scraping 2014
Finished scraping 2015
Finished scraping 2016
Finished scraping 2017
Finished scraping 2018
Finished scraping 2019
Finished scraping 2020
Finished scraping 2021
Finished scraping 2022


In [24]:
team_rank = pd.concat(all)
team_rank.to_csv('../data/team_rank.csv', index=False)
print(team_rank.shape)
print(f'{team_rank.Year.min()}-{team_rank.Year.max()}')

(966, 16)
1990-2022


## VII. All-Star Appearance

In [25]:
url = 'https://en.wikipedia.org/wiki/List_of_NBA_All-Stars'
res = requests.get(url)
if res.status_code >= 200:
    soup = BeautifulSoup(res.content, 'lxml')
    table = soup.find('table', class_="wikitable sortable")
    
    as_appearance = pd.read_html(str(table))[0]

as_appearance.to_csv('../data/all_star_appearances.csv', index=False)    

## **3. Misc Data**

## VIII. NBA Draft Year

In [43]:
url = 'https://www.nba.com/stats/draft/history' #if there were escape characters, could have specified r'https://www.nba.com/stats/draft/history'
driver.get(url)

In [46]:
element = driver.find_element(By.XPATH, "/html/body/div[1]/div[2]/div[2]/div[3]/section[2]/div/div[2]/div[2]/div[1]/div[3]/div/label/div/select")
select = Select(element)

In [47]:
# 'All' is first element, so we will need to index 0
select.select_by_index(0)

In [69]:
src = driver.page_source
parser = BeautifulSoup(src, 'lxml')
table = parser.find('table', class_= 'Crom_table__p1iZz')
draft_year = pd.read_html(str(table))[0]

In [70]:
draft_year

Unnamed: 0,Player,Team,Affiliation,Year,Round Number,Round Pick,Overall Pick
0,Victor Wembanyama,San Antonio Spurs,Metropolitans 92 (France),2023,1,1,1
1,Brandon Miller,Charlotte Hornets,Alabama,2023,1,2,2
2,Scoot Henderson,Portland Trail Blazers,Ignite (G League),2023,1,3,3
3,Amen Thompson,Houston Rockets,Overtime Elite,2023,1,4,4
4,Ausar Thompson,Detroit Pistons,Overtime Elite,2023,1,5,5
...,...,...,...,...,...,...,...
8252,Jack Walton,Pittsburgh Ironmen,DePaul,1947,0,0,0
8253,Herman Knoche,Pittsburgh Ironmen,Washington & Jefferson,1947,0,0,0
8254,Gene Vance,Chicago Stags,Illinois,1947,0,0,0
8255,Dick Ives,Pittsburgh Ironmen,Iowa,1947,0,0,0


In [71]:
draft_year.to_csv('../data/draft_year.csv', index=False)

In [None]:
# Webscraping Tutorials Used To Complete

# https://towardsdatascience.com/how-to-use-selenium-to-web-scrape-with-example-80f9b23a843a
# https://towardsdatascience.com/intro-to-web-scrapping-how-can-we-easily-seek-the-data-we-want-d9f3d9359246
# https://alexsl.medium.com/installing-selenium-webdriver-with-python-and-pycharm-from-scratch-on-windows-e4c713043882
# https://betterprogramming.pub/nba-web-scraping-with-python-22c76cfd1d4f
# https://towardsdatascience.com/how-scraping-nba-stats-is-cooler-than-michael-jordan-49d7562ce3ef

# https://www.youtube.com/watch?v=JGQGd-oa0l4&t=735s
# https://www.youtube.com/watch?v=fAW6AxMHego
# https://www.youtube.com/watch?v=lRT6YK0QcLw
# https://www.youtube.com/watch?v=nHtlRlWmTV4&t=342s