# Capstone Project: Predicting NHL Player Salary

## Part I - Project Intro and Data Collection

Author: Charles Ramey

Date: 05/15/2023

---

## Problem Statement

In the National Hockey League (NHL), team executives lack a robust, data-driven solution to forecasting player salaries, which hinder's their ability to perform effective roster building and financial planning. This stems from the inherent complexity of factors that drive player salaries, including their performance, the quality of the team's they have played for, how long they have played, and the value of contracts signed by similar players. This project seeks to design a data-driven approach that can leverage historical data and advanced modeling techniques to help NHL executives balance their budgets, invest in their rosters, and remain competitive within the league.

#### Notebook Links

Part II - Exploratory Data Analysis (EDA)
- [`Part-2_eda.ipynb`](../code/Part-2_eda.ipynb)

Part III - Modeling
- [`Part-3.1_modeling-forwards.ipynb`](../code/Part-3.1_modeling-forwards.ipynb)
- [`Part-3.2_modeling-defense.ipynb`](../code/Part-3.2_modeling-defense.ipynb)
- [`Part-3.3_modeling-goalies.ipynb`](../code/Part-3.3_modeling-goalies.ipynb)

Part IV - Conclusion, Recommendations, and Sources
- [`Part-4_conclusion-and-recommendations.ipynb`](../code/Part-4_conclusion-and-recommendations.ipynb)

### Contents

- [Background](#Background)
- [Scraping & Cleaning: Signings](#Scraping-&-Cleaning:-Signings)
- [Scraping & Cleaning: Salary Cap](#Scraping-&-Cleaning:-Salary-Cap)
- [Scraping & Cleaning: Player Stats](#Scraping-&-Cleaning:-Player-Stats)
- [Scraping & Cleaning: Team Standings](#Scraping-&-Cleaning:-Team-Standings)
- [Merging Data](#Merging-Data)

## Background

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

### Library Imports

In [1]:
import pandas as pd
import numpy as np
import time

from bs4 import BeautifulSoup
import requests

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import ElementClickInterceptedException
from selenium import __version__

import requests

This notebook was originally run with Selenium v4.8.2

In [2]:
print(f"Selenium version: {__version__}")

Selenium version: 4.8.2


In [3]:
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [4]:
teams_dict = {
    'Anaheim Ducks'        : 'ANA',
    'Arizona Coyotes'      : 'ARI',
    'Atlanta Thrashers'    : 'ATL',    # Winnipeg Jets after 2010-11 season
    'Boston Bruins'        : 'BOS',
    'Buffalo Sabres'       : 'BUF',
    'Calgary Flames'       : 'CGY',
    'Carolina Hurricanes'  : 'CAR',
    'Chicago Blackhawks'   : 'CHI',
    'Colorado Avalanche'   : 'COL',
    'Columbus Blue Jackets': 'CBJ',
    'Dallas Stars'         : 'DAL',
    'Detroit Red Wings'    : 'DET',
    'Edmonton Oilers'      : 'EDM',
    'Florida Panthers'     : 'FLA',
    'Los Angeles Kings'    : 'LAK',
    'Minnesota Wild'       : 'MIN',
    'Montreal Canadiens'   : 'MTL',
    'Nashville Predators'  : 'NSH',
    'New Jersey Devils'    : 'NJD',
    'New York Islanders'   : 'NYI',
    'New York Rangers'     : 'NYR',
    'Ottawa Senators'      : 'OTT',
    'Philadelphia Flyers'  : 'PHI',
    'Phoenix Coyotes'      : 'ARI',
    'Pittsburgh Penguins'  : 'PIT',
    'San Jose Sharks'      : 'SJS',
    'Seattle Kraken'       : 'SEA',    # Added in 2021-22 season
    'St. Louis Blues'      : 'STL',
    'Tampa Bay Lightning'  : 'TBL',
    'Toronto Maple Leafs'  : 'TOR',
    'Vancouver Canucks'    : 'VAN',
    'Vegas Golden Knights' : 'VGK',    # Added in 2017-18 season
    'Washington Capitals'  : 'WSH',
    'Winnipeg Jets'        : 'WPG'
}

---
## Scraping & Cleaning: Signings

In [5]:
positions = {
    'forwards': '2',
    'defense': '6',
    'goaltender': '7'
}

to_years = {
    '2012': '2',
    '2013': '2',
    '2014': '2',
    '2015': '2',
    '2016': '2',
    '2017': '2',
    '2018': '2',
    '2019': '2',
    '2020': '2',
    '2021': '2',
    '2022': '2'
}

from_years = {
    '2012': '13',
    '2013': '14',
    '2014': '15',
    '2015': '16',
    '2016': '17',
    '2017': '18',
    '2018': '19',
    '2019': '20',
    '2020': '21',
    '2021': '22',
    '2022': '23'
}

to_months = {
    'Feb': '2',
    'Mar': '2',
    'Apr': '2',
    'May': '2',
    'Jun': '2',
    'Jul': '2',
    'Aug': '2',
    'Sep': '2',
    'Oct': '2',
    'Nov': '2',
    'Dec': '2'
}

from_months = {
    'Feb': '2',
    'Mar': '3',
    'Apr': '4',
    'May': '5',
    'Jun': '6',
    'Jul': '7',
    'Aug': '8',
    'Sep': '9',
    'Oct': '10',
    'Nov': '11',
    'Dec': '12'
}

### Player Signings Scraping Pseudcode

<img src="../assets/first_selection_looped.gif" alt="first_selection_looped" style="width:800px;height:250px;">

Description

<img src="../assets/next_year_looped.gif" alt="next_year_looped" style="width:800px;height:250px;">

Description

<img src="../assets/next_position_looped.gif" alt="next_position_looped" style="width:800px;height:250px;">




In [6]:
def select_element(xpath):
    time.sleep(0.1)
    driver.find_element(By.XPATH, xpath).click()

In [7]:
driver.get("https://www.capfriendly.com/signings")
time.sleep(2)

columns = ['PLAYER','AGE','POS','TEAM','DATE','TYPE','EXTENSION',
           'STRUCTURE','LENGTH','VALUE','CAP HIT']

# Initiate final dataframe to store all signings
signings = pd.DataFrame(columns=columns)

for pos in positions.values():
    # Select Position Box (Forwards/Defense)
    select_element(f'//*[@id="pos"]/option[{pos}]')
    
    # Select Date Range Box (FROM)
    select_element('//*[@id="from"]')
    # Select First Year, 2010 (FROM)
    select_element('//*[@id="ui-datepicker-div"]/div/div/select[2]/option[12]')
    # Select First Month, Jan (FROM)
    select_element('//*[@id="ui-datepicker-div"]/div/div/select[1]/option[1]')
    # Select Day (first of the month)
    time.sleep(0.1)
    first_row = driver.find_elements(By.XPATH, '//*[@id="ui-datepicker-div"]/table/tbody/tr[1]/td')
    for td_element in first_row:
        try:
            td_element.click()
            break
        except ElementClickInterceptedException:
            continue    
    
    # Select Date Range Box (TO)
    select_element('//*[@id="to"]')
    # Select First Year, 2010 (TO)
    select_element('//*[@id="ui-datepicker-div"]/div/div/select[2]/option[1]')
    # Select First Month, Jan (TO)
    select_element('//*[@id="ui-datepicker-div"]/div/div/select[1]/option[1]')
    # Select Day (last of the month)
    time.sleep(0.1)
    last_row = driver.find_elements(By.XPATH, '//*[@id="ui-datepicker-div"]/table/tbody/tr[last()]/td')
    for td_element in last_row[::-1]:
        try:
            td_element.click()
            break
        except ElementClickInterceptedException:
            continue
    
    # Save Result of First Month
    time.sleep(0.1)
    table_element = driver.find_element(By.XPATH, '//*[@id="na"]')
    table_data = pd.read_html(table_element.get_attribute('outerHTML'))
    jan_2011 = pd.DataFrame(table_data[0])
    signings = pd.concat([signings, jan_2011], join='inner')    
    
    
    # Iterate through all remaining months for the first year
    for (to_month_str, to_month), (from_month_str, from_month) in zip (to_months.items(), from_months.items()):
        # Select Date Range Box (TO)
        select_element('//*[@id="to"]')
        # Select Month (TO)
        select_element(f'//*[@id="ui-datepicker-div"]/div/div/select[1]/option[{to_month}]')
        # Select Day (last of the month)
        time.sleep(0.1)
        last_row = driver.find_elements(By.XPATH, '//*[@id="ui-datepicker-div"]/table/tbody/tr[last()]/td')
        for td_element in last_row[::-1]:
            try:
                td_element.click()
                break
            except ElementClickInterceptedException:
                continue

        # Select Date Range Box (FROM)
        select_element('//*[@id="from"]')        
        # Select Month (FROM)
        select_element(f'//*[@id="ui-datepicker-div"]/div/div/select[1]/option[{from_month}]')
        # Select Day (first of the month)
        time.sleep(0.1)
        first_row = driver.find_elements(By.XPATH, '//*[@id="ui-datepicker-div"]/table/tbody/tr[1]/td')
        for td_element in first_row:
            try:
                td_element.click()
                break
            except ElementClickInterceptedException:
                continue        
    
        # Save Result of Each Month for First Year
        time.sleep(0.1)
        table_element = driver.find_element(By.XPATH, '//*[@id="na"]')
        table_data = pd.read_html(table_element.get_attribute('outerHTML'))
        month_2011 = pd.DataFrame(table_data[0])
        signings = pd.concat([signings, month_2011], join='inner')
    
    
    # Iterate through remaining years to get all signings for position
    for (to_year_str, to_year), (from_year_str, from_year) in zip (to_years.items(), from_years.items()):        
        # Select Date Range Box (TO)
        select_element('//*[@id="to"]')
        # Select Year (TO)
        select_element(f'//*[@id="ui-datepicker-div"]/div/div/select[2]/option[{to_year}]')
        # Select Month (TO)
        select_element('//*[@id="ui-datepicker-div"]/div/div/select[1]/option[1]')
        # Select Day (last of the month)
        time.sleep(0.1)
        last_row = driver.find_elements(By.XPATH, '//*[@id="ui-datepicker-div"]/table/tbody/tr[last()]/td')
        for td_element in last_row[::-1]:
            try:
                td_element.click()
                break
            except ElementClickInterceptedException:
                continue
                  
        # Select Date Range Box (FROM)
        select_element('//*[@id="from"]')
        # Select Year (FROM)
        select_element(f'//*[@id="ui-datepicker-div"]/div/div/select[2]/option[{from_year}]')
        # Select Month (FROM)
        select_element('//*[@id="ui-datepicker-div"]/div/div/select[1]/option[1]')
        # Select Day (first of the month)
        time.sleep(0.1)
        first_row = driver.find_elements(By.XPATH, '//*[@id="ui-datepicker-div"]/table/tbody/tr[1]/td')
        for td_element in first_row:
            try:
                td_element.click()
                break
            except ElementClickInterceptedException:
                continue

        time.sleep(0.1)
        table_element = driver.find_element(By.XPATH, '//*[@id="na"]')
        table_data = pd.read_html(table_element.get_attribute('outerHTML'))
        jan_year = pd.DataFrame(table_data[0])
        signings = pd.concat([signings, jan_year], join='inner')
        

        for (to_month_str, to_month), (from_month_str, from_month) in zip (to_months.items(), from_months.items()):
            # Select Date Range Box (TO)
            select_element('//*[@id="to"]')
            # Select Month (TO)
            select_element(f'//*[@id="ui-datepicker-div"]/div/div/select[1]/option[{to_month}]')
            # Select Day (last of the month)
            time.sleep(0.1)
            last_row = driver.find_elements(By.XPATH, '//*[@id="ui-datepicker-div"]/table/tbody/tr[last()]/td')
            for td_element in last_row[::-1]:
                try:
                    td_element.click()
                    break
                except ElementClickInterceptedException:
                    continue

            # Select Date Range Box (FROM)
            select_element('//*[@id="from"]')        
            # Select Month (FROM)
            select_element(f'//*[@id="ui-datepicker-div"]/div/div/select[1]/option[{from_month}]')
            # Select Day (first of the month)
            time.sleep(0.1)
            first_row = driver.find_elements(By.XPATH, '//*[@id="ui-datepicker-div"]/table/tbody/tr[1]/td')
            for td_element in first_row:
                try:
                    td_element.click()
                    break
                except ElementClickInterceptedException:
                    continue

            time.sleep(0.1)
            table_element = driver.find_element(By.XPATH, '//*[@id="na"]')
            table_data = pd.read_html(table_element.get_attribute('outerHTML'))
            month_year = pd.DataFrame(table_data[0])
            signings = pd.concat([signings, month_year], join='inner')
            
print(f'Shape of signings: {signings.shape}')

Shape of signings: (3730, 11)


In [8]:
signings.head()

Unnamed: 0,PLAYER,AGE,POS,TEAM,DATE,TYPE,EXTENSION,STRUCTURE,LENGTH,VALUE,CAP HIT
0,Nate Thompson,26,C,TBL,"Jan. 31, 2011",Standard,✔,1-way,2,"$1,800,000","$900,000"
1,Matt Moulson,27,LW,NYI,"Jan. 27, 2011",Standard,✔,1-way,3,"$9,400,000","$3,133,333"
2,Alexander Semin,26,"RW, LW",WSH,"Jan. 27, 2011",Standard,✔,1-way,1,"$6,700,000","$6,700,000"
3,Mark Letestu,25,"C, RW",PIT,"Jan. 18, 2011",Standard,✔,1-way,2,"$1,250,000","$625,000"
4,Kyle Wellwood,27,C,ARI,"Jan. 17, 2011",Standard,,2-way,1,"$650,000","$650,000"


In [9]:
signings.columns = signings.columns.str.lower().str.replace(' ', '_')

In [10]:
signings['pos'] = np.where(signings['pos'].str.contains('D'), 'D',
                           np.where(signings['pos'].str.contains('G'), 'G', 'F'))

In [11]:
signings = signings[signings['structure'] == '1-way']

In [12]:
for col in ['value', 'cap_hit']:
    signings.loc[:, col] = signings[col].replace('[\$,]', '', regex=True).astype(int)

In [13]:
signings.loc[:, 'contract_aav'] = signings['value'] / signings['length']
signings.loc[:, 'contract_aav'] = signings['contract_aav'].astype(int)
signings = signings.drop(columns=['age', 'team', 'extension', 'type', 'length', 'value', 'cap_hit', 'structure'])

In [14]:
signings[signings['date'].isna()]

Unnamed: 0,player,pos,date,contract_aav


In [15]:
from datetime import datetime

def convert_to_datetime(date_string):
    try:
        # Try converting with the first format 'Mar. 3, 2011'
        date_obj = datetime.strptime(date_string, '%b. %d, %Y')
    except ValueError:
        # If the first format fails, try converting with the second format 'Mar 3, 2011'
        date_obj = datetime.strptime(date_string, '%b %d, %Y')
    return date_obj

# Apply the conversion function to the entire column
signings['date'] = signings['date'].apply(convert_to_datetime)

In [16]:
signings = signings[signings['date'] > '2011-06-15']

In [17]:
signings[signings['date'].isna()]

Unnamed: 0,player,pos,date,contract_aav


In [18]:
signings = signings.reset_index(drop=True)

In [19]:
# https://en.wikipedia.org/wiki/List_of_NHL_seasons
season_ranges = [
    {'season': '2010-11', 'start': pd.to_datetime('2011-06-15'), 'end': pd.to_datetime('2012-06-11')},
    {'season': '2011-12', 'start': pd.to_datetime('2012-06-11'), 'end': pd.to_datetime('2013-06-24')},
    {'season': '2012-13', 'start': pd.to_datetime('2013-06-24'), 'end': pd.to_datetime('2014-06-13')},
    {'season': '2013-14', 'start': pd.to_datetime('2014-06-13'), 'end': pd.to_datetime('2015-06-15')},
    {'season': '2014-15', 'start': pd.to_datetime('2015-06-15'), 'end': pd.to_datetime('2016-06-12')},
    {'season': '2015-16', 'start': pd.to_datetime('2016-06-12'), 'end': pd.to_datetime('2017-06-11')},
    {'season': '2016-17', 'start': pd.to_datetime('2017-06-11'), 'end': pd.to_datetime('2018-06-07')},
    {'season': '2017-18', 'start': pd.to_datetime('2018-06-07'), 'end': pd.to_datetime('2019-06-12')},
    {'season': '2018-19', 'start': pd.to_datetime('2019-06-12'), 'end': pd.to_datetime('2020-09-28')},
    {'season': '2019-20', 'start': pd.to_datetime('2020-09-28'), 'end': pd.to_datetime('2021-07-07')},
    {'season': '2020-21', 'start': pd.to_datetime('2021-07-07'), 'end': pd.to_datetime('2022-06-26')},
    {'season': '2021-22', 'start': pd.to_datetime('2022-06-26'), 'end': pd.to_datetime('2022-12-31')}
]

signings['season'] = ''

# Iterate through the date column and assign the corresponding label for the date range to the new 'season' column
for i in range(len(signings)):
    for season_range in season_ranges:
        if season_range['start'] < signings.loc[i, 'date'] <= season_range['end']:
            signings.loc[i, 'season'] = season_range['season']
            break   # Exit the inner loop once a matching date range is found

In [20]:
signings.head()

Unnamed: 0,player,pos,date,contract_aav,season
0,Trent Hunter,F,2011-09-30,600000,2010-11
1,Jason Chimera,F,2011-09-29,1750000,2010-11
2,Mike Modano,F,2011-09-21,999999,2010-11
3,R.J. Umberger,F,2011-09-20,4600000,2010-11
4,Kevin Westgarth,F,2011-09-19,725000,2010-11


In [21]:
signings = signings.drop(columns='date')

In [22]:
signings

Unnamed: 0,player,pos,contract_aav,season
0,Trent Hunter,F,600000,2010-11
1,Jason Chimera,F,1750000,2010-11
2,Mike Modano,F,999999,2010-11
3,R.J. Umberger,F,4600000,2010-11
4,Kevin Westgarth,F,725000,2010-11
...,...,...,...,...
1372,Jake Oettinger,G,4000000,2021-22
1373,Daniel Vladar,G,2200000,2021-22
1374,Jake Allen,G,3850000,2021-22
1375,Pyotr Kochetkov,G,2000000,2021-22


In [23]:
signings.to_csv('../data/signings_cleaned.csv', index=False)

In [24]:
signings_skaters = signings[signings['pos'] != 'G']

In [25]:
signings_goalies = signings[signings['pos'] == 'G']

---
## Scraping & Cleaning: Salary Cap

In [26]:
driver.quit()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [27]:
driver.get("https://www.capfriendly.com/salary-cap")
time.sleep(2)

table_element = driver.find_element(By.XPATH, '//*[@id="salaryCapHistoryInnerContainer"]/table')
table_data = pd.read_html(table_element.get_attribute('outerHTML'))
salary_cap = pd.DataFrame(table_data[0])

In [28]:
salary_cap

Unnamed: 0,SEASON,CONFIRMED,% CHANGE,UPPER LIMIT,LOWER LIMIT,MIN. SALARY
0,2025-26,NHL Estimate,5.14%,"$92,000,000","$68,000,000","$775,000"
1,2024-25,NHL Estimate,4.79%,"$87,500,000","$64,700,000","$775,000"
2,2023-24,NHL Estimate,1.21%,"$83,500,000","$61,700,000","$775,000"
3,2022-23,"Mar. 29, 2022",1.23%,"$82,500,000","$61,000,000","$750,000"
4,2021-22,"Jul. 1, 2021",0.00%,"$81,500,000","$60,200,000","$750,000"
5,2020-21,"Jul. 10, 2020",0.00%,"$81,500,000","$60,200,000","$700,000"
6,2019-20,"Jun. 22, 2019",2.52%,"$81,500,000","$60,200,000","$700,000"
7,2018-19,"Jun. 21, 2018",6.00%,"$79,500,000","$58,800,000","$650,000"
8,2017-18,"Jun. 18, 2017",2.74%,"$75,000,000","$55,400,000","$650,000"
9,2016-17,"Jun. 21, 2016",2.24%,"$73,000,000","$54,000,000","$575,000"


In [29]:
salary_cap.columns = salary_cap.columns.str.lower().str.replace(' ', '_', regex=False).str.replace('%','pct', regex=False).str.replace('.', '', regex=False)

In [30]:
salary_cap['pct_change'] = salary_cap['pct_change'].str.strip('%').astype(float) / 100

In [31]:
for col in ['upper_limit', 'lower_limit', 'min_salary']:
    salary_cap[col] = salary_cap[col].replace('[\$,]', '', regex=True).astype(int)

In [32]:
seasons_to_exclude = [
    '2005-06',
    '2006-07',
    '2007-08',
    '2008-09',
    '2009-10',
    '2010-11',
    '2024-25',
    '2025-26'
]

In [33]:
salary_cap = salary_cap[~salary_cap['season'].isin(seasons_to_exclude)]

In [34]:
salary_cap = salary_cap.drop(columns='confirmed')

In [35]:
cap_to_stats = {
    '2011-12': '2010-11',
    '2012-13': '2011-12',
    '2013-14': '2012-13',
    '2014-15': '2013-14',
    '2015-16': '2014-15',
    '2016-17': '2015-16',
    '2017-18': '2016-17',
    '2018-19': '2017-18',
    '2019-20': '2018-19',
    '2020-21': '2019-20',
    '2021-22': '2020-21',
    '2022-23': '2021-22',
    '2023-24': '2022-23'
}

In [36]:
salary_cap['season'] = salary_cap['season'].map(cap_to_stats)

In [37]:
salary_cap

Unnamed: 0,season,pct_change,upper_limit,lower_limit,min_salary
2,2022-23,0.0121,83500000,61700000,775000
3,2021-22,0.0123,82500000,61000000,750000
4,2020-21,0.0,81500000,60200000,750000
5,2019-20,0.0,81500000,60200000,700000
6,2018-19,0.0252,81500000,60200000,700000
7,2017-18,0.06,79500000,58800000,650000
8,2016-17,0.0274,75000000,55400000,650000
9,2015-16,0.0224,73000000,54000000,575000
10,2014-15,0.0348,71400000,52800000,575000
11,2013-14,0.0731,69000000,51000000,525000


In [38]:
salary_cap.to_csv('../data/salary_cap_cleaned.csv', index=False)

---
## Scraping & Cleaning: Player Stats

In [39]:
seasons = {
    '2010-11': '13',
    '2011-12': '12',
    '2012-13': '11',
    '2013-14': '10',
    '2014-15': '9',
    '2015-16': '8',
    '2016-17': '7',
    '2017-18': '6',
    '2018-19': '5',
    '2019-20': '4',
    '2020-21': '3',
    '2021-22': '2',
    '2022-23': '1'
}

### Skater States (Forwards & Defense)

In [40]:
driver.quit()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [41]:
driver.get("https://moneypuck.com/stats.htm")
time.sleep(3)

# Initiate final dataframe to store all player stats
skater_stats = pd.DataFrame()

# Select only Regular Season Stats
select_element('//*[@id="table_playoff_type"]/option[1]')

# Iterate through all seasons from 2010-11 to 2021-22
for season_str, season in seasons.items():
    # Select Season
    select_element(f'//*[@id="season_type"]/option[{season}]')
    
    time.sleep(8)
    table_element = driver.find_element(By.XPATH, '//*[@id="includedContent"]/table')
    table_data = pd.read_html(table_element.get_attribute('outerHTML'))
    season_stats = pd.DataFrame(table_data[0])
    
    tr_elements = driver.find_elements(By.XPATH, '//*[@id="includedContent"]/table/tbody/tr')
    alt_values = [tr_element.find_element(By.XPATH, './th/table/tbody/tr/td[2]/img').get_attribute('alt') for tr_element in tr_elements]
    
    season_stats.columns = season_stats.columns.str.lower().str.replace(' ', '_').str.replace('%','pct')
    season_stats.dropna(subset=['pos'], inplace=True)

    season_stats['season'] = season_str
    season_stats['team'] = alt_values
    season_stats['team'] = season_stats['team'].map(teams_dict)
    
    season = season_stats.pop('season')
    team = season_stats.pop('team')
    season_stats.insert(2, 'season', season)
    season_stats.insert(3, 'team', team)
    
    
    print(f'Shape of {season_str} stats: {season_stats.shape}')
    skater_stats = pd.concat([skater_stats, season_stats])

Shape of 2010-11 stats: (887, 80)
Shape of 2011-12 stats: (890, 80)
Shape of 2012-13 stats: (837, 80)
Shape of 2013-14 stats: (882, 80)
Shape of 2014-15 stats: (879, 80)
Shape of 2015-16 stats: (897, 80)
Shape of 2016-17 stats: (884, 80)
Shape of 2017-18 stats: (886, 80)
Shape of 2018-19 stats: (903, 80)
Shape of 2019-20 stats: (880, 80)
Shape of 2020-21 stats: (910, 80)
Shape of 2021-22 stats: (998, 80)
Shape of 2022-23 stats: (949, 80)


In [42]:
skater_stats

Unnamed: 0,name,pos,season,team,games_played,icetime_(minutes),expected_goals,goals,assists,points,...,share_of_xgoals_from_rebounds_shots,xgoals_from_non_rebounds,xgoals_of_expected_rebounds,created_xgoals,created_xgoals_minus_actual_xgoals,goals.1,expected_goals.1,shooting_talent_above_average,shooting_talent_adjusted_expected_goals,goals_above_shooting_talent
0,1Frans Nielsen,C,2010-11,NYI,71,1261.0,14.7,13.0,31.0,44.0,...,16.2%,12.4,2.1,14.4,-0.3,13.0,14.7,-17%,12.2,0.8
2,2Jaroslav Spacek,D,2010-11,MTL,59,1135.0,2.7,1.0,15.0,16.0,...,2.2%,2.6,0.9,3.5,0.8,1.0,2.7,0%,2.7,-1.7
4,3Antti Miettinen,R,2010-11,MIN,73,1242.0,15.7,16.0,19.0,35.0,...,7.6%,14.5,2.6,17.1,1.4,16.0,15.7,0.6%,15.8,0.2
6,4Kyle Quincey,D,2010-11,COL,21,411.0,1.4,0.0,1.0,1.0,...,3.6%,1.3,0.5,1.8,0.4,0.0,1.4,-22.6%,1.1,-1.1
8,5Sergei Samsonov,L,2010-11,FLA,78,1207.0,15.6,13.0,27.0,40.0,...,28.2%,11.2,2.1,13.3,-2.3,13.0,15.6,4.4%,16.2,-3.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1888,945Rasmus Asplund,C,2022-23,NSH,46,504.0,5.9,2.0,6.0,8.0,...,41.4%,3.4,0.7,4.2,-1.7,2.0,5.9,-21.6%,4.6,-2.6
1890,946Ross Johnston,L,2022-23,NYI,16,124.0,0.3,0.0,2.0,2.0,...,0%,0.3,0.1,0.4,0.1,0.0,0.3,0%,0.3,-0.3
1892,947Neal Pionk,D,2022-23,WPG,82,1799.0,7.0,10.0,23.0,33.0,...,9.7%,6.3,1.8,8.1,1.1,10.0,7.0,-12.5%,6.1,3.9
1894,948Tyler Pitlick,C,2022-23,STL,61,614.0,5.6,7.0,9.0,16.0,...,30.3%,3.9,0.8,4.7,-0.9,7.0,5.6,3.8%,5.9,1.1


In [43]:
pct_cols = [col for col in skater_stats.columns if 'pct' in col]
for col in pct_cols:
    skater_stats[col] = skater_stats[col].str.strip('%').astype(float) / 100

In [44]:
skater_stats.select_dtypes(include=['object']).columns.tolist()

['name',
 'pos',
 'season',
 'team',
 'games_played',
 'share_of_possible_icetime',
 'share_of_xgoals_from_rebounds_shots',
 'shooting_talent_above_average']

In [45]:
skater_stats['games_played'] = skater_stats['games_played'].astype(int)
skater_stats['share_of_possible_icetime'] = skater_stats['share_of_possible_icetime'].str.strip('%').astype(float) / 100
skater_stats['share_of_xgoals_from_rebounds_shots'] = skater_stats['share_of_xgoals_from_rebounds_shots'].str.strip('%').astype(float) / 100
skater_stats['shooting_talent_above_average'] = skater_stats['shooting_talent_above_average'].str.strip('%').astype(float) / 100

In [46]:
skater_stats.select_dtypes(include=['object']).columns.tolist()

['name', 'pos', 'season', 'team']

In [47]:
skater_stats['name'] = skater_stats['name'].str.replace(r'^\d+\s*([^\d]+)$', r'\1', regex=True)

In [48]:
skater_stats['pos'] = skater_stats['pos'].replace(['C','R','L'], 'F')

In [49]:
int_cols = ['games_played','icetime_(minutes)','goals','assists',
           'points','primary_assists','secondary_assists','shifts',
           'hits','pim','pim_drawn','pim_differential','shots_blocked_by_player',
           'takeaways','giveaways', 'defensive_zone_giveaways','shot_attempts',
           'shots_on_goal','shots_that_missed_net','shots_that_were_blocked',
           'high_danger_unblocked_shot_attempts','medium_danger_unblocked_shot_attempts',
           'low_danger_unblocked_shot_attempts','on-ice_goal_differential',
           'rebounds_created']

for col in int_cols:
    skater_stats[col] = skater_stats[col].astype(int)

In [50]:
skater_stats = skater_stats.rename(columns={'icetime_(minutes)': 'icetime', 'name': 'player'})

In [51]:
skater_stats = skater_stats.drop(columns=['goals.1', 'expected_goals.1'])

In [52]:
skater_stats.reset_index(drop=True, inplace=True)

In [53]:
skater_stats

Unnamed: 0,player,pos,season,team,games_played,icetime,expected_goals,goals,assists,points,...,rebounds_created_above_expected,xgoals_on_rebounds_shots,share_of_xgoals_from_rebounds_shots,xgoals_from_non_rebounds,xgoals_of_expected_rebounds,created_xgoals,created_xgoals_minus_actual_xgoals,shooting_talent_above_average,shooting_talent_adjusted_expected_goals,goals_above_shooting_talent
0,Frans Nielsen,F,2010-11,NYI,71,1261,14.7,13,31,44,...,1.4,2.4,0.162,12.4,2.1,14.4,-0.3,-0.170,12.2,0.8
1,Jaroslav Spacek,D,2010-11,MTL,59,1135,2.7,1,15,16,...,-2.5,0.1,0.022,2.6,0.9,3.5,0.8,0.000,2.7,-1.7
2,Antti Miettinen,F,2010-11,MIN,73,1242,15.7,16,19,35,...,-8.4,1.2,0.076,14.5,2.6,17.1,1.4,0.006,15.8,0.2
3,Kyle Quincey,D,2010-11,COL,21,411,1.4,0,1,1,...,2.9,0.1,0.036,1.3,0.5,1.8,0.4,-0.226,1.1,-1.1
4,Sergei Samsonov,F,2010-11,FLA,78,1207,15.6,13,27,40,...,-3.3,4.4,0.282,11.2,2.1,13.3,-2.3,0.044,16.2,-3.2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11677,Rasmus Asplund,F,2022-23,NSH,46,504,5.9,2,6,8,...,1.6,2.4,0.414,3.4,0.7,4.2,-1.7,-0.216,4.6,-2.6
11678,Ross Johnston,F,2022-23,NYI,16,124,0.3,0,2,2,...,0.8,0.0,0.000,0.3,0.1,0.4,0.1,0.000,0.3,-0.3
11679,Neal Pionk,D,2022-23,WPG,82,1799,7.0,10,23,33,...,-1.4,0.7,0.097,6.3,1.8,8.1,1.1,-0.125,6.1,3.9
11680,Tyler Pitlick,F,2022-23,STL,61,614,5.6,7,9,16,...,6.3,1.7,0.303,3.9,0.8,4.7,-0.9,0.038,5.9,1.1


In [54]:
skater_stats.to_csv('../data/skater_stats_cleaned.csv', index=False)

### Goalie Stats

In [55]:
driver.quit()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [56]:
driver.get("https://moneypuck.com/goalies.htm")
time.sleep(3)

# Initiate final dataframe to store all player stats
goalie_stats = pd.DataFrame()

# Select only Regular Season Stats
select_element('//*[@id="table_playoff_type"]/option[1]')

# Iterate through all seasons from 2010-11 to 2021-22
for season_str, season in seasons.items():
    # Select Season
    select_element(f'//*[@id="season_type"]/option[{season}]')

    time.sleep(8)
    table_element = driver.find_element(By.XPATH, '//*[@id="goaliesTable"]')
    table_data = pd.read_html(table_element.get_attribute('outerHTML'))
    season_stats = pd.DataFrame(table_data[0])
    
    tr_elements = driver.find_elements(By.XPATH, '//*[@id="goaliesTable"]/tbody/tr')
    alt_values = [tr_element.find_element(By.XPATH, './th/table/tbody/tr/td[2]/img').get_attribute('alt') for tr_element in tr_elements]

    season_stats.columns = season_stats.columns.str.lower().str.replace(' ', '_').str.replace('%','pct')
    season_stats.dropna(subset=['games_played'], inplace=True)

    season_stats['pos'] = 'G'
    season_stats['season'] = season_str
    season_stats['team'] = alt_values
    season_stats['team'] = season_stats['team'].map(teams_dict)
    
    pos = season_stats.pop('pos')
    season = season_stats.pop('season')
    team = season_stats.pop('team')
    season_stats.insert(1, 'pos', pos)
    season_stats.insert(2, 'season', season)
    season_stats.insert(3, 'team', team)
    
    
    print(f'Shape of {season_str} stats: {season_stats.shape}')
    goalie_stats = pd.concat([goalie_stats, season_stats])

Shape of 2010-11 stats: (87, 41)
Shape of 2011-12 stats: (88, 41)
Shape of 2012-13 stats: (82, 41)
Shape of 2013-14 stats: (97, 41)
Shape of 2014-15 stats: (92, 41)
Shape of 2015-16 stats: (92, 41)
Shape of 2016-17 stats: (94, 41)
Shape of 2017-18 stats: (95, 41)
Shape of 2018-19 stats: (93, 41)
Shape of 2019-20 stats: (85, 41)
Shape of 2020-21 stats: (98, 41)
Shape of 2021-22 stats: (119, 41)
Shape of 2022-23 stats: (107, 41)


In [57]:
goalie_stats

Unnamed: 0,name,pos,season,team,games_played,goals_against,expected_goals_against,goals_saved_above_expected,goals_saved_above_expected_per_60,save_pct_on_unblocked_shots,...,on_goal_pct_above_expected,low_danger_unblocked_shot_attempt_save_pct,xlow_danger_unblocked_shot_attempt_save_pct,low_danger_unblocked_shot_attempt_savepct_above_expected,medium_danger_unblocked_shot_attempt_save_pct,xmedium_danger_unblocked_shot_attempt_save_pct,medium_danger_unblocked_shot_attemptsave_pct_above_expected,high_danger_unblocked_shot_attempt_save_pct,xhigh_danger_unblocked_shot_attempt_save_pct,high_danger_unblocked_shot_attempt_save_pct_above_expected
0,1Tim Thomas,G,2010-11,BOS,57.0,112,151.56,39.6,0.706,0.966,...,1.77%,0.982,0.971,0.011,0.914,0.882,0.032,0.691,0.651,0.040
2,2Cam Ward,G,2010-11,CAR,74.0,184,209.88,25.9,0.365,0.959,...,-1.88%,0.978,0.971,0.007,0.894,0.880,0.015,0.659,0.672,-0.013
4,3Jonas Hiller,G,2010-11,ANA,49.0,114,134.72,20.7,0.465,0.959,...,-0.58%,0.980,0.970,0.010,0.895,0.879,0.016,0.648,0.664,-0.015
6,4Carey Price,G,2010-11,MTL,72.0,165,185.54,20.5,0.292,0.959,...,1.09%,0.978,0.972,0.006,0.907,0.880,0.027,0.626,0.674,-0.048
8,5Roberto Luongo,G,2010-11,VAN,60.0,126,144.60,18.6,0.312,0.960,...,1.88%,0.978,0.971,0.007,0.916,0.879,0.036,0.593,0.678,-0.085
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
204,103Jonathan Quick,G,2022-23,VGK,41.0,127,109.83,-17.2,-0.462,0.939,...,-1.25%,0.958,0.971,-0.012,0.846,0.876,-0.031,0.730,0.671,0.059
206,104Jack Campbell,G,2022-23,EDM,36.0,115,96.72,-18.3,-0.542,0.935,...,1.73%,0.960,0.969,-0.010,0.866,0.879,-0.013,0.604,0.655,-0.051
208,105Spencer Martin,G,2022-23,VAN,29.0,107,83.47,-23.5,-0.876,0.929,...,-0.29%,0.941,0.968,-0.027,0.884,0.876,0.008,0.651,0.690,-0.038
210,106Kaapo Kahkonen,G,2022-23,SJS,37.0,135,110.42,-24.6,-0.701,0.938,...,-0.03%,0.962,0.970,-0.008,0.825,0.877,-0.052,0.706,0.690,0.017


In [58]:
goalie_stats['team'].unique()

array(['BOS', 'CAR', 'ANA', 'MTL', 'VAN', 'NSH', 'ARI', 'NYI', 'CHI',
       'NYR', 'TOR', 'WSH', 'PHI', 'FLA', 'PIT', 'EDM', 'MIN', 'LAK',
       'BUF', 'STL', nan, 'OTT', 'TBL', 'SJS', 'DAL', 'DET', 'NJD', 'CBJ',
       'CGY', 'COL', 'WPG', 'VGK'], dtype=object)

In [59]:
goalie_stats['team'].fillna(
    goalie_stats['season'].map({'2010-11': 'ATL', '2021-22': 'SEA'}),
    inplace=True
)

In [60]:
pct_cols = [col for col in goalie_stats.columns if 'pct' in col]
for col in pct_cols:
    if goalie_stats[col].dtype == 'object':
        goalie_stats[col] = goalie_stats[col].str.strip('%').astype(float) / 100

In [61]:
goalie_stats.select_dtypes(include=['object']).columns.tolist()

['name', 'pos', 'season', 'team', 'goals_against']

In [62]:
goalie_stats['goals_against'] = goalie_stats['goals_against'].astype(int)

In [63]:
goalie_stats['name'] = goalie_stats['name'].str.replace(r'^\d+\s*([^\d]+)$', r'\1', regex=True)

In [64]:
pd.set_option('display.max_columns', None)
# pd.set_option('display.')
goalie_stats

Unnamed: 0,name,pos,season,team,games_played,goals_against,expected_goals_against,goals_saved_above_expected,goals_saved_above_expected_per_60,save_pct_on_unblocked_shots,xsave_pct_on_unblocked_shots,save_pct_above_expected,save_pct_on_shots_on_goal,gaa,xgaa,gaa_better_than_expected,wins_above_replacement,icetime_(minutes),rebounds_per_save,xrebounds_per_save,rebounds_above_expected,puck_freezes,expected_puck_freeze,puck_freezes_above_expected,puck_freezes_above_expected_per_shot_on_goal,goals_against.1,saves_on_shots_on_goal,saves_on_unblocked_shot_attempts,pct_of_shot_attempts_blocked_by_teammates,pct_of_unblocked_shot_attempts_against_on_goal,expected_pct_of_unblocked_shot_attempts_against_on_goal,on_goal_pct_above_expected,low_danger_unblocked_shot_attempt_save_pct,xlow_danger_unblocked_shot_attempt_save_pct,low_danger_unblocked_shot_attempt_savepct_above_expected,medium_danger_unblocked_shot_attempt_save_pct,xmedium_danger_unblocked_shot_attempt_save_pct,medium_danger_unblocked_shot_attemptsave_pct_above_expected,high_danger_unblocked_shot_attempt_save_pct,xhigh_danger_unblocked_shot_attempt_save_pct,high_danger_unblocked_shot_attempt_save_pct_above_expected
0,Tim Thomas,G,2010-11,BOS,57.0,112,151.56,39.6,0.706,0.966,0.953,0.012,0.938,2.00,2.70,0.71,6.59,3363.0,0.033,0.038,-0.004,481.0,395.37,85.63,0.05,112.0,1699.0,3146.0,0.2019,0.5559,0.5382,0.0177,0.982,0.971,0.011,0.914,0.882,0.032,0.691,0.651,0.040
2,Cam Ward,G,2010-11,CAR,74.0,184,209.88,25.9,0.365,0.959,0.953,0.006,0.923,2.59,2.96,0.36,4.31,4257.0,0.045,0.040,0.005,505.0,555.09,-50.09,-0.02,184.0,2191.0,4266.0,0.1922,0.5337,0.5526,-0.0188,0.978,0.971,0.007,0.894,0.880,0.015,0.659,0.672,-0.013
4,Jonas Hiller,G,2010-11,ANA,49.0,114,134.72,20.7,0.465,0.959,0.951,0.008,0.924,2.56,3.03,0.47,3.45,2671.0,0.048,0.041,0.008,298.0,332.28,-34.28,-0.02,114.0,1379.0,2640.0,0.1900,0.5421,0.5479,-0.0058,0.980,0.970,0.010,0.895,0.879,0.016,0.648,0.664,-0.015
6,Carey Price,G,2010-11,MTL,72.0,165,185.54,20.5,0.292,0.959,0.953,0.005,0.923,2.35,2.65,0.29,3.42,4206.0,0.039,0.039,0.000,528.0,478.50,49.50,0.02,165.0,1982.0,3813.0,0.2126,0.5397,0.5288,0.0109,0.978,0.972,0.006,0.907,0.880,0.027,0.626,0.674,-0.048
8,Roberto Luongo,G,2010-11,VAN,60.0,126,144.60,18.6,0.312,0.960,0.954,0.006,0.928,2.11,2.42,0.31,3.10,3579.0,0.027,0.040,-0.013,429.0,380.53,48.47,0.03,126.0,1627.0,3026.0,0.2004,0.5562,0.5374,0.0188,0.978,0.971,0.007,0.916,0.879,0.036,0.593,0.678,-0.085
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
204,Jonathan Quick,G,2022-23,VGK,41.0,127,109.83,-17.2,-0.462,0.939,0.947,-0.008,0.882,3.41,2.95,-0.46,-2.86,2234.0,0.055,0.040,0.015,231.0,250.00,-19.00,-0.02,127.0,953.0,1961.0,0.2130,0.5172,0.5298,-0.0125,0.958,0.971,-0.012,0.846,0.876,-0.031,0.730,0.671,0.059
206,Jack Campbell,G,2022-23,EDM,36.0,115,96.72,-18.3,-0.542,0.935,0.946,-0.010,0.888,3.41,2.86,-0.54,-3.05,2026.0,0.049,0.040,0.009,237.0,233.74,3.26,0.00,115.0,912.0,1664.0,0.1854,0.5773,0.5600,0.0173,0.960,0.969,-0.010,0.866,0.879,-0.013,0.604,0.655,-0.051
208,Spencer Martin,G,2022-23,VAN,29.0,107,83.47,-23.5,-0.876,0.929,0.945,-0.016,0.871,3.99,3.11,-0.88,-3.92,1610.0,0.059,0.042,0.018,159.0,188.42,-29.42,-0.04,107.0,723.0,1402.0,0.1922,0.5500,0.5530,-0.0029,0.941,0.968,-0.027,0.884,0.876,0.008,0.651,0.690,-0.038
210,Kaapo Kahkonen,G,2022-23,SJS,37.0,135,110.42,-24.6,-0.701,0.938,0.949,-0.011,0.883,3.85,3.15,-0.70,-4.10,2106.0,0.052,0.040,0.012,287.0,264.42,22.58,0.02,135.0,1014.0,2049.0,0.2113,0.5261,0.5264,-0.0003,0.962,0.970,-0.008,0.825,0.877,-0.052,0.706,0.690,0.017


In [65]:
int_cols = ['games_played','icetime_(minutes)','puck_freezes',
            'saves_on_shots_on_goal','saves_on_unblocked_shot_attempts']

for col in int_cols:
    goalie_stats[col] = goalie_stats[col].astype(int)

In [66]:
goalie_stats = goalie_stats.rename(columns={'icetime_(minutes)': 'icetime', 'name': 'player'})

In [67]:
goalie_stats = goalie_stats.drop(columns='goals_against.1')

In [68]:
goalie_stats.reset_index(drop=True, inplace=True)

In [69]:
goalie_stats

Unnamed: 0,player,pos,season,team,games_played,goals_against,expected_goals_against,goals_saved_above_expected,goals_saved_above_expected_per_60,save_pct_on_unblocked_shots,xsave_pct_on_unblocked_shots,save_pct_above_expected,save_pct_on_shots_on_goal,gaa,xgaa,gaa_better_than_expected,wins_above_replacement,icetime,rebounds_per_save,xrebounds_per_save,rebounds_above_expected,puck_freezes,expected_puck_freeze,puck_freezes_above_expected,puck_freezes_above_expected_per_shot_on_goal,saves_on_shots_on_goal,saves_on_unblocked_shot_attempts,pct_of_shot_attempts_blocked_by_teammates,pct_of_unblocked_shot_attempts_against_on_goal,expected_pct_of_unblocked_shot_attempts_against_on_goal,on_goal_pct_above_expected,low_danger_unblocked_shot_attempt_save_pct,xlow_danger_unblocked_shot_attempt_save_pct,low_danger_unblocked_shot_attempt_savepct_above_expected,medium_danger_unblocked_shot_attempt_save_pct,xmedium_danger_unblocked_shot_attempt_save_pct,medium_danger_unblocked_shot_attemptsave_pct_above_expected,high_danger_unblocked_shot_attempt_save_pct,xhigh_danger_unblocked_shot_attempt_save_pct,high_danger_unblocked_shot_attempt_save_pct_above_expected
0,Tim Thomas,G,2010-11,BOS,57,112,151.56,39.6,0.706,0.966,0.953,0.012,0.938,2.00,2.70,0.71,6.59,3363,0.033,0.038,-0.004,481,395.37,85.63,0.05,1699,3146,0.2019,0.5559,0.5382,0.0177,0.982,0.971,0.011,0.914,0.882,0.032,0.691,0.651,0.040
1,Cam Ward,G,2010-11,CAR,74,184,209.88,25.9,0.365,0.959,0.953,0.006,0.923,2.59,2.96,0.36,4.31,4257,0.045,0.040,0.005,505,555.09,-50.09,-0.02,2191,4266,0.1922,0.5337,0.5526,-0.0188,0.978,0.971,0.007,0.894,0.880,0.015,0.659,0.672,-0.013
2,Jonas Hiller,G,2010-11,ANA,49,114,134.72,20.7,0.465,0.959,0.951,0.008,0.924,2.56,3.03,0.47,3.45,2671,0.048,0.041,0.008,298,332.28,-34.28,-0.02,1379,2640,0.1900,0.5421,0.5479,-0.0058,0.980,0.970,0.010,0.895,0.879,0.016,0.648,0.664,-0.015
3,Carey Price,G,2010-11,MTL,72,165,185.54,20.5,0.292,0.959,0.953,0.005,0.923,2.35,2.65,0.29,3.42,4206,0.039,0.039,0.000,528,478.50,49.50,0.02,1982,3813,0.2126,0.5397,0.5288,0.0109,0.978,0.972,0.006,0.907,0.880,0.027,0.626,0.674,-0.048
4,Roberto Luongo,G,2010-11,VAN,60,126,144.60,18.6,0.312,0.960,0.954,0.006,0.928,2.11,2.42,0.31,3.10,3579,0.027,0.040,-0.013,429,380.53,48.47,0.03,1627,3026,0.2004,0.5562,0.5374,0.0188,0.978,0.971,0.007,0.916,0.879,0.036,0.593,0.678,-0.085
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1224,Jonathan Quick,G,2022-23,VGK,41,127,109.83,-17.2,-0.462,0.939,0.947,-0.008,0.882,3.41,2.95,-0.46,-2.86,2234,0.055,0.040,0.015,231,250.00,-19.00,-0.02,953,1961,0.2130,0.5172,0.5298,-0.0125,0.958,0.971,-0.012,0.846,0.876,-0.031,0.730,0.671,0.059
1225,Jack Campbell,G,2022-23,EDM,36,115,96.72,-18.3,-0.542,0.935,0.946,-0.010,0.888,3.41,2.86,-0.54,-3.05,2026,0.049,0.040,0.009,237,233.74,3.26,0.00,912,1664,0.1854,0.5773,0.5600,0.0173,0.960,0.969,-0.010,0.866,0.879,-0.013,0.604,0.655,-0.051
1226,Spencer Martin,G,2022-23,VAN,29,107,83.47,-23.5,-0.876,0.929,0.945,-0.016,0.871,3.99,3.11,-0.88,-3.92,1610,0.059,0.042,0.018,159,188.42,-29.42,-0.04,723,1402,0.1922,0.5500,0.5530,-0.0029,0.941,0.968,-0.027,0.884,0.876,0.008,0.651,0.690,-0.038
1227,Kaapo Kahkonen,G,2022-23,SJS,37,135,110.42,-24.6,-0.701,0.938,0.949,-0.011,0.883,3.85,3.15,-0.70,-4.10,2106,0.052,0.040,0.012,287,264.42,22.58,0.02,1014,2049,0.2113,0.5261,0.5264,-0.0003,0.962,0.970,-0.008,0.825,0.877,-0.052,0.706,0.690,0.017


In [70]:
goalie_stats.to_csv('../data/goalie_stats_cleaned.csv', index=False)

---
## Scraping & Cleaning: Team Standings

In [71]:
driver.get('https://www.hockey-reference.com/leagues/NHL_2011_standings.html')
time.sleep(2)

# Initiate final dataframe to store team standings for each year
team_standings = pd.DataFrame()


for season_str, season in seasons.items():
    
    table_element = driver.find_element(By.XPATH, '//*[@id="expanded_standings"]')
    table_data = pd.read_html(table_element.get_attribute('outerHTML'))
    season_standings = pd.DataFrame(table_data[0])
    season_standings['season'] = season_str
    try:
        select_element('//*[@id="meta"]/div[2]/div/a[2]')
    except ElementClickInterceptedException:
        select_element('//*[@id="modal-close"]')
    time.sleep(5)
    
    season_standings.columns = season_standings.columns.str.lower()
    season_standings = season_standings.loc[:, ['rk', 'unnamed: 1', 'season']]
    season_standings = season_standings.rename(columns={'rk': 'final_standing','unnamed: 1': 'team'})
    season_standings['team'] = season_standings['team'].map(teams_dict)
    season_standings = season_standings[['team','season','final_standing']]
    
    team_standings = pd.concat([team_standings, season_standings])

In [72]:
team_standings.reset_index(drop=True, inplace=True)

In [73]:
team_standings

Unnamed: 0,team,season,final_standing
0,VAN,2010-11,1
1,WSH,2010-11,2
2,PIT,2010-11,3
3,PHI,2010-11,4
4,SJS,2010-11,5
...,...,...,...
391,NJD,2022-23,28
392,PHI,2022-23,29
393,SEA,2022-23,30
394,ARI,2022-23,31


In [74]:
team_standings.to_csv('../data/team_standings_cleaned.csv', index=False)

---

## Merging Data

In [75]:
signings.head(2)

Unnamed: 0,player,pos,contract_aav,season
0,Trent Hunter,F,600000,2010-11
1,Jason Chimera,F,1750000,2010-11


In [76]:
salary_cap.head(2)

Unnamed: 0,season,pct_change,upper_limit,lower_limit,min_salary
2,2022-23,0.0121,83500000,61700000,775000
3,2021-22,0.0123,82500000,61000000,750000


In [77]:
skater_stats.head(2)

Unnamed: 0,player,pos,season,team,games_played,icetime,expected_goals,goals,assists,points,primary_assists,secondary_assists,shifts,share_of_possible_icetime,pct_of_shift_starts_in_offensive_zone,pct_of_shift_starts_in_neutral_zone,pct_of_shift_starts_in_defensive_zone,pct_of_shift_starts_on_fly,hits,pim,pim_drawn,pim_differential,shots_blocked_by_player,shots_blocked_by_player_per_60,takeaways,giveaways,defensive_zone_giveaways,faceoff_win_pct,goals_above_expected,expected_goals_per_60_minutes,goals_per_60_minutes,assists_per_60_minutes,points_per_60_minutes,shots_on_goal_per_60_minutes,shot_attempts_per_60_minutes,shooting_pct,shooting_pct_on_unblocked_shots,expected_shooting_pct_on_unblocked_shots,shooting_pct_on_unblocked_shots_above_expected,shot_attempts,shots_on_goal,shots_that_missed_net,shots_that_were_blocked,pct_of_unblocked_shots_that_missed_net,expected_pct_of_unblocked_shots_that_missed_net,net_miss_pct_above_expected,high_danger_unblocked_shot_attempts,medium_danger_unblocked_shot_attempts,low_danger_unblocked_shot_attempts,high_danger_xgoals,medium_danger_xgoals,low_danger_xgoals,on-ice_shot_attempt_pct_(corsi),on-ice_unblocked_shot_attempt_pct_(fenwick),on-ice_goals_pct,on-ice_expected_goals_pct,off-ice_expected_goals_pct,relative_expected_goals_pct,on-ice_score_adjusted_expected_goals_pct,on-ice_score/flurry_adjusted_expected_goals_pct,on-ice_expected_goals_against_per_60_minutes,on-ice_shot_attempts_against_per_60_minutes,on-ice_high_danger_shot_attempts_against_per_60_minutes,flurry_adjusted_xgoals,on-ice_goal_differential,on-ice_expected_goals_differential,rebounds_created,xrebounds_created,rebounds_created_above_expected,xgoals_on_rebounds_shots,share_of_xgoals_from_rebounds_shots,xgoals_from_non_rebounds,xgoals_of_expected_rebounds,created_xgoals,created_xgoals_minus_actual_xgoals,shooting_talent_above_average,shooting_talent_adjusted_expected_goals,goals_above_shooting_talent
0,Frans Nielsen,F,2010-11,NYI,71,1261,14.7,13,31,44,20,11,1692,0.291,0.138,0.186,0.199,0.477,23,18,49,-31,63,3.0,66,33,13,0.462,-1.7,0.7,0.62,1.47,2.09,7.42,12.13,0.083,0.064,0.069,-0.005,255,156,46,53,0.228,0.272,-0.045,13,47,142,4.64,5.63,4.45,0.49,0.5,0.531,0.516,0.454,0.062,0.516,0.519,2.9,58.47,2.71,14.3,8,4.1,11,9.6,1.4,2.4,0.162,12.4,2.1,14.4,-0.3,-0.17,12.2,0.8
1,Jaroslav Spacek,D,2010-11,MTL,59,1135,2.7,1,15,16,5,10,1526,0.317,0.098,0.145,0.13,0.626,61,45,28,17,90,4.76,30,55,53,0.0,-1.7,0.14,0.05,0.79,0.85,3.44,9.04,0.015,0.01,0.02,-0.01,171,65,37,69,0.363,0.343,0.02,0,5,97,0.0,0.72,1.96,0.5,0.5,0.5,0.434,0.548,-0.114,0.432,0.438,3.0,58.66,3.12,2.6,0,-13.2,2,4.5,-2.5,0.1,0.022,2.6,0.9,3.5,0.8,0.0,2.7,-1.7


In [78]:
goalie_stats.head(2)

Unnamed: 0,player,pos,season,team,games_played,goals_against,expected_goals_against,goals_saved_above_expected,goals_saved_above_expected_per_60,save_pct_on_unblocked_shots,xsave_pct_on_unblocked_shots,save_pct_above_expected,save_pct_on_shots_on_goal,gaa,xgaa,gaa_better_than_expected,wins_above_replacement,icetime,rebounds_per_save,xrebounds_per_save,rebounds_above_expected,puck_freezes,expected_puck_freeze,puck_freezes_above_expected,puck_freezes_above_expected_per_shot_on_goal,saves_on_shots_on_goal,saves_on_unblocked_shot_attempts,pct_of_shot_attempts_blocked_by_teammates,pct_of_unblocked_shot_attempts_against_on_goal,expected_pct_of_unblocked_shot_attempts_against_on_goal,on_goal_pct_above_expected,low_danger_unblocked_shot_attempt_save_pct,xlow_danger_unblocked_shot_attempt_save_pct,low_danger_unblocked_shot_attempt_savepct_above_expected,medium_danger_unblocked_shot_attempt_save_pct,xmedium_danger_unblocked_shot_attempt_save_pct,medium_danger_unblocked_shot_attemptsave_pct_above_expected,high_danger_unblocked_shot_attempt_save_pct,xhigh_danger_unblocked_shot_attempt_save_pct,high_danger_unblocked_shot_attempt_save_pct_above_expected
0,Tim Thomas,G,2010-11,BOS,57,112,151.56,39.6,0.706,0.966,0.953,0.012,0.938,2.0,2.7,0.71,6.59,3363,0.033,0.038,-0.004,481,395.37,85.63,0.05,1699,3146,0.2019,0.5559,0.5382,0.0177,0.982,0.971,0.011,0.914,0.882,0.032,0.691,0.651,0.04
1,Cam Ward,G,2010-11,CAR,74,184,209.88,25.9,0.365,0.959,0.953,0.006,0.923,2.59,2.96,0.36,4.31,4257,0.045,0.04,0.005,505,555.09,-50.09,-0.02,2191,4266,0.1922,0.5337,0.5526,-0.0188,0.978,0.971,0.007,0.894,0.88,0.015,0.659,0.672,-0.013


In [79]:
team_standings.head(2)

Unnamed: 0,team,season,final_standing
0,VAN,2010-11,1
1,WSH,2010-11,2


In [80]:
skaters = signings_skaters.merge(salary_cap, on='season', how='left')

In [81]:
skaters = skaters.merge(skater_stats, on=['player', 'pos', 'season'], how='left')

In [82]:
skaters = skaters.merge(team_standings, on=['team', 'season'], how='left')

In [83]:
skaters

Unnamed: 0,player,pos,contract_aav,season,pct_change,upper_limit,lower_limit,min_salary,team,games_played,icetime,expected_goals,goals,assists,points,primary_assists,secondary_assists,shifts,share_of_possible_icetime,pct_of_shift_starts_in_offensive_zone,pct_of_shift_starts_in_neutral_zone,pct_of_shift_starts_in_defensive_zone,pct_of_shift_starts_on_fly,hits,pim,pim_drawn,pim_differential,shots_blocked_by_player,shots_blocked_by_player_per_60,takeaways,giveaways,defensive_zone_giveaways,faceoff_win_pct,goals_above_expected,expected_goals_per_60_minutes,goals_per_60_minutes,assists_per_60_minutes,points_per_60_minutes,shots_on_goal_per_60_minutes,shot_attempts_per_60_minutes,shooting_pct,shooting_pct_on_unblocked_shots,expected_shooting_pct_on_unblocked_shots,shooting_pct_on_unblocked_shots_above_expected,shot_attempts,shots_on_goal,shots_that_missed_net,shots_that_were_blocked,pct_of_unblocked_shots_that_missed_net,expected_pct_of_unblocked_shots_that_missed_net,net_miss_pct_above_expected,high_danger_unblocked_shot_attempts,medium_danger_unblocked_shot_attempts,low_danger_unblocked_shot_attempts,high_danger_xgoals,medium_danger_xgoals,low_danger_xgoals,on-ice_shot_attempt_pct_(corsi),on-ice_unblocked_shot_attempt_pct_(fenwick),on-ice_goals_pct,on-ice_expected_goals_pct,off-ice_expected_goals_pct,relative_expected_goals_pct,on-ice_score_adjusted_expected_goals_pct,on-ice_score/flurry_adjusted_expected_goals_pct,on-ice_expected_goals_against_per_60_minutes,on-ice_shot_attempts_against_per_60_minutes,on-ice_high_danger_shot_attempts_against_per_60_minutes,flurry_adjusted_xgoals,on-ice_goal_differential,on-ice_expected_goals_differential,rebounds_created,xrebounds_created,rebounds_created_above_expected,xgoals_on_rebounds_shots,share_of_xgoals_from_rebounds_shots,xgoals_from_non_rebounds,xgoals_of_expected_rebounds,created_xgoals,created_xgoals_minus_actual_xgoals,shooting_talent_above_average,shooting_talent_adjusted_expected_goals,goals_above_shooting_talent,final_standing
0,Trent Hunter,F,600000,2010-11,0.0825,64300000,48300000,525000,NYI,17.0,215.0,2.1,1.0,3.0,4.0,1.0,2.0,310.0,0.207,0.158,0.232,0.123,0.487,36.0,23.0,0.0,23.0,5.0,1.39,3.0,2.0,0.0,0.000,-1.0,0.57,0.28,0.84,1.12,8.36,18.96,0.033,0.020,0.040,-0.020,68.0,30.0,20.0,18.0,0.400,0.300,0.100,0.0,10.0,40.0,0.00,0.99,1.07,0.52,0.51,0.467,0.516,0.488,0.028,0.509,0.511,2.45,56.87,1.95,2.0,-1.0,0.6,2.0,2.2,-0.1,0.1,0.068,1.9,0.5,2.5,0.4,-0.206,1.6,-0.6,27.0
1,Jason Chimera,F,1750000,2010-11,0.0825,64300000,48300000,525000,WSH,81.0,1072.0,13.9,10.0,16.0,26.0,9.0,7.0,1407.0,0.217,0.126,0.165,0.087,0.622,98.0,54.0,48.0,6.0,16.0,0.89,26.0,27.0,12.0,0.513,-3.9,0.78,0.56,0.89,1.45,9.06,14.32,0.062,0.047,0.060,-0.013,256.0,162.0,53.0,41.0,0.247,0.265,-0.019,13.0,42.0,160.0,4.32,5.25,4.38,0.49,0.49,0.437,0.474,0.545,-0.071,0.473,0.473,2.61,56.43,2.35,13.7,-11.0,-4.6,9.0,10.3,-1.2,1.6,0.114,12.4,2.2,14.5,0.6,0.013,14.1,-4.1,2.0
2,Mike Modano,F,999999,2010-11,0.0825,64300000,48300000,525000,DET,40.0,497.0,6.1,4.0,11.0,15.0,5.0,6.0,696.0,0.204,0.237,0.149,0.115,0.499,7.0,8.0,8.0,0.0,10.0,1.20,14.0,13.0,3.0,0.485,-2.1,0.74,0.48,1.33,1.81,9.52,16.75,0.051,0.037,0.055,-0.018,139.0,79.0,30.0,30.0,0.275,0.266,0.009,5.0,19.0,85.0,1.55,2.17,2.38,0.61,0.61,0.609,0.615,0.515,0.100,0.609,0.607,2.10,46.15,1.33,5.7,10.0,10.3,5.0,5.3,-0.3,0.6,0.092,5.6,1.2,6.7,0.6,0.104,6.7,-2.7,6.0
3,R.J. Umberger,F,4600000,2010-11,0.0825,64300000,48300000,525000,CBJ,82.0,1575.0,26.1,25.0,32.0,57.0,19.0,13.0,2038.0,0.314,0.196,0.160,0.149,0.495,114.0,28.0,44.0,-16.0,70.0,2.67,44.0,20.0,5.0,0.505,-1.1,1.00,0.95,1.22,2.17,8.38,13.82,0.114,0.084,0.087,-0.003,363.0,220.0,79.0,64.0,0.264,0.278,-0.013,33.0,64.0,202.0,11.66,7.86,6.61,0.54,0.54,0.497,0.545,0.477,0.068,0.544,0.543,2.58,52.02,2.32,25.0,-1.0,13.5,16.0,16.5,-0.5,6.1,0.234,20.0,3.9,23.9,-2.2,0.036,27.1,-2.1,24.0
4,Kevin Westgarth,F,725000,2010-11,0.0825,64300000,48300000,525000,LAK,56.0,304.0,1.7,0.0,3.0,3.0,1.0,2.0,466.0,0.090,0.144,0.189,0.082,0.586,59.0,95.0,77.0,18.0,5.0,0.99,5.0,11.0,5.0,0.500,-1.7,0.33,0.00,0.59,0.59,3.94,6.70,0.000,0.000,0.032,-0.032,34.0,20.0,11.0,3.0,0.355,0.290,0.065,0.0,8.0,23.0,0.00,0.98,0.68,0.48,0.49,0.364,0.433,0.534,-0.101,0.436,0.435,2.35,47.89,2.56,1.6,-6.0,-2.8,3.0,1.6,1.5,0.4,0.229,1.3,0.3,1.6,-0.1,0.000,1.7,-1.7,12.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1127,Mattias Samuelsson,D,4285714,2021-22,0.0123,82500000,61000000,750000,BUF,42.0,839.0,2.3,0.0,10.0,10.0,5.0,5.0,1132.0,0.331,0.059,0.155,0.151,0.634,100.0,16.0,20.0,-4.0,60.0,4.29,8.0,28.0,23.0,0.000,-2.3,0.16,0.00,0.71,0.71,3.14,5.93,0.000,0.000,0.035,-0.035,83.0,44.0,13.0,26.0,0.228,0.316,-0.088,2.0,5.0,50.0,0.58,0.60,1.09,0.43,0.42,0.381,0.412,0.487,-0.075,0.415,0.415,3.50,60.03,4.00,2.2,-23.0,-14.6,3.0,2.5,0.5,0.6,0.256,1.7,0.5,2.2,-0.1,-0.100,2.0,-2.0,31.0
1128,Nicolas Hague,D,2294150,2021-22,0.0123,82500000,61000000,750000,VGK,52.0,969.0,4.8,4.0,10.0,14.0,4.0,6.0,1233.0,0.308,0.101,0.157,0.117,0.624,57.0,38.0,18.0,20.0,80.0,4.95,25.0,23.0,19.0,0.000,-0.8,0.30,0.25,0.62,0.87,7.18,14.79,0.034,0.025,0.025,0.000,239.0,116.0,46.0,77.0,0.284,0.309,-0.025,1.0,12.0,149.0,0.25,1.35,3.19,0.50,0.52,0.486,0.486,0.490,-0.004,0.489,0.489,3.08,58.71,3.46,4.6,-3.0,-2.6,12.0,6.7,5.3,0.4,0.088,4.4,1.4,5.8,1.0,-0.019,4.7,-0.7,1.0
1129,MacKenzie Weegar,D,6250000,2021-22,0.0123,82500000,61000000,750000,FLA,80.0,1869.0,11.2,8.0,36.0,44.0,17.0,19.0,2372.0,0.384,0.158,0.175,0.164,0.504,179.0,51.0,21.0,30.0,156.0,5.01,74.0,85.0,53.0,0.000,-3.2,0.36,0.26,1.16,1.41,6.51,12.55,0.039,0.030,0.041,-0.011,391.0,203.0,68.0,120.0,0.251,0.306,-0.055,4.0,26.0,241.0,1.49,3.13,6.59,0.54,0.54,0.541,0.540,0.561,-0.021,0.542,0.540,3.29,54.97,3.63,10.6,18.0,17.9,11.0,11.9,-0.9,0.7,0.065,10.5,2.6,13.1,1.9,0.000,11.2,-3.2,4.0
1130,Calvin De Haan,D,850000,2021-22,0.0123,82500000,61000000,750000,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,


In [84]:
skaters.isna().sum().sort_values(ascending = False).loc[lambda x: x > 0]

shooting_pct                           137
final_standing                         136
on-ice_goals_pct                       136
share_of_xgoals_from_rebounds_shots    135
net_miss_pct_above_expected            135
                                      ... 
assists_per_60_minutes                 131
giveaways                              131
shots_on_goal_per_60_minutes           131
shot_attempts_per_60_minutes           131
games_played                           131
Length: 76, dtype: int64

In [85]:
skaters.dropna(inplace=True)

In [86]:
skaters.to_csv('../data/skaters_cleaned.csv', index=False)

In [87]:
forwards = skaters[skaters['pos'] == 'F']

In [88]:
forwards.shape

(523, 84)

In [89]:
forwards.isna().sum().sort_values(ascending = False).loc[lambda x: x > 0]

Series([], dtype: int64)

In [90]:
forwards.to_csv('../data/forwards_cleaned.csv', index=False)

In [91]:
defense = skaters[skaters['pos'] == 'D']

In [92]:
defense.shape

(462, 84)

In [93]:
defense.to_csv('../data/defense_cleaned.csv', index=False)

In [94]:
goalies = signings_goalies.merge(salary_cap, on='season', how='left')

In [95]:
goalies = goalies.merge(goalie_stats, on=['player', 'pos', 'season'], how='left')

In [96]:
goalies = goalies.merge(team_standings, on=['team', 'season'], how='left')

In [97]:
goalies.shape

(245, 46)

In [98]:
goalies.isna().sum().sort_values(ascending = False).loc[lambda x: x > 0]

rebounds_per_save                                              46
xrebounds_per_save                                             46
puck_freezes                                                   46
expected_puck_freeze                                           46
puck_freezes_above_expected                                    46
puck_freezes_above_expected_per_shot_on_goal                   46
saves_on_shots_on_goal                                         46
saves_on_unblocked_shot_attempts                               46
pct_of_shot_attempts_blocked_by_teammates                      46
pct_of_unblocked_shot_attempts_against_on_goal                 46
expected_pct_of_unblocked_shot_attempts_against_on_goal        46
on_goal_pct_above_expected                                     46
low_danger_unblocked_shot_attempt_save_pct                     46
xlow_danger_unblocked_shot_attempt_save_pct                    46
low_danger_unblocked_shot_attempt_savepct_above_expected       46
medium_dan

In [99]:
goalies.dropna(inplace=True)

In [100]:
goalies

Unnamed: 0,player,pos,contract_aav,season,pct_change,upper_limit,lower_limit,min_salary,team,games_played,goals_against,expected_goals_against,goals_saved_above_expected,goals_saved_above_expected_per_60,save_pct_on_unblocked_shots,xsave_pct_on_unblocked_shots,save_pct_above_expected,save_pct_on_shots_on_goal,gaa,xgaa,gaa_better_than_expected,wins_above_replacement,icetime,rebounds_per_save,xrebounds_per_save,rebounds_above_expected,puck_freezes,expected_puck_freeze,puck_freezes_above_expected,puck_freezes_above_expected_per_shot_on_goal,saves_on_shots_on_goal,saves_on_unblocked_shot_attempts,pct_of_shot_attempts_blocked_by_teammates,pct_of_unblocked_shot_attempts_against_on_goal,expected_pct_of_unblocked_shot_attempts_against_on_goal,on_goal_pct_above_expected,low_danger_unblocked_shot_attempt_save_pct,xlow_danger_unblocked_shot_attempt_save_pct,low_danger_unblocked_shot_attempt_savepct_above_expected,medium_danger_unblocked_shot_attempt_save_pct,xmedium_danger_unblocked_shot_attempt_save_pct,medium_danger_unblocked_shot_attemptsave_pct_above_expected,high_danger_unblocked_shot_attempt_save_pct,xhigh_danger_unblocked_shot_attempt_save_pct,high_danger_unblocked_shot_attempt_save_pct_above_expected,final_standing
0,Ilya Bryzgalov,G,5666666,2010-11,0.0825,64300000,48300000,525000,ARI,68.0,168.0,182.33,14.3,0.212,0.957,0.953,0.004,0.921,2.49,2.70,0.21,2.39,4054.0,0.037,0.041,-0.004,508.0,485.74,22.26,0.01,1957.0,3728.0,0.1820,0.5454,0.5566,-0.0111,0.976,0.972,0.005,0.895,0.878,0.018,0.624,0.667,-0.043,11.0
1,Henrik Karlsson,G,862500,2010-11,0.0825,64300000,48300000,525000,CGY,17.0,36.0,33.13,-2.9,-0.208,0.950,0.954,-0.004,0.908,2.59,2.38,-0.21,-0.48,834.0,0.031,0.037,-0.006,79.0,87.20,-8.20,-0.02,355.0,684.0,0.2018,0.5431,0.5330,0.0101,0.965,0.971,-0.007,0.901,0.875,0.026,0.567,0.639,-0.072,17.0
2,Jhonas Enroth,G,675000,2010-11,0.0825,64300000,48300000,525000,BUF,14.0,35.0,32.70,-2.3,-0.179,0.951,0.955,-0.003,0.907,2.73,2.55,-0.18,-0.38,769.0,0.026,0.038,-0.012,84.0,88.73,-4.73,-0.01,342.0,684.0,0.1912,0.5243,0.5450,-0.0207,0.975,0.973,0.002,0.875,0.882,-0.007,0.630,0.709,-0.079,15.0
3,Jason LaBarbera,G,1250000,2010-11,0.0825,64300000,48300000,525000,ARI,17.0,48.0,46.58,-1.4,-0.095,0.946,0.948,-0.002,0.909,3.26,3.17,-0.10,-0.24,882.0,0.050,0.043,0.006,131.0,112.22,18.78,0.04,481.0,841.0,0.1676,0.5951,0.5694,0.0256,0.970,0.967,0.003,0.880,0.882,-0.003,0.649,0.722,-0.073,11.0
4,Tomas Vokoun,G,1500000,2010-11,0.0825,64300000,48300000,525000,FLA,57.0,137.0,145.11,8.1,0.151,0.956,0.953,0.003,0.922,2.55,2.70,0.15,1.35,3223.0,0.045,0.040,0.005,422.0,376.70,45.30,0.03,1616.0,2948.0,0.1905,0.5682,0.5466,0.0216,0.975,0.971,0.004,0.886,0.881,0.006,0.644,0.655,-0.011,28.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
239,Cayden Primeau,G,890000,2021-22,0.0123,82500000,61000000,750000,MTL,12.0,40.0,26.55,-13.4,-1.548,0.924,0.949,-0.026,0.868,4.62,3.07,-1.55,-2.24,519.0,0.056,0.042,0.014,64.0,67.44,-3.44,-0.01,262.0,484.0,0.1800,0.5763,0.5578,0.0186,0.954,0.971,-0.017,0.808,0.880,-0.072,0.593,0.694,-0.102,18.0
240,Jake Oettinger,G,4000000,2021-22,0.0123,82500000,61000000,750000,DAL,48.0,114.0,115.43,1.4,0.031,0.954,0.954,0.001,0.914,2.53,2.56,0.03,0.24,2707.0,0.037,0.038,-0.001,293.0,303.67,-10.67,-0.01,1217.0,2386.0,0.2023,0.5324,0.5322,0.0002,0.962,0.971,-0.009,0.901,0.880,0.021,0.731,0.665,0.066,17.0
242,Jake Allen,G,3850000,2021-22,0.0123,82500000,61000000,750000,MTL,35.0,107.0,104.85,-2.2,-0.068,0.947,0.948,-0.001,0.905,3.30,3.23,-0.07,-0.36,1947.0,0.049,0.041,0.008,251.0,254.85,-3.85,0.00,1016.0,1898.0,0.2006,0.5601,0.5466,0.0135,0.969,0.970,-0.002,0.871,0.876,-0.005,0.705,0.694,0.011,18.0
243,Pyotr Kochetkov,G,2000000,2021-22,0.0123,82500000,61000000,750000,CAR,3.0,6.0,5.04,-1.0,-0.404,0.941,0.951,-0.009,0.902,2.42,2.03,-0.39,-0.16,148.0,0.021,0.039,-0.018,19.0,12.01,6.99,0.11,55.0,96.0,0.1969,0.5980,0.5453,0.0527,0.943,0.971,-0.028,0.857,0.879,-0.021,1.000,0.683,0.317,3.0


In [101]:
goalies.to_csv('../data/goalies_cleaned.csv', index=False)

---
## Active Player Contracts

In [102]:
driver.quit()
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))

In [103]:
pages = {
    '1': '',
    '2': '&pg=2',
    '3': '&pg=3',
    '4': '&pg=4',
    '5': '&pg=5',
    '6': '&pg=6',
    '7': '&pg=7',
    '8': '&pg=8',
    '9': '&pg=9',
    '10': '&pg=10',
    '11': '&pg=11',
    '12': '&pg=12',
    '13': '&pg=13',
    '14': '&pg=14',
    '15': '&pg=15',
    '16': '&pg=16',
    '17': '&pg=17',
    '18': '&pg=18',
    '19': '&pg=19',
    '20': '&pg=20',
    '21': '&pg=21',
    '22': '&pg=22',
    '23': '&pg=23',
    '24': '&pg=24',
    '25': '&pg=25',
    '26': '&pg=26',
    '27': '&pg=27',
    '28': '&pg=28',
    '29': '&pg=29',
    '30': '&pg=30',
    '31': '&pg=31',
    '32': '&pg=32'
}

In [104]:
base_url = 'https://www.capfriendly.com/browse/active?stats-season=2023&hide=clauses,age,handed,expiry-status,caphit,skater-stats,goalie-stats'

active_contracts = pd.DataFrame(columns=['player', 'pos', 'team', 'salary'])

for page, element in pages.items():

    driver.get(base_url + element)
    time.sleep(2)
    
    scroll_script = f"window.scrollBy(0, 800);"
    driver.execute_script(scroll_script)
    
    time.sleep(2)
    table_element = driver.find_element(By.XPATH, '//*[@id="brwt"]')
    time.sleep(2)
    table_data = pd.read_html(table_element.get_attribute('outerHTML'))
    page_data = pd.DataFrame(table_data[0])
    page_data.columns = page_data.columns.str.lower()
    active_contracts = pd.concat([active_contracts, page_data], join='inner')


In [105]:
active_contracts

Unnamed: 0,player,pos,team,salary
0,1. Connor McDavid,C,EDM,"$12,000,000"
1,2. Artemi Panarin,LW,NYR,"$12,500,000"
2,3. Auston Matthews,C,TOR,"$7,950,000"
3,4. Erik Karlsson,RD,SJS,"$12,000,000"
4,5. Drew Doughty,RD,LAK,"$11,000,000"
...,...,...,...,...
37,1588. Christopher Gibson,G,SEA,"$750,000"
38,1589. Evan Cormier,G,WPG,"$750,000"
39,1590. Jérémy Groleau,LD,NJD,"$750,000"
40,1591. Connor Ingram,G,ARI,"$750,000"


In [106]:
test = active_contracts.copy()

In [107]:
active_contracts = test.copy()

In [108]:
active_contracts['player'] = active_contracts['player'].str.replace(r'^\d+.\s*([^\d]+)$', r'\1', regex=True)

In [109]:
active_contracts['pos'] = np.where(active_contracts['pos'].str.contains('D'), 'D',
                           np.where(active_contracts['pos'].str.contains('G'), 'G', 'F'))

In [110]:
active_contracts_f = active_contracts[active_contracts['pos'] == 'F']

In [111]:
active_contracts_d = active_contracts[active_contracts['pos'] == 'D']

In [112]:
active_contracts_g = active_contracts[active_contracts['pos'] == 'G']

In [113]:
active_contracts_f.loc[:, "player"] = active_contracts_f["player"] + ", " + active_contracts_f["pos"] + ", " + active_contracts_f["team"]

active_contracts_f = active_contracts_f.drop(columns=["pos", "team"])

In [114]:
active_contracts_d.loc[:, "player"] = active_contracts_d["player"] + ", " + active_contracts_d["pos"] + ", " + active_contracts_d["team"]

active_contracts_d = active_contracts_d.drop(columns=["pos", "team"])

In [115]:
active_contracts_g.loc[:, "player"] = active_contracts_g["player"] + ", " + active_contracts_g["pos"] + ", " + active_contracts_g["team"]

active_contracts_g = active_contracts_g.drop(columns=["pos", "team"])

In [116]:
active_contracts_f.head()

Unnamed: 0,player,salary
0,"Connor McDavid, F, EDM","$12,000,000"
1,"Artemi Panarin, F, NYR","$12,500,000"
2,"Auston Matthews, F, TOR","$7,950,000"
5,"John Tavares, F, TOR","$7,950,000"
6,"Mitchell Marner, F, TOR","$8,000,000"


In [117]:
skaters_2023 = skater_stats.merge(salary_cap, on='season', how='left')

skaters_2023 = skaters_2023.merge(team_standings, on=['team', 'season'], how='left')

skaters_2023 = skaters_2023[skaters_2023['season'] == '2022-23']

In [118]:
forwards_2023 = skaters_2023[skaters_2023['pos'] == 'F']

In [119]:
defense_2023 = skaters_2023[skaters_2023['pos'] == 'D']

In [142]:
goalies_2023 = goalie_stats.merge(salary_cap, on='season', how='left')

goalies_2023 = goalies_2023.merge(team_standings, on= ['team', 'season'], how='left')

goalies_2023 = goalies_2023[goalies_2023['season'] == '2022-23']

In [121]:
forwards_2023.loc[:, "player"] = forwards_2023["player"] + ", " + forwards_2023["pos"] + ", " + forwards_2023["team"]

# Optional: Drop the original columns
forwards_2023 = forwards_2023.drop(columns=["pos", "team", "season"])

In [122]:
defense_2023.loc[:, "player"] = defense_2023["player"] + ", " + defense_2023["pos"] + ", " + defense_2023["team"]

# Optional: Drop the original columns
defense_2023 = defense_2023.drop(columns=["pos", "team", "season"])

In [144]:
goalies_2023.loc[:, "player"] = goalies_2023["player"] + ", " + goalies_2023["pos"] + ", " + goalies_2023["team"]

# Optional: Drop the original columns
goalies_2023 = goalies_2023.drop(columns=["pos", "team", "season"])

In [124]:
forwards_2023.reset_index(drop=True, inplace=True)

defense_2023.reset_index(drop=True, inplace=True)

goalies_2023.reset_index(drop=True, inplace=True)

In [125]:
active_contracts_f = active_contracts_f[active_contracts_f['player'].isin(forwards_2023['player'])]
active_contracts_f.shape

(531, 2)

In [126]:
active_contracts_d = active_contracts_d[active_contracts_d['player'].isin(defense_2023['player'])]
active_contracts_d.shape

(290, 2)

In [127]:
active_contracts_g = active_contracts_g[active_contracts_g['player'].isin(goalies_2023['player'])]
active_contracts_g.shape

(92, 2)

In [128]:
forwards_2023 = forwards_2023[forwards_2023['player'].isin(active_contracts_f['player'])]

forwards_2023.shape, active_contracts_f.shape

((526, 80), (531, 2))

In [129]:
active_contracts_f = active_contracts_f.drop_duplicates()

forwards_2023.shape, active_contracts_f.shape

((526, 80), (526, 2))

In [130]:
defense_2023 = defense_2023[defense_2023['player'].isin(active_contracts_d['player'])]

defense_2023.shape, active_contracts_d.shape

((286, 80), (290, 2))

In [131]:
active_contracts_d = active_contracts_d.drop_duplicates()

defense_2023.shape, active_contracts_d.shape

((286, 80), (286, 2))

In [146]:
goalies_2023 = goalies_2023[goalies_2023['player'].isin(active_contracts_g['player'])]

goalies_2023.shape, active_contracts_g.shape

((88, 42), (92, 2))

In [133]:
forwards_2023.head(3)

Unnamed: 0,player,games_played,icetime,expected_goals,goals,assists,points,primary_assists,secondary_assists,shifts,share_of_possible_icetime,pct_of_shift_starts_in_offensive_zone,pct_of_shift_starts_in_neutral_zone,pct_of_shift_starts_in_defensive_zone,pct_of_shift_starts_on_fly,hits,pim,pim_drawn,pim_differential,shots_blocked_by_player,shots_blocked_by_player_per_60,takeaways,giveaways,defensive_zone_giveaways,faceoff_win_pct,goals_above_expected,expected_goals_per_60_minutes,goals_per_60_minutes,assists_per_60_minutes,points_per_60_minutes,shots_on_goal_per_60_minutes,shot_attempts_per_60_minutes,shooting_pct,shooting_pct_on_unblocked_shots,expected_shooting_pct_on_unblocked_shots,shooting_pct_on_unblocked_shots_above_expected,shot_attempts,shots_on_goal,shots_that_missed_net,shots_that_were_blocked,pct_of_unblocked_shots_that_missed_net,expected_pct_of_unblocked_shots_that_missed_net,net_miss_pct_above_expected,high_danger_unblocked_shot_attempts,medium_danger_unblocked_shot_attempts,low_danger_unblocked_shot_attempts,high_danger_xgoals,medium_danger_xgoals,low_danger_xgoals,on-ice_shot_attempt_pct_(corsi),on-ice_unblocked_shot_attempt_pct_(fenwick),on-ice_goals_pct,on-ice_expected_goals_pct,off-ice_expected_goals_pct,relative_expected_goals_pct,on-ice_score_adjusted_expected_goals_pct,on-ice_score/flurry_adjusted_expected_goals_pct,on-ice_expected_goals_against_per_60_minutes,on-ice_shot_attempts_against_per_60_minutes,on-ice_high_danger_shot_attempts_against_per_60_minutes,flurry_adjusted_xgoals,on-ice_goal_differential,on-ice_expected_goals_differential,rebounds_created,xrebounds_created,rebounds_created_above_expected,xgoals_on_rebounds_shots,share_of_xgoals_from_rebounds_shots,xgoals_from_non_rebounds,xgoals_of_expected_rebounds,created_xgoals,created_xgoals_minus_actual_xgoals,shooting_talent_above_average,shooting_talent_adjusted_expected_goals,goals_above_shooting_talent,pct_change,upper_limit,lower_limit,min_salary,final_standing
1,"Robby Fabbri, F, DET",28,447,6.8,7,9,16,5,4,504,0.264,0.188,0.175,0.107,0.53,49,12,12,0,15,2.01,6,12,3,0.385,0.2,0.91,0.94,1.21,2.14,4.69,8.71,0.2,0.132,0.113,0.019,65,35,18,12,0.34,0.264,0.075,10,14,29,4.02,1.73,1.07,0.46,0.46,0.622,0.518,0.448,0.07,0.517,0.516,2.69,57.19,1.88,6.6,9,1.5,5,2.9,2.2,1.9,0.272,5.0,0.7,5.6,-1.2,0.148,7.8,-0.8,0.0121,83500000,61700000,775000,25.0
2,"Michael Bunting, F, TOR",82,1295,28.0,23,26,49,19,7,1586,0.261,0.178,0.158,0.061,0.602,85,83,87,-4,18,0.83,49,38,14,0.5,-5.0,1.3,1.07,1.2,2.27,8.06,13.8,0.132,0.091,0.111,-0.02,298,174,79,45,0.312,0.249,0.063,39,78,136,13.3,9.81,4.93,0.55,0.56,0.642,0.584,0.525,0.059,0.588,0.581,2.81,54.66,2.87,26.4,39,24.6,31,15.4,15.6,8.2,0.292,19.9,3.5,23.3,-4.7,0.043,29.2,-6.2,0.0121,83500000,61700000,775000,4.0
3,"Alex Iafallo, F, LAK",59,961,14.7,14,22,36,13,9,1294,0.268,0.131,0.168,0.15,0.551,27,20,24,-4,33,2.06,28,9,4,0.538,-0.7,0.92,0.87,1.37,2.25,8.3,13.6,0.105,0.082,0.082,0.0,218,133,37,48,0.218,0.276,-0.059,15,45,110,5.8,5.58,3.34,0.51,0.53,0.559,0.558,0.523,0.035,0.56,0.554,2.75,55.78,2.87,14.2,12,11.5,12,8.9,3.1,2.9,0.197,11.8,2.0,13.8,-0.9,-0.044,14.1,-0.1,0.0121,83500000,61700000,775000,14.0


In [151]:
ordered_columns = sorted(forwards_2023.columns)
forwards_2023 = forwards_2023.reindex(columns=ordered_columns)
defense_2023 = defense_2023.reindex(columns=ordered_columns)

ordered_columns = sorted(goalies_2023.columns)
goalies_2023 = goalies_2023.reindex(columns=ordered_columns)

In [135]:
forwards_2023.head(3)

Unnamed: 0,assists,assists_per_60_minutes,created_xgoals,created_xgoals_minus_actual_xgoals,defensive_zone_giveaways,expected_goals,expected_goals_per_60_minutes,expected_pct_of_unblocked_shots_that_missed_net,expected_shooting_pct_on_unblocked_shots,faceoff_win_pct,final_standing,flurry_adjusted_xgoals,games_played,giveaways,goals,goals_above_expected,goals_above_shooting_talent,goals_per_60_minutes,high_danger_unblocked_shot_attempts,high_danger_xgoals,hits,icetime,low_danger_unblocked_shot_attempts,low_danger_xgoals,lower_limit,medium_danger_unblocked_shot_attempts,medium_danger_xgoals,min_salary,net_miss_pct_above_expected,off-ice_expected_goals_pct,on-ice_expected_goals_against_per_60_minutes,on-ice_expected_goals_differential,on-ice_expected_goals_pct,on-ice_goal_differential,on-ice_goals_pct,on-ice_high_danger_shot_attempts_against_per_60_minutes,on-ice_score/flurry_adjusted_expected_goals_pct,on-ice_score_adjusted_expected_goals_pct,on-ice_shot_attempt_pct_(corsi),on-ice_shot_attempts_against_per_60_minutes,on-ice_unblocked_shot_attempt_pct_(fenwick),pct_change,pct_of_shift_starts_in_defensive_zone,pct_of_shift_starts_in_neutral_zone,pct_of_shift_starts_in_offensive_zone,pct_of_shift_starts_on_fly,pct_of_unblocked_shots_that_missed_net,pim,pim_differential,pim_drawn,player,points,points_per_60_minutes,primary_assists,rebounds_created,rebounds_created_above_expected,relative_expected_goals_pct,secondary_assists,share_of_possible_icetime,share_of_xgoals_from_rebounds_shots,shifts,shooting_pct,shooting_pct_on_unblocked_shots,shooting_pct_on_unblocked_shots_above_expected,shooting_talent_above_average,shooting_talent_adjusted_expected_goals,shot_attempts,shot_attempts_per_60_minutes,shots_blocked_by_player,shots_blocked_by_player_per_60,shots_on_goal,shots_on_goal_per_60_minutes,shots_that_missed_net,shots_that_were_blocked,takeaways,upper_limit,xgoals_from_non_rebounds,xgoals_of_expected_rebounds,xgoals_on_rebounds_shots,xrebounds_created
1,9,1.21,5.6,-1.2,3,6.8,0.91,0.264,0.113,0.385,25.0,6.6,28,12,7,0.2,-0.8,0.94,10,4.02,49,447,29,1.07,61700000,14,1.73,775000,0.075,0.448,2.69,1.5,0.518,9,0.622,1.88,0.516,0.517,0.46,57.19,0.46,0.0121,0.107,0.175,0.188,0.53,0.34,12,0,12,"Robby Fabbri, F, DET",16,2.14,5,5,2.2,0.07,4,0.264,0.272,504,0.2,0.132,0.019,0.148,7.8,65,8.71,15,2.01,35,4.69,18,12,6,83500000,5.0,0.7,1.9,2.9
2,26,1.2,23.3,-4.7,14,28.0,1.3,0.249,0.111,0.5,4.0,26.4,82,38,23,-5.0,-6.2,1.07,39,13.3,85,1295,136,4.93,61700000,78,9.81,775000,0.063,0.525,2.81,24.6,0.584,39,0.642,2.87,0.581,0.588,0.55,54.66,0.56,0.0121,0.061,0.158,0.178,0.602,0.312,83,-4,87,"Michael Bunting, F, TOR",49,2.27,19,31,15.6,0.059,7,0.261,0.292,1586,0.132,0.091,-0.02,0.043,29.2,298,13.8,18,0.83,174,8.06,79,45,49,83500000,19.9,3.5,8.2,15.4
3,22,1.37,13.8,-0.9,4,14.7,0.92,0.276,0.082,0.538,14.0,14.2,59,9,14,-0.7,-0.1,0.87,15,5.8,27,961,110,3.34,61700000,45,5.58,775000,-0.059,0.523,2.75,11.5,0.558,12,0.559,2.87,0.554,0.56,0.51,55.78,0.53,0.0121,0.15,0.168,0.131,0.551,0.218,20,-4,24,"Alex Iafallo, F, LAK",36,2.25,13,12,3.1,0.035,9,0.268,0.197,1294,0.105,0.082,0.0,-0.044,14.1,218,13.6,33,2.06,133,8.3,37,48,28,83500000,11.8,2.0,2.9,8.9


In [153]:
active_contracts_f.to_csv('../data/active_contracts_f.csv', index=False)
active_contracts_d.to_csv('../data/active_contracts_d.csv', index=False)
active_contracts_g.to_csv('../data/active_contracts_g.csv', index=False)

forwards_2023.to_csv('../data/forwards_2023.csv', index=False)
defense_2023.to_csv('../data/defense_2023.csv', index=False)
goalies_2023.to_csv('../data/goalies_2023.csv', index=False)