<a id="#top"></a><h1>NHL Vezina Trophy Machine Learning Project</h1>

<h2>Objective : To collect the necessary stats and information from hockey-reference.com to predict each year's Vezina Trophy winner using machine learning</h2>
    
<h3>Data set location : All data for the project was scrapped using python from the hockey-reference.com website. All pages will be saved locally as HTML files to reduce requests to the website.</h3>

<p><strong>About the project :</strong> The vezina trophy, named for NHL Goalie George Vezina, is awarded each year to the best goalie in the NHL. Before the 1982 season, the award was given to the goalie with the lowest goals against average (GAA). GAA is calculated by the number of goals allowed by a goaltender over 60 minutes of play. Starting from 1982 the award has been voted on by a panel consisting of all the general managers in the NHL. A voting system of a 3 person ballot where votes are scored as 5pts for first place, 3pts for second place, and 1pt for 3rd place has been used.<br><br>

Invariably the voting process introduces an element of bias and ambiguity to the award. General Managers keep or lose their job based on the performance of their team. The goalie, as the only player to be on the ice for the full game, has a unique and important impact on the result of a game and the overall result of the team for a season. As NHL commentator Jeff Marek says "if you have the goalie it's 70 percent of your team, if you don't it's 100 percent.". In short, A good and reliable goalie over a long time is incredibly hard to find. Ergo the General Manager's view of a particular goalie may be tainted by their own feelings about job security as it's directly related to season results. The question we would like to answer is "How correctly can we predict the GMs voting results using the statistical elements at our disposal?".<br><br>

There are some historical anecdotal evidence to support the idea that sometimes GMs are biased in their voting for various reasons. It's extremely rare for a goalie to win the Vezina in their rookie season. Also perception of the team can impact voting. Martin Brodeur, who has the most wins of any NHL Goalie, didn't win his first Vezina until 2003, 10 years after starting his career. All this despite having the best GAA in 1997 and 1998 (Jennings trophy winner). Perhaps this is because he played during this time against Dominik Hasek, whom many consider to be the most talented goaltender to ever play the game, but there is some speculation is that he was unfairly penalized for playing on the New Jersey Devils. The Devils at this time were a team noted for their strong defensive team play from 1994-2002 and the story goes it was only after winning gold for Canada at the Olympics in 2002 that GMs changed their opinion about Brodeur's worthiness for the Vezina.</p>

<p><strong>Process :</strong> The project will be split into 2 parts. Web-scraping and cleaning the data, and finally the Machine Learning model.

Web-Scraping : There are 3 primary information sources we will scrape. The award results, individual goalie stats, and team stats. All of which will be done for each season from 1982 to present.</p>

***

Data prep and cleaning:
- [Scraping/Cleaning Vezina Trophy Results](#award_results)
- [Scraping/Cleaning Goalie Stats](#goalie_stats)
- [Scraping/Cleaning Team Stats](#team_stats)

In [53]:
import time

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

In [54]:
# Looking at all years from 1982 to 2021
years = list(range(1982, 2021))
# Removing 2005 because this season was canceled due to the lockout
years.remove(2005)

In [3]:
# Getting awards results for each year.
url_awards = "https://www.hockey-reference.com/awards/voting-{}.html"

In [4]:
# A first attempt at scraping the data was made using just get requests, but the page wouldn't fully load must be scrolled. Automating the browser with selenium for this.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

ser = Service("C:\\chromedriver.exe")
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=ser, options=op)

In [5]:
# looping through our seasons to scrap/save award results
for year in years:
    url = url_awards.format(year)
    driver.get(url)
    driver.execute_script("window.scrollTo(1,10000)")
    time.sleep(5)
    html = driver.page_source
    with open("awards/{}.html".format(year), "w+", encoding="utf-8") as f:
        f.write(html)
    time.sleep(5)

In [55]:
# initiate beautiful soup class
soup = BeautifulSoup(page, "html.parser")

In [74]:
# # finding and removing, decomposing, the over_header div that is above our stats table
data = []
for year in years:
    with open("awards/{}.html".format(year), encoding="utf-8") as f:
        page = f.read()
    soup = BeautifulSoup(page, "html.parser")
    year_results = soup.find_all(id="div_vezina_stats")
    temp_df = pd.read_html(str(year_results))[0]
    temp_df["Year"] = year
    data.append(temp_df)
vezina_results = pd.concat(data)

In [75]:
# checking results
vezina_results

Unnamed: 0,Place,Player,Age,Tm,Pos,Votes,Vote%,1st,2nd,3rd,4th,5th,W,L,T/O,GAA,SV%,OPS,DPS,GPS,PS,Year
0,1,Billy Smith,31,NYI,G,51,48.57,8.0,3.0,2.0,,,32,9,4,2.97,0.898,0.0,0.0,10.5,10.5,1982
1,2,Grant Fuhr,19,EDM,G,31,29.52,2.0,5.0,6.0,,,28,5,14,3.31,0.898,0.0,0.0,9.9,9.9,1982
2,3,Michel Dion,27,PIT,G,30,28.57,3.0,3.0,6.0,,,25,24,12,3.80,0.878,0.0,0.0,10.4,10.4,1982
3,4,Dan Bouchard,31,QUE,G,19,18.10,2.0,3.0,0.0,,,27,22,11,3.87,0.867,0.0,0.0,10.1,10.1,1982
4,5,Rick Wamsley,22,MTL,G,18,17.14,3.0,1.0,0.0,,,23,7,7,2.75,0.893,0.0,0.0,9.2,9.2,1982
5,6,Richard Brodeur,29,VAN,G,16,15.24,1.0,3.0,2.0,,,20,18,12,3.36,0.893,0.0,0.0,10.3,10.3,1982
6,7,Don Edwards,26,BUF,G,10,9.52,1.0,1.0,2.0,,,26,23,9,3.52,0.882,0.0,0.0,11.3,11.3,1982
7,8,Glenn Resch,33,CLR,G,5,4.76,1.0,0.0,0.0,,,16,31,11,4.04,0.879,0.0,0.0,9.0,9.0,1982
8,8,Gilles Meloche,31,MNS,G,5,4.76,0.0,1.0,2.0,,,26,15,9,3.48,0.894,0.0,0.0,10.0,10.0,1982
0,1,Pete Peeters,25,BOS,G,105,100.00,21.0,0.0,0.0,,,40,12,9,2.37,0.903,0.0,0.0,16.5,16.5,1983


In [76]:
# resetting our index
vezina_results = vezina_results.reset_index(drop=True)

In [77]:
# checking a few columns that maybe can be dropped
pd.set_option("display.max_columns", None)
pd.set_option("display.min_rows", 25)
vezina_results.loc[vezina_results["5th"] == 0]

Unnamed: 0,Place,Player,Age,Tm,Pos,Votes,Vote%,1st,2nd,3rd,4th,5th,W,L,T/O,GAA,SV%,OPS,DPS,GPS,PS,Year
346,1,Sergei Bobrovsky,28,CBJ,G,138,92.0,25.0,4.0,1.0,0.0,0.0,41,17,5,2.06,0.931,0.0,0.0,14.9,14.9,2017
347,2,Braden Holtby,27,WSH,G,87,58.0,4.0,21.0,4.0,0.0,0.0,42,13,6,2.07,0.925,0.0,0.0,12.3,12.3,2017
348,3,Carey Price,29,MTL,G,19,12.67,0.0,2.0,13.0,0.0,0.0,37,20,5,2.23,0.923,0.0,0.0,12.6,12.6,2017
349,4,Cam Talbot,29,EDM,G,17,11.33,1.0,2.0,6.0,0.0,0.0,42,22,8,2.39,0.919,0.0,0.0,14.0,14.0,2017
350,5,Devan Dubnyk,30,MIN,G,8,5.33,0.0,1.0,5.0,0.0,0.0,40,19,5,2.25,0.923,0.0,0.0,13.1,13.1,2017
351,6,Martin Jones,27,SJS,G,1,0.67,0.0,0.0,1.0,0.0,0.0,35,23,6,2.4,0.912,0.0,0.0,9.9,9.9,2017


In [78]:
# 4th and 5th are not needed as there are no 4th/5th place votes for the Vezina, OPS/DPS are Offensive/Defensive point share. Not applicaple as all Goaltenders are 0. GPS, Goalie point share will be kept.
cols_to_drop = ["4th", "5th", "OPS", "DPS"]
vezina_results = vezina_results.drop(cols_to_drop, axis=1)
vezina_results.head()

Unnamed: 0,Place,Player,Age,Tm,Pos,Votes,Vote%,1st,2nd,3rd,W,L,T/O,GAA,SV%,GPS,PS,Year
0,1,Billy Smith,31,NYI,G,51,48.57,8.0,3.0,2.0,32,9,4,2.97,0.898,10.5,10.5,1982
1,2,Grant Fuhr,19,EDM,G,31,29.52,2.0,5.0,6.0,28,5,14,3.31,0.898,9.9,9.9,1982
2,3,Michel Dion,27,PIT,G,30,28.57,3.0,3.0,6.0,25,24,12,3.8,0.878,10.4,10.4,1982
3,4,Dan Bouchard,31,QUE,G,19,18.1,2.0,3.0,0.0,27,22,11,3.87,0.867,10.1,10.1,1982
4,5,Rick Wamsley,22,MTL,G,18,17.14,3.0,1.0,0.0,23,7,7,2.75,0.893,9.2,9.2,1982


In [79]:
# for later we will need both the team abbreviation and the long team name, adding the long team name to the award results.
team_names = {}
with open("team_names.txt") as f:
    lines = f.readlines()
    for line in lines[1:]:
        abbrev, name = line.replace("\n", "").split(",")
        team_names[abbrev] = name

In [80]:
vezina_results["Team"] = vezina_results["Tm"].map(team_names)
vezina_results

Unnamed: 0,Place,Player,Age,Tm,Pos,Votes,Vote%,1st,2nd,3rd,W,L,T/O,GAA,SV%,GPS,PS,Year,Team
0,1,Billy Smith,31,NYI,G,51,48.57,8.0,3.0,2.0,32,9,4,2.97,0.898,10.5,10.5,1982,New York Islanders
1,2,Grant Fuhr,19,EDM,G,31,29.52,2.0,5.0,6.0,28,5,14,3.31,0.898,9.9,9.9,1982,Edmonton Oilers
2,3,Michel Dion,27,PIT,G,30,28.57,3.0,3.0,6.0,25,24,12,3.80,0.878,10.4,10.4,1982,Pittsburgh Penguins
3,4,Dan Bouchard,31,QUE,G,19,18.10,2.0,3.0,0.0,27,22,11,3.87,0.867,10.1,10.1,1982,Quebec Nordiques
4,5,Rick Wamsley,22,MTL,G,18,17.14,3.0,1.0,0.0,23,7,7,2.75,0.893,9.2,9.2,1982,Montreal Canadiens
5,6,Richard Brodeur,29,VAN,G,16,15.24,1.0,3.0,2.0,20,18,12,3.36,0.893,10.3,10.3,1982,Vancouver Canucks
6,7,Don Edwards,26,BUF,G,10,9.52,1.0,1.0,2.0,26,23,9,3.52,0.882,11.3,11.3,1982,Buffalo Sabres
7,8,Glenn Resch,33,CLR,G,5,4.76,1.0,0.0,0.0,16,31,11,4.04,0.879,9.0,9.0,1982,Colorado Rockies
8,8,Gilles Meloche,31,MNS,G,5,4.76,0.0,1.0,2.0,26,15,9,3.48,0.894,10.0,10.0,1982,Minnesota North Stars
9,1,Pete Peeters,25,BOS,G,105,100.00,21.0,0.0,0.0,40,12,9,2.37,0.903,16.5,16.5,1983,Boston Bruins


In [81]:
vezina_results.loc[vezina_results["Team"].isnull()]

Unnamed: 0,Place,Player,Age,Tm,Pos,Votes,Vote%,1st,2nd,3rd,W,L,T/O,GAA,SV%,GPS,PS,Year,Team
55,5,Bob Froese,28,TOT,G,8,7.62,1.0,1.0,0.0,17,11,0,3.64,0.885,5.1,5.1,1987,
83,7,Tom Barrasso,23,TOT,G,3,2.86,0.0,1.0,0.0,20,22,7,4.21,0.88,9.4,9.4,1989,
87,11,Kelly Hrudey,28,TOT,G,1,0.95,0.0,0.0,1.0,28,28,5,3.66,0.882,10.8,10.8,1989,
93,5,Mike Liut,34,TOT,G,4,3.81,0.0,1.0,1.0,19,16,1,2.53,0.905,7.0,7.0,1990,
156,9,Patrick Roy,30,TOT,G,5,3.85,1.0,0.0,0.0,34,24,2,2.78,0.908,11.3,11.3,1996,
158,10,Bill Ranford,29,TOT,G,1,0.77,0.0,0.0,1.0,34,30,9,3.29,0.885,8.5,8.5,1996,
266,8,Cristobal Huet,32,TOT,G,4,2.67,0.0,1.0,1.0,32,14,6,2.32,0.92,11.1,11.1,2008,
330,3,Devan Dubnyk,28,TOT,G,28,18.67,1.0,4.0,11.0,36,14,4,2.07,0.929,12.6,12.6,2015,
378,6,Robin Lehner,28,TOT,G,3,2.0,0.0,1.0,0.0,19,10,5,2.89,0.92,8.3,8.3,2020,


In [82]:
# saving as CSV for later
vezina_results.to_csv("vezina_results.csv", encoding="utf-8")

***

<a id="#award_results"></a><h2>Vezina Trophy Voting Results</h2>

<p><strong>Notes :</strong>All voting results were saved. Certain goalies who played for more than 1 team during a year where they received votes are listed as playing for "TOT" or Total. The team for whom they played the most is not in the data at this level, but can be recovered from player stats in the next step.</p>

<p><a href="#top">[Return to top]</a></p>

In [33]:
# Repeating the process for individual stats
stats_url = "https://www.hockey-reference.com/leagues/NHL_{}_goalies.html"
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

ser = Service("C:\\chromedriver.exe")
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=ser, options=op)

In [34]:
# looping through our seasons.
for year in years:
    url = stats_url.format(year)
    driver.get(url)
    driver.execute_script("window.scrollTo(1,10000)")
    time.sleep(5)
    html = driver.page_source
    with open("stats/{}.html".format(year), "w+", encoding="utf-8") as f:
        f.write(html)
    time.sleep(5)

In [83]:
# initiate beautiful soup class, scrape out yearly goalie stats and concat them into one long big dataframe
soup = BeautifulSoup(page, "html.parser")
data = []
for year in years:
    with open("stats/{}.html".format(year), encoding="utf-8") as f:
        page = f.read()
    soup.find("tr", class_="thead").decompose()
    soup.find("tr", class_="thead").decompose()
    soup.find("tr", class_="thead").decompose()
    try:
        soup.find("tr", class_="thead").decompose()
    except:
        pass
    soup.find("tr", class_="over_header").decompose()
    soup = BeautifulSoup(page, "html.parser")
    stats = soup.find(id="stats")
    temp_df = pd.read_html(str(stats))[0]
    temp_df["Year"] = year
    data.append(temp_df)
goalie_stats = pd.concat(data)
goalie_stats = goalie_stats.reset_index(drop=True).droplevel(level=0, axis=1)
goalie_stats.rename(columns={"": "Year"}, inplace=True)

In [84]:
# checking for goalies that played for more than one team.
goalie_stats.loc[goalie_stats["Player"] == "Michael Hutchinson"]

Unnamed: 0,Rk,Player,Age,Tm,GP,GS,W,L,T/O,GA,SA,SV,SV%,GAA,SO,GPS,MIN,QS,QS%,RBS,GA%-,GSAA,G,A,PTS,PIM,Year
2905,42,Michael Hutchinson,23,WPG,3,3,2,1,0,5,88,83,0.943,1.64,0,0.8,183,3,1.0,0,,,0,0,0,0,2014
3018,39,Michael Hutchinson,24,WPG,38,36,21,10,5,85,986,901,0.914,2.38,2,5.8,2138,19,0.528,7,101.0,-0.84,0,0,0,0,2015
3118,39,Michael Hutchinson,25,WPG,30,25,9,15,3,75,805,730,0.907,2.84,0,4.0,1586,9,0.36,3,109.0,-6.46,0,0,0,0,2016
3227,41,Michael Hutchinson,26,WPG,28,20,9,12,3,67,690,623,0.903,2.92,1,3.2,1378,8,0.4,1,112.0,-7.29,0,0,0,0,2017
3333,44,Michael Hutchinson,27,WPG,3,3,2,1,0,7,75,68,0.907,3.26,0,0.4,129,2,0.667,1,,,0,1,1,0,2018
3443,44,Michael Hutchinson,28,TOT,9,8,3,4,2,27,239,212,0.887,3.27,1,0.8,496,2,0.25,2,,,0,0,0,0,2019
3444,44,Michael Hutchinson,28,TOR,5,5,2,3,0,13,152,139,0.914,2.64,1,0.9,295,2,0.4,0,,,0,0,0,0,2019
3445,44,Michael Hutchinson,28,FLA,4,3,1,1,2,14,87,73,0.839,4.17,0,-0.2,201,0,0.0,2,,,0,0,0,0,2019
3550,37,Michael Hutchinson,29,TOT,16,12,5,9,1,49,439,390,0.888,3.47,1,1.5,847,4,0.333,0,123.0,-9.3,0,0,0,0,2020
3551,37,Michael Hutchinson,29,TOR,15,11,4,9,1,48,421,373,0.886,3.66,1,1.3,787,3,0.273,0,126.0,-9.93,0,0,0,0,2020


In [85]:
# some players played for multiple teams during a season. Their records are recorded as a line TOT for the yearly total and then each team underneath. We will group by year and player, then replace the TOT team name with the team for whom they played the most that season.
def single_row(df):
    if df.shape[0] == 1:
        return df
    else:
        row = df[df["Tm"] == "TOT"]
        row["Tm"] = df.iloc[1, :]["Tm"]
        return row


goalie_stats = goalie_stats.groupby(["Player", "Year"]).apply(single_row)
goalie_stats.index = goalie_stats.index.droplevel()
goalie_stats.index = goalie_stats.index.droplevel()
goalie_stats = goalie_stats.reset_index(drop=True)
goalie_stats.loc[goalie_stats["Player"] == "Michael Hutchinson"]

Unnamed: 0,Rk,Player,Age,Tm,GP,GS,W,L,T/O,GA,SA,SV,SV%,GAA,SO,GPS,MIN,QS,QS%,RBS,GA%-,GSAA,G,A,PTS,PIM,Year
1995,42,Michael Hutchinson,23,WPG,3,3,2,1,0,5,88,83,0.943,1.64,0,0.8,183,3,1.0,0,,,0,0,0,0,2014
1996,39,Michael Hutchinson,24,WPG,38,36,21,10,5,85,986,901,0.914,2.38,2,5.8,2138,19,0.528,7,101.0,-0.84,0,0,0,0,2015
1997,39,Michael Hutchinson,25,WPG,30,25,9,15,3,75,805,730,0.907,2.84,0,4.0,1586,9,0.36,3,109.0,-6.46,0,0,0,0,2016
1998,41,Michael Hutchinson,26,WPG,28,20,9,12,3,67,690,623,0.903,2.92,1,3.2,1378,8,0.4,1,112.0,-7.29,0,0,0,0,2017
1999,44,Michael Hutchinson,27,WPG,3,3,2,1,0,7,75,68,0.907,3.26,0,0.4,129,2,0.667,1,,,0,1,1,0,2018
2000,44,Michael Hutchinson,28,TOR,9,8,3,4,2,27,239,212,0.887,3.27,1,0.8,496,2,0.25,2,,,0,0,0,0,2019
2001,37,Michael Hutchinson,29,TOR,16,12,5,9,1,49,439,390,0.888,3.47,1,1.5,847,4,0.333,0,123.0,-9.3,0,0,0,0,2020


In [86]:
# checking our work to make sure no TOT teams remain
goalie_stats.loc[goalie_stats["Tm"] == "TOT"]

Unnamed: 0,Rk,Player,Age,Tm,GP,GS,W,L,T/O,GA,SA,SV,SV%,GAA,SO,GPS,MIN,QS,QS%,RBS,GA%-,GSAA,G,A,PTS,PIM,Year


In [87]:
# for later we will need both the team abbreviation and the long team name, adding the long team name to the award results.
team_names = {}
with open("team_names.txt") as f:
    lines = f.readlines()
    for line in lines[1:]:
        abbrev, name = line.replace("\n", "").split(",")
        team_names[abbrev] = name

In [88]:
goalie_stats["Team"] = goalie_stats["Tm"].map(team_names)
goalie_stats

Unnamed: 0,Rk,Player,Age,Tm,GP,GS,W,L,T/O,GA,SA,SV,SV%,GAA,SO,GPS,MIN,QS,QS%,RBS,GA%-,GSAA,G,A,PTS,PIM,Year,Team
0,21,Aaron Dell,27,SJS,20,17,11,6,1,37,533,496,.931,2.00,1,4.2,1111,12,.706,1,80,9.13,0,0,0,0,2017,San Jose Sharks
1,20,Aaron Dell,28,SJS,29,22,15,5,4,67,775,708,.914,2.64,2,4.4,1522,11,.500,4,98,1.02,0,1,1,0,2018,San Jose Sharks
2,22,Aaron Dell,29,SJS,25,20,10,8,4,70,613,543,.886,3.17,2,1.9,1323,6,.300,6,127,-14.75,0,0,0,0,2019,San Jose Sharks
3,16,Aaron Dell,30,SJS,33,30,12,15,3,92,986,894,.907,3.01,0,5.3,1834,15,.500,5,103,-2.84,0,0,0,0,2020,San Jose Sharks
4,7,Adam Berkhoel,24,ATL,9,,2,4,1,30,255,225,.882,3.80,0,1.0,473,,,,,,0,0,0,0,2006,Atlanta Thrashers
5,40,Adam Hauser,25,LAK,1,,0,0,0,6,24,18,.750,7.08,0,-0.2,51,,,,,,0,0,0,0,2006,Los Angeles Kings
6,62,Adam Munro,21,CHI,7,,1,5,1,26,217,191,.880,3.66,0,0.5,426,,,,,,0,0,0,2,2004,Chicago Blackhawks
7,65,Adam Munro,23,CHI,10,,3,5,2,25,234,209,.893,3.00,1,1.2,501,,,,,,0,1,1,0,2006,Chicago Blackhawks
8,87,Adam Werner,22,COL,2,1,1,1,0,5,58,53,.914,3.42,0,0.4,88,0,.000,1,,,0,0,0,0,2020,Colorado Avalanche
9,95,Adam Wilcox,25,BUF,1,0,0,1,0,0,14,14,1.000,0.00,0,0.2,39,0,,0,,,0,0,0,0,2018,Buffalo Sabres


In [89]:
# Checking Work
goalie_stats.loc[goalie_stats["Team"].isnull()]

Unnamed: 0,Rk,Player,Age,Tm,GP,GS,W,L,T/O,GA,SA,SV,SV%,GAA,SO,GPS,MIN,QS,QS%,RBS,GA%-,GSAA,G,A,PTS,PIM,Year,Team


In [90]:
# quick look overall
goalie_stats.sample(25).sort_values("Year", ascending=True)

Unnamed: 0,Rk,Player,Age,Tm,GP,GS,W,L,T/O,GA,SA,SV,SV%,GAA,SO,GPS,MIN,QS,QS%,RBS,GA%-,GSAA,G,A,PTS,PIM,Year,Team
662,6,Dan Bouchard,31,QUE,60,,27,22,11,230,1723,1493,0.867,3.87,1,10.1,3568,,,,105.0,-11.02,0,3,3,36,1982,Quebec Nordiques
1871,24,Mark Holden,25,MTL,2,,0,1,1,6,42,36,0.857,4.12,0,0.2,87,,,,,,0,0,0,0,1983,Montreal Canadiens
866,57,Doug Soetaert,29,MTL,28,,14,9,4,91,621,530,0.853,3.41,0,2.4,1600,,,,117.0,-13.16,0,0,0,4,1985,Montreal Canadiens
219,49,Bill Ranford,19,BOS,4,,3,1,0,10,116,106,0.914,2.51,0,0.9,239,,,,,,0,0,0,0,1986,Boston Bruins
2548,67,Robbie Tallas,22,BOS,1,,1,0,0,3,29,26,0.897,3.0,0,0.2,60,,,,,,0,0,0,0,1996,Boston Bruins
2286,35,Pat Jablonski,28,MTL,24,,5,9,6,63,681,618,0.907,2.97,0,4.2,1272,,,,91.0,6.18,0,1,1,2,1996,Montreal Canadiens
1767,19,Manny Fernandez,21,DAL,5,,0,1,1,19,121,102,0.843,4.58,0,0.0,249,,,,,,0,0,0,0,1996,Dallas Stars
667,10,Dan Cloutier,21,NYR,12,,4,5,1,23,248,225,0.907,2.5,0,1.5,551,,,,,,0,0,0,19,1998,New York Rangers
2744,17,Sebastien Centomo,20,TOR,1,,0,0,0,3,12,9,0.75,4.5,0,-0.1,40,,,,,,0,0,0,0,2002,Toronto Maple Leafs
572,2,Craig Anderson,21,CHI,6,,0,3,2,18,125,107,0.856,4.0,0,0.0,270,,,,,,0,0,0,0,2003,Chicago Blackhawks


In [153]:
# saving our goalie stats for after
goalie_stats.to_csv("goalie_stats.csv", encoding="utf-8")

***

<a id="#goalie_stats"></a><h2>Goalie Stats</h2>

<p><strong>Notes :</strong>All Goalie stats were saved for all years. Certain stats were added as the years progressed such as quality stats, quality start %. All of this will be analyzed and used in the second part of the project when we do feature selection.</p>

<p><a href="#top">[Return to top]</a></p>

In [156]:
# repeating the process for team stats
team_url = "https://www.hockey-reference.com/leagues/NHL_{}.html"
from selenium import webdriver
from selenium.webdriver.chrome.service import Service

ser = Service("C:\\chromedriver.exe")
op = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=ser, options=op)

In [157]:
# scraping data, saving locally
for year in years:
    url = team_url.format(year)
    driver.get(url)
    driver.execute_script("window.scrollTo(1,10000)")
    time.sleep(5)
    html = driver.page_source
    with open("team_stats/{}.html".format(year), "w+", encoding="utf-8") as f:
        f.write(html)
    time.sleep(5)

In [101]:
# initiate beautiful soup class, scrape out yearly goalie stats and concat them into one long big dataframe, add playoffs col, remove asterisk
soup = BeautifulSoup(page, "html.parser")
data = []
for year in years:
    with open("team_stats/{}.html".format(year), encoding="utf-8") as f:
        page = f.read()
    soup = BeautifulSoup(page, "html.parser")
    stats = soup.find(id="stats")
    stats.find("tr", class_="over_header").decompose()
    temp_df = pd.read_html(str(stats))[0]
    to_drop = temp_df.loc[temp_df["Unnamed: 1"] == "League Average"].index[0]
    temp_df = temp_df.drop([to_drop])
    temp_df["Year"] = year
    data.append(temp_df)
team_stats = pd.concat(data)
team_stats.rename(columns={"Unnamed: 1": "Team"}, inplace=True)
team_stats = team_stats.reset_index(drop=True)
team_stats["Playoffs"] = team_stats.apply(
    lambda x: 1 if "*" in x["Team"] else 0, axis=1
)
team_stats["Team"] = team_stats["Team"].str.replace("*", "", regex=False)
team_stats.sample(15)

Unnamed: 0,Rk,Team,GP,W,L,T,PTS,PTS%,GF,GA,SRS,SOS,GF/G,GA/G,PP,PPO,PP%,PPA,PPOA,PK%,SH,SHA,PIM/G,oPIM/G,S,S%,SA,SV%,SO,Year,AvAge,OL,SOW,SOL,Playoffs
339,6.0,Buffalo Sabres,82,40,30,12.0,92,0.561,237,208,0.31,-0.04,2.89,2.54,43,326,13.19,59,364,83.79,16,4,22.3,22.1,2090,11.3,2728,0.924,5,1997,,,,,1
827,27.0,Calgary Flames,82,35,40,,77,0.47,202,238,-0.35,0.04,2.46,2.9,39,249,15.66,43,235,81.7,12,7,10.5,10.6,2199,9.2,2345,0.899,2,2014,26.7,7.0,7.0,3.0,0
188,21.0,Quebec Nordiques,80,12,61,7.0,31,0.194,240,407,-1.91,0.18,3.0,5.09,70,371,18.87,98,382,74.35,8,15,26.3,23.7,2301,10.4,2754,0.852,0,1990,,,,,0
531,1.0,Detroit Red Wings,82,48,21,11.0,109,0.665,255,189,0.72,-0.09,3.11,2.3,63,314,20.06,42,317,86.75,15,5,11.5,11.2,2484,10.3,2151,0.912,7,2004,31.9,2.0,,,1
113,9.0,New York Islanders,80,35,33,12.0,82,0.513,279,281,-0.03,-0.01,3.49,3.51,84,337,24.93,83,381,78.22,8,9,22.9,22.4,2481,11.2,2226,0.874,1,1987,,,,,1
947,27.0,Detroit Red Wings,82,30,39,,73,0.445,212,254,-0.48,-0.01,2.59,3.1,41,234,17.52,58,259,77.61,9,5,8.9,8.3,2490,8.5,2613,0.903,4,2018,30.2,13.0,5.0,1.0,0
305,24.0,Tampa Bay Lightning,48,17,28,3.0,37,0.385,120,144,-0.49,0.01,2.5,3.0,25,177,14.12,32,205,84.39,6,5,21.5,17.5,1280,9.4,1325,0.891,2,1995,,,,,0
571,11.0,San Jose Sharks,82,44,27,,99,0.604,265,235,0.36,0.06,3.23,2.87,91,500,18.2,77,399,80.7,10,11,12.7,14.9,2483,10.7,2180,0.892,5,2006,26.5,11.0,1.0,7.0,1
179,12.0,Toronto Maple Leafs,80,38,38,4.0,80,0.5,337,358,-0.26,0.0,4.21,4.48,81,348,23.28,89,408,78.19,16,17,30.2,29.8,2435,13.8,2798,0.872,0,1990,,,,,1
924,4.0,Boston Bruins,82,50,20,,112,0.683,267,211,0.62,-0.07,3.26,2.57,61,258,23.64,40,245,83.67,9,10,9.5,9.6,2703,9.9,2399,0.912,4,2018,28.6,12.0,3.0,3.0,1


In [102]:
# Checking spelling of teams
team_stats["Team"].value_counts()

Pittsburgh Penguins        38
Toronto Maple Leafs        38
Vancouver Canucks          38
Edmonton Oilers            38
Calgary Flames             38
Detroit Red Wings          38
Buffalo Sabres             38
St. Louis Blues            38
Los Angeles Kings          38
Montreal Canadiens         38
Philadelphia Flyers        38
New York Islanders         38
Washington Capitals        38
Boston Bruins              38
New York Rangers           38
New Jersey Devils          37
Chicago Blackhawks         33
San Jose Sharks            28
Tampa Bay Lightning        27
Ottawa Senators            27
Florida Panthers           26
Dallas Stars               26
Colorado Avalanche         24
Winnipeg Jets              24
Carolina Hurricanes        22
Nashville Predators        21
Minnesota Wild             19
Columbus Blue Jackets      19
Phoenix Coyotes            17
Hartford Whalers           16
Anaheim Ducks              14
Quebec Nordiques           14
Minnesota North Stars      12
Mighty Duc

In [103]:
# Chicago blackhawks removed the space in name
team_stats.loc[team_stats["Team"] == "Chicago Black Hawks"] = "Chicago Blackhawks"
team_stats["Team"].value_counts()

Pittsburgh Penguins        38
Toronto Maple Leafs        38
Vancouver Canucks          38
Edmonton Oilers            38
Calgary Flames             38
Detroit Red Wings          38
Buffalo Sabres             38
St. Louis Blues            38
Los Angeles Kings          38
Montreal Canadiens         38
Chicago Blackhawks         38
Philadelphia Flyers        38
New York Islanders         38
Washington Capitals        38
Boston Bruins              38
New York Rangers           38
New Jersey Devils          37
San Jose Sharks            28
Tampa Bay Lightning        27
Ottawa Senators            27
Florida Panthers           26
Dallas Stars               26
Colorado Avalanche         24
Winnipeg Jets              24
Carolina Hurricanes        22
Nashville Predators        21
Minnesota Wild             19
Columbus Blue Jackets      19
Phoenix Coyotes            17
Hartford Whalers           16
Anaheim Ducks              14
Quebec Nordiques           14
Minnesota North Stars      12
Mighty Duc

In [107]:
# for later we will need both the team abbreviation and the long team name, adding the abbreviation to the team stats.
team_abbrev = {}
with open("team_names.txt") as f:
    lines = f.readlines()
    for line in lines[1:]:
        abbrev, name = line.replace("\n", "").split(",")
        team_abbrev[name] = abbrev

In [108]:
# adding team abbreviations column to team stats
team_stats["Tm"] = team_stats["Team"].map(team_abbrev)

Unnamed: 0,Rk,Team,GP,W,L,T,PTS,PTS%,GF,GA,SRS,SOS,GF/G,GA/G,PP,PPO,PP%,PPA,PPOA,PK%,SH,SHA,PIM/G,oPIM/G,S,S%,SA,SV%,SO,Year,AvAge,OL,SOW,SOL,Playoffs,Tm
0,1.0,New York Islanders,80,54,16,10.0,118,0.738,385,250,1.63,-0.05,4.81,3.13,80,284,28.17,65,332,80.42,16,3,16.6,15.5,2469,15.6,2406,0.896,0,1982,,,,,1,NYI
1,2.0,Edmonton Oilers,80,48,17,15.0,111,0.694,417,295,1.33,-0.19,5.21,3.69,86,341,25.22,67,371,81.94,12,8,18.2,17.2,2690,15.5,2538,0.884,0,1982,,,,,1,EDM
2,3.0,Montreal Canadiens,80,46,17,17.0,109,0.681,360,223,1.68,-0.04,4.5,2.79,72,297,24.24,57,286,80.07,9,5,18.2,18.7,2702,13.3,2198,0.899,6,1982,,,,,1,MTL
3,4.0,Boston Bruins,80,43,27,10.0,96,0.6,323,285,0.55,0.08,4.04,3.56,65,289,22.49,54,291,81.44,11,7,15.8,15.3,2417,13.4,2056,0.861,2,1982,,,,,1,BOS
4,5.0,Minnesota North Stars,80,37,23,20.0,94,0.588,346,288,0.53,-0.19,4.33,3.6,89,362,24.59,49,271,81.92,11,9,16.8,19.8,2641,13.1,2633,0.891,1,1982,,,,,1,MNS
5,6.0,Buffalo Sabres,80,39,26,15.0,93,0.581,307,273,0.51,0.08,3.84,3.41,63,301,20.93,57,281,79.72,4,10,17.6,18.0,2342,13.1,2320,0.882,0,1982,,,,,1,BUF
6,7.0,New York Rangers,80,39,27,14.0,92,0.575,316,306,0.21,0.09,3.95,3.83,68,306,22.22,75,319,76.49,7,12,17.3,16.1,2285,13.8,2356,0.87,1,1982,,,,,1,NYR
7,8.0,Philadelphia Flyers,80,38,31,11.0,87,0.544,325,313,0.24,0.09,4.06,3.91,79,296,26.69,102,397,74.31,11,5,30.9,26.3,2721,11.9,2399,0.87,0,1982,,,,,1,PHI
8,9.0,Quebec Nordiques,80,33,31,16.0,82,0.513,356,345,0.25,0.11,4.45,4.31,83,314,26.43,71,331,78.55,16,9,21.8,20.3,2305,15.4,2368,0.854,1,1982,,,,,1,QUE
9,10.0,Winnipeg Jets,80,33,33,14.0,80,0.5,319,332,-0.28,-0.12,3.99,4.15,74,328,22.56,61,249,75.5,4,11,16.3,18.5,2634,12.1,2596,0.872,3,1982,,,,,1,WPG


In [109]:
# Checking Results
team_stats.loc[team_stats["Tm"].isnull()]

Unnamed: 0,Rk,Team,GP,W,L,T,PTS,PTS%,GF,GA,SRS,SOS,GF/G,GA/G,PP,PPO,PP%,PPA,PPOA,PK%,SH,SHA,PIM/G,oPIM/G,S,S%,SA,SV%,SO,Year,AvAge,OL,SOW,SOL,Playoffs,Tm


In [110]:
team_stats.to_csv("team_stats.csv", encoding="utf-8")

***

<a id="#team_stats"></a><h2>Team Stats</h2>

<p><strong>Notes :</strong>All team stats were saved for all years. A new column was added to mark whether or not the team made the playoffs that year, this could be important in our model later on.</p>

<p><a href="#top">[Return to top]</a></p>