This is my Fantasy Hockey Analyzer. The purpose of this project is to predict the fantasy hockey output of individual skaters based on stats from previous years.


Section 1: Modules Used

The following is a list of modules that I used and the reason why they were used:
-os: to allow the program to read data in the repository
-numpy: basic math operations
-pandas: all dataframe operations/data storage/data cleaning
-various sklearn: all machine learning operations/analysis

In addition to these modules, I also have a custom module that contains helper functions that help in data cleaning/accuracy evaluation. These functions are contained in the "my_module.py" file in the repository. If you are interested in taking a look at these functions, they are available at https://github.com/chrisberry888/FantasyHockeyAnalyzer in the "my_module.py" file.

In [1]:
#Import block
import os
import numpy as np
import pandas as pd
from sklearn.neural_network import MLPRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error

import my_module as mx

Section 2: Data Gathering and Cleaning

The ultimate goal of this project is to predict the number of fantasy points that a given player will have in the 2022-2023 season. Different fantasy leagues have different points breakdowns, but my current league has a points breakdown as described in the "points_dict" variable in the following code block. For example, each player gets 5 fantasy points for a goal, 3 for an assist, and so on. Of course, this can be changed if a different league has a different points breakdown.

To accomplish this all, I have gathered data from rotowire.com and moneypuck.com. The Moneypuck data contains just about every advanced stat you could think of in a variety of different situations (5-on-5, 5-on-4, etc). However, for the sake of this project we will only use their data that describes a player's output in all of their situations. The only stat we need that can't be calculated using the Moneypuck data is +/-; Rotowire has +/- available, so I'm using that dataset.

Data from these sets start from the 2010-2011 season and stretch to the 2021-2022 season. All of the major data gathering and cleaning occurs in the next three code blocks.


The following block does all of the prep work before we start to read the data. It first establishes where the data is stored in the repository so that it can be read by the program. It then creates a list of labels that will be used by the rotowire data (the rotowire dataset is formatted differently than the moneypuck dataset, so we need to do this extra step before proceeding). It then establishes a points breakdown for each relevant stat; this is used later on to calculate fantasy points for each player.

In [27]:
#The current working directory is the main repository directory; these lines set the path to where the data is
path = os.getcwd()
data_path = path + '\\data'

#This array makes it easier to format the rotowire data
rw_labels = ["name", "Team", "Pos", "Games", "Goals", "Assists", "Pts", "+/-", "PIM", "SOG", "GWG", "PP_Goals", "PP_Assists", "SH_Goals", "SH_Assists", "Hits", "Blocked_Shots"]

#This is the breakdown of how many fantasy points a player gets for each category
points_dict = {"Goals":5, "Assists":3, "+/-":1.5, "PIM":-0.25, "PP_Goals":4, "PP_Assists":2, "SH_Goals":6, "SH_Assists":4, "Faceoffs_Won":0.25, "Faceoffs_Lost":-0.15, "Hits":0.5, "Blocked_Shots":0.75 }


D:\Git_Repositories\FantasyHockeyAnalyzer


The following block takes all of the data in the repository and turns it into year-by-year player data. For each year from 2010 to 2021, the for-loop reads the rotowire and moneypuck data from the csv files in the repository, merges the datasets together, calculates the player's fantasy points for that season, does some formatting, then adds it to the "yearly_player_data" list. This list can be used later on for turning into ML-readable data.

In [3]:
#I have data from the 2010-2011 season through the 2021-2022 season.
#By the end of this block, there will be 12 seasons-worth of data in the "data" variable
yearly_player_data = []

for i in range(2010, 2022):
    new_data = []
    
    #Imports the rotowire and moneypuck datasets from the selected year into rdf and mdf
    rdf = pd.read_csv(data_path + '\\rotowire_data\\rotowire{}.csv'.format(str(i)))
    mdf = pd.read_csv(data_path + '\\moneypuck_data\\moneypuck{}.csv'.format(str(i)))
    
    #Formats the rotowire data
    rdf.set_axis(rw_labels, axis=1, inplace=True)
    rdf.drop(index=rdf.index[0], axis=0, inplace=True)
    
    #The Moneypuck data has information about 5-on-5, 5-on-4, 4-on-5, other, and all.
    #For this project I'm just focused on "all" since I suspect it'll give me the best results.
    mdf = mdf[mdf["situation"] == "all"]
    
    #Combines the "Name" and "Team" columns (There are some players with the same name on different teams)
    rdf["name"] = rdf["name"] + "-" + rdf["Team"]
    mdf["name"] = mdf["name"] + "-" + mdf["team"]
    
    
    
    #Merges the rotowire and moneypuck dataframes
    new_data = pd.merge(rdf, mdf, on="name")
    
    #Changes the name of a few columns in the new dataframe
    new_data = new_data.rename(columns={"name":"Name","faceoffsWon":"Faceoffs_Won","faceoffsLost":"Faceoffs_Lost"})
    
    #This section calculates each player's total fantasy output for that year
    cols = new_data.columns
    fant_points = [0 for i in range(len(new_data))]
    for i in range(len(new_data)):
        for j in range(len(new_data.iloc[i])):
            mult = points_dict.get(cols[j], 0)
            if mult != 0:
                fant_points[i] += mult*int(new_data.iloc[i, j])
    
    #Adds the players' fantasy points to the new_data dataframe
    new_data["Fantasy_Points"] = fant_points
    
    new_data = new_data.drop_duplicates()
    
    #Adds new_data to the "data" array
    yearly_player_data.append(new_data)
                
    



In [4]:
#pd.options.display.max_columns = None
#pd.options.display.max_rows = None
#yearly_player_data[11]

The following block takes the yearly data and turns it into ML-readable data. For this project, I am creating different models that use data from the past one year, the past two years, and the past three years, and seeing how much they differ in terms of efficacy.

In [5]:
ml_data_one_year = pd.DataFrame()
ml_data_two_year = pd.DataFrame()
ml_data_three_year = pd.DataFrame()
for i in range(2011, 2022):
    arr = [yearly_player_data[i-2011]]
    points_df = yearly_player_data[i-2010]
    temp = mx.merge_dataframes(arr, points_df)
    ml_data_one_year = pd.concat([ml_data_one_year, temp], ignore_index=True)
    
for i in range(2012, 2022):
    arr = [yearly_player_data[i-2012], yearly_player_data[i-2011]]
    points_df = yearly_player_data[i-2010]
    temp = mx.merge_dataframes(arr, points_df)
    ml_data_two_year = pd.concat([ml_data_two_year, temp], ignore_index=True)
    
for i in range(2013, 2022):
    arr = [yearly_player_data[i-2013], yearly_player_data[i-2012], yearly_player_data[i-2011]]
    points_df = yearly_player_data[i-2010]
    temp = mx.merge_dataframes(arr, points_df)
    ml_data_three_year = pd.concat([ml_data_three_year, temp], ignore_index=True)



ONE YEAR:

In [20]:
arr = mx.separate_fantasy_points(ml_data_one_year)
X = mx.reformat_df(arr[0])
y = arr[1]

X_train_one, X_test_one, y_train_one, y_test_one = train_test_split(X, y, random_state=1)

one_year_regr = MLPRegressor().fit(X_train_one, y_train_one)


TWO YEAR:

In [21]:
arr = mx.separate_fantasy_points(ml_data_two_year)
X = mx.reformat_df(arr[0])
y = arr[1]

X_train_two, X_test_two, y_train_two, y_test_two = train_test_split(X, y, random_state=1)

two_year_regr = MLPRegressor(max_iter=500).fit(X_train_two, y_train_two)



THREE YEAR:

In [22]:
arr = mx.separate_fantasy_points(ml_data_three_year)
X = mx.reformat_df(arr[0])
y = arr[1]

X_train_three, X_test_three, y_train_three, y_test_three = train_test_split(X, y, random_state=1)

three_year_regr = MLPRegressor(max_iter=500).fit(X_train_three, y_train_three)



ANALYSIS:

In [23]:
y_pred = one_year_regr.predict(X_test_one)

print(mean_absolute_error(y_test_one, y_pred))

y_pred = two_year_regr.predict(X_test_two)

print(mean_absolute_error(y_test_two, y_pred))

y_pred = three_year_regr.predict(X_test_three)

print(mean_absolute_error(y_test_three, y_pred))

81.39743988050087
79.91442467133213
88.29785892655995


PREDICTIONS:

In [24]:
one_year_pred = yearly_player_data[11].copy()
one_year_pred.drop(columns=["Fantasy_Points"], inplace=True)
one_year_df = mx.get_name_predictions(one_year_regr, one_year_pred)
one_year_df.sort_values(by="Prediction", ascending=False, inplace=True)
#pd.options.display.max_rows = 978
display(one_year_df)


Unnamed: 0,Name,Prediction
48,Auston Matthews-TOR,551.293859
37,Nathan MacKinnon-COL,526.463437
62,Leon Draisaitl-EDM,505.839201
8,Mikko Rantanen-COL,484.299331
18,Connor McDavid-EDM,475.067723
...,...,...
238,Walker Duehr-CGY,22.240577
285,Alex Steeves-TOR,21.823438
220,Kyle Criscuolo-DET,18.806423
243,Ben McCartney-ARI,7.469042


In [25]:
two_year_pred = [yearly_player_data[i] for i in [10,11]]
two_year_pred = mx.merge_dataframes(two_year_pred, yearly_player_data[11])
two_year_pred.drop(columns=["Predicted_Fantasy_Points"], inplace=True)
two_year_df = mx.get_name_predictions(two_year_regr, two_year_pred)
two_year_df.sort_values(by="Prediction", ascending=False, inplace=True)
#pd.options.display.max_rows = 978
display(two_year_df)

Unnamed: 0,Name,Prediction
0,Auston Matthews-TOR,376.388747
7,Aleksander Barkov-FLA,355.721895
3,Mikko Rantanen-COL,346.688344
21,Nathan MacKinnon-COL,340.625996
1,Connor McDavid-EDM,339.737278
...,...,...
213,Chris Wagner-BOS,80.802291
163,Anders Bjork-BUF,74.913007
248,Slater Koekkoek-EDM,69.799831
262,Joseph Cramarossa-MIN,64.194616


In [26]:
three_year_pred = [yearly_player_data[i] for i in [9,10,11]]
three_year_pred = mx.merge_dataframes(three_year_pred, yearly_player_data[11])
three_year_pred.drop(columns=["Predicted_Fantasy_Points"], inplace=True)
three_year_df = mx.get_name_predictions(three_year_regr, three_year_pred)
three_year_df.sort_values(by="Prediction", ascending=False, inplace=True)
#pd.options.display.max_rows = 978
display(three_year_df)

Unnamed: 0,Name,Prediction
1,Auston Matthews-TOR,334.881157
7,Connor McDavid-EDM,292.140882
6,Nathan MacKinnon-COL,290.934507
2,Leon Draisaitl-EDM,286.964859
41,Aleksander Barkov-FLA,286.447841
...,...,...
190,Niko Mikkola-STL,87.971259
161,Michael Stone-CGY,87.148655
187,Dylan DeMelo-WPG,83.265318
199,Tim Gettinger-NYR,74.938228


In [13]:
mx.check_duplicate_names(yearly_player_data[11])

False

ACKNOWLEDGEMENTS:

Thank you to Rotowire and Moneypuck for making your NHL data easy for someone like me to utilize in a project like this, and thank you to Peter Tanner of Moneypuck not just for creating such a valuable resource, but for being responsive to questions I was having about your dataset.