This is my Fantasy Hockey Analyzer. The purpose of this project is to predict the fantasy hockey output of individual skaters based on stats from previous years.


Section 1: Modules Used

The following is a list of modules that I used and the reason why they were used:

-os: to allow the program to read data in the repository

-numpy: basic math operations

-pandas: all dataframe operations/data storage/data cleaning

-various sklearn: all machine learning operations/analysis

In addition to these modules, I also have a custom module that contains helper functions that help in data cleaning/accuracy evaluation. These functions are contained in the "my_module.py" file in the repository. If you are interested in taking a look at these functions, they are available at https://github.com/chrisberry888/FantasyHockeyAnalyzer in the "my_module.py" file.

In [1]:
#Import block
import os
import numpy as np
import pandas as pd
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.datasets import make_regression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.base import clone

import my_module as mx

Section 2: Data Gathering and Cleaning

The ultimate goal of this project is to predict the number of fantasy points that a given player will have in the 2022-2023 season. Different fantasy leagues have different points breakdowns, but my current league has a points breakdown as described in the "points_dict" variable in the following code block. For example, each player gets 5 fantasy points for a goal, 3 for an assist, and so on. Of course, this can be changed if a different league has a different points breakdown.

To accomplish this all, I have gathered data from rotowire.com and moneypuck.com. The Moneypuck data contains just about every advanced stat you could think of in a variety of different situations (5-on-5, 5-on-4, etc). However, for the sake of this project we will only use their data that describes a player's output in all of their situations. The only stat we need that can't be calculated using the Moneypuck data is +/-; Rotowire has +/- available, so I'm using that dataset.

Data from these sets start from the 2010-2011 season and stretch to the 2021-2022 season. All of the major data gathering and cleaning occurs in the next three code blocks.


The following block does all of the prep work before we start to read the data. It first establishes where the data is stored in the repository so that it can be read by the program. It then creates a list of labels that will be used by the rotowire data (the rotowire dataset is formatted differently than the moneypuck dataset, so we need to do this extra step before proceeding). It then establishes a points breakdown for each relevant stat; this is used later on to calculate fantasy points for each player.

In [2]:
#The current working directory is the main repository directory; these lines set the path to where the data is
path = os.getcwd()
data_path = path + '/data'

#This array makes it easier to format the rotowire data
rw_labels = ["name", "Team", "Pos", "Games", "Goals", "Assists", "Pts", "+/-", "PIM", "SOG", "GWG", "PP_Goals", "PP_Assists", "SH_Goals", "SH_Assists", "Hits", "Blocked_Shots"]

#This is the breakdown of how many fantasy points a player gets for each category
points_dict = {"Goals":5, "Assists":3, "+/-":1.5, "PIM":-0.25, "PP_Goals":4, "PP_Assists":2, "SH_Goals":6, "SH_Assists":4, "Faceoffs_Won":0.25, "Faceoffs_Lost":-0.15, "Hits":0.5, "Blocked_Shots":0.75 }


The following block takes all of the data in the repository and turns it into year-by-year player data. For each year from 2010 to 2024, the for-loop reads the rotowire and moneypuck data from the csv files in the repository, merges the datasets together, calculates the player's fantasy points for that season, does some formatting, then adds it to the "yearly_player_data" list. This list can be used later on for turning into ML-readable data.

In [3]:
#I have data from the 2010-2011 season through the 2024-2025 season. df_filtered = df[~df['name'].str.contains('Elias Pettersson')]
#By the end of this block, there will be 14 seasons-worth of data in the "data" variable
yearly_player_data = []

for i in range(2010, 2025):
    
    new_data = []
    
    #Imports the rotowire and moneypuck datasets from the selected year into rotowire_df and moneypuck_df
    rotowire_df = pd.read_csv(data_path + '/rotowire_data/rotowire{}.csv'.format(str(i)), names=rw_labels, header=None)
    moneypuck_df = pd.read_csv(data_path + '/moneypuck_data/moneypuck{}.csv'.format(str(i)))

    #Formats the rotowire data
    rotowire_df = rotowire_df.iloc[2:]

    #Replace arizona with utah
    rotowire_df['Team'] = rotowire_df['Team'].replace('ARI', 'UTA')
    moneypuck_df['team'] = moneypuck_df['team'].replace('ARI', 'UTA')
    
    #The Moneypuck data has information about 5-on-5, 5-on-4, 4-on-5, other, and all.
    #For this project I'm just focused on "all" since I suspect it'll give me the best results.
    moneypuck_df = moneypuck_df[moneypuck_df["situation"] == "all"]
    
    #Combines the "Name" and "Team" columns (There are some players with the same name on different teams)
    rotowire_df["name"] = rotowire_df["name"] + "-" + rotowire_df["Team"]
    moneypuck_df["name"] = moneypuck_df["name"] + "-" + moneypuck_df["team"]
    
    #Merges the rotowire and moneypuck dataframes
    new_data = pd.merge(rotowire_df, moneypuck_df, on="name")
    
    #Changes the name of a few columns in the new dataframe
    new_data = new_data.rename(columns={"name":"Name","faceoffsWon":"Faceoffs_Won","faceoffsLost":"Faceoffs_Lost"})
    
    #This section calculates each player's total fantasy output for that year
    cols = new_data.columns
    fant_points = [0 for i in range(len(new_data))]
    for i in range(len(new_data)):
        for j in range(len(new_data.iloc[i])):
            mult = points_dict.get(cols[j], 0)
            if mult != 0:
                fant_points[i] += mult*int(new_data.iloc[i, j])
    
    #Adds the players' fantasy points to the new_data dataframe
    new_data["Fantasy_Points"] = fant_points
    
    new_data = new_data.drop_duplicates()
    
    #Adds new_data to the "data" array
    yearly_player_data.append(new_data)


The following block takes the yearly data and turns it into ML-readable data. For this project, I am creating different models that use data from the past one year, the past two years, and the past three years, and seeing how much they differ in terms of efficacy.

In [4]:
ml_data_one_year = pd.DataFrame()
ml_data_two_year = pd.DataFrame()
ml_data_three_year = pd.DataFrame()
for i in range(2011, 2025):
    arr = [yearly_player_data[i-2011]]
    points_df = yearly_player_data[i-2010]
    temp = mx.merge_dataframes(arr, points_df)
    ml_data_one_year = pd.concat([ml_data_one_year, temp], ignore_index=True)
    
for i in range(2012, 2025):
    arr = [yearly_player_data[i-2012], yearly_player_data[i-2011]]
    points_df = yearly_player_data[i-2010]
    temp = mx.merge_dataframes(arr, points_df)
    ml_data_two_year = pd.concat([ml_data_two_year, temp], ignore_index=True)
    
for i in range(2013, 2025):
    arr = [yearly_player_data[i-2013], yearly_player_data[i-2012], yearly_player_data[i-2011]]
    points_df = yearly_player_data[i-2010]
    temp = mx.merge_dataframes(arr, points_df)
    ml_data_three_year = pd.concat([ml_data_three_year, temp], ignore_index=True)



In [5]:
display(ml_data_three_year.head())

Unnamed: 0,Name,Team_x,Pos_x,Games_x,Goals_x,Assists_x,Pts_x,+/-_x,PIM_x,SOG_x,...,OffIce_A_xGoals,OffIce_F_shotAttempts,OffIce_A_shotAttempts,xGoalsForAfterShifts,xGoalsAgainstAfterShifts,corsiForAfterShifts,corsiAgainstAfterShifts,fenwickForAfterShifts,fenwickAgainstAfterShifts,Predicted_Fantasy_Points
0,Jonathan Toews-CHI,CHI,C,80,32,44,76,25,26,233,...,79.45,1683.0,1657.0,0.0,0.0,0.0,0.0,0.0,0.0,508.0
1,Sidney Crosby-PIT,PIT,C,41,32,34,66,20,31,161,...,66.97,1085.0,1378.0,0.0,0.0,0.0,0.0,0.0,0.0,655.35
2,Patrick Kane-CHI,CHI,RW,73,27,46,73,7,28,216,...,73.1,1757.0,1572.0,0.0,0.0,0.0,0.0,0.0,0.0,359.65
3,Jamie Benn-DAL,DAL,C,69,22,34,56,-5,52,177,...,91.49,1328.0,1695.0,0.0,0.0,0.0,0.0,0.0,0.0,534.2
4,Patrice Bergeron-BOS,BOS,C,80,22,35,57,20,26,211,...,75.85,1673.0,1556.0,0.0,0.0,0.0,0.0,0.0,0.0,561.2


In [6]:
# display(ml_data_three_year.head())

Section 3: ML Model Training

Now that we have the data that ml models can read, we can now train the models. For this project, I'm using multi-layer perceptrons (MLPRegressor) and Random Forests (RandomForestRegressor). I'm making six total models: a MLP each for the one- two- and three-year data, and a Random Forest each for the one- two- and three-year data.

ONE YEAR:

In [7]:
#number of models per model type (mpt = models per type)
mpt = 100

arr = mx.separate_fantasy_points(ml_data_one_year)
X = mx.reformat_df(arr[0])
y = arr[1]


one_year_regr = MLPRegressor(max_iter=1000)


one_year_RF = RandomForestRegressor()


one_year_regr_models = mx.sim(one_year_regr, X, y, mpt) 

In [8]:
# one_year_RF_models = mx.sim(one_year_RF, X, y, mpt)

TWO YEAR:

In [9]:
arr = mx.separate_fantasy_points(ml_data_two_year)
X = mx.reformat_df(arr[0])
y = arr[1]


two_year_regr = MLPRegressor(max_iter=1000)

two_year_RF = RandomForestRegressor()

two_year_regr_models = mx.sim(two_year_regr, X, y, mpt) 

In [10]:
# two_year_RF_models = mx.sim(two_year_RF, X, y, mpt)

THREE YEAR:

In [11]:
arr = mx.separate_fantasy_points(ml_data_three_year)
X = mx.reformat_df(arr[0])
y = arr[1]

three_year_regr = MLPRegressor(max_iter=1000)

three_year_RF = RandomForestRegressor()

three_year_regr_models= mx.sim(three_year_regr, X, y, mpt)

In [12]:
# three_year_RF_models= mx.sim(three_year_RF, X, y, mpt)

Section 4: Analysis

Now that we have the models trained, we can analyze them. We'll be analyzing the data in two ways: first, we'll see how accurate the actual points predictions are using the mean_absolute_error.

In [13]:
'''y_pred = one_year_regr.predict(X_test_one)
print(mean_absolute_error(y_test_one, y_pred))

y_pred = two_year_regr.predict(X_test_two)
print(mean_absolute_error(y_test_two, y_pred))

y_pred = three_year_regr.predict(X_test_three)
print(mean_absolute_error(y_test_three, y_pred))

y_pred = one_year_RF.predict(X_test_one)
print(mean_absolute_error(y_test_one, y_pred))

y_pred = two_year_RF.predict(X_test_two)
print(mean_absolute_error(y_test_two, y_pred))

y_pred = three_year_RF.predict(X_test_three)
print(mean_absolute_error(y_test_three, y_pred))'''

'y_pred = one_year_regr.predict(X_test_one)\nprint(mean_absolute_error(y_test_one, y_pred))\n\ny_pred = two_year_regr.predict(X_test_two)\nprint(mean_absolute_error(y_test_two, y_pred))\n\ny_pred = three_year_regr.predict(X_test_three)\nprint(mean_absolute_error(y_test_three, y_pred))\n\ny_pred = one_year_RF.predict(X_test_one)\nprint(mean_absolute_error(y_test_one, y_pred))\n\ny_pred = two_year_RF.predict(X_test_two)\nprint(mean_absolute_error(y_test_two, y_pred))\n\ny_pred = three_year_RF.predict(X_test_three)\nprint(mean_absolute_error(y_test_three, y_pred))'

The MAE for all of the models range between around 70 and around 95. Given that most players in the league finish with a fantasy point total in the hundreds, we can see that the predicted points values aren't very accurate to the real-life values. However, we are less concerned with the actual points total that a player will have, and more concerned with their rank within the rest of the league. To look at this, we will rank the players both in terms of predicted fantasy points for a season, and actual fantasy points for a season. (I'll incorporate this at a later time)

Section 5: Predictions for Next Year

Now that we've taken a look at the accuracy of the model, we'll see what the models think will happen in the 2022-2023 season. 

In [15]:
big_preds = []

one_year_pred = yearly_player_data[14].copy()
one_year_pred.drop(columns=["Fantasy_Points"], inplace=True)
dfs1 = []
for i in range(mpt):
    dfs1.append(mx.get_name_predictions(one_year_regr_models[i], one_year_pred))
big_preds.extend(dfs1)

# one_year_pred = yearly_player_data[14].copy()
# one_year_pred.drop(columns=["Fantasy_Points"], inplace=True)
# dfs2 = []
# for i in range(mpt):
#     dfs2.append(mx.get_name_predictions(one_year_RF_models[i], one_year_pred))
# big_preds.extend(dfs2)

two_year_pred = [yearly_player_data[i] for i in [13,14]]
two_year_pred = mx.merge_dataframes(two_year_pred, yearly_player_data[12])
two_year_pred.drop(columns=["Predicted_Fantasy_Points"], inplace=True)
dfs3 = []
for i in range(mpt):
    dfs3.append(mx.get_name_predictions(two_year_regr_models[i], two_year_pred))
big_preds.extend(dfs3)

# two_year_pred = [yearly_player_data[i] for i in [13,14]]
# two_year_pred = mx.merge_dataframes(two_year_pred, yearly_player_data[12])
# two_year_pred.drop(columns=["Predicted_Fantasy_Points"], inplace=True)
# dfs4 = []
# for i in range(mpt):
#     dfs4.append(mx.get_name_predictions(two_year_RF_models[i], two_year_pred))
# big_preds.extend(dfs4)

three_year_pred = [yearly_player_data[i] for i in [12,13,14]]
three_year_pred = mx.merge_dataframes(three_year_pred, yearly_player_data[12])
three_year_pred.drop(columns=["Predicted_Fantasy_Points"], inplace=True)
dfs5 = []
for i in range(mpt):
    dfs5.append(mx.get_name_predictions(three_year_regr_models[i], three_year_pred))
big_preds.extend(dfs5)

# three_year_pred = [yearly_player_data[i] for i in [12,13,14]]
# three_year_pred = mx.merge_dataframes(three_year_pred, yearly_player_data[12])
# three_year_pred.drop(columns=["Predicted_Fantasy_Points"], inplace=True)
# dfs6 = []
# for i in range(mpt):
#     dfs6.append(mx.get_name_predictions(three_year_RF_models[i], three_year_pred))
# big_preds.extend(dfs6)

In [16]:
pd.options.display.max_rows = None

In [17]:

# one_year_df = mx.sum_predictions(dfs1)
# one_year_df.sort_values(by="Prediction", ascending=False, inplace=True)
# pd.options.display.max_rows = None
# display(one_year_df)

In [18]:

# one_year_df = mx.sum_predictions(dfs2)
# one_year_df.sort_values(by="Prediction", ascending=False, inplace=True)
# pd.options.display.max_rows = None
# display(one_year_df)

In [19]:
# two_year_df = mx.sum_predictions(dfs3)
# two_year_df.sort_values(by="Prediction", ascending=False, inplace=True)
# pd.options.display.max_rows = None
# display(two_year_df)

In [20]:
# two_year_df = mx.sum_predictions(dfs4)
# two_year_df.sort_values(by="Prediction", ascending=False, inplace=True)
# pd.options.display.max_rows = None
# display(two_year_df)

In [21]:
# three_year_df = mx.sum_predictions(dfs5)
# three_year_df.sort_values(by="Prediction", ascending=False, inplace=True)
# pd.options.display.max_rows = None
# display(three_year_df)

In [22]:
# three_year_df = mx.sum_predictions(dfs6)
# three_year_df.sort_values(by="Prediction", ascending=False, inplace=True)
# pd.options.display.max_rows = None
# display(three_year_df)

In [23]:
extended_preds_df = mx.sum_predictions(big_preds)
extended_preds_df['Prediction'] = extended_preds_df['Prediction'] / (3 * mpt)
extended_preds_df.sort_values(by="Prediction", ascending=False, inplace=True)
extended_preds_df.reset_index(drop=True, inplace=True)
pd.options.display.max_rows = None
display(extended_preds_df)

Unnamed: 0,Name,Prediction
0,Elias Pettersson-VAN,840.05483
1,Auston Matthews-TOR,343.237289
2,Leon Draisaitl-EDM,343.17274
3,Nathan MacKinnon-COL,342.852044
4,Connor McDavid-EDM,328.868415
5,Sebastian Aho-CAR,304.440242
6,Sidney Crosby-PIT,302.721845
7,Jack Eichel-VGK,300.088327
8,Cale Makar-COL,299.566093
9,Vincent Trocheck-NYR,299.509473


Section 6: Conclusion and Next Steps

The predictions made by all of these models make sense; all of the predicted top players are still some of the top players in the league this year, and many of the predictions match up with predictions made by ESPN. One step that could be done is aggregating the six models together to get an average points prediction, and listing the players that way. Another thing that can be done to improve the models is incorporate injury data; there are some elite players that were injured in some part of the past three years, and their predictions are more pessemistic than other players. Another improvement could be to try and scale for the COVID-shortened 2019-2020 and 2020-2021 seasons. There were many logistical issues that contributed to fewer games and lower scoring in those years, and a scaling of goal/assist values could be beneficial to the models.

Overall, I'm happy with how the models performed. After the 2022-2023 regular season, I will see how well they were able to predict some of the outliers, and I'll use that new data to make a prediction for the 2023-2024 season.

UPDATE 9/8/23

I updated the program to make 100 models each for the 6 different combinations of model type and years-scope, for a total of 600 models. I then took the predictions from those 600 models and averaged them out to get a final list of NHL players.

I used this list to draft my current fantasy NHL team. While most of the predictions aligned with ESPN's predictions, there were some players whom my model thought were underrated. The most notable example is Shane Pinto, currently of the Ottawa Senators; while ESPN believes he should be ranked near 300th among skaters (low enough where he doesn't have a position rank), my model predicts that he will be a top 80 skater this year. Other notable skaters that my model believes in are Brayden Schenn of the Blues, Ryan Hartman of the Wild, Ty Dellandrea of the Stars, and Mikael Backlund of the Flames. Time will tell if these predictions pan out.

UPDATE 10/3/25

I have no clue why this didn't update last year. I updated and ran this program last year, I used it to draft my team, and my team made the playoffs but lost in the semifinals. I'm happy with how it performed.

I'm updating the program now to draft my team for the 2025-26 season.

ACKNOWLEDGEMENTS:

Thank you to Rotowire and Moneypuck for making your NHL data easy for someone like me to utilize in a project like this, and thank you to Peter Tanner of Moneypuck not just for creating such a valuable resource, but for being responsive to questions I was having about your dataset.