# NHL Game Prediction Modeling
by Gary Schwaeber

## Overview

With sport betting becoming increasingly popular and mainstream I believe that data science can be used to make superior decisions over gut intuitions. In this notebook I will attempt to train logistic regression, ada boost, and gradient boosting models in an attempt to make the best possible game prediction model. I will train my models and tune model hyperparemetres using game results from seasons '2017-2018', '2018-2019', '2019-2020'. Then I will predict on held out games from the current 2021 season and evaluate my model. There are currently a handful of public models whose log loss on the current season's games is being [tracked](https://hockey-statistics.com/2021/05/03/game-projections-january-13th-2021/) on which I can compare the quality of my model to. The score I will look to optimize is log loss, however, I will also review accuracy scores due to their interpretability.

Log-loss is indicative of how close the prediction probability is to the corresponding actual/true value (0 or 1 in case of binary classification). The more the predicted probability diverges from the actual value, the higher is the log-loss value. [Source](https://towardsdatascience.com/intuition-behind-log-loss-score-4e0c9979680a)


In [158]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import numpy as np
import statsmodels.api as sm
import hockey_scraper
import pickle
import time
import random
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.preprocessing import normalize, FunctionTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss, accuracy_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import roc_auc_score, roc_curve, auc

from sklearn.metrics import confusion_matrix, plot_confusion_matrix,\
    precision_score, recall_score, accuracy_score, f1_score, log_loss,\
    roc_curve, roc_auc_score, classification_report
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, AdaBoostClassifier, AdaBoostRegressor, GradientBoostingClassifier
from collections import Counter
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.compose import make_column_selector as selector
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.feature_selection import RFECV
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns', None)

In [2]:
df = pd.read_csv('data/all_games_multirolling_SVA.csv')

In [3]:
conditions = [((df['date'] >= '2017-10-04') & (df['date'] <= '2018-04-08')),
              ((df['date'] >= '2018-10-03') & (df['date'] <= '2019-04-06')),
              ((df['date'] >= '2019-10-02') & (df['date'] <= '2020-03-12')),
              ((df['date'] >= '2021-01-13') & (df['date'] <= '2021-04-29'))
             ]
, 
choices = ['2017-2018',
           '2018-2019',
           '2019-2020',
           '2020-2021']
           
    

df['Season'] = np.select(conditions, choices)

In [245]:
# define feature columns for different rolling intervals
r3 = ['home_B2B', 'away_B2B', 'home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%', 'home_last_3_FF%_5v5',
 'home_last_3_GF%_5v5',
 'home_last_3_xGF%_5v5',
 'home_last_3_SH%',
 'home_last3_pp_TOI_per_game',
 'home_last3_xGF_per_min_pp',
 'home_last3_pk_TOI_per_game',
 'home_last3_xGA_per_min_pk', 'away_last_3_FF%_5v5',
 'away_last_3_GF%_5v5',
 'away_last_3_xGF%_5v5',
 'away_last_3_SH%',
 'away_last3_pp_TOI_per_game',
 'away_last3_xGF_per_min_pp',
 'away_last3_pk_TOI_per_game',
 'away_last3_xGA_per_min_pk']
r5 =['home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%', 'home_B2B', 'away_B2B', 'home_last_5_FF%_5v5',
 'home_last_5_GF%_5v5',
 'home_last_5_xGF%_5v5',
 'home_last_5_SH%',
 'home_last5_pp_TOI_per_game',
 'home_last5_xGF_per_min_pp',
 'home_last5_pk_TOI_per_game',
 'home_last5_xGA_per_min_pk', 'away_last_5_FF%_5v5',
 'away_last_5_GF%_5v5',
 'away_last_5_xGF%_5v5',
 'away_last_5_SH%',
 'away_last5_pp_TOI_per_game',
 'away_last5_xGF_per_min_pp',
 'away_last5_pk_TOI_per_game',
 'away_last5_xGA_per_min_pk']
r10 =['home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%', 'home_B2B', 'away_B2B', 'home_last_10_FF%_5v5',
 'home_last_10_GF%_5v5',
 'home_last_10_xGF%_5v5',
 'home_last_10_SH%',
 'home_last10_pp_TOI_per_game',
 'home_last10_xGF_per_min_pp',
 'home_last10_pk_TOI_per_game',
 'home_last10_xGA_per_min_pk', 'away_last_10_FF%_5v5',
 'away_last_10_GF%_5v5',
 'away_last_10_xGF%_5v5',
 'away_last_10_SH%',
 'away_last10_pp_TOI_per_game',
 'away_last10_xGF_per_min_pp',
 'away_last10_pk_TOI_per_game',
 'away_last10_xGA_per_min_pk']
r20 = ['home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%', 'home_B2B', 'away_B2B',  'home_last_20_FF%_5v5',
 'home_last_20_GF%_5v5',
 'home_last_20_xGF%_5v5',
 'home_last_20_SH%',
 'home_last20_pp_TOI_per_game',
 'home_last20_xGF_per_min_pp',
 'home_last20_pk_TOI_per_game',
 'home_last20_xGA_per_min_pk', 'away_last_20_FF%_5v5',
 'away_last_20_GF%_5v5',
 'away_last_20_xGF%_5v5',
 'away_last_20_SH%',
 'away_last20_pp_TOI_per_game',
 'away_last20_xGF_per_min_pp',
 'away_last20_pk_TOI_per_game',
 'away_last20_xGA_per_min_pk']
r30 = ['home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%', 'home_B2B', 'away_B2B',  'home_last_30_FF%_5v5',
 'home_last_30_GF%_5v5',
 'home_last_30_xGF%_5v5',
 'home_last_30_SH%',
 'home_last30_pp_TOI_per_game',
 'home_last30_xGF_per_min_pp',
 'home_last30_pk_TOI_per_game',
 'home_last30_xGA_per_min_pk', 'away_last_30_FF%_5v5',
 'away_last_30_GF%_5v5',
 'away_last_30_xGF%_5v5',
 'away_last_30_SH%',
 'away_last30_pp_TOI_per_game',
 'away_last30_xGF_per_min_pp',
 'away_last30_pk_TOI_per_game',
 'away_last30_xGA_per_min_pk']
r40 = ['home_Goalie_FenwickSV%',
 'home_Goalie_GSAx/60',
 'home_Goalie_HDCSV%',
 'away_Goalie_FenwickSV%',
 'away_Goalie_GSAx/60',
 'away_Goalie_HDCSV%', 'home_B2B', 'away_B2B',
'home_last_40_FF%_5v5',
 'home_last_40_GF%_5v5',
 'home_last_40_xGF%_5v5',
 'home_last_40_SH%',
 'home_last40_pp_TOI_per_game',
 'home_last40_xGF_per_min_pp',
 'home_last40_pk_TOI_per_game',
 'home_last40_xGA_per_min_pk',
'away_last_40_FF%_5v5',
 'away_last_40_GF%_5v5',
 'away_last_40_xGF%_5v5',
 'away_last_40_SH%',
 'away_last40_pp_TOI_per_game',
 'away_last40_xGF_per_min_pp',
 'away_last40_pk_TOI_per_game',
 'away_last40_xGA_per_min_pk']
all_r = list(set(r3+r5+r10+r20+r30+r40))

r3_30 =list(set(r3+r30))
r5_30 = list(set(r5+r30))
r10_30 = list(set(r10+r30))
r_3_5_30 = list(set(r3+r5+r30))
r_5_20 = list(set(r5+r20))
r_5_40 = list(set(r5+r40))

## Baseline Model

The baseline model will predict that every home team wins their game and that the probability of that is the ratio of games the home team has won.

In [374]:
df['Home_Team_Won'].value_counts(normalize=True)

1    0.541458
0    0.458542
Name: Home_Team_Won, dtype: float64

In [371]:
baseline_preds = np.ones(df.shape[0])
accuracy_score(df['Home_Team_Won'],baseline_preds)

0.5414581066376496

In [381]:
baseline_probs = np.repeat(df['Home_Team_Won'].value_counts(normalize=True)[1], df.shape[0])

log_loss(df['Home_Team_Won'], baseline_probs)

0.689705681560888

The models will need to beat an accuracy score of 54.15% and a log loss of .6897, otherwise they are no better than just predicting the home team will win. 

## Rolling 5 and 40 game features

For my first set of models I will attempt using 5 and 40 game rolling features. These seemed like a good set based on the feature selection notebook. 40 games is currently the longest rolling runway I have currently for the team statistics. The 40 games stats intuitively provide the most smoothing of team data over the course of the season, while the 5 game stats may provide some insight on any streakiness or may cover recent developments that would affect short term team performances such as player injuries, trades coaching changes etc.

In [21]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [23]:
X_train.columns

Index(['home_last_5_FF%_5v5', 'home_last5_xGF_per_min_pp',
       'home_last40_pp_TOI_per_game', 'away_last40_pk_TOI_per_game',
       'home_last5_pk_TOI_per_game', 'away_B2B', 'away_last40_xGF_per_min_pp',
       'home_Goalie_GSAx/60', 'home_last_5_SH%', 'away_last5_pk_TOI_per_game',
       'away_last_5_GF%_5v5', 'away_Goalie_GSAx/60', 'home_last_40_GF%_5v5',
       'away_last_5_xGF%_5v5', 'home_B2B', 'away_last5_xGF_per_min_pp',
       'home_last40_pk_TOI_per_game', 'away_last_40_SH%',
       'away_last_40_GF%_5v5', 'home_last_40_xGF%_5v5',
       'home_Goalie_FenwickSV%', 'home_last5_xGA_per_min_pk',
       'home_last_5_GF%_5v5', 'away_Goalie_HDCSV%', 'home_last_40_SH%',
       'away_last_40_xGF%_5v5', 'away_last40_pp_TOI_per_game',
       'home_last40_xGA_per_min_pk', 'home_last5_pp_TOI_per_game',
       'away_last_5_FF%_5v5', 'away_last5_xGA_per_min_pk',
       'home_last_40_FF%_5v5', 'away_last5_pp_TOI_per_game',
       'home_last40_xGF_per_min_pp', 'home_last_5_xGF%_5v5', 'away_

In [31]:
numeric_features = ['home_last_5_FF%_5v5', 'home_last5_xGF_per_min_pp',
       'home_last40_pp_TOI_per_game', 'away_last40_pk_TOI_per_game',
       'home_last5_pk_TOI_per_game', 'away_last40_xGF_per_min_pp',
       'home_Goalie_GSAx/60', 'home_last_5_SH%', 'away_last5_pk_TOI_per_game',
       'away_last_5_GF%_5v5', 'away_Goalie_GSAx/60', 'home_last_40_GF%_5v5',
       'away_last_5_xGF%_5v5', 'away_last5_xGF_per_min_pp',
       'home_last40_pk_TOI_per_game', 'away_last_40_SH%',
       'away_last_40_GF%_5v5', 'home_last_40_xGF%_5v5',
       'home_Goalie_FenwickSV%', 'home_last5_xGA_per_min_pk',
       'home_last_5_GF%_5v5', 'away_Goalie_HDCSV%', 'home_last_40_SH%',
       'away_last_40_xGF%_5v5', 'away_last40_pp_TOI_per_game',
       'home_last40_xGA_per_min_pk', 'home_last5_pp_TOI_per_game',
       'away_last_5_FF%_5v5', 'away_last5_xGA_per_min_pk',
       'home_last_40_FF%_5v5', 'away_last5_pp_TOI_per_game',
       'home_last40_xGF_per_min_pp', 'home_last_5_xGF%_5v5', 'away_last_5_SH%',
       'away_last40_xGA_per_min_pk', 'away_Goalie_FenwickSV%',
       'away_last_40_FF%_5v5', 'home_Goalie_HDCSV%']

In [26]:
X_train.loc[:, numerical_columns]

Unnamed: 0,home_last_5_FF%_5v5,home_last5_xGF_per_min_pp,home_last40_pp_TOI_per_game,away_last40_pk_TOI_per_game,home_last5_pk_TOI_per_game,away_last40_xGF_per_min_pp,home_Goalie_GSAx/60,home_last_5_SH%,away_last5_pk_TOI_per_game,away_last_5_GF%_5v5,away_Goalie_GSAx/60,home_last_40_GF%_5v5,away_last_5_xGF%_5v5,away_last5_xGF_per_min_pp,home_last40_pk_TOI_per_game,away_last_40_SH%,away_last_40_GF%_5v5,home_last_40_xGF%_5v5,home_Goalie_FenwickSV%,home_last5_xGA_per_min_pk,home_last_5_GF%_5v5,away_Goalie_HDCSV%,home_last_40_SH%,away_last_40_xGF%_5v5,away_last40_pp_TOI_per_game,home_last40_xGA_per_min_pk,home_last5_pp_TOI_per_game,away_last_5_FF%_5v5,away_last5_xGA_per_min_pk,home_last_40_FF%_5v5,away_last5_pp_TOI_per_game,home_last40_xGF_per_min_pp,home_last_5_xGF%_5v5,away_last_5_SH%,away_last40_xGA_per_min_pk,away_Goalie_FenwickSV%,away_last_40_FF%_5v5,home_Goalie_HDCSV%
0,52.399869,0.079714,5.328333,4.540000,3.693333,0.122400,-0.334940,9.426112,3.070000,45.937500,0.027934,50.127801,48.770492,0.069910,4.923333,8.124451,51.399425,48.992719,0.932657,0.098556,57.080799,0.872792,9.025236,49.339386,4.646667,0.104858,4.190000,52.562502,0.074267,48.803377,5.893333,0.112699,51.663405,6.967375,0.133976,0.942629,49.991679,0.866667
1,42.564205,0.143856,4.705417,4.928750,3.546667,0.102018,0.205712,12.093988,4.966667,49.927641,-0.138771,56.868932,51.204482,0.096000,4.774167,8.420932,58.184556,51.954595,0.941176,0.153383,59.064609,0.882353,9.060588,52.486645,4.315417,0.129028,3.336667,46.882217,0.109128,50.828439,6.000000,0.124909,46.860987,11.358025,0.097844,0.945897,50.633643,0.869942
2,60.511924,0.113316,4.682500,5.185417,4.540000,0.120843,0.312441,8.478124,5.853333,45.427286,0.041876,56.575634,40.305523,0.153218,4.233750,7.879167,50.499508,49.851785,0.942539,0.131278,58.385392,0.891688,9.025460,49.136336,4.921667,0.116445,6.283333,43.520998,0.112415,50.407241,4.816667,0.132248,60.180542,9.286882,0.107127,0.940136,50.595552,0.896450
3,60.511924,0.113316,4.682500,5.185417,4.540000,0.120843,0.312441,8.478124,5.853333,45.427286,0.041876,56.575634,40.305523,0.153218,4.233750,7.879167,50.499508,49.851785,0.942539,0.131278,58.385392,0.891688,9.025460,49.136336,4.921667,0.116445,6.283333,43.520998,0.112415,50.407241,4.816667,0.132248,60.180542,9.286882,0.107127,0.940136,50.595552,0.896450
4,54.316401,0.118615,4.778333,5.305000,4.763333,0.143998,-0.232180,9.804628,5.963333,56.272661,0.009622,53.260259,49.941995,0.137242,4.379167,5.932286,45.246898,52.809227,0.932564,0.137299,57.771883,0.852632,7.970138,50.855171,5.571250,0.120913,4.620000,51.909534,0.086864,52.890654,5.173333,0.105738,52.571429,6.524847,0.093779,0.940035,51.197815,0.852201
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3790,52.626239,0.059924,5.620833,5.130833,4.210000,0.112455,-0.004131,9.856137,4.503333,50.523013,-0.160907,50.702434,54.505170,0.168144,4.886667,6.774670,45.320701,47.185204,0.937551,0.112114,56.342957,0.855114,9.172687,51.722795,4.195000,0.103138,5.273333,52.471344,0.095041,48.478268,4.436667,0.107724,51.508227,7.141273,0.109875,0.932952,50.676962,0.891117
3791,43.058811,0.096857,4.900833,4.127083,5.073333,0.136510,-0.646591,12.419735,3.396667,70.421512,0.072030,47.486397,46.681034,0.091262,4.458333,7.861206,45.570087,47.911779,0.926621,0.165572,66.463680,0.860963,8.299737,42.816482,4.415417,0.116636,2.333333,51.846709,0.095976,48.281854,4.120000,0.115644,39.061033,6.831641,0.127148,0.939850,46.900863,0.831615
3792,48.469552,0.102022,4.770833,4.755833,5.786667,0.122391,-0.435356,11.374701,3.583333,53.981623,0.135063,43.586998,45.388350,0.107551,4.810417,9.108824,58.458552,45.316488,0.928974,0.119585,51.856336,0.901024,7.231170,53.445722,4.307917,0.098068,4.450000,47.912088,0.061953,48.711262,4.370000,0.098096,43.516270,6.481567,0.122481,0.944050,53.247076,0.872340
3793,55.838089,0.086931,5.192083,5.041667,5.200000,0.121545,-0.029116,9.257587,3.623333,49.006951,-0.054492,56.092965,51.102088,0.090909,4.855417,8.685066,51.541976,54.701218,0.942539,0.086154,44.565217,0.868195,8.310378,50.256575,5.016667,0.116725,4.463333,53.800952,0.247286,54.328835,6.380000,0.108531,59.167117,9.050064,0.139438,0.935115,50.202661,0.863636


In [113]:
scoring = ['neg_log_loss', 'accuracy']

In [155]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['home_B2B', 'away_B2B']


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', 'passthrough', categorical_features)])

### Logistic Regression

In [118]:
log_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [0.1, 10, 20, 100],
                'logisticregression__class_weight': [None] }

log_cv = GridSearchCV(log_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [119]:
log_cv.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_5_FF%_5v5',
                                                                          'home_last5_xGF_per_min_pp',
                                                                          'home_last40_pp_TOI_per_game',
                                                                          'away_last40_pk_TOI_per_game',
                                                                          'home_last5_pk_TOI_per_game',
                                                                          'away_last40_xGF_per_min_pp',
                

In [120]:
log_cv.best_score_

-0.6777347180223572

In [194]:
log_results = pd.DataFrame(log_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
0,0.027825,0.003296,0.009434,0.000387,0.1,,l1,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.677067,-0.675773,-0.682986,-0.677729,-0.675119,-0.677735,0.002782,1,0.569333,0.590667,0.588,0.568,0.589333,0.581067,0.010168,13
3,0.02371,0.001415,0.008098,0.000967,0.1,,l2,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.675531,-0.677123,-0.683068,-0.676456,-0.677169,-0.677869,0.002666,2,0.569333,0.608,0.573333,0.584,0.566667,0.580267,0.015071,14
5,0.024544,0.00102,0.007175,7e-05,0.1,,l2,newton-cg,"{'logisticregression__C': 0.1, 'logisticregres...",-0.675577,-0.677124,-0.683118,-0.676415,-0.677126,-0.677872,0.002684,3,0.569333,0.608,0.573333,0.581333,0.565333,0.579467,0.015216,15
4,0.017814,0.000625,0.007292,0.000235,0.1,,l2,lbfgs,"{'logisticregression__C': 0.1, 'logisticregres...",-0.675577,-0.677125,-0.683118,-0.676414,-0.677127,-0.677872,0.002684,4,0.569333,0.608,0.573333,0.581333,0.565333,0.579467,0.015216,15
6,0.042029,0.003558,0.007798,0.000512,10.0,,l1,liblinear,"{'logisticregression__C': 10, 'logisticregress...",-0.675282,-0.677541,-0.683798,-0.676713,-0.67806,-0.678279,0.002915,5,0.578667,0.609333,0.576,0.584,0.565333,0.582667,0.014655,11
12,0.037968,0.00437,0.007319,0.000199,20.0,,l1,liblinear,"{'logisticregression__C': 20, 'logisticregress...",-0.675278,-0.677565,-0.683819,-0.676712,-0.678098,-0.678294,0.002922,6,0.578667,0.610667,0.574667,0.584,0.565333,0.582667,0.015272,11
9,0.024154,0.00134,0.007429,0.000419,10.0,,l2,liblinear,"{'logisticregression__C': 10, 'logisticregress...",-0.675276,-0.677583,-0.68383,-0.676706,-0.678129,-0.678305,0.002926,7,0.578667,0.609333,0.574667,0.584,0.568,0.582933,0.014196,3
11,0.024628,0.000918,0.007113,5.4e-05,10.0,,l2,newton-cg,"{'logisticregression__C': 10, 'logisticregress...",-0.675276,-0.677584,-0.68383,-0.676707,-0.678129,-0.678305,0.002926,8,0.578667,0.609333,0.574667,0.584,0.568,0.582933,0.014196,3
10,0.018765,0.000731,0.007361,0.000541,10.0,,l2,lbfgs,"{'logisticregression__C': 10, 'logisticregress...",-0.675279,-0.677585,-0.683832,-0.676704,-0.678134,-0.678307,0.002926,9,0.578667,0.610667,0.574667,0.584,0.568,0.5832,0.014693,1
18,0.03995,0.002819,0.007582,9.2e-05,100.0,,l1,liblinear,"{'logisticregression__C': 100, 'logisticregres...",-0.675276,-0.677585,-0.683832,-0.676712,-0.678132,-0.678307,0.002926,10,0.578667,0.610667,0.574667,0.584,0.568,0.5832,0.014693,1


### Ada Boost

In [151]:
ada_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25, 50],
         'ada__learning_rate': [.1, 1, 10, 20],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression()],}

ada_cv = GridSearchCV(ada_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [152]:
ada_cv.fit(X_train, y_train)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_5_FF%_5v5',
                                                                          'home_last5_xGF_per_min_pp',
                                                                          'home_last40_pp_TOI_per_game',
                                                                          'away_last40_pk_TOI_per_game',
                                                                          'home_last5_pk_TOI_per_game',
                                                                          'away_last40_xGF_per_min_pp',
                

In [153]:
ada_cv.best_score_

-0.6812883430589551

In [156]:
ada_results = pd.DataFrame(ada_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
4,48.373599,0.345728,2.937553,0.031958,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.68246,-0.678853,-0.681992,-0.682383,-0.680754,-0.681288,0.001363,1,0.576,0.597333,0.56,0.573333,0.570667,0.575467,0.012209,3
0,51.813747,0.627942,2.980223,0.038913,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.682541,-0.680892,-0.682944,-0.680844,-0.680954,-0.681635,0.000914,2,0.562667,0.578667,0.552,0.557333,0.562667,0.562667,0.008924,6
8,0.144892,0.005547,0.017487,0.001032,LogisticRegression(),0.1,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.684177,-0.68308,-0.685022,-0.682982,-0.683228,-0.683698,0.000787,3,0.557333,0.588,0.557333,0.570667,0.576,0.569867,0.011673,5
6,44.035846,2.534538,2.475855,0.240989,"SVC(kernel='linear', probability=True)",20.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.68587,-0.680804,-0.68496,-0.684752,-0.684487,-0.684175,0.001748,4,0.542667,0.582667,0.533333,0.549333,0.545333,0.550667,0.016844,9
5,88.420303,5.276782,5.004536,0.535126,"SVC(kernel='linear', probability=True)",10.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.686099,-0.680916,-0.685079,-0.684745,-0.684607,-0.684289,0.001765,5,0.541333,0.597333,0.537333,0.548,0.546667,0.554133,0.021935,7
1,95.772775,2.769245,5.614874,0.184574,"SVC(kernel='linear', probability=True)",0.1,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.685556,-0.684639,-0.685505,-0.684347,-0.684885,-0.684986,0.000476,6,0.542667,0.542667,0.538667,0.544,0.544,0.5424,0.00196,13
7,81.078607,7.450631,4.446727,0.699147,"SVC(kernel='linear', probability=True)",20.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.68772,-0.681945,-0.686726,-0.686418,-0.686285,-0.685819,0.002001,7,0.542667,0.586667,0.542667,0.544,0.544,0.552,0.017344,8
9,0.292306,0.043241,0.027016,0.002916,LogisticRegression(),0.1,50,"{'ada__base_estimator': LogisticRegression(), ...",-0.686973,-0.686153,-0.687641,-0.686227,-0.68664,-0.686727,0.000545,8,0.554667,0.584,0.573333,0.561333,0.578667,0.5704,0.010878,4
2,39.229604,0.798806,2.387321,0.059241,"SVC(kernel='linear', probability=True)",1.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688021,-0.688112,-0.688802,-0.688062,-0.688124,-0.688224,0.000291,9,0.542667,0.542667,0.542667,0.544,0.544,0.5432,0.000653,10
3,75.819306,1.990608,4.576631,0.098662,"SVC(kernel='linear', probability=True)",1.0,50,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688886,-0.688712,-0.689001,-0.68858,-0.688875,-0.688811,0.000148,10,0.542667,0.542667,0.542667,0.544,0.544,0.5432,0.000653,10


### Gradient Boosting

In [143]:
gb_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('gb', GradientBoostingClassifier())])

gb_params = {'gb__n_estimators': [200, 300, 400],
         'gb__learning_rate': [.001,.01, .1],
         'gb__max_depth' : [3,5]}

gb_cv = GridSearchCV(gb_pipeline, param_grid=gb_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [144]:
gb_cv.fit(X_train, y_train)

Fitting 5 folds for each of 18 candidates, totalling 90 fits
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=200; total time=   4.9s
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=200; total time=   5.0s
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=200; total time=   5.1s
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=200; total time=   5.1s
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=200; total time=   5.1s
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=300; total time=   7.6s
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=300; total time=   7.5s
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=300; total time=   7.4s
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=300; total time=   7.4s
[CV] END gb__learning_rate=0.001, gb__max_depth=3, gb__n_estimators=300; total time=   7.5s
[CV] END gb__learni

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last_5_FF%_5v5',
                                                                          'home_last5_xGF_per_min_pp',
                                                                          'home_last40_pp_TOI_per_game',
                                                                          'away_last40_pk_TOI_per_game',
                                                                          'home_last5_pk_TOI_per_game',
                                                                          'away_last40_xGF_per_min_pp',
                

In [145]:
gb_cv.best_score_

-0.6813076251374858

In [146]:
gb_results = pd.DataFrame(gb_cv.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
gb_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_gb__learning_rate,param_gb__max_depth,param_gb__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
6,5.271923,0.141695,0.019683,0.009091,0.01,3,200,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.68219,-0.680045,-0.683195,-0.679583,-0.681526,-0.681308,0.001338,1,0.556,0.573333,0.569333,0.562667,0.542667,0.5608,0.010819,5
7,7.873014,0.096896,0.016901,0.003443,0.01,3,300,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.68301,-0.680717,-0.683615,-0.679676,-0.680791,-0.681561,0.001495,2,0.564,0.573333,0.569333,0.568,0.554667,0.565867,0.006344,2
8,10.30141,0.380511,0.015873,0.000352,0.01,3,400,"{'gb__learning_rate': 0.01, 'gb__max_depth': 3...",-0.684514,-0.682343,-0.684595,-0.679947,-0.681175,-0.682515,0.00183,3,0.562667,0.568,0.577333,0.568,0.56,0.5672,0.005939,1
5,16.534748,0.199185,0.027762,0.001951,0.001,5,400,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.684924,-0.684652,-0.683665,-0.685081,-0.686546,-0.684974,0.000928,4,0.564,0.565333,0.568,0.552,0.546667,0.5592,0.008331,7
2,10.285022,0.335853,0.019065,0.001787,0.001,3,400,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.685048,-0.685482,-0.683626,-0.685462,-0.685492,-0.685022,0.000718,5,0.549333,0.549333,0.554667,0.554667,0.541333,0.549867,0.004888,8
9,7.998002,0.027494,0.016288,0.000805,0.01,5,200,"{'gb__learning_rate': 0.01, 'gb__max_depth': 5...",-0.683699,-0.684479,-0.685518,-0.683762,-0.688825,-0.685257,0.001901,6,0.564,0.578667,0.565333,0.556,0.541333,0.561067,0.012261,4
1,7.484383,0.079754,0.015311,0.000887,0.001,3,300,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.685524,-0.686056,-0.68462,-0.686407,-0.685681,-0.685658,0.000603,7,0.544,0.542667,0.548,0.546667,0.552,0.546667,0.003266,11
4,12.39178,0.195259,0.024282,0.005705,0.001,5,300,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.685632,-0.685264,-0.685363,-0.685987,-0.686624,-0.685774,0.000493,8,0.548,0.552,0.557333,0.549333,0.538667,0.549067,0.006104,9
0,5.027624,0.073685,0.013116,0.001382,0.001,3,200,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.686338,-0.686972,-0.685883,-0.687305,-0.686566,-0.686613,0.000494,9,0.542667,0.542667,0.542667,0.542667,0.550667,0.544267,0.0032,12
3,8.214938,0.091112,0.019273,0.000652,0.001,5,200,"{'gb__learning_rate': 0.001, 'gb__max_depth': ...",-0.686655,-0.686523,-0.685808,-0.687157,-0.687206,-0.68667,0.000508,10,0.545333,0.544,0.545333,0.538667,0.545333,0.543733,0.002585,13


It does not seem that gradient boosting is producing good results for this dataset

### Feature Importance Evaluation

Reviewing the logistic regression, coefficients, I can see which feature the algorithm deemed most impactful. I am
very surprised that away_last_40_xGF%_5v5 was cut by the l1 regularization, that seemed like it would be one of the more important features.

In [191]:
log_coef = pd.DataFrame(list(zip(X_train.columns, log_cv.best_estimator_[1].coef_[0])), columns = ['Feature', 'Coef'] )
log_coef['Coef_abs'] = abs(log_coef['Coef'])
log_coef.sort_values('Coef_abs', ascending = False)

Unnamed: 0,Feature,Coef,Coef_abs
18,away_last_40_GF%_5v5,0.190469,0.190469
36,away_last40_xGA_per_min_pk,-0.159196,0.159196
38,away_last_40_FF%_5v5,-0.152225,0.152225
29,away_last_5_FF%_5v5,0.151841,0.151841
10,away_last_5_GF%_5v5,-0.067696,0.067696
39,home_Goalie_HDCSV%,0.060833,0.060833
15,away_last5_xGF_per_min_pp,-0.060642,0.060642
14,home_B2B,0.060171,0.060171
17,away_last_40_SH%,0.057695,0.057695
3,away_last40_pk_TOI_per_game,-0.054198,0.054198


## 40 Game Rolling

I will run some models using only the rolling 40 game team stats

In [351]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,r40]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [352]:
numeric_features = [
       'home_last40_pp_TOI_per_game', 'away_last40_pk_TOI_per_game',
 'away_last40_xGF_per_min_pp',
       'home_Goalie_GSAx/60',  'away_Goalie_GSAx/60', 'home_last_40_GF%_5v5',
       'home_last40_pk_TOI_per_game', 'away_last_40_SH%',
       'away_last_40_GF%_5v5', 'home_last_40_xGF%_5v5',
       'home_Goalie_FenwickSV%',
       'away_Goalie_HDCSV%', 'home_last_40_SH%',
       'away_last_40_xGF%_5v5', 'away_last40_pp_TOI_per_game',
       'home_last40_xGA_per_min_pk', 
       'home_last_40_FF%_5v5', 
       'home_last40_xGF_per_min_pp',
       'away_last40_xGA_per_min_pk', 'away_Goalie_FenwickSV%',
       'away_last_40_FF%_5v5', 'home_Goalie_HDCSV%']

### Logistic Regression

In [209]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['home_B2B', 'away_B2B']


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', 'passthrough', categorical_features)])

log_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

In [210]:
log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.01, 0.1, 1, 10],
                'logisticregression__class_weight': [None] }

log_cv_40 = GridSearchCV(log_40_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [211]:
log_cv_40.fit(X_train, y_train)

Fitting 5 folds for each of 24 candidates, totalling 120 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_pp_TOI_per_game',
                                                                          'away_last40_pk_TOI_per_game',
                                                                          'away_last40_xGF_per_min_pp',
                                                                          'home_Goalie_GSAx/60',
                                                                          'away_Goalie_GSAx/60',
                                                                          'home_last_40_GF%_5v5',
                            

In [212]:
log_40_results = pd.DataFrame(log_cv_40.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
4,0.014693,0.00113,0.007716,0.000719,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.675677,-0.672521,-0.678186,-0.674674,-0.669348,-0.674081,0.002986,1,0.570667,0.593333,0.592,0.569333,0.581333,0.581333,0.010154,3
5,0.017283,0.000292,0.006951,0.000118,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.675677,-0.67252,-0.678187,-0.674673,-0.669355,-0.674082,0.002984,2,0.570667,0.593333,0.592,0.569333,0.581333,0.581333,0.010154,3
3,0.014061,0.000746,0.008523,0.001029,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.675375,-0.672548,-0.678078,-0.674946,-0.669716,-0.674133,0.002821,3,0.568,0.596,0.592,0.573333,0.584,0.582667,0.010667,2
9,0.014426,0.000387,0.00695,0.000158,0.1,,l2,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.674011,-0.672891,-0.678995,-0.6746,-0.671304,-0.67436,0.002575,4,0.570667,0.584,0.593333,0.578667,0.576,0.580533,0.00771,5
11,0.020913,0.000814,0.007221,0.000453,0.1,,l2,newton-cg,"{'logisticregression__C': 0.1, 'logisticregres...",-0.674061,-0.672896,-0.679035,-0.674563,-0.671252,-0.674362,0.0026,5,0.568,0.582667,0.593333,0.581333,0.577333,0.580533,0.008202,5
10,0.015567,0.000349,0.0071,0.00019,0.1,,l2,lbfgs,"{'logisticregression__C': 0.1, 'logisticregres...",-0.674062,-0.6729,-0.679034,-0.674562,-0.67125,-0.674362,0.002599,6,0.568,0.582667,0.593333,0.581333,0.577333,0.580533,0.008202,5
12,0.022668,0.001238,0.007185,0.000171,1.0,,l1,liblinear,"{'logisticregression__C': 1, 'logisticregressi...",-0.673682,-0.672968,-0.679327,-0.674495,-0.671486,-0.674391,0.002659,7,0.572,0.58,0.590667,0.577333,0.576,0.5792,0.006288,9
15,0.015133,0.000416,0.007112,0.000265,1.0,,l2,liblinear,"{'logisticregression__C': 1, 'logisticregressi...",-0.673574,-0.673067,-0.679649,-0.6746,-0.67206,-0.67459,0.002659,8,0.573333,0.582667,0.585333,0.574667,0.572,0.5776,0.00536,10
17,0.021284,0.000212,0.007193,0.000235,1.0,,l2,newton-cg,"{'logisticregression__C': 1, 'logisticregressi...",-0.673579,-0.673068,-0.679654,-0.674596,-0.672055,-0.67459,0.002661,9,0.573333,0.582667,0.585333,0.574667,0.572,0.5776,0.00536,10
16,0.016655,0.000304,0.006913,0.00019,1.0,,l2,lbfgs,"{'logisticregression__C': 1, 'logisticregressi...",-0.673578,-0.673068,-0.679656,-0.674598,-0.672055,-0.674591,0.002662,10,0.573333,0.582667,0.585333,0.574667,0.572,0.5776,0.00536,10


#### Feature Importance Evaluation

In [214]:
log_40_coef = pd.DataFrame(list(zip(X_train.columns, log_cv_40.best_estimator_[1].coef_[0])), columns = ['Feature', 'Coef'] )
log_40_coef['Coef_abs'] = abs(log_40_coef['Coef'])
log_40_coef.sort_values('Coef_abs', ascending = False)

Unnamed: 0,Feature,Coef,Coef_abs
10,home_last_40_xGF%_5v5,0.167769,0.167769
16,away_last_40_FF%_5v5,0.128198,0.128198
22,away_last40_pk_TOI_per_game,-0.127853,0.127853
20,away_last40_pp_TOI_per_game,-0.120332,0.120332
9,home_last_40_GF%_5v5,0.092343,0.092343
5,away_Goalie_HDCSV%,-0.069468,0.069468
7,away_B2B,-0.069465,0.069465
23,away_last40_xGA_per_min_pk,0.068317,0.068317
4,away_Goalie_GSAx/60,-0.058731,0.058731
6,home_B2B,0.057827,0.057827


### Ada Boost

In [354]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['home_B2B', 'away_B2B']


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', 'passthrough', categorical_features)])

log_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

ada_40_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25],
         'ada__learning_rate': [.01, .1, 1, 10],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression()],}

ada_cv_40 = GridSearchCV(ada_40_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [355]:
ada_cv_40.fit(X_train, y_train)

Fitting 5 folds for each of 8 candidates, totalling 40 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('preprocessor',
                                        ColumnTransformer(transformers=[('num',
                                                                         Pipeline(steps=[('scaler',
                                                                                          StandardScaler())]),
                                                                         ['home_last40_pp_TOI_per_game',
                                                                          'away_last40_pk_TOI_per_game',
                                                                          'away_last40_xGF_per_min_pp',
                                                                          'home_Goalie_GSAx/60',
                                                                          'away_Goalie_GSAx/60',
                                                                          'home_last_40_GF%_5v5',
                            

In [357]:
ada_40_results = pd.DataFrame(ada_cv_40.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_40_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
0,40.928906,1.102974,2.342911,0.05223,"SVC(kernel='linear', probability=True)",0.01,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.676974,-0.673666,-0.677154,-0.676621,-0.672943,-0.675472,0.001793,1,0.56,0.590667,0.597333,0.569333,0.58,0.579467,0.013613,2
4,0.123535,0.004822,0.0169,0.001988,LogisticRegression(),0.01,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.681029,-0.678332,-0.681242,-0.678748,-0.678069,-0.679484,0.001367,2,0.566667,0.577333,0.562667,0.573333,0.569333,0.569867,0.005102,4
3,42.345423,0.33629,2.573474,0.028111,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.683231,-0.672194,-0.682686,-0.680888,-0.680843,-0.679968,0.004002,3,0.573333,0.597333,0.546667,0.562667,0.564,0.5688,0.016645,5
1,42.433099,0.796355,2.499568,0.082708,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.682249,-0.680174,-0.68182,-0.680632,-0.680885,-0.681152,0.000768,4,0.573333,0.58,0.558667,0.56,0.548,0.564,0.011345,6
5,0.132891,0.010832,0.016125,0.000969,LogisticRegression(),0.1,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.684313,-0.682629,-0.684777,-0.68314,-0.682848,-0.683541,0.000848,5,0.565333,0.585333,0.574667,0.590667,0.578667,0.578933,0.008739,3
2,35.091805,0.160695,2.052148,0.016464,"SVC(kernel='linear', probability=True)",1.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.688593,-0.688348,-0.688968,-0.688236,-0.687525,-0.688334,0.000476,6,0.542667,0.542667,0.542667,0.544,0.544,0.5432,0.000653,7
6,0.101714,0.006457,0.0169,0.000865,LogisticRegression(),1.0,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.691506,-0.691278,-0.691629,-0.691432,-0.691436,-0.691456,0.000114,7,0.568,0.586667,0.592,0.572,0.586667,0.581067,0.00933,1
7,0.271557,0.016521,0.017777,0.002543,LogisticRegression(),10.0,25,"{'ada__base_estimator': LogisticRegression(), ...",-0.690214,-0.693413,-0.700866,-0.689553,-0.694382,-0.693686,0.004031,8,0.542667,0.542667,0.542667,0.544,0.450667,0.524533,0.036937,8


## All Rolling Game Features With Recursive Feature Elimination

In [382]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,all_r]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,all_r]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [383]:
X_train.shape

(3750, 104)

### Recursive Feature Elimination

In [275]:
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

categorical_features = ['home_B2B', 'away_B2B']


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', 'passthrough', categorical_features)])

rfecv = RFECV(estimator= LogisticRegression(max_iter =10000, penalty = 'l2', solver='liblinear', C=.1), step=1, scoring='accuracy')
rfecv_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                      ('rfecv', rfecv)])

In [276]:
rfecv_pipeline.fit(X_train, y_train)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('num',
                                                  Pipeline(steps=[('scaler',
                                                                   StandardScaler())]),
                                                  ['home_Goalie_FenwickSV%',
                                                   'home_Goalie_GSAx/60',
                                                   'home_Goalie_HDCSV%',
                                                   'away_Goalie_FenwickSV%',
                                                   'away_Goalie_GSAx/60',
                                                   'away_Goalie_HDCSV%',
                                                   'home_last_3_FF%_5v5',
                                                   'home_last_3_GF%_5v5',
                                                   'home_last_3_xGF%_5v5',
                                                   'home_last_3_SH%',
    

In [278]:
rfecv_pipeline[1].n_features_

23

In [279]:
rfecv_pipeline[1].ranking_

array([ 1,  2, 11, 75,  1, 27,  1, 48,  1, 24, 47, 81,  9, 65, 72, 16, 70,
       14, 37, 34, 42, 77,  1, 67,  1, 23, 46, 62, 10, 12, 45, 15, 36, 13,
       71, 33, 43, 63, 64, 79, 41, 82, 76, 26, 40, 73, 44, 54, 35, 74, 38,
       32, 61, 21, 28, 78, 30, 80, 57, 25,  5,  1,  1, 53,  1,  1, 49, 31,
        8, 69,  1, 66, 29, 52, 51, 55, 59,  1, 19, 18,  6,  1, 50,  1,  7,
       68,  1,  1,  1,  1, 60, 56,  4,  3, 20, 17,  1,  1, 39,  1, 58, 22,
        1,  1])

In [281]:
rfecv_results = pd.DataFrame(list(zip(X_train.columns, rfecv_pipeline[1].ranking_)), columns = ['Feature', 'Ranking']).sort_values('Ranking')
rfecv_results.head(rfecv_pipeline[1].n_features_)

Unnamed: 0,Feature,Ranking
0,home_last_10_FF%_5v5,1
102,home_last_30_FF%_5v5,1
61,away_last3_xGF_per_min_pp,1
62,away_Goalie_GSAx/60,1
64,away_last_5_xGF%_5v5,1
65,away_last_40_SH%,1
70,home_last_3_GF%_5v5,1
77,away_last_40_FF%_5v5,1
24,away_Goalie_FenwickSV%,1
81,away_last40_pk_TOI_per_game,1


In [282]:
rfecv_columns = list(rfecv_results.iloc[:rfecv_pipeline[1].n_features_,0])
rfecv_columns 

['home_last_10_FF%_5v5',
 'home_last_30_FF%_5v5',
 'away_last3_xGF_per_min_pp',
 'away_Goalie_GSAx/60',
 'away_last_5_xGF%_5v5',
 'away_last_40_SH%',
 'home_last_3_GF%_5v5',
 'away_last_40_FF%_5v5',
 'away_Goalie_FenwickSV%',
 'away_last40_pk_TOI_per_game',
 'home_Goalie_FenwickSV%',
 'away_last_20_SH%',
 'home_last5_xGA_per_min_pk',
 'away_last_10_SH%',
 'away_last_10_xGF%_5v5',
 'away_last5_pp_TOI_per_game',
 'away_last_5_SH%',
 'away_last30_pk_TOI_per_game',
 'home_last_5_xGF%_5v5',
 'home_Goalie_HDCSV%',
 'away_last_3_FF%_5v5',
 'home_last_3_SH%',
 'away_last10_pk_TOI_per_game']

### Logistic Regression

In [283]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [284]:
log_rfecv_pipeline = Pipeline(steps=[('ss', StandardScaler()),
                      ('logisticregression', LogisticRegression(max_iter=10000))])

log_params = {'logisticregression__solver' : ['liblinear', 'lbfgs', 'newton-cg'],
                'logisticregression__penalty': ['l1', 'l2'],
                'logisticregression__C': [.01, 0.1, 10, 20, 100],
                'logisticregression__class_weight': [None]}

log_cv_all = GridSearchCV(log_rfecv_pipeline, param_grid=log_params, cv=5, scoring=scoring, refit = 'neg_log_loss',  verbose=1)

In [285]:
log_cv_all.fit(X_train[rfecv_columns], y_train)

Fitting 5 folds for each of 30 candidates, totalling 150 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('logisticregression',
                                        LogisticRegression(max_iter=10000))]),
             param_grid={'logisticregression__C': [0.01, 0.1, 10, 20, 100],
                         'logisticregression__class_weight': [None],
                         'logisticregression__penalty': ['l1', 'l2'],
                         'logisticregression__solver': ['liblinear', 'lbfgs',
                                                        'newton-cg']},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [286]:
log_all_results = pd.DataFrame(log_cv_all.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
log_all_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__class_weight,param_logisticregression__penalty,param_logisticregression__solver,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
6,0.011894,0.001292,0.003923,6e-06,0.1,,l1,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.676998,-0.674722,-0.67814,-0.677774,-0.671281,-0.675783,0.002545,1,0.557333,0.592,0.570667,0.566667,0.597333,0.5768,0.01531,16
5,0.015744,0.000855,0.004152,0.00045,0.01,,l2,newton-cg,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677915,-0.675348,-0.678426,-0.678199,-0.670199,-0.676017,0.003113,2,0.554667,0.586667,0.56,0.561333,0.597333,0.572,0.016823,18
4,0.013048,0.001829,0.005018,0.000662,0.01,,l2,lbfgs,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677915,-0.675348,-0.678427,-0.6782,-0.670201,-0.676018,0.003113,3,0.554667,0.586667,0.56,0.561333,0.597333,0.572,0.016823,18
3,0.010138,0.000692,0.004599,0.000393,0.01,,l2,liblinear,"{'logisticregression__C': 0.01, 'logisticregre...",-0.677768,-0.67523,-0.678595,-0.678217,-0.670434,-0.676049,0.003044,4,0.56,0.586667,0.573333,0.565333,0.594667,0.576,0.012955,17
9,0.010136,0.000326,0.003898,0.000107,0.1,,l2,liblinear,"{'logisticregression__C': 0.1, 'logisticregres...",-0.678761,-0.675954,-0.678914,-0.681263,-0.670352,-0.677049,0.003748,5,0.56,0.594667,0.578667,0.568,0.592,0.578667,0.013387,3
11,0.016889,0.000644,0.003922,5e-05,0.1,,l2,newton-cg,"{'logisticregression__C': 0.1, 'logisticregres...",-0.678788,-0.675981,-0.678899,-0.681268,-0.670329,-0.677053,0.003756,6,0.562667,0.594667,0.576,0.568,0.592,0.578667,0.012733,3
10,0.01227,0.000551,0.003819,3.2e-05,0.1,,l2,lbfgs,"{'logisticregression__C': 0.1, 'logisticregres...",-0.678788,-0.67598,-0.678899,-0.68127,-0.670329,-0.677053,0.003757,7,0.562667,0.594667,0.576,0.568,0.592,0.578667,0.012733,3
12,0.014359,0.000719,0.004021,0.000203,10.0,,l1,liblinear,"{'logisticregression__C': 10, 'logisticregress...",-0.678976,-0.676168,-0.678976,-0.681805,-0.670594,-0.677304,0.003799,8,0.561333,0.594667,0.576,0.568,0.593333,0.578667,0.01336,3
18,0.014439,0.000322,0.004291,0.000485,20.0,,l1,liblinear,"{'logisticregression__C': 20, 'logisticregress...",-0.678992,-0.676181,-0.678983,-0.681833,-0.670604,-0.677318,0.003803,9,0.56,0.594667,0.576,0.568,0.593333,0.5784,0.013712,15
24,0.017296,0.002053,0.004062,7.7e-05,100.0,,l1,liblinear,"{'logisticregression__C': 100, 'logisticregres...",-0.679006,-0.676194,-0.678992,-0.681852,-0.670607,-0.67733,0.003808,10,0.561333,0.596,0.574667,0.568,0.593333,0.578667,0.013753,3


### Ada Boost

In [398]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021']['Home_Team_Won']

In [399]:
ada_rfecv_pipeline = Pipeline(steps=[('ss', StandardScaler()),
                      ('ada', AdaBoostClassifier())])

ada_params = {'ada__n_estimators': [25],
         'ada__learning_rate': [ .1, 10],
         'ada__base_estimator': [svm.SVC(probability=True , kernel='linear'), LogisticRegression(max_iter =10000, C=.01, penalty = 'l1', solver = 'liblinear')],}

ada_cv_all = GridSearchCV(ada_rfecv_pipeline, param_grid=ada_params, cv=5, scoring=scoring, refit='neg_log_loss', verbose=1)

In [400]:
ada_cv_all.fit(X_train, y_train)

Fitting 5 folds for each of 4 candidates, totalling 20 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('ss', StandardScaler()),
                                       ('ada', AdaBoostClassifier())]),
             param_grid={'ada__base_estimator': [SVC(kernel='linear',
                                                     probability=True),
                                                 LogisticRegression(C=0.01,
                                                                    max_iter=10000,
                                                                    penalty='l1',
                                                                    solver='liblinear')],
                         'ada__learning_rate': [0.1, 10],
                         'ada__n_estimators': [25]},
             refit='neg_log_loss', scoring=['neg_log_loss', 'accuracy'],
             verbose=1)

In [402]:
ada_all_results = pd.DataFrame(ada_cv_all.cv_results_).sort_values('mean_test_neg_log_loss', ascending=False)
ada_all_results.head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ada__base_estimator,param_ada__learning_rate,param_ada__n_estimators,params,split0_test_neg_log_loss,split1_test_neg_log_loss,split2_test_neg_log_loss,split3_test_neg_log_loss,split4_test_neg_log_loss,mean_test_neg_log_loss,std_test_neg_log_loss,rank_test_neg_log_loss,split0_test_accuracy,split1_test_accuracy,split2_test_accuracy,split3_test_accuracy,split4_test_accuracy,mean_test_accuracy,std_test_accuracy,rank_test_accuracy
0,45.457442,0.866537,2.623058,0.066013,"SVC(kernel='linear', probability=True)",0.1,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.683233,-0.681924,-0.683197,-0.68211,-0.68163,-0.682418,0.000668,1,0.561333,0.565333,0.541333,0.568,0.554667,0.558133,0.009526,2
1,43.603495,0.634419,2.648859,0.125019,"SVC(kernel='linear', probability=True)",10.0,25,"{'ada__base_estimator': SVC(kernel='linear', p...",-0.684232,-0.681059,-0.684624,-0.682336,-0.683243,-0.683099,0.001294,2,0.569333,0.598667,0.553333,0.568,0.557333,0.569333,0.015889,1
2,0.07714,0.000653,0.011736,0.000212,"LogisticRegression(C=0.01, max_iter=10000, pen...",0.1,25,{'ada__base_estimator': LogisticRegression(C=0...,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,0.0,3,0.457333,0.457333,0.457333,0.456,0.456,0.4568,0.000653,3
3,0.075963,0.001387,0.01214,8.6e-05,"LogisticRegression(C=0.01, max_iter=10000, pen...",10.0,25,{'ada__base_estimator': LogisticRegression(C=0...,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,-0.693147,0.0,3,0.457333,0.457333,0.457333,0.456,0.456,0.4568,0.000653,3


## Apply Best Model To Test

I will evaluate the best model iterations on the held out 2021 season data

In [358]:
results_dict = {'cv accuracy': {}, 'cv log loss': {}, 'test accuracy': {}, 'test log_loss':{}}
accuracy_list = []
log_loss_list = []

In [359]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



test_preds_5_40 = log_cv.predict(X_test)

test_probs_5_40 = log_cv.predict_proba(X_test)


accuracy_list.append(accuracy_score(y_test, test_preds_5_40))
log_loss_list.append(log_loss(y_test, test_probs_5_40))


In [360]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



test_preds_40 = log_cv_40.predict(X_test)

test_probs_40 = log_cv_40.predict_proba(X_test)

accuracy_list.append(accuracy_score(y_test, test_preds_40))
log_loss_list.append(log_loss(y_test, test_probs_40))

In [361]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,rfecv_columns]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,rfecv_columns]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']

test_preds_rfecv = log_cv_all.predict(X_test)

test_probs_rfecv = log_cv_all.predict_proba(X_test)


accuracy_list.append(accuracy_score(y_test, test_preds_rfecv))
log_loss_list.append(log_loss(y_test, test_probs_rfecv))



In [364]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r_5_40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r_5_40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, ada_cv.predict(X_test)))
log_loss_list.append(log_loss(y_test,ada_cv.predict_proba(X_test)))

In [365]:
X_train = df[df['Season'] != '2020-2021'].dropna().loc[:,r40]
y_train = df[df['Season'] != '2020-2021'].dropna()['Home_Team_Won']
X_test = df[df['Season'] == '2020-2021'].dropna().loc[:,r40]
y_test = df[df['Season'] == '2020-2021'].dropna()['Home_Team_Won']



accuracy_list.append(accuracy_score(y_test, ada_cv_40.predict(X_test)))
log_loss_list.append(log_loss(y_test, ada_cv_40.predict_proba(X_test)))

In [403]:
results_dict['test accuracy'] = accuracy_list
results_dict['test log_loss'] = log_loss_list
models = ['5 and 40 log', '40 log', 'rfecv log', '5 and 40 ada', '40 ada']
results_dict['cv accuracy'] = [log_results['mean_test_accuracy'][0], log_40_results['mean_test_accuracy'][0], log_all_results['mean_test_accuracy'][0], ada_results['mean_test_accuracy'][0], ada_40_results['mean_test_accuracy'][0]]
results_dict['cv log loss'] = [log_cv.best_score_*-1, log_cv_40.best_score_*-1, log_cv_all.best_score_*-1, ada_cv.best_score_*-1, ada_cv_40.best_score_*-1]

In [404]:
results_df = pd.DataFrame(results_dict, index = models)

## Conclusion

Best model was logistic regression with the rolling 5 and 40 features on the test data. Interestingly, this was the 4th best model on the CV training data set though it did have the best CV accuracy.

In [405]:
results_df.sort_values('test log_loss')

Unnamed: 0,cv accuracy,cv log loss,test accuracy,test log_loss
5 and 40 log,0.581067,0.677735,0.597205,0.657201
40 log,0.580267,0.674081,0.593393,0.65724
40 ada,0.579467,0.675472,0.606099,0.660695
rfecv log,0.571733,0.675783,0.590851,0.665077
5 and 40 ada,0.562667,0.681288,0.560356,0.678121


## Next Steps
To further improve the models I would like to take the following next steps

- Train a neural network model
- Categorize B2B better
- Include team ELO feature
- Try linear weightings in rolling features
- Increase goalie games
- Add prior year goalie GAR feature
- Add Team HDSC % feature
- Add more seasons to training set
- Compare against historical implied odds from a bookmaker
- Adjust ineperienced goalie imputed stats and exclude 2021 season to avoid data leakage on test set