# Capstone Project: Predicting NHL Player Salary

## Part III - Modeling

Author: Charles Ramey

Date: 05/15/2023

---

#### Notebook Links

Part I - Project Intro & Data Cleaning
- [`Part-1_setup-and-cleaning.ipynb`](../code/Part-1_setup-and-cleaning.ipynb)

Part II - Exploratory Data Analysis (EDA)
- [`Part-2_eda.ipynb`](../code/Part-2_eda.ipynb)

Part III - Modeling
- [`Part-3.1_modeling-forwards.ipynb`](../code/Part-3.1_modeling-forwards.ipynb)
- [`Part-3.2_modeling-defense.ipynb`](../code/Part-3.2_modeling-defense.ipynb)

Part IV - Conclusion, Recommendations, and Sources
- [`Part-4_conclusion-and-recommendations.ipynb`](../code/Part-4_conclusion-and-recommendations.ipynb)

### Contents

- [Data Import and Preprocessing](#Data-Import-and-Cleaning)
- [Modeling](#Modeling)

### Library Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
import xgboost as xg
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict, GridSearchCV

from sklearn.metrics import mean_squared_error
import pickle

## Data Import and Preprocessing

### Data Import

In [2]:
goalies = pd.read_csv('../data/goalies_cleaned.csv')

In [3]:
goalies.head(2)

Unnamed: 0,player,pos,contract_aav,season,pct_change,upper_limit,lower_limit,min_salary,team,games_played,...,low_danger_unblocked_shot_attempt_save_pct,xlow_danger_unblocked_shot_attempt_save_pct,low_danger_unblocked_shot_attempt_savepct_above_expected,medium_danger_unblocked_shot_attempt_save_pct,xmedium_danger_unblocked_shot_attempt_save_pct,medium_danger_unblocked_shot_attemptsave_pct_above_expected,high_danger_unblocked_shot_attempt_save_pct,xhigh_danger_unblocked_shot_attempt_save_pct,high_danger_unblocked_shot_attempt_save_pct_above_expected,final_standing
0,Ilya Bryzgalov,G,5666666,2010-11,0.0825,64300000,48300000,525000,ARI,68.0,...,0.976,0.972,0.005,0.895,0.878,0.018,0.624,0.667,-0.043,11.0
1,Henrik Karlsson,G,862500,2010-11,0.0825,64300000,48300000,525000,CGY,17.0,...,0.965,0.971,-0.007,0.901,0.875,0.026,0.567,0.639,-0.072,17.0


### Preprocessing

In [4]:
ordered_columns = sorted(goalies.columns)
goalies = goalies.reindex(columns=ordered_columns)

In [5]:
goalies.head(2)

Unnamed: 0,contract_aav,expected_goals_against,expected_pct_of_unblocked_shot_attempts_against_on_goal,expected_puck_freeze,final_standing,gaa,gaa_better_than_expected,games_played,goals_against,goals_saved_above_expected,...,season,team,upper_limit,wins_above_replacement,xgaa,xhigh_danger_unblocked_shot_attempt_save_pct,xlow_danger_unblocked_shot_attempt_save_pct,xmedium_danger_unblocked_shot_attempt_save_pct,xrebounds_per_save,xsave_pct_on_unblocked_shots
0,5666666,182.33,0.5566,485.74,11.0,2.49,0.21,68.0,168.0,14.3,...,2010-11,ARI,64300000,2.39,2.7,0.667,0.972,0.878,0.041,0.953
1,862500,33.13,0.533,87.2,17.0,2.59,-0.21,17.0,36.0,-2.9,...,2010-11,CGY,64300000,-0.48,2.38,0.639,0.971,0.875,0.037,0.954


In [6]:
goalies_train = goalies[~goalies['season'].isin(['2020-21', '2021-22'])]
goalies_test = goalies[goalies['season'].isin(['2020-21', '2021-22'])]

---
## Modeling

In [7]:
X_train = goalies_train.drop(columns=['player', 'pos', 'team', 'season', 'contract_aav'], axis=1)
y_train = goalies_train['contract_aav']

X_test = goalies_test.drop(columns=['player', 'pos', 'team', 'season', 'contract_aav'], axis=1)
y_test = goalies_test['contract_aav']

In [8]:
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

In [9]:
# Initiate the model
lr = LinearRegression()

# Fit the model to the training data
lr.fit(Xs_train, y_train)

# Score the model's performance one the train and test data
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test)

(0.705828450564584, 0.08840994935667668)

In [10]:
# Define the pipeline
pipe = Pipeline([
    ('ss', StandardScaler()),
    ('pc', PCA()),
    ('lr', LinearRegression())
])

# Define a parameter grid to search over
pgrid = {'pc__n_components' : np.arange(1, 40)}

# Instantiate the gridsearch
gs = GridSearchCV(pipe, pgrid, cv = 5, n_jobs=-1)

# Fit the model with the best parameters
gs.fit(X_train, y_train)

In [11]:
# Score the model's performance one the train and test data
gs.score(X_train, y_train), gs.score(X_test, y_test)

(0.4627815738981269, -0.16263417142296865)

In [12]:
# Instantiation of models to test

# Logisitic Regression
lr = LinearRegression()

# Decision Tree Regressor
dt = DecisionTreeRegressor()

# Random Forest Regressor
rf = RandomForestRegressor()

# Extra Trees Regressor
et = ExtraTreesRegressor()

# Adaptive Boost Regressor
ada = AdaBoostRegressor()

# Extreme Gradient Boost Regressor
xgb = xg.XGBRegressor()

In [13]:
lr_params = {
    'pc__n_components' : np.arange(1, 40)
}

dt_params = {
    'pc__n_components' : np.arange(1, 40),
    'dt__max_depth': [3,5],
    'dt__min_samples_split': [2,5,10],
    'dt__random_state': [42]}

rf_params = {
    'pc__n_components' : np.arange(1, 40),
    'rf__max_depth': [3,5],
    'rf__min_samples_split': [2,5,10],
    'rf__random_state': [42]}
    
et_params = {
    'pc__n_components' : np.arange(1, 40),
    'et__n_estimators': [50,100],
    'et__max_depth': [3,5],
    'et__min_samples_split': [2,5,10],
    'et__random_state': [42]}

ada_params = {
    'pc__n_components' : np.arange(1, 40),
    'ada__n_estimators': [50,100],
    'ada__learning_rate': [0.3,0.6],
    'ada__random_state': [42]}

xgb_params = {
    'pc__n_components' : np.arange(1, 40),
    'xgb__n_estimators': [50,100],
    'xgb__learning_rate': [0.3,0.6],
    'xgb__random_state': [42]}

In [14]:
'''
Code adapted from Devin Faye, General Assembly
'''
def evaluation(X_train, y_train, X_test, y_test, model, model_name:str,
               model_params, scores_df=None):
    # If this is the first call of this function and a previous dataframe of scores
    # is not passed in the parameters, define a new dataframe to store model metrics
    if scores_df is None:
        scores_df = pd.DataFrame(columns = ['train_score', 'test_score', 'difference', 'rmse ($)'])

    # Create a pipeline that will scale the data, perform a PCA, and feed it into the selected model
    pipe = Pipeline([
        ('ss', StandardScaler()),
        ('pc', PCA(random_state=42)),
        (model_name, model)])
    
    # Instantiate a grid search to pass the pipeline into with hyperparameters to search over
    gs = GridSearchCV(
        pipe,
        model_params,
        cv = 5, n_jobs=-1)
    
    # Fit the model to the training data
    gs.fit(X_train, y_train)
   
    # Print out the best hyperparameters for this model
    print(f"The best parameters for the {model_name} model are: {gs.best_params_}")
    
    # Store the model's predicted values
    preds = gs.predict(X_test)
    
    # Score the model's performance on the training and testing data
    train_score = round(gs.score(X_train, y_train), 2)
    test_score = round(gs.score(X_test, y_test), 2)
    difference = round(train_score - test_score, 2)
    rmse = int(mean_squared_error(y_test, preds, squared=False))
    
    # Store the scores in the dataframe
    scores_df.loc[model_name,:] = [train_score, test_score, difference, rmse]
    # Make sure the scores are being stored as a dataframe
    scores_df = pd.DataFrame(scores_df)
    
    print('')
    print(scores_df)
    
    return scores_df

In [15]:
scores = evaluation(X_train, y_train, X_test, y_test, lr, 'lr',
                    lr_params)

The best parameters for the lr model are: {'pc__n_components': 4}

   train_score test_score difference rmse ($)
lr        0.46      -0.16       0.62  1682295


In [16]:
scores = evaluation(X_train, y_train, X_test, y_test, dt, 'dt',
                    dt_params, scores_df=scores)

The best parameters for the dt model are: {'dt__max_depth': 3, 'dt__min_samples_split': 2, 'dt__random_state': 42, 'pc__n_components': 1}

   train_score test_score difference rmse ($)
lr        0.46      -0.16       0.62  1682295
dt         0.5      -0.27       0.77  1758036


In [17]:
scores = evaluation(X_train, y_train, X_test, y_test, rf, 'rf',
                    rf_params, scores_df=scores)

The best parameters for the rf model are: {'pc__n_components': 12, 'rf__max_depth': 5, 'rf__min_samples_split': 10, 'rf__random_state': 42}

   train_score test_score difference rmse ($)
lr        0.46      -0.16       0.62  1682295
dt         0.5      -0.27       0.77  1758036
rf        0.75      -0.26       1.01  1751746


In [18]:
scores = evaluation(X_train, y_train, X_test, y_test, et, 'et',
                    et_params, scores_df=scores)

The best parameters for the et model are: {'et__max_depth': 5, 'et__min_samples_split': 10, 'et__n_estimators': 50, 'et__random_state': 42, 'pc__n_components': 14}

   train_score test_score difference rmse ($)
lr        0.46      -0.16       0.62  1682295
dt         0.5      -0.27       0.77  1758036
rf        0.75      -0.26       1.01  1751746
et        0.67      -0.13        0.8  1660682


In [19]:
scores = evaluation(X_train, y_train, X_test, y_test, ada, 'ada',
                    ada_params, scores_df=scores)

The best parameters for the ada model are: {'ada__learning_rate': 0.3, 'ada__n_estimators': 50, 'ada__random_state': 42, 'pc__n_components': 22}

    train_score test_score difference rmse ($)
lr         0.46      -0.16       0.62  1682295
dt          0.5      -0.27       0.77  1758036
rf         0.75      -0.26       1.01  1751746
et         0.67      -0.13        0.8  1660682
ada        0.78      -0.03       0.81  1583427


In [20]:
scores = evaluation(X_train, y_train, X_test, y_test, xgb, 'xgb',
                    xgb_params, scores_df=scores)

The best parameters for the xgb model are: {'pc__n_components': 16, 'xgb__learning_rate': 0.3, 'xgb__n_estimators': 50, 'xgb__random_state': 42}

    train_score test_score difference rmse ($)
lr         0.46      -0.16       0.62  1682295
dt          0.5      -0.27       0.77  1758036
rf         0.75      -0.26       1.01  1751746
et         0.67      -0.13        0.8  1660682
ada        0.78      -0.03       0.81  1583427
xgb         1.0      -0.68       1.68  2020580


As expected, the models performance decreases further when trained and tested on the goalie dataset compared to the forwards and defense. This is liekly due to the fact that the goalie dataset is severly limited in scale. 

In [21]:
lr_pipe = Pipeline([
    ('ss', StandardScaler()),
    ('pc', PCA(
        n_components=19,
        random_state=42)),
    ('lr', LinearRegression())])

In [22]:
lr_pipe.fit(X_train, y_train)

In [23]:
preds = lr_pipe.predict(X_test)

In [24]:
print(f"Training score: {round(lr_pipe.score(X_train, y_train), 2)}")
print(f"Test score: {round(lr_pipe.score(X_test, y_test), 2)}")
print(f"Difference: {round(lr_pipe.score(X_train, y_train)-(lr_pipe.score(X_test, y_test)), 2)}")
print(f"RMSE: ${int(mean_squared_error(y_test, preds, squared=False))}")

Training score: 0.56
Test score: 0.05
Difference: 0.51
RMSE: $1516992


In [26]:
with open('../models/goalies_model.pkl', 'wb') as file:
    pickle.dump(lr_pipe, file)

---
#### Notebook Links

Part I - Project Intro & Data Cleaning
- [`Part-1_setup-and-cleaning.ipynb`](../code/Part-1_setup-and-cleaning.ipynb)

Part II - Exploratory Data Analysis (EDA)
- [`Part-2_eda.ipynb`](../code/Part-2_eda.ipynb)

Part III - Modeling
- [`Part-3.1_modeling-forwards.ipynb`](../code/Part-3.1_modeling-forwards.ipynb)
- [`Part-3.2_modeling-defense.ipynb`](../code/Part-3.2_modeling-defense.ipynb)

Part IV - Conclusion, Recommendations, and Sources
- [`Part-4_conclusion-and-recommendations.ipynb`](../code/Part-4_conclusion-and-recommendations.ipynb)