# Capstone Project: Predicting NHL Player Salary

## Part III - Modeling

Author: Charles Ramey

Date: 05/15/2023

---

#### Notebook Links

Part I - Project Intro & Data Cleaning
- [`Part-1_setup-and-cleaning.ipynb`](../code/Part-1_setup-and-cleaning.ipynb)

Part II - Exploratory Data Analysis (EDA)
- [`Part-2_eda.ipynb`](../code/Part-2_eda.ipynb)

Part III - Modeling
- [`Part-3.1_modeling-forwards.ipynb`](../code/Part-3.1_modeling-forwards.ipynb)
- [`Part-3.3_modeling-goalies.ipynb`](../code/Part-3.3_modeling-goalies.ipynb)

Part IV - Conclusion, Recommendations, and Sources
- [`Part-4_conclusion-and-recommendations.ipynb`](../code/Part-4_conclusion-and-recommendations.ipynb)

### Contents

- [Data Import and Preprocessing](#Data-Import-and-Cleaning)
- [Modeling](#Modeling)

### Library Imports

In [1]:
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, AdaBoostRegressor
import xgboost as xg
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict, GridSearchCV

from sklearn.metrics import mean_squared_error
import pickle

## Data Import and Preprocessing

### Data Import

In [2]:
defense = pd.read_csv('../data/defense_cleaned.csv')

In [3]:
defense.head(2)

Unnamed: 0,player,pos,contract_aav,season,pct_change,upper_limit,lower_limit,min_salary,team,games_played,...,xgoals_on_rebounds_shots,share_of_xgoals_from_rebounds_shots,xgoals_from_non_rebounds,xgoals_of_expected_rebounds,created_xgoals,created_xgoals_minus_actual_xgoals,shooting_talent_above_average,shooting_talent_adjusted_expected_goals,goals_above_shooting_talent,final_standing
0,Drew Doughty,D,7000000,2010-11,0.0825,64300000,48300000,525000,LAK,76.0,...,0.1,0.005,9.4,2.3,11.6,2.2,0.136,10.7,0.3,12.0
1,Chris Campoli,D,1750000,2010-11,0.0825,64300000,48300000,525000,CHI,77.0,...,0.4,0.097,3.5,1.0,4.4,0.6,0.118,4.3,-0.3,13.0


### Preprocessing

As with our forwards data, we need to reorder the columns to make the training data congruous with the 2022-23 data for use in the web app.

In [4]:
ordered_columns = sorted(defense.columns)
defense = defense.reindex(columns=ordered_columns)

In [5]:
defense.head(2)

Unnamed: 0,assists,assists_per_60_minutes,contract_aav,created_xgoals,created_xgoals_minus_actual_xgoals,defensive_zone_giveaways,expected_goals,expected_goals_per_60_minutes,expected_pct_of_unblocked_shots_that_missed_net,expected_shooting_pct_on_unblocked_shots,...,shots_on_goal_per_60_minutes,shots_that_missed_net,shots_that_were_blocked,takeaways,team,upper_limit,xgoals_from_non_rebounds,xgoals_of_expected_rebounds,xgoals_on_rebounds_shots,xrebounds_created
0,29.0,0.89,7000000,11.6,2.2,52.0,9.4,0.29,0.318,0.042,...,4.28,75.0,140.0,22.0,LAK,64300000,9.4,2.3,0.1,10.3
1,17.0,0.69,1750000,4.4,0.6,26.0,3.8,0.16,0.325,0.026,...,3.42,30.0,70.0,19.0,CHI,64300000,3.5,1.0,0.4,4.8


Here we split the train and test data, training on the 2010-11 to 2019-20 seasons and testing on the 2020-21 and 2021-22 seasons.

In [6]:
defense_train = defense[~defense['season'].isin(['2020-21', '2021-22'])]
defense_test = defense[defense['season'].isin(['2020-21', '2021-22'])]

---
## Modeling

In [7]:
X_train = defense_train.drop(columns=['player', 'pos', 'team', 'season', 'contract_aav'], axis=1)
y_train = defense_train['contract_aav']

X_test = defense_test.drop(columns=['player', 'pos', 'team', 'season', 'contract_aav'], axis=1)
y_test = defense_test['contract_aav']

In [8]:
ss = StandardScaler()

Xs_train = ss.fit_transform(X_train)
Xs_test = ss.transform(X_test)

In [9]:
# Initiate the model
lr = LinearRegression()

# Fit the model to the training data
lr.fit(Xs_train, y_train)

# Score the model's performance one the train and test data
lr.score(Xs_train, y_train), lr.score(Xs_test, y_test)

(0.7450609102059134, 0.6019537531910315)

In [10]:
# Define the pipeline
pipe = Pipeline([
    ('ss', StandardScaler()),
    ('pc', PCA()),
    ('lr', LinearRegression())
])

# Define a parameter grid to search over
pgrid = {'pc__n_components' : np.arange(1, 40)}

# Instantiate the gridsearch
gs = GridSearchCV(pipe, pgrid, cv = 5, n_jobs=-1)

# Fit the model with the best parameters
gs.fit(X_train, y_train)

In [11]:
# Score the model's performance one the train and test data
gs.score(X_train, y_train), gs.score(X_test, y_test)

(0.7214790154639533, 0.5630143787267721)

In [12]:
# Instantiation of models to test

# Logisitic Regression
lr = LinearRegression()

# Decision Tree Regressor
dt = DecisionTreeRegressor()

# Random Forest Regressor
rf = RandomForestRegressor()

# Extra Trees Regressor
et = ExtraTreesRegressor()

# Adaptive Boost Regressor
ada = AdaBoostRegressor()

# Extreme Gradient Boost Regressor
xgb = xg.XGBRegressor()

In [13]:
lr_params = {
    'pc__n_components' : np.arange(1, 40)
}

dt_params = {
    'pc__n_components' : np.arange(1, 40),
    'dt__max_depth': [3,5],
    'dt__min_samples_split': [2,5,10],
    'dt__random_state': [42]}

rf_params = {
    'pc__n_components' : np.arange(1, 40),
    'rf__max_depth': [3,5],
    'rf__min_samples_split': [2,5,10],
    'rf__random_state': [42]}
    
et_params = {
    'pc__n_components' : np.arange(1, 40),
    'et__n_estimators': [50,100],
    'et__max_depth': [3,5],
    'et__min_samples_split': [2,5,10],
    'et__random_state': [42]}

ada_params = {
    'pc__n_components' : np.arange(1, 40),
    'ada__n_estimators': [50,100],
    'ada__learning_rate': [0.3,0.6],
    'ada__random_state': [42]}

xgb_params = {
    'pc__n_components' : np.arange(1, 40),
    'xgb__n_estimators': [50,100],
    'xgb__learning_rate': [0.3,0.6],
    'xgb__random_state': [42]}

In [14]:
'''
Code adapted from Devin Faye, General Assembly
'''
def evaluation(X_train, y_train, X_test, y_test, model, model_name:str,
               model_params, scores_df=None):
    # If this is the first call of this function and a previous dataframe of scores
    # is not passed in the parameters, define a new dataframe to store model metrics
    if scores_df is None:
        scores_df = pd.DataFrame(columns = ['train_score', 'test_score', 'difference', 'rmse ($)'])

    # Create a pipeline that will scale the data, perform a PCA, and feed it into the selected model
    pipe = Pipeline([
        ('ss', StandardScaler()),
        ('pc', PCA(random_state=42)),
        (model_name, model)])
    
    # Instantiate a grid search to pass the pipeline into with hyperparameters to search over
    gs = GridSearchCV(
        pipe,
        model_params,
        cv = 5, n_jobs=-1)
    
    # Fit the model to the training data
    gs.fit(X_train, y_train)
   
    # Print out the best hyperparameters for this model
    print(f"The best parameters for the {model_name} model are: {gs.best_params_}")
    
    # Store the model's predicted values
    preds = gs.predict(X_test)
    
    # Score the model's performance on the training and testing data
    train_score = round(gs.score(X_train, y_train), 2)
    test_score = round(gs.score(X_test, y_test), 2)
    difference = round(train_score - test_score, 2)
    rmse = int(mean_squared_error(y_test, preds, squared=False))
    
    # Store the scores in the dataframe
    scores_df.loc[model_name,:] = [train_score, test_score, difference, rmse]
    # Make sure the scores are being stored as a dataframe
    scores_df = pd.DataFrame(scores_df)
    
    print('')
    print(scores_df)
    
    return scores_df

In [15]:
scores = evaluation(X_train, y_train, X_test, y_test, lr, 'lr',
                    lr_params)

The best parameters for the lr model are: {'pc__n_components': 35}

   train_score test_score difference rmse ($)
lr        0.72       0.56       0.16  1628068


In [16]:
scores = evaluation(X_train, y_train, X_test, y_test, dt, 'dt',
                    dt_params, scores_df=scores)

The best parameters for the dt model are: {'dt__max_depth': 3, 'dt__min_samples_split': 2, 'dt__random_state': 42, 'pc__n_components': 2}

   train_score test_score difference rmse ($)
lr        0.72       0.56       0.16  1628068
dt        0.66        0.3       0.36  2060087


In [17]:
scores = evaluation(X_train, y_train, X_test, y_test, rf, 'rf',
                    rf_params, scores_df=scores)

The best parameters for the rf model are: {'pc__n_components': 9, 'rf__max_depth': 5, 'rf__min_samples_split': 2, 'rf__random_state': 42}

   train_score test_score difference rmse ($)
lr        0.72       0.56       0.16  1628068
dt        0.66        0.3       0.36  2060087
rf        0.84       0.33       0.51  2011468


In [18]:
scores = evaluation(X_train, y_train, X_test, y_test, et, 'et',
                    et_params, scores_df=scores)

The best parameters for the et model are: {'et__max_depth': 5, 'et__min_samples_split': 2, 'et__n_estimators': 100, 'et__random_state': 42, 'pc__n_components': 20}

   train_score test_score difference rmse ($)
lr        0.72       0.56       0.16  1628068
dt        0.66        0.3       0.36  2060087
rf        0.84       0.33       0.51  2011468
et        0.73       0.39       0.34  1924557


In [19]:
scores = evaluation(X_train, y_train, X_test, y_test, ada, 'ada',
                    ada_params, scores_df=scores)

The best parameters for the ada model are: {'ada__learning_rate': 0.3, 'ada__n_estimators': 50, 'ada__random_state': 42, 'pc__n_components': 9}

    train_score test_score difference rmse ($)
lr         0.72       0.56       0.16  1628068
dt         0.66        0.3       0.36  2060087
rf         0.84       0.33       0.51  2011468
et         0.73       0.39       0.34  1924557
ada        0.72       0.36       0.36  1969138


In [20]:
scores = evaluation(X_train, y_train, X_test, y_test, xgb, 'xgb',
                    xgb_params, scores_df=scores)

The best parameters for the xgb model are: {'pc__n_components': 15, 'xgb__learning_rate': 0.3, 'xgb__n_estimators': 50, 'xgb__random_state': 42}

    train_score test_score difference rmse ($)
lr         0.72       0.56       0.16  1628068
dt         0.66        0.3       0.36  2060087
rf         0.84       0.33       0.51  2011468
et         0.73       0.39       0.34  1924557
ada        0.72       0.36       0.36  1969138
xgb         1.0        0.3        0.7  2061759


Like the forwards, the linear regression model performed best on the defense data as well. However, the overall performance has declined pretty substantially, and there is a lot more variance with this data. One possible cause is simply that there are fewer defensemen, so the models are trained on fewer data points overall. Unofrtunately, it is likely that this result with be exacerbated for the goalie data, which is an even smaller dataset than either the forwards or defense.

In [21]:
lr_pipe = Pipeline([
    ('ss', StandardScaler()),
    ('pc', PCA(
        n_components=36,
        random_state=42)),
    ('lr', LinearRegression())])

In [22]:
lr_pipe.fit(X_train, y_train)

In [23]:
preds = lr_pipe.predict(X_test)

In [24]:
print(f"Training score: {round(lr_pipe.score(X_train, y_train), 2)}")
print(f"Test score: {round(lr_pipe.score(X_test, y_test), 2)}")
print(f"Difference: {round(lr_pipe.score(X_train, y_train)-(lr_pipe.score(X_test, y_test)), 2)}")
print(f"RMSE: ${int(mean_squared_error(y_test, preds, squared=False))}")

Training score: 0.72
Test score: 0.56
Difference: 0.16
RMSE: $1631424


In [26]:
with open('../models/defense_model.pkl', 'wb') as file:
    pickle.dump(lr_pipe, file)

---
#### Notebook Links

Part I - Project Intro & Data Cleaning
- [`Part-1_setup-and-cleaning.ipynb`](../code/Part-1_setup-and-cleaning.ipynb)

Part II - Exploratory Data Analysis (EDA)
- [`Part-2_eda.ipynb`](../code/Part-2_eda.ipynb)

Part III - Modeling
- [`Part-3.1_modeling-forwards.ipynb`](../code/Part-3.1_modeling-forwards.ipynb)
- [`Part-3.3_modeling-goalies.ipynb`](../code/Part-3.3_modeling-goalies.ipynb)

Part IV - Conclusion, Recommendations, and Sources
- [`Part-4_conclusion-and-recommendations.ipynb`](../code/Part-4_conclusion-and-recommendations.ipynb)