## Regression to see the relationship between the features and the player's "end_cost"
"end_cost" is the cost of a player at the end of each fantasy season. We can use it to see how the fanstay algo values a player's impact over the course of the season

In [1]:
import pandas as pd
import numpy as np
import requests
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from matplotlib import pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler


In [2]:
#Import goalie-specific, model ready csv

goalie_hist_model_ready = pd.read_csv("C:/Users/Daniel Quinn/Desktop/Bootcamp/Project_2/data/processed/goalie_hist_model_ready.csv")
#Ensure we can read the whole dataframe, without "..."
pd.set_option("display.max_columns", None)
goalie_hist_model_ready = goalie_hist_model_ready.drop(columns = ['Unnamed: 0'])
goalie_hist_model_ready.shape


(269, 33)

In [3]:
# Create saves percentage stat - will use NaN's to drop irrelevant goalies

#Create Saves Percentage stat:

goalie_hist_model_ready["Saves_Percentage"] = (
    (goalie_hist_model_ready['saves'] + goalie_hist_model_ready['penalties_saved']) /
    (goalie_hist_model_ready['saves'] + goalie_hist_model_ready['penalties_saved'] + goalie_hist_model_ready['goals_conceded'])
) * 100
goalie_hist_model_1_ready = goalie_hist_model_ready.dropna()
goalie_hist_model_1_ready.shape

(181, 34)

# P-Value test on statsmodel regression

I ran a number of different models here. Starting on with a maximal features list, I used each inteatio to pare back the stat. insignificant.

Foir example, "start_cost" overwhelmed the model initially, so I removed it and kept doing so until I got a model that returned p values that made sense.

What "made sense?" Intuitively, it made no sense for a "start_cost" to essentially ignore saves and clean sheets when assessing the value of a goalie at the end of the season.



In [4]:
#Check the p-value to determine the statistical significance of each features

goalie_hist_model_2_ready = goalie_hist_model_1_ready[['minutes', 'total_points', 'clean_sheets', 'goals_conceded',
                                                       'saves', 'ict_index', 'bonus', 'bps',
                                                       'expected_goals_conceded', 'starts', 'penalties_saved', 'expected_goal_involvements',
                                                       'assists', 'own_goals', 'end_cost']]

import statsmodels.api as sm

#create X & y variables
X = goalie_hist_model_2_ready.drop(columns = ['end_cost'])
y = goalie_hist_model_2_ready['end_cost']

#test-training split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Use the statsmodels package to create and fit a linear regression
lr = sm.OLS(y_train, X_train).fit()
lr.pvalues.sort_values(ascending=False)

minutes                       0.952519
expected_goal_involvements    0.784980
ict_index                     0.726721
bps                           0.713543
starts                        0.637350
expected_goals_conceded       0.519998
goals_conceded                0.474923
penalties_saved               0.470761
clean_sheets                  0.431686
saves                         0.352007
own_goals                     0.321933
bonus                         0.282440
assists                       0.281112
total_points                  0.219594
dtype: float64

# Linear Regression:

In [5]:
#create X & y variables
X = goalie_hist_model_2_ready.drop(columns = ['end_cost'])
y = goalie_hist_model_2_ready['end_cost']

#test-training split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#create model
model = LinearRegression()

model.fit(X_train, y_train)
print("Train model score: ", model.score(X_train, y_train))
print("Test model score: ", model.score(X_test, y_test))

Train model score:  0.7208526117402403
Test model score:  0.6588366831630601


In [6]:
#Make predictions

prediction1 = model.predict(X_test)

#Evaluate models with mse and r2

mse = mean_squared_error(y_test, prediction1) # how close are the predicted values to actual values via the squared differences between expected and real
r2 = r2_score(y_test, prediction1) # r2 - how well do the indep variables explain the variation in the dep var? ) 0 is a perfect model, the larger the nuber, the worse the model is performing

print(f"All Features (end_cost = y):")
print(f"mean squared error (MSE): {mse}")
print(f"R-squared (R2): {r2}")

All Features (end_cost = y):
mean squared error (MSE): 9.911793981979073
R-squared (R2): 0.6588366831630601


# LR Model for new data:

In [7]:
#Run the model to get the predicted end costs for each goalie

predicted_end_cost = goalie_hist_model_2_ready.drop(columns = ['end_cost'])
y_new_pred = model.predict(predicted_end_cost)
y_new_pred
exp_end_cost_1 = pd.DataFrame(y_new_pred)
exp_end_cost_1

Unnamed: 0,0
0,48.655486
1,49.521576
2,53.788648
3,46.411417
4,44.863684
...,...
176,40.960683
177,42.660748
178,53.417522
179,49.318396


# Random Forest:

In [8]:
#create X & y variables
X = goalie_hist_model_2_ready.drop(columns=['end_cost'])
y = goalie_hist_model_2_ready['end_cost']

#test-training split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#create & train model
random_forest = RandomForestRegressor(n_estimators=500, random_state=42).fit(X_train, y_train)

# Evaluate the model
print(f"Training Score: {random_forest.score(X_train, y_train)}")
print(f"Testing Score: {random_forest.score(X_test, y_test)}")

Training Score: 0.9490096019629226
Testing Score: 0.6905235082308542


In [9]:
predicted_end_cost_1 = goalie_hist_model_2_ready.drop(columns = ['end_cost'])
y_new_pred_1 = random_forest.predict(predicted_end_cost_1)
y_new_pred_1
exp_end_cost_2 = pd.DataFrame(y_new_pred_1)
exp_end_cost_2

Unnamed: 0,0
0,45.376
1,49.154
2,54.892
3,44.196
4,46.408
...,...
176,40.284
177,40.190
178,51.408
179,50.256


In [10]:
# Feature Importance
feature_importances = random_forest.feature_importances_

feature_importances_df = pd.DataFrame(feature_importances, X.columns)


print(feature_importances_df.sort_values(by=0, ascending=False))

                                   0
total_points                0.285248
clean_sheets                0.247887
minutes                     0.143369
saves                       0.094906
goals_conceded              0.084884
bps                         0.039040
bonus                       0.031459
ict_index                   0.023360
expected_goals_conceded     0.015568
starts                      0.012127
assists                     0.006981
expected_goal_involvements  0.006688
penalties_saved             0.006231
own_goals                   0.002252


# Running models on a different combination of features

In [11]:
# These are using an edited features list - everything here is goalie-centric
#  & not necessarily applicable to other positions - what we used for the p-value test above

In [12]:
#New feature set

edit2_goalies_list = goalie_hist_model_2_ready
# [['expected_assists', 'starts', 'ict_index',
#                                               'red_cards', 'expected_goals_conceded', 'expected_goals',
#                                               'bps', 'expected_goal_involvements', 'yellow_cards',
#                                               'minutes', 'own_goals', 'goals_scored', 'clean_sheets', 'assists',
#                                               'penalties_saved', 'bonus', 'total_points', 'saves', 'goals_conceded', 'end_cost']]

edit2_goalies_list.columns


Index(['minutes', 'total_points', 'clean_sheets', 'goals_conceded', 'saves',
       'ict_index', 'bonus', 'bps', 'expected_goals_conceded', 'starts',
       'penalties_saved', 'expected_goal_involvements', 'assists', 'own_goals',
       'end_cost'],
      dtype='object')

# Pvalues below are about the same as above:

In [13]:
#Check the p-value to determine the statistical significance of each features

goalie_hist_model_3_ready = edit2_goalies_list[['minutes', 'total_points', 'clean_sheets', 'goals_conceded',
                                                       'saves', 'ict_index', 'bonus', 'bps',
                                                       'expected_goals_conceded', 'starts', 'penalties_saved', 'expected_goal_involvements',
                                                       'assists', 'own_goals', 'end_cost']]

import statsmodels.api as sm

#create X & y variables
X = goalie_hist_model_3_ready.drop(columns = ['end_cost'])
y = goalie_hist_model_3_ready['end_cost']

#test-training split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Use the statsmodels package to create and fit a linear regression
lr = sm.OLS(y_train, X_train).fit()
lr.pvalues.sort_values(ascending=False)

minutes                       0.952519
expected_goal_involvements    0.784980
ict_index                     0.726721
bps                           0.713543
starts                        0.637350
expected_goals_conceded       0.519998
goals_conceded                0.474923
penalties_saved               0.470761
clean_sheets                  0.431686
saves                         0.352007
own_goals                     0.321933
bonus                         0.282440
assists                       0.281112
total_points                  0.219594
dtype: float64

# Linear Regression:

In [14]:
#create X & y variables
X = edit2_goalies_list.drop(columns=['end_cost'])
y = edit2_goalies_list['end_cost']

#test-training split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#create model
model = LinearRegression()

model.fit(X_train, y_train)
print("Train model score: ", model.score(X_train, y_train))
print("Test model score: ", model.score(X_test, y_test))

#These are slightly lower than above

Train model score:  0.7208526117402403
Test model score:  0.6588366831630601


In [15]:
#Make predictions

prediction1 = model.predict(X_test)

#Evaluate models with mse and r2

mse = mean_squared_error(y_test, prediction1)
r2 = r2_score(y_test, prediction1)

#Evaluate & print the model scores
print(f"All Features (end_cost=y):")
print(f"mean squared error (MSE): {mse}")
print(f"Test R-squared (R2): {r2}")

# without start_cost, the MSE is 10.75; r2 = 0.63

#Full-feature: MSE 1.986 R2 .944

All Features (end_cost=y):
mean squared error (MSE): 9.911793981979073
Test R-squared (R2): 0.6588366831630601


In [16]:
#Random Forest
#create X & y variables
X = edit2_goalies_list.drop(columns=['end_cost'])
y = edit2_goalies_list['end_cost']

#test-training split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

#create & train model
random_forest_2 = RandomForestRegressor(n_estimators=500, random_state=42).fit(X_train, y_train)

# Evaluate the model
print(f"Training Score: {random_forest.score(X_train, y_train)}")
print(f"Testing Score: {random_forest.score(X_test, y_test)}")


Training Score: 0.9490096019629226
Testing Score: 0.6905235082308542


In [17]:
# Feature Importance
feature_importances = random_forest.feature_importances_

feature_importances_df = pd.DataFrame(feature_importances, X.columns)


print(feature_importances_df.sort_values(by=0, ascending=False))

                                   0
total_points                0.285248
clean_sheets                0.247887
minutes                     0.143369
saves                       0.094906
goals_conceded              0.084884
bps                         0.039040
bonus                       0.031459
ict_index                   0.023360
expected_goals_conceded     0.015568
starts                      0.012127
assists                     0.006981
expected_goal_involvements  0.006688
penalties_saved             0.006231
own_goals                   0.002252


In [18]:


#The total_points metric is a fantasy-generated metric that includes minutes, clean_sheets, saves,
# penalties_saved, own_goals, goals_conceded. 

# New dataset predictions:

In [19]:
#Run the model to get the predicted end costs for each goalie

predicted_end_cost = edit2_goalies_list.drop(columns = ['end_cost'])
y_new_pred2 = random_forest_2.predict(predicted_end_cost)
print(y_new_pred2)

[45.376 49.154 54.892 44.196 46.408 43.992 39.856 44.116 52.146 49.326
 50.384 50.784 43.114 40.572 39.85  39.25  41.87  41.674 40.984 39.246
 53.72  51.046 44.906 44.518 49.    47.252 39.584 42.116 41.134 44.728
 43.506 44.266 46.636 46.594 44.178 44.152 52.082 49.276 44.738 44.424
 40.062 41.812 41.18  39.418 42.994 50.222 50.03  50.472 48.226 47.576
 45.524 50.334 41.38  41.27  45.772 44.98  50.824 52.486 48.082 45.934
 39.94  46.002 43.256 40.942 38.554 46.2   46.428 51.676 43.512 48.554
 48.616 43.172 39.558 39.956 40.006 40.34  58.798 58.106 54.602 60.18
 52.942 53.048 43.392 42.432 39.334 38.222 48.602 51.59  44.886 41.834
 39.36  57.164 58.232 56.882 59.602 60.316 54.01  54.652 38.646 37.614
 50.336 50.638 45.21  46.622 43.924 49.242 45.65  49.896 50.426 45.336
 44.908 40.872 43.088 40.512 50.952 51.586 53.506 49.382 51.766 49.888
 44.396 44.624 50.568 42.09  39.938 39.872 39.668 43.18  45.02  45.896
 50.904 49.54  40.478 44.572 41.182 39.988 40.642 44.222 43.158 45.68
 44.988 

In [20]:
# Import data & create dataframe to hold y_new_pred2 prediction

y_new_pred2.shape

(181,)

In [21]:
# Create new dataframe with expected end_costs

end_cost_predictions_df_master = goalie_hist_model_1_ready
end_cost_predictions_df_master['Predicted_End_Cost'] = y_new_pred2
end_cost_predictions_df_master = end_cost_predictions_df_master[['id', 'first_name', 'second_name', 'team', 'element_type', 'code',
       'element_code', 'season', 'total_points',
       'minutes', 'goals_scored', 'assists', 'clean_sheets', 'goals_conceded',
       'own_goals', 'penalties_saved', 'penalties_missed', 'yellow_cards',
       'red_cards', 'saves', 'bonus', 'bps', 'influence', 'creativity',
       'threat', 'ict_index', 'starts', 'expected_goals', 'expected_assists',
       'expected_goal_involvements', 'expected_goals_conceded',
       'Saves_Percentage', 'start_cost', 'end_cost', 'Predicted_End_Cost']]

end_cost_predictions_df_master
end_cost_predictions_df_master.to_csv("C:/Users/Daniel Quinn/Desktop/Bootcamp/Project_2/data/processed/end_cost_predictions_df_master.csv")
end_cost_predictions_df_master.shape

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  end_cost_predictions_df_master['Predicted_End_Cost'] = y_new_pred2


(181, 35)