# Prior Model Selection

This notebook will perform model selection via cross validation for our prior distributions. Models to be considered are: 
* Random Forest Regression (covariates: team rating and contract value)
* Random Forest Regression (covariates: contract value only)
* Gradient Boosting Regressor (covariates: team rating and contract value)
* Gradient Boosting Regressor (covariates: contract value only)

In [126]:
import pandas as pd
import numpy as np

# read in all our training data

# MAIN training set for after we've validated
main_train_rookies = pd.read_csv("../data/pre_2015_16/main_train_rookies.csv")
main_train_rookies.drop(main_train_rookies.columns[0], axis = 1, inplace = True)

main_train_vets = pd.read_csv("../data/pre_2015_16/main_train_vets.csv")
main_train_vets.drop(main_train_vets.columns[0], axis = 1, inplace = True)

# training set before validation
train_rookies = pd.read_csv("../data/pre_2015_16/train_rookies.csv")
train_rookies.drop(train_rookies.columns[0], axis = 1, inplace = True)

train_vets = pd.read_csv("../data/pre_2015_16/train_vets.csv")
train_vets.drop(train_vets.columns[0], axis = 1, inplace = True)

# validation dataset
validate_rookies = pd.read_csv("../data/pre_2015_16/validate_rookies.csv")
validate_rookies.drop(validate_rookies.columns[0], axis = 1, inplace = True)

validate_vets = pd.read_csv("../data/pre_2015_16/validate_vets.csv")
validate_vets.drop(validate_vets.columns[0], axis = 1, inplace = True)

Define x and y variables for model fitting. 

**NOTE** - the 1 in the variable name indicates that team rating is included as a covariate. When team rating is not included as a covariate, the variable names will have a 2 at the end.

In [277]:
# FIRST - with team rating included as a covariate

# x and y for training
x_rookies1 = np.array(train_rookies[['rating', 'mu']])
y_rookies = np.array(train_rookies['coefs'])

x_vets1 = np.array(train_vets[['rating', 'mu']])
y_vets = np.array(train_vets['coefs'])

# x and y for validation
x_rookies_validate1 = np.array(validate_rookies[['rating', 'mu']])
y_rookies_validate = np.array(validate_rookies['coefs'])

x_vets_validate1 = np.array(validate_vets[['rating', 'mu']])
y_vets_validate = np.array(validate_vets['coefs'])

# SECOND - without team rating as a covariate
# Note that we don't need to change the y variables since they stay the same regardless of the covariates
x_rookies2 = np.array(train_rookies['mu']).reshape(-1, 1)
x_vets2 = np.array(train_vets['mu']).reshape(-1, 1)
x_rookies_validate2 = np.array(validate_rookies['mu']).reshape(-1, 1)
x_vets_validate2 = np.array(validate_vets['mu']).reshape(-1, 1)

# Now create dataset for main training sets
x_main_rookies = np.array(main_train_rookies['mu']).reshape(-1, 1)
y_main_rookies = np.array(main_train_rookies['coefs'])
x_main_vets = np.array(main_train_vets['mu']).reshape(-1, 1)
y_main_vets = np.array(main_train_vets['coefs']).reshape(-1, 1)


## Now Model Training

We will train and validate 4 models for rookies and vets (so 8 models total) - random forest with and without team rating as a covariate (2 models), and gradient boosting regressor with and without team rating as a covariate (2 models). We will select the model for rookies and vets that performs best on our validation data, then we will retrain that chosen model on ALL the data to get priors for the 2015/16 NBA season.

### First Random Forest Models

A note on whether or not team rating boosts model performance - initially, based on only the random forest models, it appears that the models perform very slightly better on validation data WITHOUT team rating as a covariate. We will investigate this in gradient boosting as well, but if we see similar results there we will officially drop team rating as a covariate since it doesn't seem to be helping at all and it needlessly increases model complexity.


### Best RF Model for Rookies: random forest with optimized hyperparameters without team rating (MSE 15.21)

This seems to give the most intuitively reasonable results with Kyrie Irving as the top rookie.

### For Veterans: both models look good - we chose optimized params without team rating (MSE 13.6)

Since we prefer the model without team rating for rookies, we will be consistent and choose the model without team rating for veterans as well since both perform similarly anyways.

In [206]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# first for rookies with team rating
rf_rookie1 = RandomForestRegressor()
params = {'max_depth': [2,5,10], 'n_estimators': [50, 100, 200]} # optimize over max_depth and number of estimators

rf_rookie1 = GridSearchCV(rf_rookie1, params)

rf_rookie1 = rf_rookie1.fit(x_rookies1, y_rookies)

print(rf_rookie1.best_params_) # print the best parameters so we know what we're working with

rf_rookie1 = rf_rookie1.best_estimator_ # set the model to be the best estimator

# Now get predictions on validation set and record MSE
preds_rookie_rf1 = rf_rookie1.predict(x_rookies_validate1)
mse_rf_rookie1 = np.mean((y_rookies_validate - preds_rookie_rf1)**2)
print("MSE Random Forest Rookies Team Rating: ", mse_rf_rookie1)


# Quickly check if random forest with default hyperparameters gives more reasonable predictions - answer - not really

# tmp_rf = RandomForestRegressor(max_depth = 2).fit(x_rookies1, y_rookies)
# tmp_preds_rf = tmp_rf.predict(x_rookies_validate1)
# idx = (-tmp_preds_rf).argsort()[:10]
# tmp_preds_rf[idx]


array([1.51133654, 1.51133654, 1.38079844, 1.25516856, 1.24537634,
       1.22266694, 1.2224639 , 1.20892296, 1.16704222, 1.12812231])

In [171]:
idx = (-preds_rookie_rf1).argsort()[:10]
preds_rookie_rf1[idx]

array([1.28946149, 1.21971329, 1.21555909, 1.16758456, 1.08219275,
       1.06770593, 0.95351087, 0.95351087, 0.93633871, 0.80106413])

In [172]:
validate_rookies.iloc[idx]

Unnamed: 0,rating,Team,Type,mu,sd,name,player_id,index,player_name,coefs
9,6.984504,Oklahoma City Thunder,Rookie,0.734,5,Jeremy Lamb,203087,186,Jeremy Lamb,-0.460762
7,6.984504,Oklahoma City Thunder,Rookie,0.72832,5,Steven Adams,203500,152,Steven Adams,-0.639499
13,6.984504,Oklahoma City Thunder,Rookie,1.354,5,Dion Waiters,203079,65,Dion Waiters,-4.674635
8,6.984504,Oklahoma City Thunder,Rookie,1.898225,5,Enes Kanter,202683,279,Enes Kanter,0.640074
6,7.952643,Los Angeles Clippers,Rookie,0.81328,5,Austin Rivers,203085,174,Austin Rivers,-7.834556
3,9.804995,San Antonio Spurs,Rookie,0.964686,5,Kawhi Leonard,202695,144,Kawhi Leonard,4.714348
27,5.67812,Golden State Warriors,Rookie,1.025293,5,Klay Thompson,202691,115,Klay Thompson,3.562053
21,5.67812,Golden State Warriors,Rookie,1.01664,5,Harrison Barnes,203084,178,Harrison Barnes,1.313442
19,5.685673,Houston Rockets,Rookie,1.6,5,Kostas Papanikolaou,203123,249,Kostas Papanikolaou,-3.362832
144,-3.220032,Cleveland Cavaliers,Rookie,2.35691,5,Kyrie Irving,202681,367,Kyrie Irving,3.501904


In [189]:
# Now rookies without team rating

rf_rookie2 = RandomForestRegressor()
params = {'max_depth': [2,5,10], 'n_estimators': [50, 100, 200]} # optimize over max_depth and number of estimators

rf_rookie2 = GridSearchCV(rf_rookie2, params)

rf_rookie2 = rf_rookie2.fit(x_rookies2, y_rookies)

print(rf_rookie2.best_params_) # print the best parameters so we know what we're working with

rf_rookie2 = rf_rookie2.best_estimator_ # set the model to be the best estimator

# Now get predictions on validation set and record MSE
preds_rookie_rf2 = rf_rookie2.predict(x_rookies_validate2)
mse_rf_rookie2 = np.mean((y_rookies_validate - preds_rookie_rf2)**2)
print("MSE Random Forest Rookies NO Team Rating: ", mse_rf_rookie2)

{'max_depth': 2, 'n_estimators': 200}
MSE Random Forest Rookies NO Team Rating:  15.213135983382914


In [273]:
idx = (-preds_rookie_rf2).argsort()[:10]
print(preds_rookie_rf2[idx])
print(min(preds_rookie_rf2))
print(max(preds_rookie_rf2))

[1.7361497  0.57008504 0.45770079 0.45770079 0.45770079 0.33339012
 0.33339012 0.31222487 0.27037224 0.26797101]
-1.2760203496382783
1.7361497042574094


In [191]:
validate_rookies.iloc[idx]

Unnamed: 0,rating,Team,Type,mu,sd,name,player_id,index,player_name,coefs
144,-3.220032,Cleveland Cavaliers,Rookie,2.35691,5,Kyrie Irving,202681,367,Kyrie Irving,3.501904
1,9.804995,San Antonio Spurs,Rookie,0.692333,5,Aron Baynes,203382,391,Aron Baynes,4.288611
9,6.984504,Oklahoma City Thunder,Rookie,0.734,5,Jeremy Lamb,203087,186,Jeremy Lamb,-0.460762
150,-3.488153,Detroit Pistons,Rookie,0.73479,5,Reggie Jackson,202704,188,Reggie Jackson,-1.362558
7,6.984504,Oklahoma City Thunder,Rookie,0.72832,5,Steven Adams,203500,152,Steven Adams,-0.639499
36,4.697486,Portland Trailblazers,Rookie,0.807,5,CJ McCollum,203468,247,CJ McCollum,-0.638371
177,-5.228922,Orlando Magic,Rookie,0.79928,5,Elfrid Payton,203901,380,Elfrid Payton,-1.44275
172,-5.228922,Orlando Magic,Rookie,0.793531,5,Tobias Harris,202699,59,Tobias Harris,1.912663
114,-0.738064,Denver Nuggets,Rookie,0.749923,5,Kenneth Faried,202702,325,Kenneth Faried,1.791871
91,0.0,Atlanta Hawks,Rookie,0.811111,5,Shelvin Mack,202714,173,Shelvin Mack,2.658083


In [192]:
# Now vets with team rating

rf_vet1 = RandomForestRegressor()
params = {'max_depth': [2,5,10], 'n_estimators': [50, 100, 200]} # optimize over max_depth and number of estimators

rf_vet1 = GridSearchCV(rf_vet1, params)

rf_vet1 = rf_vet1.fit(x_vets1, y_vets)

print(rf_vet1.best_params_) # print the best parameters so we know what we're working with

rf_vet1 = rf_vet1.best_estimator_ # set the model to be the best estimator

# Now get predictions on validation set and record MSE
preds_vet_rf1 = rf_vet1.predict(x_vets_validate1)
mse_rf_vet1 = np.mean((y_vets_validate - preds_vet_rf1)**2)
print("MSE Random Forest Veterans Team Rating: ", mse_rf_vet1)

{'max_depth': 2, 'n_estimators': 50}
MSE Random Forest Veterans Team Rating:  13.598488981066104


In [210]:
idx = (-preds_vet_rf1).argsort()[:20]
preds_vet_rf1[idx]

array([3.457183  , 3.41072155, 3.34612116, 3.31499142, 3.31499142,
       3.31317354, 3.31317354, 3.31317354, 3.29430166, 3.26040505,
       3.11850324, 3.0928423 , 3.05034568, 2.79658703, 2.79272861,
       2.65164843, 2.61761138, 2.61761138, 2.61761138, 2.59595066])

In [211]:
validate_vets.iloc[idx] # this seems reasonable

Unnamed: 0,rating,Team,Type,mu,sd,name,player_id,index,player_name,coefs
20,7.952643,Los Angeles Clippers,Non-rookie,6.689521,5,Chris Paul,101108,285,Chris Paul,4.353985
25,6.984504,Oklahoma City Thunder,Non-rookie,6.331875,5,Kevin Durant,201142,284,Kevin Durant,7.042239
15,7.952643,Los Angeles Clippers,Non-rookie,5.891537,5,Blake Griffin,201933,76,Blake Griffin,1.336778
49,4.843455,Miami Heat,Non-rookie,6.881467,5,Chris Bosh,2547,24,Chris Bosh,1.85873
37,5.685673,Houston Rockets,Non-rookie,7.145424,5,Dwight Howard,2730,105,Dwight Howard,6.055062
176,-0.755792,New York Knicks,Non-rookie,7.803663,5,Amar'e Stoudemire,2405,32,Amar'e Stoudemire,4.285844
171,-0.755792,New York Knicks,Non-rookie,7.486,5,Carmelo Anthony,2546,394,Carmelo Anthony,8.285201
158,-0.489994,Brooklyn Nets,Non-rookie,7.72693,5,Joe Johnson,2207,478,Joe Johnson,4.187927
192,-1.422322,Sacramento Kings,Non-rookie,6.439108,5,Rudy Gay,200752,349,Rudy Gay,1.973265
160,-0.489994,Brooklyn Nets,Non-rookie,6.584822,5,Deron Williams,101114,316,Deron Williams,1.616904


In [195]:
# Now vets without team rating
rf_vet2 = RandomForestRegressor()
params = {'max_depth': [2,5,10], 'n_estimators': [50, 100, 200]} # optimize over max_depth and number of estimators

rf_vet2 = GridSearchCV(rf_vet2, params)

rf_vet2 = rf_vet2.fit(x_vets2, y_vets)

print(rf_vet2.best_params_) # print the best parameters so we know what we're working with

rf_vet2 = rf_vet2.best_estimator_ # set the model to be the best estimator

# Now get predictions on validation set and record MSE
preds_vet_rf2 = rf_vet2.predict(x_vets_validate2)
mse_rf_vet2 = np.mean((y_vets_validate - preds_vet_rf2)**2)
print("MSE Random Forest Veterans NO Team Rating: ", mse_rf_vet2)

{'max_depth': 2, 'n_estimators': 50}
MSE Random Forest Veterans NO Team Rating:  13.604154551267278


In [212]:
idx = (-preds_vet_rf2).argsort()[:20]
preds_vet_rf2[idx]

array([3.91736504, 3.91736504, 3.91736504, 3.91736504, 3.77602607,
       3.77602607, 3.77602607, 3.77602607, 3.77602607, 3.77602607,
       3.75199967, 3.75199967, 3.59984226, 3.52905624, 2.28116092,
       2.28116092, 2.28116092, 2.28116092, 2.28116092, 2.28116092])

In [213]:
validate_vets.iloc[idx] # also reasonable

Unnamed: 0,rating,Team,Type,mu,sd,name,player_id,index,player_name,coefs
171,-0.755792,New York Knicks,Non-rookie,7.486,5,Carmelo Anthony,2546,394,Carmelo Anthony,8.285201
176,-0.755792,New York Knicks,Non-rookie,7.803663,5,Amar'e Stoudemire,2405,32,Amar'e Stoudemire,4.285844
222,-4.658751,Los Angeles Lakers,Non-rookie,7.833333,5,Kobe Bryant,977,6,Kobe Bryant,2.311105
158,-0.489994,Brooklyn Nets,Non-rookie,7.72693,5,Joe Johnson,2207,478,Joe Johnson,4.187927
192,-1.422322,Sacramento Kings,Non-rookie,6.439108,5,Rudy Gay,200752,349,Rudy Gay,1.973265
37,5.685673,Houston Rockets,Non-rookie,7.145424,5,Dwight Howard,2730,105,Dwight Howard,6.055062
197,-3.220032,Cleveland Cavaliers,Non-rookie,6.881467,5,LeBron James,2544,165,LeBron James,3.792951
49,4.843455,Miami Heat,Non-rookie,6.881467,5,Chris Bosh,2547,24,Chris Bosh,1.85873
160,-0.489994,Brooklyn Nets,Non-rookie,6.584822,5,Deron Williams,101114,316,Deron Williams,1.616904
20,7.952643,Los Angeles Clippers,Non-rookie,6.689521,5,Chris Paul,101108,285,Chris Paul,4.353985


## Now Gradient Boosting Regressor

**NOTE** - it appears that the model gives MUCH more reasonable estimates when we do not optimize over some of the hyperparameters bur rather stick with the defaults. 

* When we optimize hyperparameters for rookies with team ratings, the relative ordering of rookies seems somewhat ok but the magnitudes of the estimates are far too low.
* When we optimize hyperparameters WITHOUT team rating as a covariate, the relative ordering of rookies seems actually better than when we do not optimize; however, we see very small magnitude of estimates again which is a problem.
* For VETERANS - the best model was when we optimized hyperparameters and excluded team ratings as a covariate. This gave the most reasonable intuitive ordering of top 20 players, but the magnitudes were a bit small again. Perhaps we could just settle for this and then scale up the magnitudes according to the magnitudes of coefs. Or just leave it as is and let the Bayesian model do the rest


### For Veterans best GBR model - optimized without team ratings (MSE 13.75)

### For rookies best GBR model - simple model without team ratings (MSE 15.86)

In [265]:
from sklearn.ensemble import GradientBoostingRegressor

# first for rookies with team rating

# magnitudes seem good (fairly large), but ordering seems a bit suspect

gbr_rookie1 = GradientBoostingRegressor().fit(x_rookies1, y_rookies)
preds_rookie_gbr1 = gbr_rookie1.predict(x_rookies_validate1)

mse_gbr_rookie1 = np.mean((y_rookies_validate - preds_rookie_gbr1)**2)
print("MSE Gradient Boosting Rookies Team Rating: ", mse_gbr_rookie1)

idx = (-preds_rookie_gbr1).argsort()[:20]
preds_rookie_gbr1[idx]






# attempting to optimize hyperparameters:

# magnitudes are now very small. Only one positive player, the rest negative. This seems bad. 
# ordering seems somewhat acceptable but the magnitudes are just way too problematic.

# gbr_rookie1 = GradientBoostingRegressor()

# params = {'learning_rate': [0.001, 0.01, 0.1], 
#          'subsample': [1, 0.9],
#          'max_depth': [2,5,10],
#          'n_estimators': [50, 100, 200]}

# gbr_rookie1 = GridSearchCV(gbr_rookie1, params)

# gbr_rookie1 = gbr_rookie1.fit(x_rookies1, y_rookies)

# print(gbr_rookie1.best_params_) # print the best parameters so we know what we're working with

# gbr_rookie1 = gbr_rookie1.best_estimator_ # set the model to be the best estimator

# # Now get predictions on validation set and record MSE
# preds_rookie_gbr1 = gbr_rookie1.predict(x_rookies_validate1)
# mse_gbr_rookie1 = np.mean((y_rookies_validate - preds_rookie_gbr1)**2)
# print("MSE Gradient Boosting Rookies Team Rating: ", mse_gbr_rookie1)

MSE Gradient Boosting Rookies Team Rating:  16.699630349352667


array([5.47454242, 4.809936  , 4.809936  , 3.8415315 , 3.59333701,
       3.59333701, 3.38100247, 3.17798228, 3.14560827, 3.14560827,
       2.44505093, 2.44505093, 2.07707618, 2.01697625, 1.76841536,
       1.72551666, 1.53484181, 1.45777023, 1.25750733, 1.14763982])

In [266]:
idx = (-preds_rookie_gbr1).argsort()[:20]
print(preds_rookie_gbr1[idx])
print(min(preds_rookie_gbr1))
print(max(preds_rookie_gbr1)) # these are reasonable

[5.47454242 4.809936   4.809936   3.8415315  3.59333701 3.59333701
 3.38100247 3.17798228 3.14560827 3.14560827 2.44505093 2.44505093
 2.07707618 2.01697625 1.76841536 1.72551666 1.53484181 1.45777023
 1.25750733 1.14763982]
-4.501009855233326
5.474542416647444


In [267]:
validate_rookies.iloc[idx]

Unnamed: 0,rating,Team,Type,mu,sd,name,player_id,index,player_name,coefs
8,6.984504,Oklahoma City Thunder,Rookie,1.898225,5,Enes Kanter,202683,279,Enes Kanter,0.640074
21,5.67812,Golden State Warriors,Rookie,1.01664,5,Harrison Barnes,203084,178,Harrison Barnes,1.313442
27,5.67812,Golden State Warriors,Rookie,1.025293,5,Klay Thompson,202691,115,Klay Thompson,3.562053
19,5.685673,Houston Rockets,Rookie,1.6,5,Kostas Papanikolaou,203123,249,Kostas Papanikolaou,-3.362832
50,3.756517,Minnesota Timberwolves,Rookie,1.83688,5,Andrew Wiggins,203952,175,Andrew Wiggins,1.459406
42,3.756517,Minnesota Timberwolves,Rookie,1.85464,5,Anthony Bennett,203461,63,Anthony Bennett,-5.261511
6,7.952643,Los Angeles Clippers,Rookie,0.81328,5,Austin Rivers,203085,174,Austin Rivers,-7.834556
13,6.984504,Oklahoma City Thunder,Rookie,1.354,5,Dion Waiters,203079,65,Dion Waiters,-4.674635
60,3.69279,Phoenix Suns,Rookie,0.981074,5,Marcus Morris,202694,154,Marcus Morris,-2.672349
59,3.69279,Phoenix Suns,Rookie,0.996413,5,Markieff Morris,202693,153,Markieff Morris,5.578282


In [271]:
# Now rookies no team rating

# magnitudes seem far more reasonable. Ordering seems decent. This seems like the best option

gbr_rookie2 = GradientBoostingRegressor().fit(x_rookies2, y_rookies)
preds_rookie_gbr2 = gbr_rookie2.predict(x_rookies_validate2)

mse_gbr_rookie2 = np.mean((y_rookies_validate - preds_rookie_gbr2)**2)
print("MSE Gradient Boosting Rookies NO Team Rating: ", mse_gbr_rookie2)

idx = (-preds_rookie_gbr2).argsort()[:20]
preds_rookie_gbr2[idx]


# Attempting to optimize hyperparameters:

# Note - when we optimize hyperparameters, the ordering seems good actually but the magnitudes are way too small.
# also - only two players get positive coefficients and the rest have negative. This is clearly not ideal.

# gbr_rookie2 = GradientBoostingRegressor()

# params = {'learning_rate': [0.001, 0.01, 0.1], 
#          'subsample': [1, 0.9],
#          'max_depth': [2,5,10],
#          'n_estimators': [50, 100, 200]}

# gbr_rookie2 = GridSearchCV(gbr_rookie2, params)

# gbr_rookie2 = gbr_rookie2.fit(x_rookies2, y_rookies)

# print(gbr_rookie2.best_params_) # print the best parameters so we know what we're working with

# gbr_rookie2 = gbr_rookie2.best_estimator_ # set the model to be the best estimator

# # Now get predictions on validation set and record MSE
# preds_rookie_gbr2 = gbr_rookie2.predict(x_rookies_validate2)
# mse_gbr_rookie2 = np.mean((y_rookies_validate - preds_rookie_gbr2)**2)
# print("MSE Gradient Boosting Rookies NO Team Rating: ", mse_gbr_rookie2)

MSE Gradient Boosting Rookies NO Team Rating:  15.859333950289404


array([2.61780468, 2.44981099, 2.39852198, 2.27997957, 2.27997957,
       2.24406341, 1.98706712, 1.90017838, 1.90017838, 1.84087184,
       1.61235094, 1.49840397, 1.34810973, 1.34257192, 1.28329678,
       0.98080736, 0.98080736, 0.98080736, 0.88525148, 0.88525148])

In [269]:
idx = (-preds_rookie_gbr2).argsort()[:20]
print(preds_rookie_gbr2[idx])
print(min(preds_rookie_gbr2))
print(max(preds_rookie_gbr2))

[2.61780468 2.44981099 2.39852198 2.27997957 2.27997957 2.24406341
 1.98706712 1.90017838 1.90017838 1.84087184 1.61235094 1.49840397
 1.34810973 1.34257192 1.28329678 0.98080736 0.98080736 0.98080736
 0.88525148 0.88525148]
-3.2529270338487772
2.6178046752610786


In [270]:
validate_rookies.iloc[idx] # this is fairly reasonable, not ideal but not bad

Unnamed: 0,rating,Team,Type,mu,sd,name,player_id,index,player_name,coefs
1,9.804995,San Antonio Spurs,Rookie,0.692333,5,Aron Baynes,203382,391,Aron Baynes,4.288611
116,-0.738064,Denver Nuggets,Rookie,0.5064,5,Gary Harris,203914,211,Gary Harris,-6.113417
82,1.522379,Chicago Bulls,Rookie,0.669583,5,Jimmy Butler,202710,159,Jimmy Butler,4.519813
36,4.697486,Portland Trailblazers,Rookie,0.807,5,CJ McCollum,203468,247,CJ McCollum,-0.638371
177,-5.228922,Orlando Magic,Rookie,0.79928,5,Elfrid Payton,203901,380,Elfrid Payton,-1.44275
144,-3.220032,Cleveland Cavaliers,Rookie,2.35691,5,Kyrie Irving,202681,367,Kyrie Irving,3.501904
56,3.69279,Phoenix Suns,Rookie,1.21664,5,Alex Len,203458,44,Alex Len,-5.986509
208,-10.015718,Philadelphia 76ers,Rookie,1.22612,5,Thomas Robinson,203080,123,Thomas Robinson,-1.546704
80,2.771242,Toronto Raptors,Rookie,1.22612,5,Jonas Valanciunas,202685,313,Jonas Valanciunas,-2.353275
8,6.984504,Oklahoma City Thunder,Rookie,1.898225,5,Enes Kanter,202683,279,Enes Kanter,0.640074


In [248]:
# Now veterans team ratings

# here we get good magnitudes for estimates - relative ordering not great.

gbr_vet1 = GradientBoostingRegressor().fit(x_vets1, y_vets)
preds_vet_gbr1 = gbr_vet1.predict(x_vets_validate1)

mse_gbr_vet1 = np.mean((y_vets_validate - preds_vet_gbr1)**2)
print("MSE Gradient Boosting Veterans Team Rating: ", mse_gbr_vet1)


# attempting to optimize hyperparameters:

# when we optimize parameters here the relative ordering seems pretty good again, magnitude is ok but still a bit too small

# gbr_vet1 = GradientBoostingRegressor()

# params = {'learning_rate': [0.001, 0.01, 0.1], 
#          'subsample': [1, 0.9],
#          'max_depth': [2,5,10],
#          'n_estimators': [50, 100, 200]}

# gbr_vet1 = GridSearchCV(gbr_vet1, params)

# gbr_vet1 = gbr_vet1.fit(x_vets1, y_vets)

# print(gbr_vet1.best_params_) # print the best parameters so we know what we're working with

# gbr_vet1 = gbr_vet1.best_estimator_ # set the model to be the best estimator

# # Now get predictions on validation set and record MSE
# preds_vet_gbr1 = gbr_vet1.predict(x_vets_validate1)
# mse_gbr_vet1 = np.mean((y_vets_validate - preds_vet_gbr1)**2)
# print("MSE Gradient Boosting Veterans Team Rating: ", mse_gbr_vet1)

MSE Gradient Boosting Veterans Team Rating:  15.097709396214167


In [249]:
idx = (-preds_vet_gbr1).argsort()[:20]
print(preds_vet_gbr1[idx])
print(max(preds_vet_gbr1))
print(min(preds_vet_gbr1))

[8.42984127 8.16005703 7.66943092 7.08430715 7.06589504 6.15097835
 5.46029541 5.19218033 5.07944451 4.86884289 4.76152705 4.31828591
 4.2671245  4.09505441 3.97813703 3.55643397 3.54040484 3.27569912
 3.24408631 3.09649349]
8.429841267709886
-4.140874964168131


In [250]:
validate_vets.iloc[idx] # seems ok (amare stoudemire had a huge contract - outlier) - except Lebron isn't top 20, so probably not totally correct


Unnamed: 0,rating,Team,Type,mu,sd,name,player_id,index,player_name,coefs
176,-0.755792,New York Knicks,Non-rookie,7.803663,5,Amar'e Stoudemire,2405,32,Amar'e Stoudemire,4.285844
171,-0.755792,New York Knicks,Non-rookie,7.486,5,Carmelo Anthony,2546,394,Carmelo Anthony,8.285201
158,-0.489994,Brooklyn Nets,Non-rookie,7.72693,5,Joe Johnson,2207,478,Joe Johnson,4.187927
222,-4.658751,Los Angeles Lakers,Non-rookie,7.833333,5,Kobe Bryant,977,6,Kobe Bryant,2.311105
17,7.952643,Los Angeles Clippers,Non-rookie,0.018591,5,Lester Hudson,201991,320,Lester Hudson,3.14804
18,7.952643,Los Angeles Clippers,Non-rookie,0.129211,5,Dahntay Jones,2563,263,Dahntay Jones,-8.197401
25,6.984504,Oklahoma City Thunder,Non-rookie,6.331875,5,Kevin Durant,201142,284,Kevin Durant,7.042239
11,9.804995,San Antonio Spurs,Non-rookie,0.041701,5,Reggie Williams,202130,273,Reggie Williams,-2.220307
107,2.782586,Memphis Grizzlies,Non-rookie,5.5,5,Zach Randolph,2216,131,Zach Randolph,6.285065
192,-1.422322,Sacramento Kings,Non-rookie,6.439108,5,Rudy Gay,200752,349,Rudy Gay,1.973265


In [255]:
# Now veterans no team rating

# magnitudes seem good, ordering seems ok but missing lebron in top 20 seems bad

# gbr_vet2 = GradientBoostingRegressor().fit(x_vets2, y_vets)
# preds_vet_gbr2 = gbr_vet2.predict(x_vets_validate2)

# mse_gbr_vet2 = np.mean((y_vets_validate - preds_vet_gbr2)**2)
# print("MSE Gradient Boosting Veterans NO Team Rating: ", mse_gbr_vet2)




# attempting to optimize hyperparameters:

# magnitudes a bit small again, relative ordering seems solid.

gbr_vet2 = GradientBoostingRegressor()

params = {'learning_rate': [0.001, 0.01, 0.1], 
         'subsample': [1, 0.9],
         'max_depth': [2,5,10],
         'n_estimators': [50, 100, 200]}

gbr_vet2 = GridSearchCV(gbr_vet2, params)

gbr_vet2 = gbr_vet2.fit(x_vets2, y_vets)

print(gbr_vet2.best_params_) # print the best parameters so we know what we're working with

gbr_vet2 = gbr_vet2.best_estimator_ # set the model to be the best estimator

# Now get predictions on validation set and record MSE
preds_vet_gbr2 = gbr_vet2.predict(x_vets_validate2)
mse_gbr_vet2 = np.mean((y_vets_validate - preds_vet_gbr2)**2)
print("MSE Gradient Boosting Veterans NO Team Rating: ", mse_gbr_vet2)

{'learning_rate': 0.01, 'max_depth': 2, 'n_estimators': 200, 'subsample': 0.9}
MSE Gradient Boosting Veterans NO Team Rating:  13.74526264802979


In [256]:
idx = (-preds_vet_gbr2).argsort()[:20]
print(preds_vet_gbr2[idx])
print(min(preds_vet_gbr2))
print(max(preds_vet_gbr2))

[3.3983553  3.02276823 3.02276823 3.02276823 3.02276823 2.93608538
 2.93608538 2.93608538 2.93608538 2.92441884 2.91714446 2.91714446
 2.91714446 2.91714446 2.07440845 2.07440845 2.07440845 2.07440845
 2.07440845 2.07440845]
-1.0444379307401035
3.398355296125337


In [257]:
validate_vets.iloc[idx]

Unnamed: 0,rating,Team,Type,mu,sd,name,player_id,index,player_name,coefs
107,2.782586,Memphis Grizzlies,Non-rookie,5.5,5,Zach Randolph,2216,131,Zach Randolph,6.285065
222,-4.658751,Los Angeles Lakers,Non-rookie,7.833333,5,Kobe Bryant,977,6,Kobe Bryant,2.311105
158,-0.489994,Brooklyn Nets,Non-rookie,7.72693,5,Joe Johnson,2207,478,Joe Johnson,4.187927
176,-0.755792,New York Knicks,Non-rookie,7.803663,5,Amar'e Stoudemire,2405,32,Amar'e Stoudemire,4.285844
171,-0.755792,New York Knicks,Non-rookie,7.486,5,Carmelo Anthony,2546,394,Carmelo Anthony,8.285201
160,-0.489994,Brooklyn Nets,Non-rookie,6.584822,5,Deron Williams,101114,316,Deron Williams,1.616904
192,-1.422322,Sacramento Kings,Non-rookie,6.439108,5,Rudy Gay,200752,349,Rudy Gay,1.973265
127,1.522379,Chicago Bulls,Non-rookie,6.287625,5,Derrick Rose,201565,79,Derrick Rose,3.908301
25,6.984504,Oklahoma City Thunder,Non-rookie,6.331875,5,Kevin Durant,201142,284,Kevin Durant,7.042239
15,7.952643,Los Angeles Clippers,Non-rookie,5.891537,5,Blake Griffin,201933,76,Blake Griffin,1.336778


# Summary

Overall - our best random forest models seem to outperform our best gradient boosting models based on MSE for both rookies and non rookies. 

## Final Model Selection:

* **Rookies** - Random Forest Regression with optimized hyperparameters without team rating as a covariate
* **Veterans** - Random Forest Regression with optimized hyperparameters without team rating as a covariate


# Now actually calculate priors and store them

In [287]:
# read in contract data for 2015/16 season which will be used as the new data in our model to get priors

newdata_vets = pd.read_csv("../data/Contract+team2015_NonRookie.csv")
newdata_rookies = pd.read_csv("../data/Contract+team2015_Rookie.csv")

newdata_vets.drop(newdata_vets.columns[0], axis = 1, inplace = True)
newdata_rookies.drop(newdata_rookies.columns[0], axis = 1, inplace = True)

In [288]:
x_final_rookies = np.array(newdata_rookies['mu']).reshape(-1, 1)
x_final_vets = np.array(newdata_vets['mu']).reshape(-1, 1)

In [289]:
# train rookie model and veteran model on all of our main data

rf_rookie2 = RandomForestRegressor(max_depth = 2, n_estimators = 200).fit(x_main_rookies, y_main_rookies)

rf_vet2 = RandomForestRegressor(max_depth = 2, n_estimators = 50).fit(x_main_vets, y_main_vets)

# NOTE - keep the MSE's from validation set and this will be used as our standard error in the priors
mse_vets = mse_rf_vet2
mse_rookies = mse_rf_rookie2

priors_rookies_means = rf_rookie2.predict(x_final_rookies)
priors_vets_means = rf_vet2.predict(x_final_vets)

sigma_rookies = np.sqrt(mse_rookies)
sigma_vets = np.sqrt(mse_vets)

newdata_vets['finalpriors'] = priors_vets_means
newdata_rookies['finalpriors'] = priors_rookies_means

newdata_vets['finalse'] = sigma_vets
newdata_rookies['finalse'] = sigma_rookies

  """


In [290]:
# Now add player id and index columns by merging with the player index map for 2015/16

player_index_map_2015 = pd.read_csv("../data/player_index_map_2015-16.csv")
player_index_map_2015.drop(player_index_map_2015.columns[0], axis = 1, inplace = True)

player_index_map_2015.head()

Unnamed: 0,player_id,index,player_name
0,201952,0,Jeff Teague
1,203471,1,Dennis Schroder
2,203488,2,Mike Muscala
3,203145,3,Kent Bazemore
4,203503,4,Tony Snell


In [291]:
newdata_vets = newdata_vets.merge(player_index_map_2015, how = "inner", left_on = "name", right_on = "player_name")
newdata_rookies = newdata_rookies.merge(player_index_map_2015, how = "inner", left_on = "name", right_on = "player_name")

newdata_vets

Unnamed: 0,rating,Team,Type,mu,sd,name,finalpriors,finalse,player_id,index,player_name
0,6.239155,Golden State Warriors,Non-rookie,0.833333,5,Leandro Barbosa,-0.782111,3.688381,2571,12,Leandro Barbosa
1,6.239155,Golden State Warriors,Non-rookie,4.000000,5,Andrew Bogut,1.213735,3.688381,101106,365,Andrew Bogut
2,6.239155,Golden State Warriors,Non-rookie,3.790262,5,Stephen Curry,1.178937,3.688381,201939,405,Stephen Curry
3,6.239155,Golden State Warriors,Non-rookie,4.766667,5,Draymond Green,2.835642,3.688381,203110,9,Draymond Green
4,6.239155,Golden State Warriors,Non-rookie,3.903485,5,Andre Iguodala,1.178937,3.688381,2738,339,Andre Iguodala
...,...,...,...,...,...,...,...,...,...,...,...
245,-13.598845,New York Knicks,Non-rookie,4.333333,5,Robin Lopez,1.386358,3.688381,201577,127,Robin Lopez
246,-13.598845,New York Knicks,Non-rookie,0.933333,5,Kevin Seraphin,0.134907,3.688381,202338,128,Kevin Seraphin
247,-13.598845,New York Knicks,Non-rookie,0.550000,5,Lance Thomas,-0.863502,3.688381,202498,77,Lance Thomas
248,-13.598845,New York Knicks,Non-rookie,0.452049,5,Sasha Vujacic,-0.863502,3.688381,2756,75,Sasha Vujacic


In [296]:
newdata_vets.to_csv("../data/final_priors_vets_2015_16.csv")
newdata_rookies.to_csv("../data/final_priors_rookies_2015_16.csv")

# Notes - 

To replicate this process for another year (2016/17 for example) using the final models selected here, we would do the following:
* First two code cells are the same as in this file, just switch the years of the data that we read in.
* Fit the two random forest models (rookies and vets) on the small train data for that year, then get mse on the validation data for that year and save this as it will be used as the prior standard error.
* Then just use the 6 code cells above this one and make sure to put the correct year. That's all.