# Predicting All-NBA Team and Player Salaries - Modeling Part II of II: Player Salaries
---

In our modeling notebooks, we will build upon the groundwork laid by our webscraping, data cleaning, and exploratory data analysis. Our cleaned data now contains over 80 features, including player statistic (advanced, totals, and per-game), salary cap information, and team payroll data. With this, our objective is twofold:

1. <u>**All-NBA Team**</u>: In our previous notebook ([Part I](./04_Data_Modeling_I.ipynb)), we ...

    1. Constructed multiple <u>**regression models**</u> to predict voter share, which ultimately enabled us to discern which Top 15 players made the All-NBA Teams (5 per 1st, 2nd, and 3rd Teams). 
    2. Briefly employed <u>**classification models**</u> to determine how well our models performed in deciphering between 1st, 2nd, and 3rd team players. 
    <br></br>
2. <span style = 'color:orange'><u>**Salary**</u>:</span> Next, in this notebook, we will also use <u>**regression modeling**</u> to predict player salaries, training on the intricate relationship between player performance, individual statistics, and their contracts. 

This process will involve trial and error as well as the application of GridSearch techniques to fine-tune our models.

At the end of our analysis, we hope to unravel the complexities of the NBA landscape, discovering patterns and associations that govern player recognition in All-NBA Teams and their financial remuneration. These insights will inform decision-making processes and aid in the evaluation of player performance and compensation within the competitive realm of professional basketball.

Further detailed notebooks on the various segments of this project can be found at the following: 
- [01_Data_Acquisition](./01_Data_Acquisition.ipynb)
- [02_Data_Cleaning](./02_Data_Cleaning.ipynb)
- [03_Preliminary_EDA](./03_Preliminary_EDA.ipynb)
- [04_Data_Modeling_I](./04_Data_Modeling_I.ipynb)

For more information on the background, a summary of methods, and findings, please see the associated [README](../README.md) for this analysis.

## Contents

---

In [21]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import shap
import streamlit as st


from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline

from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNet
from sklearn.svm import LinearSVR, SVR
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor, VotingRegressor, AdaBoostRegressor, GradientBoostingRegressor
# LGBMRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

# from sklearn.compose import ColumnTransformer
# from sklearn.neighbors import KNeighborsClassifier

import datetime

import warnings
warnings.filterwarnings('ignore') 

pd.options.display.max_rows = 400
pd.options.display.max_columns = 400

In [5]:
df = pd.read_csv('../data/clean/stats_main.csv')
df.head()

Unnamed: 0,player,pos,age,tm,g,pg_gs,pg_mp,pg_fg,pg_fga,pg_fg%,pg_3p,pg_3pa,pg_3p%,pg_2p,pg_2pa,pg_2p%,pg_efg%,pg_ft,pg_fta,pg_ft%,pg_orb,pg_drb,pg_trb,pg_ast,pg_stl,pg_blk,pg_tov,pg_pf,pg_pts,year,tot_gs,tot_mp,tot_fg,tot_fga,tot_fg%,tot_3p,tot_3pa,tot_3p%,tot_2p,tot_2pa,tot_2p%,tot_efg%,tot_ft,tot_fta,tot_ft%,tot_orb,tot_drb,tot_trb,tot_ast,tot_stl,tot_blk,tot_tov,tot_pf,tot_pts,adv_per,adv_ts%,adv_3par,adv_ftr,adv_orb%,adv_drb%,adv_trb%,adv_ast%,adv_stl%,adv_blk%,adv_tov%,adv_usg%,adv_ows,adv_dws,adv_ws,adv_ws/48,adv_obpm,adv_dbpm,adv_bpm,adv_vorp,gt1_pos,pos_5,pos_3,f,gu,midseason_trade,all_nba_team,pts_won,pts_max,share,all_nba_winner,n_allstar,all_star,salary,salary_adj,team,conf,div,w,l,w/l%,seed,champs,won_championship,salary_cap,salary_cap_adj,payroll,payroll_adj
0,Nick Anderson,SG,23,ORL,70,42.0,28,5.7,12.2,0.467,0.2,0.8,0.293,5.5,11.4,0.479,0.477,2.5,3.7,0.668,1.3,4.2,5.5,1.5,1.1,0.6,1.6,2.1,14.1,1990,42.0,1971,400.0,857.0,0.467,17.0,58.0,0.293,383.0,799.0,0.479,0.477,173.0,259.0,0.668,92.0,294.0,386.0,106.0,74.0,44.0,113.0,145.0,990.0,15.1,0.51,0.068,0.302,4.9,16.7,10.7,8.5,1.8,1.3,10.4,22.4,1.2,1.9,3.1,0.075,0.0,0.3,0.3,1.1,0,SG,Gu,0,1,0,0,0.0,0.0,0.0,0.0,0.0,0,725000.0,1653775.0,Orlando Magic,W,M,31,51,0.378,19,Chicago Bulls,0,11871000.0,25499592.0,7532000.0,17181014.0
1,Ron Anderson,SF,32,PHI,82,13.0,28,6.2,12.9,0.485,0.1,0.5,0.209,6.1,12.3,0.497,0.49,2.0,2.4,0.833,1.3,3.2,4.5,1.4,0.8,0.2,1.2,2.0,14.6,1990,13.0,2340,512.0,1055.0,0.485,9.0,43.0,0.209,503.0,1012.0,0.497,0.49,165.0,198.0,0.833,103.0,264.0,367.0,115.0,65.0,13.0,100.0,163.0,1198.0,15.5,0.524,0.041,0.188,5.0,12.4,8.8,8.2,1.4,0.3,8.1,23.2,2.3,1.8,4.1,0.085,-0.2,-1.4,-1.6,0.2,0,SF,F,1,0,0,0,0.0,0.0,0.0,0.0,0.0,0,425000.0,969454.0,Philadelphia 76ers,E,A,44,38,0.537,12,Chicago Bulls,0,11871000.0,25499592.0,11640000.0,26551652.0
2,Willie Anderson,SG,24,SAS,75,75.0,34,6.0,13.2,0.457,0.1,0.5,0.2,5.9,12.7,0.467,0.461,2.3,2.8,0.798,0.9,3.8,4.7,4.8,1.1,0.6,2.2,3.0,14.4,1990,75.0,2592,453.0,991.0,0.457,7.0,35.0,0.2,446.0,956.0,0.467,0.461,170.0,213.0,0.798,68.0,283.0,351.0,358.0,79.0,46.0,167.0,226.0,1083.0,13.0,0.499,0.035,0.215,3.1,11.5,7.5,20.2,1.5,1.1,13.3,20.1,1.3,3.5,4.8,0.089,-0.9,1.0,0.1,1.4,0,SG,Gu,0,1,0,0,0.0,0.0,0.0,0.0,0.0,0,725000.0,1653775.0,San Antonio Spurs,W,M,55,27,0.671,6,Chicago Bulls,0,11871000.0,25499592.0,11057000.0,25221786.0
3,Thurl Bailey,PF,29,UTA,82,22.0,30,4.9,10.6,0.458,0.0,0.0,0.0,4.9,10.6,0.459,0.458,2.7,3.3,0.808,1.2,3.7,5.0,1.5,0.6,1.1,1.6,2.0,12.4,1990,22.0,2486,399.0,872.0,0.458,0.0,3.0,0.0,399.0,869.0,0.459,0.458,219.0,271.0,0.808,101.0,306.0,407.0,124.0,53.0,91.0,130.0,160.0,1017.0,12.5,0.513,0.003,0.311,5.1,13.6,9.6,7.7,1.1,2.3,11.6,20.0,0.6,3.1,3.7,0.072,-1.4,0.0,-1.4,0.4,0,PF,F,1,0,0,0,0.0,0.0,0.0,0.0,0.0,0,1000000.0,2281070.0,Utah Jazz,W,M,54,28,0.659,7,Chicago Bulls,0,11871000.0,25499592.0,10695000.0,24396040.0
4,Benoit Benjamin,C,26,LAC,70,65.0,31,5.5,11.1,0.496,0.0,0.0,0.0,5.5,11.1,0.496,0.496,3.0,4.2,0.712,2.2,8.1,10.3,1.7,0.8,2.1,3.4,2.6,14.0,1990,65.0,2236,386.0,778.0,0.496,0.0,0.0,0.0,386.0,778.0,0.496,0.496,210.0,295.0,0.712,157.0,566.0,723.0,119.0,54.0,145.0,235.0,184.0,982.0,15.1,0.541,0.0,0.379,7.8,28.7,18.1,7.7,1.2,4.0,20.6,21.0,-0.7,3.7,3.0,0.064,-1.9,0.8,-1.1,0.5,0,C,C,0,0,1,0,0.0,0.0,0.0,0.0,0.0,0,1750000.0,3991872.0,Los Angeles Clippers,W,P,31,51,0.378,18,Chicago Bulls,0,11871000.0,25499592.0,10245000.0,23369557.0


In [48]:
df.shape

(4377, 102)

In [6]:
train_yrs = [i for i in range(1990, 2017)] #1990-2016
test_yrs = [i for i in range(2017,2021)] #2017-2020
hold_yrs = [2021, 2022] 

In [234]:
feats = ['age', 'g', 'pg_gs', 'pg_mp', 'pg_fg', 'pg_fga', 'pg_fg%', 'pg_3p', 'pg_3pa', 'pg_3p%', 'pg_2p', 'pg_2pa', 'pg_2p%', 'pg_efg%', 'pg_ft', 'pg_fta', 'pg_ft%', 'pg_orb', 'pg_drb', 'pg_trb', 'pg_ast', 'pg_stl', 'pg_blk', 'pg_tov', 'pg_pf', 'pg_pts', 'tot_mp', 'tot_fg%', 'tot_3p', 'tot_3p%', 'tot_2p%', 'tot_efg%', 'tot_ft%', 'tot_pf', 'tot_pts', 'adv_per', 'adv_ts%', 'adv_3par', 'adv_ftr', 'adv_orb%', 'adv_drb%', 'adv_trb%', 'adv_ast%', 'adv_stl%', 'adv_blk%', 'adv_tov%', 'adv_usg%', 'adv_ows', 'adv_dws', 'adv_ws', 'adv_ws/48', 'adv_obpm', 'adv_dbpm', 'adv_bpm', 'adv_vorp', 'f', 'gu', 'w/l%', 'seed', 'all_star']

In [235]:
X_train = df[feats].loc[df.year.isin(train_yrs)].reset_index(drop=True)
X_test = df[feats].loc[df.year.isin(test_yrs)].reset_index(drop=True)

y_train = df['share'].loc[df.year.isin(train_yrs)]
y_test = df['share'].loc[df.year.isin(test_yrs)]

X_hold = df[feats].loc[df.year.isin(hold_yrs)].reset_index(drop=True)
y_hold = df['share'].loc[df.year.isin(hold_yrs)]

print(f'Train: X: {X_train.shape}, y: {y_train.shape}')
print(f'Test: X: {X_test.shape}, y: {y_test.shape}')
print(f'Hold: X: {X_hold.shape}, y: {y_hold.shape}')

Train: X: (3554, 60), y: (3554,)
Test: X: (520, 60), y: (520,)
Hold: X: (303, 60), y: (303,)


In [None]:
hs = pd.read_csv('../datasets/Clean/train.csv', na_values=['NaN', '', 'Missing'], keep_default_na=False)
hs_test = pd.read_csv('../datasets/Clean/test.csv', na_values=['NaN', '', 'Missing'], keep_default_na=False)

In [None]:
# UPDATE FEATURES FOR TESTING HERE
feats_updated = ['overall_qual', 'year_built', 'year_remod', 'total_bsmt_sf', 'gr_liv_area', 'full_bath', 'fireplaces', 'age', 'garage_area', 'kitchen_qual_Fa',
 'kitchen_qual_Gd', 'kitchen_qual_TA', 'was_remod', 'bsmt_cat_finished','bsmt_cat_unfinished', 'grg_qual_num', 'garage_cat_finished', 'garage_cat_unfinished', 'garage_cat_rough_finished', 'cond12_feeder_st',
 'cond12_near_park', 'cond12_near_rr', 'cond12_norm', 'lotconfig_culdsac', 'lotconfig_inside', 'hi_bsmt_exposure', 'nbr_rank']

In [None]:
def mod_iteration(feats):
    
    # Fit regression to X_train and y_train (75% of training.csv)
    X = hs[feats]
    y = hs['SalePrice']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 531)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
    
    # Predict SalePrice for 25% testing data within train.csv and compare to truth to get residuals
    y_preds = lr.predict(X_test)
    MSE = metrics.mean_squared_error(y_test, y_preds)
    RMSE = metrics.mean_squared_error(y_test, y_preds, squared=False)
        
    for i, coef in zip(X.columns, lr.coef_):
        print(f"{i}: {coef}")
    print(f"intercept: {lr.intercept_}")
    
    return f"Training R2: {lr.score(X_train, y_train)}, Testing R2: {lr.score(X_test, y_test)}, MSE: {MSE}, RMSE: {RMSE}"
    
mod_iteration(feats_updated)

In [None]:
def mod_runon_all(feats):
    
    # Fit regression to entire data
    X = hs[feats]
    y = hs['SalePrice']
    lr_all = LinearRegression()
    lr_all.fit(X, y)
    
    # Predict SalePrice for entire data and compare to truth to get residuals
    y_preds_all = lr_all.predict(hs[feats])
    y_true = hs['SalePrice'] #Can use var from entire dataset
    MSE = metrics.mean_squared_error(y_true, y_preds_all)
    RMSE = metrics.mean_squared_error(y_true, y_preds_all, squared=False)
    
    # Use regression to predict SalePrice on Test.csv (unseen) data
    y_preds_all_test = lr_all.predict(hs_test[feats])
    hs_test['SalePrice'] = y_preds_all_test
    
    # Null model for comparison
    hs['null_pred'] = np.mean(y)
    null_pred = hs['null_pred']
    null_MSE = metrics.mean_squared_error(y_true, null_pred)
    null_RMSE = metrics.mean_squared_error(y_true, null_pred, squared=False)
    
    # Submit Predictions to Kaggle
    #submit = hs_test[['Id', 'SalePrice']]
    #submit.set_index('Id', inplace=True)
    #dt = datetime.datetime.now().strftime("%m%d%Y%H")
    #submit.to_csv(f'../datasets/Submissions/Features_Submission-{dt}.csv')
        
    for i, coef in zip(X.columns, lr_all.coef_):
        print(f"{i}: {coef}")
    print(f"intercept: {lr_all.intercept_}")
    print(f"null_MSE: {null_MSE}, null_RMSE: {null_RMSE}")
    
    return f"Full Data R2: {lr_all.score(X, y)}, MSE = {MSE}, RMSE = {RMSE}"

mod_runon_all(feats_updated)

In [None]:
def log_mod_iteration(feats):
    
    # Fit regression to X_train and y_train (75% of training.csv)
    X = hs[feats]
    y = hs['log_price']
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 531)
    lr = LinearRegression()
    lr.fit(X_train, y_train)
        
    # Predict SalePrice for 25% testing data within train.csv and compare to truth to get residuals
    y_preds = np.exp(lr.predict(X_test)) # Undoing the logged price
    MSE = metrics.mean_squared_error(np.exp(y_test), y_preds)
    RMSE = metrics.mean_squared_error(np.exp(y_test), y_preds, squared=False)
        
    for i, coef in zip(X.columns, np.exp(lr.coef_)):
        print(f"{i}: {coef}")
    print(f"intercept: {np.exp(lr.intercept_)}")
    
    return f"Training R2: {lr.score(X_train, y_train)}, Testing R2: {lr.score(X_test, y_test)}, MSE: {MSE}, RMSE: {RMSE}"
    
log_mod_iteration(feats_updated)

In [None]:
def log_mod_runon_all(feats):
    
    # Fit regression to entire data
    X = hs[feats]
    y = hs['log_price']
    lr_all = LinearRegression()
    lr_all.fit(X, y)
    
    # Predict SalePrice for entire data and compare to truth to get residuals
    y_preds_all = np.exp(lr_all.predict(hs[feats]))
    y_true = hs['SalePrice'] #Can use var from entire dataset
    MSE = metrics.mean_squared_error(y_true, y_preds_all)
    RMSE = metrics.mean_squared_error(y_true, y_preds_all, squared=False)
    
    # Use regression to predict SalePrice on Test.csv (unseen) data
    y_preds_all_test = np.exp(lr_all.predict(hs_test[feats]))
    hs_test['SalePrice'] = y_preds_all_test
    
    # Null model for comparison
    hs['null_pred'] = np.exp(np.mean(y))
    null_pred = hs['null_pred']
    null_MSE = metrics.mean_squared_error(y_true, null_pred)
    null_RMSE = metrics.mean_squared_error(y_true, null_pred, squared=False)
    
    # Submit Predictions to Kaggle
    #submit = hs_test[['Id', 'SalePrice']]
    #submit.set_index('Id', inplace=True)
    #dt = datetime.datetime.now().strftime("%m%d%Y%H")
    #submit.to_csv(f'../datasets/Submissions/Features_Submission_logy-{dt}.csv')
        
    for i, coef in zip(X.columns, np.exp(lr_all.coef_)):
        print(f"{i}: {coef}")
    print(f"intercept: {np.exp(lr_all.intercept_)}")
    print(f"null_MSE: {null_MSE}, null_RMSE: {null_RMSE}")
    
    return f"Full Data R2: {lr_all.score(X, y)}, MSE = {MSE}, RMSE = {RMSE}"

log_mod_runon_all(feats_updated)