# Predicting All-NBA Team and Player Salaries - Predicting All-NBA Team and Salary
---

In this notebook, we will build upon the groundwork laid by our webscraping, data cleaning, and exploratory data analysis. Our cleaned data now contains over 80 features, including player statistic (advanced, totals, and per-game), salary cap information, and team payroll data. With this, our objective is twofold:

1. <u>**All-NBA Team**</u>: First we will construct multiple regression models to predict voter share, which will ultimately enable us to discern the All-NBA Teams. By employing various regression techniques, we can gain valuable insights into the factors that influence the voters' decisions, helping us understand what distinguishes an All-NBA player from others. 
2. <u>**Salary**</u>: We will then also use regression modeling to predict player salaries, training on the intricate relationship between player performance, individual statistics, and their contracts. 

This process will involve trial and error as well as the application of GridSearch and RandomizedSearch techniques to fine-tune our models.

At the end of our analysis, we hope to unravel the complexities of the NBA landscape, discovering patterns and associations that govern player recognition in All-NBA Teams and their financial remuneration. These insights will inform decision-making processes and aid in the evaluation of player performance and compensation within the competitive realm of professional basketball.

Further detailed notebooks on the various segments of this project can be found at the following: 
- [01_Data_Acquisition](./01_Data_Acquisition.ipynb)
- [02_Data_Cleaning](./02_Data_Cleaning.ipynb)
- [03_Preliminary_EDA](./03_Preliminary_EDA.ipynb)
- [05_Data_Modeling_II](./05_Data_Modeling_II.ipynb)

For more information on the background, a summary of methods, and findings, please see the associated [README](../README.md) for this analysis.

## Contents

---

In [11]:
# pip install xgboost

Collecting xgboost
  Downloading xgboost-1.7.6-py3-none-win_amd64.whl (70.9 MB)
     --------------------------------------- 70.9/70.9 MB 12.6 MB/s eta 0:00:00
Installing collected packages: xgboost
Successfully installed xgboost-1.7.6
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#import shap
import streamlit as st


from sklearn.preprocessing import StandardScaler, OneHotEncoder, PolynomialFeatures
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.pipeline import Pipeline

from sklearn.linear_model import LinearRegression, RidgeCV, LassoCV, ElasticNet
from sklearn.svm import LinearSVR, SVR
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor, BaggingRegressor, VotingRegressor, AdaBoostRegressor, GradientBoostingRegressor
# LGBMRegressor
from xgboost import XGBRegressor

# from sklearn.compose import ColumnTransformer
# from sklearn.neighbors import KNeighborsClassifier

import datetime

import warnings
warnings.filterwarnings('ignore') 

In [2]:
df = pd.read_csv('../data/clean/stats_main.csv')
df.head()

Unnamed: 0,player,pos,age,tm,g,pg_gs,pg_mp,pg_fg,pg_fga,pg_fg%,...,w,l,w/l%,seed,champs,won_championship,salary_cap,salary_cap_adj,payroll,payroll_adj
0,Nick Anderson,SG,23,ORL,70,42.0,28,5.7,12.2,0.467,...,31,51,0.378,19,Chicago Bulls,0,11871000.0,25499592.0,7532000.0,17181014.0
1,Ron Anderson,SF,32,PHI,82,13.0,28,6.2,12.9,0.485,...,44,38,0.537,12,Chicago Bulls,0,11871000.0,25499592.0,11640000.0,26551652.0
2,Willie Anderson,SG,24,SAS,75,75.0,34,6.0,13.2,0.457,...,55,27,0.671,6,Chicago Bulls,0,11871000.0,25499592.0,11057000.0,25221786.0
3,Thurl Bailey,PF,29,UTA,82,22.0,30,4.9,10.6,0.458,...,54,28,0.659,7,Chicago Bulls,0,11871000.0,25499592.0,10695000.0,24396040.0
4,Benoit Benjamin,C,26,LAC,70,65.0,31,5.5,11.1,0.496,...,31,51,0.378,18,Chicago Bulls,0,11871000.0,25499592.0,10245000.0,23369557.0


In [4]:
df.shape

(4353, 101)

In [12]:
y = df.share
y_mean = df.share.mean() # Null Model
y_mean

0.0682198483804273

In [9]:
null_mse = np.mean((y - y_mean)**2)
null_mse

0.040885437308775543

In [None]:
feats = ['year', 'age', 'g', 'pg_gs', 'pg_mp', 'pg_fg', 'pg_fga', 'pg_fg%', 'pg_3p', 'pg_3pa', 'pg_3p%', 'pg_2p', 'pg_2pa', 'pg_2p%', 'pg_efg%', 'pg_ft', 'pg_fta', 'pg_ft%', 'pg_orb', 'pg_drb', 'pg_trb', 'pg_ast', 'pg_stl', 'pg_blk', 'pg_tov', 'pg_pf', 'pg_pts', 'tot_mp', 'tot_fg%', 'tot_3p', 'tot_3p%', 'tot_2p%', 'tot_efg%', 'tot_ft%', 'tot_pf', 'tot_pts', 'adv_per', 'adv_ts%', 'adv_3par', 'adv_ftr', 'adv_orb%', 'adv_drb%', 'adv_trb%', 'adv_ast%', 'adv_stl%', 'adv_blk%', 'adv_tov%', 'adv_usg%', 'adv_ows', 'adv_dws', 'adv_ws', 'adv_ws/48', 'adv_obpm', 'adv_dbpm', 'adv_bpm', 'adv_vorp', 'f', 'gu', 'w/l%', 'seed', 'all_star']
X = df[feats]
y = df['share']
X_train = df.year <= 

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 531)

In [None]:
# Define baseline
snowski.subreddit.value_counts(normalize=True)

In [None]:
X = snowski['text']
y = snowski['subreddit']

In [None]:
# Create pipeline that tests b/t CVEC and TVEC transformers and an estimator
pipe_log = Pipeline([
                 ('vec', None),
                 ('logr', LogisticRegression(solver = 'liblinear', max_iter=1000))])

In [None]:
pipe_svc = Pipeline([
    ('vec', None),
    ('svc', SVC())
])

In [None]:
pgrid_svc =[
    {
     'vec': [CountVectorizer()],
     'vec__stop_words': [None], #tested english stopwords as well
     'vec__max_features': [7000], #also tested 8000 
     'vec__min_df': [3, 5],
     'vec__max_df': [0.80], #tested 0.90 as well
     'svc__C': np.linspace(0.0001, 2, 10),
     #'svc__kernel': ['rbf','poly'],
     'svc__degree' : [2]
    },
    {
     'vec': [TfidfVectorizer()],
     'vec__stop_words': [None],
     'vec__max_features': [7000], 
     'vec__min_df': [3, 5],
     'vec__max_df': [0.80],
     'svc__C': np.linspace(0.0001, 2, 10),
     #'svc__kernel': ['rbf','poly'],
     'svc__degree' : [2]
    }
]

In [None]:
%%time

gs_svc = GridSearchCV(pipe_svc, pgrid_svc, n_jobs=25)
gs_svc.fit(X_train, y_train)

In [None]:
# Make predictions for Accuracy Report
preds_svc = gs_svc.predict(X_test)

In [None]:
print(f'----------------- {b1}SVM w/ GridSearch{b0} ----------------')
print(f'------------------ Train: {round(gs_svc.score(X_train, y_train),4)} -------------------')
print(f'------------------- Test: {round(gs_svc.score(X_test, y_test),4)} -------------------')
print('Best Params:', gs_svc.best_params_)

In [None]:
%%time

# Test RandomizedSearch for comparison in timing and outcome
rs_svc = RandomizedSearchCV(pipe_svc, pgrid_svc, cv=5, n_iter=10, n_jobs=15)
rs_svc.fit(X_train, y_train)

In [None]:
# Make predictions for Accuracy Report
preds_svc_r = rs_svc.predict(X_test)

In [None]:
print(f'-------------- {b1}SVM w/ RandomizedSearch{b0} --------------')
print(f'------------------- Train: {round(rs_svc.score(X_train, y_train),4)} -------------------')
print(f'------------------- Test: {round(rs_svc.score(X_test, y_test),4)} --------------------')
print('Best Params:', rs_svc.best_params_)

# Randomized search computed faster and test score is slightly better - we will utilize RS more