# Modeling

Now that all the features are present in the data, it is time to start modeling. First, I take a baseline by always predicting that a player will scores his expanding average for the upcoming game. Then, I build a first simple model that feeds all of the features to a linear regression model. From there, I use feature selection to improve model performance and also try other models including Random Forest, Extra Trees, ADA Boost, and XGBoost. I tune each model to try to maximize performance.

## 1. Imports and Preparing Data for Modeling

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.model_selection import TimeSeriesSplit
from sklearn.feature_selection import SequentialFeatureSelector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.dummy import DummyRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor, VotingClassifier

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.regularizers import l2
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.wrappers import scikit_learn
from tensorflow.keras.callbacks import EarlyStopping
from tensorflow.keras.metrics import Precision

In [2]:
df = pd.read_csv('./Data/no_upcoming_opponent.csv')

In [3]:
df.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,SEASON_ID,PLAYER_ID_x,PLAYER_NAME_x,TEAM_ID_x_x,TEAM_NAME_x,GAME_ID_x,GAME_DATE_x,MATCHUP_x,...,OPPONENT_TEAM_PCT_UAST_2PM_ROLLING,OPPONENT_TEAM_PCT_UAST_3PM_ROLLING,OPPONENT_TEAM_PCT_UAST_FGM_ROLLING,OPPONENT_days_of_rest_ROLLING,target,UPCOMING_game_date,UPCOMING_days_of_rest,UPCOMING_opponent,UPCOMING_opponent_id,UPCOMING_home
0,0,0,22015,203084,Harrison Barnes,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,...,0.414,0.333,0.4,10.0,12.0,2015-10-30,3,HOU,1610613000.0,0.0
1,1,1,22015,2733,Shaun Livingston,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,...,0.414,0.333,0.4,10.0,4.0,2015-10-30,3,HOU,1610613000.0,0.0
2,2,2,22015,2571,Leandro Barbosa,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,...,0.414,0.333,0.4,10.0,5.0,2015-10-30,3,HOU,1610613000.0,0.0
3,3,3,22015,201575,Brandon Rush,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,...,0.414,0.333,0.4,10.0,2.0,2015-10-30,3,HOU,1610613000.0,0.0
4,4,4,22015,203105,Festus Ezeli,1610612744,Golden State Warriors,21500003,2015-10-27,GSW vs. NOP,...,0.414,0.333,0.4,10.0,9.0,2015-10-30,3,HOU,1610613000.0,0.0


There are some nulls in the data because I have to go back and finish a few steps in my previous notebook. When I fully finish that, this won't be necessary.

In [4]:
df.dropna(inplace=True)

In [5]:
df.isna().sum().any()

False

In [6]:
df.columns.tolist()

['Unnamed: 0',
 'Unnamed: 0.1',
 'SEASON_ID',
 'PLAYER_ID_x',
 'PLAYER_NAME_x',
 'TEAM_ID_x_x',
 'TEAM_NAME_x',
 'GAME_ID_x',
 'GAME_DATE_x',
 'MATCHUP_x',
 'MIN',
 'FGM',
 'FGA',
 'FG_PCT',
 'FG3M',
 'FG3A',
 'FG3_PCT',
 'FTM',
 'FTA',
 'FT_PCT',
 'OREB',
 'DREB',
 'REB',
 'AST',
 'STL',
 'BLK',
 'TOV',
 'PF',
 'PTS',
 'PLUS_MINUS',
 'PLAYER_GAME_ID',
 'AST_PCT',
 'AST_RATIO',
 'AST_TO',
 'DEF_RATING',
 'DREB_PCT',
 'EFG_PCT',
 'E_DEF_RATING',
 'E_NET_RATING',
 'E_OFF_RATING',
 'E_PACE',
 'E_TOV_PCT',
 'E_USG_PCT',
 'FGA_PG',
 'FGM_PG',
 'NET_RATING',
 'OFF_RATING',
 'OREB_PCT',
 'PACE',
 'PACE_PER40',
 'PIE',
 'POSS',
 'REB_PCT',
 'SEASON_YEAR_x',
 'TM_TOV_PCT',
 'TS_PCT',
 'USG_PCT',
 'sp_work_DEF_RATING',
 'sp_work_NET_RATING',
 'sp_work_OFF_RATING',
 'sp_work_PACE',
 'BLKA',
 'OPP_PTS_2ND_CHANCE',
 'OPP_PTS_FB',
 'OPP_PTS_OFF_TOV',
 'OPP_PTS_PAINT',
 'PFD',
 'PTS_2ND_CHANCE',
 'PTS_FB',
 'PTS_OFF_TOV',
 'PTS_PAINT',
 'PCT_AST_2PM',
 'PCT_AST_3PM',
 'PCT_AST_FGM',
 'PCT_FGA_2PT',
 

In [7]:
numeric_columns = df.select_dtypes(['float', 'int'])

In [8]:
cols_to_remove = ['SEASON_ID_EXPANDING','PLAYER_ID_x_EXPANDING','SEASON_ID_ROLLING','Unnamed: 0','Unnamed: 0.1',
                'PLAYER_ID_x_ROLLING','UPCOMING_opponent_id','UPCOMING_home','target']

In [9]:
selected_cols = [item for item in numeric_columns if item not in cols_to_remove]

In [10]:
df[selected_cols].columns.tolist()

['FG_PCT',
 'FG3_PCT',
 'FT_PCT',
 'AST_PCT',
 'AST_RATIO',
 'AST_TO',
 'DEF_RATING',
 'DREB_PCT',
 'EFG_PCT',
 'E_DEF_RATING',
 'E_NET_RATING',
 'E_OFF_RATING',
 'E_PACE',
 'E_TOV_PCT',
 'E_USG_PCT',
 'FGA_PG',
 'FGM_PG',
 'NET_RATING',
 'OFF_RATING',
 'OREB_PCT',
 'PACE',
 'PACE_PER40',
 'PIE',
 'REB_PCT',
 'TM_TOV_PCT',
 'TS_PCT',
 'USG_PCT',
 'sp_work_DEF_RATING',
 'sp_work_NET_RATING',
 'sp_work_OFF_RATING',
 'sp_work_PACE',
 'PCT_AST_2PM',
 'PCT_AST_3PM',
 'PCT_AST_FGM',
 'PCT_FGA_2PT',
 'PCT_FGA_3PT',
 'PCT_PTS_2PT',
 'PCT_PTS_2PT_MR',
 'PCT_PTS_3PT',
 'PCT_PTS_FB',
 'PCT_PTS_FT',
 'PCT_PTS_OFF_TOV',
 'PCT_PTS_PAINT',
 'PCT_UAST_2PM',
 'PCT_UAST_3PM',
 'PCT_UAST_FGM',
 'PCT_AST',
 'PCT_BLK',
 'PCT_BLKA',
 'PCT_DREB',
 'PCT_FG3A',
 'PCT_FG3M',
 'PCT_FGA',
 'PCT_FGM',
 'PCT_FTA',
 'PCT_FTM',
 'PCT_OREB',
 'PCT_PF',
 'PCT_PFD',
 'PCT_PTS',
 'PCT_REB',
 'PCT_STL',
 'PCT_TOV',
 'TEAM_MIN',
 'TEAM_FG_PCT',
 'TEAM_FG3_PCT',
 'TEAM_FT_PCT',
 'TEAM_TOV',
 'TEAM_PLUS_MINUS',
 'TEAM_AST_PC

## 2. Baseline

For my baseline prediction, I predict that a player will score his median points per game for the following game.

In [11]:
print(mean_absolute_error(df['PTS_EXPANDING'],df['target']))
print(r2_score(df['PTS_EXPANDING'],df['target']))

4.58565060655468
0.15312922493560688


Using this method, the predictions aren't too far off-- only 4.85 points. However, the goodness of fit is quite low at .15. I will work to improve this in future iterations of the model.

## 3. First Simple Model

In [12]:
X = df[selected_cols]
y = df['target']

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_test, X_val, y_test, y_val = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

In [21]:
type(X_train)

pandas.core.frame.DataFrame

In [13]:
ss = StandardScaler()

In [14]:
X_train_scaled = ss.fit_transform(X_train)
X_val_scaled = ss.fit_transform(X_val)
X_test_scaled = ss.fit_transform(X_test)

In [22]:
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_val_scaled_df = pd.DataFrame(X_val_scaled, columns=X_val.columns)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_test.columns)

In [23]:
mlr = LinearRegression()
split = TimeSeriesSplit(n_splits=3)
sfs = SequentialFeatureSelector(mlr, n_features_to_select=30,direction='forward',cv=split)

This took a very long time to run, so I commented it out and have the results in the list below.

In [16]:
#sfs.fit(X_train,y_train)

In [17]:
#predictors = list(selected_cols[sfs.get_support()])
#predictors

In [26]:
predictors = ['MIN_EXPANDING', 'FGA_EXPANDING', 'FTM_EXPANDING', 'FT_PCT', 'OREB_EXPANDING', 'DREB_ROLLING', 'AST_EXPANDING', 'STL_EXPANDING', 'TOV_EXPANDING', 'PLUS_MINUS_ROLLING', 'DEF_RATING', 'POSS_ROLLING', 'OPP_PTS_PAINT_EXPANDING', 'PCT_AST_2PM', 'PCT_AST_3PM', 'PCT_FGA_3PT', 'PCT_PTS_2PT_MR', 'PCT_UAST_3PM', 'PCT_AST', 'PCT_DREB', 'PCT_OREB', 'PCT_PTS', 'TEAM_MIN', 'TEAM_REB_ROLLING', 'TEAM_E_NET_RATING', 'TEAM_OPP_FTA_RATE', 'TEAM_OPP_PTS_PAINT', 'TEAM_PCT_FGA_3PT', 'TEAM_PCT_PTS_PAINT']

In [27]:
mlr.fit(X_train_scaled_df[predictors],y_train)

In [28]:
train_predictions = mlr.predict(X_train_scaled_df[predictors])

In [30]:
train_predictions

array([ 7.6530134 , 16.5259547 ,  6.23382139, ..., 12.37784174,
       10.27569115, 13.78400151])

In [32]:
print(mean_absolute_error(train_predictions,y_train))
print(r2_score(train_predictions,y_train))

4.547102553695566
0.024416476514748986


In [33]:
test_predictions = mlr.predict(X_test_scaled_df[predictors])

In [35]:
print(mean_absolute_error(test_predictions,y_test))
print(r2_score(test_predictions,y_test))

4.530514273601588
0.03336367662069151


The error has improved slightly from the baseline, but the goodness of fit has gotten worse. Onwards!