# NBA Projections Preprocessing and Modeling

In this notebook, I build and tune three models (Gradient Boosting Classifier, Logistic Regression, and a Neural Network), and also the resulting soft voting classifier. A lot of the tuning to arrive at the selected hyperparameters was done on virtual machines or in other notebooks; this notebook contains the results.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import roc_curve, auc, accuracy_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from keras.callbacks import EarlyStopping
from keras import Sequential
from keras.layers import Input, Dense
from keras.utils.np_utils import to_categorical
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
import warnings
import pickle

In [2]:
df = pd.read_csv("compare.csv")

In [3]:
X = df.drop(["home_win"], axis=1)
y = df.home_win.ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y)
scaler = StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

## Model One: Gradient Boosting Classifier
Because GBC is computationally intensive, I use a RandomSearch for tuning.

In [4]:
# set parameters for a grid or randomized search
params = {
    'loss':['exponential','deviance'],
    'learning_rate':[0.05,0.1,0.2,0.3,0.4],
    'n_estimators':[10,250,500],
    'criterion':['friedman_mse', 'mse', 'mae'],
    'min_samples_split':[1,5,10],
    'min_samples_leaf':[1,4,8],
    'min_weight_fraction_leaf':[0,0.05,0.1],
    'max_depth':[2,3,4,7,8,9,None],
    'min_impurity_decrease':[0,0.01,0.05],
    'max_features':['sqrt','log2',8],
    'warm_start':[True,False],
    'n_iter_no_change':[25],
    'ccp_alpha':[0,1000,2000]   
}

In [5]:
# I ran this code block separately on a VM with n_iter = 1000
# Those results are included below
'''
import warnings
warnings.simplefilter("ignore")
gbc = GradientBoostingClassifier()
gb_cv = RandomizedSearchCV(gbc, params, n_jobs=4, n_iter=25)
gb_cv.fit(X_train, y_train)
print(gb_cv.best_params_)
'''



The results of the above random search with n_iter=1000 showed the following parameters to be the best performing:

GradientBoostingClassifier(loss='exponential', n_estimators=250,
                                 min_samples_split=5, min_samples_leaf=4,
                                 max_features=8).fit(X_train,y_train)

In [6]:
# I ran this code block separately to use HyperOpt to tune
# Those results are included below
'''
from hyperopt import tpe,hp,fmin,STATUS_OK,Trials
from hyperopt.pyll.base import scope

space = {
    'loss': 'exponential',
    'criterion': 'mae',
    'learning_rate': hp.quniform("learning_rate",0.01,0.5,0.01),
    'n_estimators': hp.choice('n_estimators',np.arange(10, 1000, 50, dtype=int)),
    'min_samples_split':hp.choice("min_samples_split",np.arange(2, 10, 1, dtype=int)),
    'min_samples_leaf': hp.choice("min_samples_leaf",np.arange(2, 10, 1, dtype=int)),
    'min_weight_fraction_leaf':hp.choice("min_weight_fraction_leaf",[0,0.05,0.1]),
    'max_depth': hp.uniform("max_depth",1,12),
    'min_impurity_decrease':hp.choice("min_impurity_decrease",[0,0.01,0.05]),
    'max_features':hp.choice("max_features",['sqrt','log2',8]),
    'warm_start':hp.choice("warm_start", [True,False]),
    'ccp_alpha':hp.uniform("ccp_alpha",0,2500)
}

# define objective function
def objective(space):
    clf = GradientBoostingClassifier(loss= space['loss'],
                                     learning_rate=space['learning_rate'],
                                     n_estimators=space['n_estimators'],
                                     criterion = space['criterion'],
                                     min_samples_split=space['min_samples_split'],
                                     min_samples_leaf=space['min_samples_leaf'],
                                     min_weight_fraction_leaf=space['min_weight_fraction_leaf'],
                                     max_depth=space['max_depth'],
                                     min_impurity_decrease=space['min_impurity_decrease'],
                                     max_features=space['max_features'],
                                     warm_start=space['warm_start'],
                                     ccp_alpha=space['ccp_alpha'],
                                    )

    clf.fit(X_train,y_train)
    acc = cross_val_score(clf, X_train, y_train, cv=5).mean()
    return{'loss':-acc, 'status': STATUS_OK }

# initialize trials object
trials = Trials()

best = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=25,
    trials=trials
)

print("Best: {}".format(best))
'''

'\nfrom hyperopt import tpe,hp,fmin,STATUS_OK,Trials\nfrom hyperopt.pyll.base import scope\n\nspace = {\n    \'loss\': \'exponential\',\n    \'criterion\': \'mae\',\n    \'learning_rate\': hp.quniform("learning_rate",0.01,0.5,0.01),\n    \'n_estimators\': hp.choice(\'n_estimators\',np.arange(10, 1000, 50, dtype=int)),\n    \'min_samples_split\':hp.choice("min_samples_split",np.arange(2, 10, 1, dtype=int)),\n    \'min_samples_leaf\': hp.choice("min_samples_leaf",np.arange(2, 10, 1, dtype=int)),\n    \'min_weight_fraction_leaf\':hp.choice("min_weight_fraction_leaf",[0,0.05,0.1]),\n    \'max_depth\': hp.uniform("max_depth",1,12),\n    \'min_impurity_decrease\':hp.choice("min_impurity_decrease",[0,0.01,0.05]),\n    \'max_features\':hp.choice("max_features",[\'sqrt\',\'log2\',8]),\n    \'warm_start\':hp.choice("warm_start", [True,False]),\n    \'ccp_alpha\':hp.uniform("ccp_alpha",0,2500)\n}\n\n# define objective function\ndef objective(space):\n    clf = GradientBoostingClassifier(loss= spa

These are the results from the HyperOpt search:

gbc = GradientBoostingClassifier(learning_rate=0.43,
                                     n_estimators=10,
                                     criterion='mse',
                                     min_samples_split=6,
                                     min_samples_leaf=6,
                                     min_weight_fraction_leaf=0.1,
                                     max_depth=8,
                                     max_features=None,
                                     warm_start=False,
                                     ccp_alpha=0)

In [7]:
# a narrowed GridSearch to see if we can eke out any more improvement
params = {
    'loss':['exponential'],
    'learning_rate':[0.05,0.1,0.2],
    'n_estimators':[250],
    'criterion':['friedman_mse', 'mse', 'mae'],
    'min_samples_split':[4,5,6],
    'min_samples_leaf':[3,4,5],
    'max_depth':[2,3,None],
    'max_features':[8],
    'warm_start':[True,False],
    'n_iter_no_change':[25], 
}

In [8]:
'''
# execute narrowed GridSearch
warnings.simplefilter("ignore")
gbc = GradientBoostingClassifier()
gb_cv = GridSearchCV(gbc, params, n_jobs=7)
gb_cv.fit(X_train, y_train)
print(gb_cv.best_params_)
'''



These are the results from the narowed GridSeach:
{'criterion': 'mse', 'learning_rate': 0.1,
 'loss': 'exponential', 'max_depth': 2,
 'max_features': 8, 'min_samples_leaf': 4,
 'min_samples_split': 6, 'n_estimators': 250, 
 'n_iter_no_change': 25, 'warm_start': True}

In [9]:
# check AUROCs and accuracys for recommended models
random_search = GradientBoostingClassifier(loss='exponential', n_estimators=250,
                                 min_samples_split=5, min_samples_leaf=4,
                                 max_features=8).fit(X_train,y_train)
y_pred = random_search.predict_proba(X_test)
y_pred = pd.DataFrame(y_pred,columns=['Loss','Win']).drop(columns=['Loss'])
print("Random Search's Tuned AUROC= " + str(round(roc_auc_score(y_test,y_pred),3)))
y_pred = random_search.predict(X_test)
predictions = pd.DataFrame(y_pred, columns=['home_win_prob'])
predictions['binary'] = predictions['home_win_prob'] > 0.5
y_pred = np.array(predictions['binary'])
print("Random Search's Tuned Accuracy = " + str(round(accuracy_score(y_test,y_pred),3)))

HyperOpt = GradientBoostingClassifier(learning_rate=0.43, n_estimators=10,
                                     criterion='mse', min_samples_split=6,
                                     min_samples_leaf=6, min_weight_fraction_leaf=0.1,
                                     max_depth=8, max_features=None,
                                     warm_start=False, ccp_alpha=0).fit(X_train,y_train)

y_pred = HyperOpt.predict_proba(X_test)
y_pred = pd.DataFrame(y_pred,columns=['Loss','Win']).drop(columns=['Loss'])
print("HyperOpt's Tuned AUROC= " + str(round(roc_auc_score(y_test,y_pred),3)))
y_pred = HyperOpt.predict(X_test)
predictions = pd.DataFrame(y_pred, columns=['home_win_prob'])
predictions['binary'] = predictions['home_win_prob'] > 0.5
y_pred = np.array(predictions['binary'])
print("HyperOpt's Tuned Accuracy = " + str(round(accuracy_score(y_test,y_pred),3)))

partial_grid = GradientBoostingClassifier(criterion='mse',loss='exponential',
                                          max_depth=2,max_features=8,
                                         min_samples_leaf=4, min_samples_split=6,
                                         n_estimators=250, warm_start=True).fit(X_train,y_train)
y_pred = partial_grid.predict_proba(X_test)
y_pred = pd.DataFrame(y_pred,columns=['Loss','Win']).drop(columns=['Loss'])
print("GridSearch's Tuned AUROC= " + str(round(roc_auc_score(y_test,y_pred),3)))
y_pred = partial_grid.predict(X_test)
predictions = pd.DataFrame(y_pred, columns=['home_win_prob'])
predictions['binary'] = predictions['home_win_prob'] > 0.5
y_pred = np.array(predictions['binary'])
print("GridSearch's Tuned Accuracy = " + str(round(accuracy_score(y_test,y_pred),3)))

Random Search's Tuned AUROC= 0.695
Random Search's Tuned Accuracy = 0.658
HyperOpt's Tuned AUROC= 0.696
HyperOpt's Tuned Accuracy = 0.663
GridSearch's Tuned AUROC= 0.698
GridSearch's Tuned Accuracy = 0.661


So after the above tuning, the best parameters seem *barely* to be those arrived at by our narrowed GridSearch.

partial_grid = GradientBoostingClassifier(criterion='mse',loss='exponential',
                                          max_depth=2,max_features=8,
                                         min_samples_leaf=4, min_samples_split=6,
                                         n_estimators=250, warm_start=True).fit(X_train,y_train)

In [10]:
clf1 = GradientBoostingClassifier(criterion='mse',loss='exponential',
                                          max_depth=2,max_features=8,
                                         min_samples_leaf=4, min_samples_split=6,
                                         n_estimators=250, warm_start=True).fit(X_train,y_train)

## Model Two: Logistic Regression
Because Logit is not computationally intensive, I use a GridSearch for tuning.

In [11]:
# set param grid for Logit GridSearch
params = {
    'penalty':['l1','l2','elasticnet','none'],
    'fit_intercept':[True,False],
    'solver':['newton-cg','lbfgs','liblinear','sag','saga'],
    'warm_start':[True,False],
    'n_jobs':[6]
}

In [12]:
# run a GridSearch here
warnings.simplefilter("ignore")
logit = LogisticRegression()
logit_cv = GridSearchCV(logit, param_grid=params, scoring="roc_auc", n_jobs=6)
logit_cv.fit(X_train, y_train)
print(logit_cv.best_params_)

{'fit_intercept': False, 'n_jobs': 6, 'penalty': 'l1', 'solver': 'liblinear', 'warm_start': False}


In [13]:
# check AUROC for model using best params from the GridSearch
clf2 = LogisticRegression(penalty='l1',solver='liblinear',warm_start=False,fit_intercept=False).fit(X_train,y_train)
y_pred = clf2.predict_proba(X_test)
y_pred = pd.DataFrame(y_pred,columns=['Loss','Win']).drop(columns=['Loss'])
print("Tuned Logit's AUROC= " + str(round(roc_auc_score(y_test,y_pred),3)))

# check accuracy
y_pred = clf2.predict(X_test)
predictions = pd.DataFrame(y_pred, columns=['home_win_prob'])
predictions['binary'] = predictions['home_win_prob'] > 0.5
y_pred = np.array(predictions['binary'])
print("Tuned Logit's Accuracy = " + str(round(accuracy_score(y_test,y_pred),3)))

Tuned Logit's AUROC= 0.696
Tuned Logit's Accuracy = 0.643


## Model Three: Neural Network

In [14]:
# adjust format of inputs for neural network classifier
predictors = np.matrix(X_train)
target = to_categorical(y_train)
X_val = np.matrix(X_test)
y_val = to_categorical(y_test)

In [15]:
# early stopping callback to expedite tuning search
callbacks = [EarlyStopping(monitor="val_loss", min_delta=1e-3,patience=3,verbose=0)]

In [16]:
# function to create + fit model, required for KerasClassifier

def create_model(layers,nodes,activation,batch_size,optimizer):
    model = Sequential()
    model.add(Input(shape=(14,)))
    for x in range(layers):
        model.add(Dense(nodes, activation=activation))
    model.add(Dense(2, activation='sigmoid'))
    model.compile(optimizer=optimizer,loss='categorical_crossentropy',metrics=['accuracy'])
    model.fit(predictors,target,batch_size=batch_size,epochs=50,validation_data=(X_val, y_val),callbacks=callbacks,verbose=0)
    model._estimator_type = 'regressor'
    return model

In [17]:
# set param grid for NN GridSearch

layers= [2,4,8,16]
nodes= [32,64,128,256]
activation = ['relu','sigmoid','softmax','softplus','softsign','tanh','selu','elu','exponential']
batch_size = [10,25,50,75,100]
param_grid = dict(layers=layers,nodes=nodes,activation=activation,batch_size=batch_size)

In [18]:
# run the GridSearch (done in another notebook)
'''
grid = GridSearchCV(estimator=model,param_grid=param_grid,cv=5)
grid_result=grid.fit(X_train,y_train)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))
'''

'\ngrid = GridSearchCV(estimator=model,param_grid=param_grid,cv=5)\ngrid_result=grid.fit(X_train,y_train)\nprint("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))\nmeans = grid_result.cv_results_[\'mean_test_score\']\nstds = grid_result.cv_results_[\'std_test_score\']\nparams = grid_result.cv_results_[\'params\']\nfor mean, stdev, param in zip(means, stds, params):\n    print("%f (%f) with: %r" % (mean, stdev, param))\n'

__First GridSearch's Optimal Parameters for NN__
Layers: 16
Nodes: 128
Activation: 'relu'
Batch Size: 64

I ran an additional GridSearch including different optimizers after that one out of curiosity and ended up with this set of parameters that might be very good as well:

__Second GridSearch's Optimal Parameters for NN__
Layers: 8
Nodes: 70
Activation: 'relu'
Batch Size: 100
Optimizer: 'RMSprop'

In [19]:
# check AUROCs and Accuracies of Two Tuned NNs

NN1 = create_model(16,128,'relu',64,'adam')
y_pred = NN1.predict(X_val)
y_probs = pd.DataFrame(y_pred,columns=['Loss','Win']).drop(columns=['Loss'])
print("First Tuned Neural Net's AUROC= " + str(round(roc_auc_score(y_test,y_probs),3)))
# check accuracy
predictions = pd.DataFrame(y_pred, columns=['home_loss_prob','home_win_prob'])
predictions['binary'] = predictions['home_win_prob'] > 0.5
y_pred = np.array(predictions['binary'])
print("First Tuned Neural Net's Accuracy = " + str(round(accuracy_score(y_test,y_pred),3)))

# second NN

NN2 = create_model(8,70,'relu',100,'RMSprop')

y_pred = NN2.predict(X_val)
y_probs = pd.DataFrame(y_pred,columns=['Loss','Win']).drop(columns=['Loss'])
print("Second Tuned Neural Net's AUROC= " + str(round(roc_auc_score(y_test,y_probs),3)))
# check accuracy
predictions = pd.DataFrame(y_pred, columns=['home_loss_prob','home_win_prob'])
predictions['binary'] = predictions['home_win_prob'] > 0.5
y_pred = np.array(predictions['binary'])
print("Second Tuned Neural Net's Accuracy = " + str(round(accuracy_score(y_test,y_pred),3)))

First Tuned Neural Net's AUROC= 0.691
First Tuned Neural Net's Accuracy = 0.656
Second Tuned Neural Net's AUROC= 0.689
Second Tuned Neural Net's Accuracy = 0.659


So the first neural net is slightly better on AUROC, which I defined before as my primary performance metric. So I'll opt for that one, even with the slightly lower accuracy.

In [20]:
NN = create_model(16,128,'relu',64,'adam')

## Models Four + Five: Soft and Hard Voting Classifiers

Because I had trouble instantiating a formal sklearn VotingClassifier with the Neural Network sklearn wrapper KerasClassifier, I wrote some user-defined functions to do it manually.

In [21]:
def soft_voting(input,*args):
    '''takes an input array and fitted predictors, and returns an array of their predictions averaged'''
    preds = {}
    clfs=[]
    vote_preds = []
    for arg in args:
        if arg._estimator_type == 'classifier':
            pred = arg.predict_proba(input)
            preds[arg] = pred
        else:
            new_preds = []
            pred = arg.predict(input)
            for entry in pred:
                new_preds.append(entry[1])
            preds[arg] = new_preds
    for key in preds.keys():
        clfs.append(key)
    for x in range(len(input)):
        votes = []
        for clf in clfs:
            votes.append(preds[clf][x])
        vote_preds.append(np.mean(votes))
    return vote_preds

In [22]:
def hard_voting(input,*args):
    '''takes an input array and fitted predictors, and returns an array of their prediction consensus'''
    preds = {}
    clfs=[]
    vote_preds = []
    for arg in args:
        if arg._estimator_type == 'classifier':
            pred = arg.predict(input)
            preds[arg] = pred
        else:
            new_preds = []
            pred = (arg.predict(input) > 0.5).astype("int32")
            for entry in pred:
                new_preds.append(entry[1])
            preds[arg] = new_preds
    for key in preds.keys():
        clfs.append(key)
    for x in range(len(input)):
        votes = []
        for clf in clfs:
            votes.append(preds[clf][x])
        vote_preds.append(np.mean(votes))
    return vote_preds

In [23]:
# hard voting's Accuracy
y_pred = hard_voting(X_val,clf1,clf2,NN)
print("Hard Voting's AUROC*= " + str(round(roc_auc_score(y_test,y_pred),3)))
predictions = pd.DataFrame(y_pred, columns=['home_win_prob'])
predictions['binary'] = predictions['home_win_prob'] > 0.5
y_pred = np.array(predictions['binary'])
print("Hard Voting's Accuracy = " + str(round(accuracy_score(y_test,y_pred),3)))

Hard Voting's AUROC*= 0.67
Hard Voting's Accuracy = 0.657


*Because hard voting in this custom-made format doesn't really have predicted probabilities as an output, using AUROC as a metric here is a bit uninformative. I've included it only for consistency.

In [24]:
# soft voting's AUROC
y_pred = soft_voting(X_val,clf1,clf2,NN)
y_pred = pd.DataFrame(y_pred,columns=['Loss','Win']).drop(columns=['Loss'])
print("Soft Voting's AUROC= " + str(round(roc_auc_score(y_test,y_pred),3)))
# soft voting's Accuracy
y_pred = soft_voting(X_val,clf1,clf2,NN)
predictions = pd.DataFrame(y_pred, columns=['Loss','Win'])
predictions['binary'] = predictions['Win'] > 0.5
y_pred = np.array(predictions['binary'])
print("Soft Voting's Accuracy = " + str(round(accuracy_score(y_test,y_pred),3)))

Soft Voting's AUROC= 0.697
Soft Voting's Accuracy = 0.662


# Conclusion
The best performing of the three individual models by AUROC is __GradientBoostingClassifier__, with an AUROC of __0.698__.
The best performing of the three individual models by classification accuracy is also __GradientBoostingClassifier__, with an accuracy of __0.661__.

The voting classifiers, by comparison, had:
Soft: an AUROC of __0.697__ and an accuracy of __0.662__.
Hard: an AUROC* (see note above) of __0.67__ and an accuracy of __0.657__.

For the purpose of most accurately predicting game outcomes, I would probably opt to use the __GradientBoostingClassifier__. The soft voting classifiier might also be worth considering. All of these differences in model performance metrics are very small, and are also influnced by the random chance from the selection of our sample. Even more hyperparameter tuning could perhaps find an incrementally better model. I'm curious to see how the models performs on fresh data, like the upcoming 2020-2021 NBA season.

In [25]:
model = GradientBoostingClassifier(criterion='mse',loss='exponential',
                                          max_depth=2,max_features=8,
                                         min_samples_leaf=4, min_samples_split=6,
                                         n_estimators=250, warm_start=True).fit(X_train,y_train)

In [26]:
pickle.dump(model, open('model.pkl','wb'))