## Feature Analysis

To analyze what are the most important indicators whether a team will win or lose, we will now build a simple binary predictor model and study which are the variables which help predict wins and losses more.

Since we are only trying to analyze the features, we will use the same game variables for our x values, since the aim is not build predictions.

To analyze the importance of the features, we will use the following models:

* Logistics Regression
* Random Forest Classifier
* Gradient Boost Classifier
* XGBoost Classifier

In [39]:
import pandas as pd
import numpy as np
import copy

from pathlib import Path

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, RandomizedSearchCV
import joblib

from modules.model import Model, ML_Model

In [40]:
team_data = pd.read_csv(f'./data/team_data.csv')

For this model, this will be the following features we will include:

* fg3a
* fg2a
* fta
* fg3_pct
* fg2_pct
* ft_pct
* ast_ratio
* team_tov_pct
* team_orb_pct
* team_efg_pct

Our target variable will be the Win variable

In [41]:
x_vars = ['fg3a', 'fg2a', 'fta', 'fg3_pct', 'fg2_pct', 'ft_pct', 'ast_ratio', 'team_tov_pct', 'team_orb_pct', 'team_efg_pct']
y_var = ['win']

In [42]:
model_data = team_data.copy(deep=True)
model_data.columns = [x.lower().strip() for x in model_data.columns]
model_data = model_data[x_vars + y_var]
model_data.head()

Unnamed: 0,fg3a,fg2a,fta,fg3_pct,fg2_pct,ft_pct,ast_ratio,team_tov_pct,team_orb_pct,team_efg_pct,win
0,33,52,21,0.393939,0.461538,0.714286,15.595758,12.9,20.9,0.512,0
1,31,50,24,0.354839,0.62,0.708333,18.524236,13.3,25.6,0.586,1
2,45,57,20,0.422222,0.421053,0.85,18.773467,14.6,28.1,0.515,0
3,40,63,38,0.35,0.444444,0.842105,14.490927,11.8,30.2,0.476,1
4,43,65,28,0.302326,0.507692,0.785714,14.713408,9.8,24.0,0.486,1


Since the variables all have different ranges to them, we will min max scale all variables above

In [43]:
# define vars for train test split

TEST_SIZE = 0.2
SEED = 4

In [44]:
X, y = model_data[x_vars].values, model_data[y_var].values

In [45]:
# min max scale variables
min_max_scaler = MinMaxScaler(feature_range=(0,1))

X_scaled = min_max_scaler.fit_transform(X)

### Feature Analysis with tree based models

In [46]:
# define models and the parameter grid for which to search
rf = {"name": "rf", "classifier": RandomForestClassifier(), 'searcher':GridSearchCV\
    , "param_grid": {"max_depth":[6,7,8,9,10], "n_estimators":[150,200,250,300,400]}, \
        'max_features': [4,5,6,7]}

gb = {"name": "gb", "classifier": GradientBoostingClassifier(), 'searcher':RandomizedSearchCV\
    , "param_distributions": {"max_depth":[6,7,8,9,10], "n_estimators":[150,200,250,300,400], \
        "learning_rate": [0.1, 0.05, 0.01, 0.001], 'max_features': [4,5,6,7], \
            'subsample': [0.5, 0.6, 0.7, 0.8, 0.9]}}

xgb = {"name": "xgb", "classifier": XGBClassifier(verbosity = 0), 'searcher':RandomizedSearchCV\
    , "param_distributions": {"max_depth":[6,7,8,9], "n_estimators":[150,200,250,300], \
        "eta": [0.1, 0.05, 0.01, 0.001], 'subsample': [0.5, 0.6, 0.7, 0.8, 0.9]}}

algorithms_params = [gb, xgb, rf]

In [47]:
# define cross validation searchers
gridsearch, randomsearch = GridSearchCV, RandomizedSearchCV

In [48]:
def run_model_commands(model, searcher: object, searcher_params: dict, \
    metrics: list=["accuracy"], test_size: float=0.2):
    
    """
    function to run the commands of a model object
    """
    model.get_best_params(searcher=searcher, searcher_params=searcher_params, \
        metrics=metrics)
    model.set_params(params_to_set=model.best_params)
    model.train_test_split(test_size=test_size)
    model.train_model()
    model.test_model()

In [49]:
estimators = {}

# run through the different algorithms chosen
for algorithm in algorithms_params:

    for metric in ["accuracy"]:

        # create a deep copy of the object
        algorithm_copy = copy.deepcopy(algorithm)

        # create the object for the model
        model = ML_Model(X=X_scaled, y=y, base_model=algorithm_copy["classifier"], \
            seed=SEED)
    
        # define searcher and searcher params
        searcher=algorithm_copy['searcher']
        params={'estimator': model.base_model, 'scoring':metric, 'n_jobs':-1}

        # add the param distribution based on the searcher
        if searcher==GridSearchCV:
            params['param_grid'] = algorithm_copy['param_grid']
        else:
            params['param_distributions'] = algorithm_copy['param_distributions']
            params['n_iter'] = 15

        # run the commands from the class necessary to create the model
        print(f"running commands {algorithm_copy['classifier']} & {metric}...")
        run_model_commands(model=model, searcher=searcher, searcher_params=params, \
            metrics=[metric])

        print("storing model ...")

        # storing the file
        name = algorithm_copy['name']
        file_object = Path(f"models/{name}_model.pkl").open("wb")
        joblib.dump(model.base_model, file_object)

        # saving it in the dictionary
        estimators[f"{algorithm_copy['name']}_{metric}"] = model

running commands GradientBoostingClassifier() & accuracy...
searching for best parameters


  y = column_or_1d(y, warn=True)


Best score: 0.7351120535626494
With the following parameters: {'subsample': 0.5, 'n_estimators': 150, 'max_features': 6, 'max_depth': 7, 'learning_rate': 0.01}
{'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.01, 'loss': 'log_loss', 'max_depth': 7, 'max_features': 6, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 150, 'n_iter_no_change': None, 'random_state': None, 'subsample': 0.5, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}


  y = column_or_1d(y, warn=True)


Model train score: 0.8268518518518518
Model test score: 0.7362962962962963
storing model ...
running commands XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, missing=nan, monotone_constraints=None,
              n_estimators=100, n_jobs=None, num_parallel_tree=None,
              predictor=None, random_state=None, ...) & accuracy...
searching for best parameters
Best score: 0.7358528382641724
With the following parameters: {'subsample': 0.7, 'n_estimators

  self.best_estimator_.fit(X, y, **fit_params)


Best score: 0.7355573320653899
With the following parameters: {'max_depth': 7, 'n_estimators': 400}
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 7, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


  self.base_model.fit(self.X_train, self.y_train)


Model train score: 0.7927777777777778
Model test score: 0.7348148148148148
storing model ...
