## Feature Analysis

To analyze what are the most important indicators whether a team will win or lose, we will now build a simple binary predictor model and study which are the variables which help predict wins and losses more.

Since we are only trying to analyze the features, we will use the same game variables for our x values, since the aim is not build predictions.

To analyze the importance of the features, we will use the following models:

* Logistics Regression
* Random Forest Classifier
* Gradient Boost Classifier
* XGBoost Classifier

In [1]:
%pip install statsmodels

Note: you may need to restart the kernel to use updated packages.


In [2]:
import pandas as pd
import numpy as np
import copy

from pathlib import Path

from dataclasses import dataclass

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, KFold, RandomizedSearchCV
import joblib

import seaborn as sns

from modules.model import Model, ML_Model

import plotly.express as px

from statsmodels.stats.outliers_influence import variance_inflation_factor

In [3]:
team_data = pd.read_csv(f'./data/team_data.csv')

For this model, this will be the following features we will include:

* fg3a
* fg2a
* fta
* fg3_pct
* fg2_pct
* ft_pct
* ast_ratio
* team_tov_pct
* team_orb_pct
* team_efg_pct

Our target variable will be the Win variable

In [4]:
x_vars = ['fg3a', 'fg2a', 'fta', 'fg3_pct', 'fg2_pct', 'ft_pct', 'ast_ratio', 'team_tov_pct', 'team_orb_pct', 'team_efg_pct']
y_var = ['win']

In [5]:
model_data = team_data.copy(deep=True)
model_data.columns = [x.lower().strip() for x in model_data.columns]
model_data = model_data[x_vars + y_var]
model_data.head()

Unnamed: 0,fg3a,fg2a,fta,fg3_pct,fg2_pct,ft_pct,ast_ratio,team_tov_pct,team_orb_pct,team_efg_pct,win
0,33,52,21,0.393939,0.461538,0.714286,15.595758,12.9,20.9,0.512,0
1,31,50,24,0.354839,0.62,0.708333,18.524236,13.3,25.6,0.586,1
2,45,57,20,0.422222,0.421053,0.85,18.773467,14.6,28.1,0.515,0
3,40,63,38,0.35,0.444444,0.842105,14.490927,11.8,30.2,0.476,1
4,43,65,28,0.302326,0.507692,0.785714,14.713408,9.8,24.0,0.486,1


### Min Max Scale data

In [6]:
# min max scale variables
min_max_scaler = MinMaxScaler(feature_range=(0,1))

# scale on the whole dataframe, win variable already binary
scaled_data = min_max_scaler.fit_transform(model_data)

# model data scaled
model_data_scaled = pd.DataFrame(scaled_data, columns=model_data.columns)
model_data_scaled.head()

Unnamed: 0,fg3a,fg2a,fta,fg3_pct,fg2_pct,ft_pct,ast_ratio,team_tov_pct,team_orb_pct,team_efg_pct,win
0,0.433962,0.367647,0.369565,0.525665,0.413484,0.587302,0.413687,0.44403,0.427403,0.427126,0.0
1,0.396226,0.338235,0.434783,0.460829,0.691432,0.578704,0.547219,0.458955,0.523517,0.576923,1.0
2,0.660377,0.441176,0.347826,0.572562,0.34247,0.783333,0.558583,0.507463,0.574642,0.433198,0.0
3,0.566038,0.529412,0.73913,0.452806,0.3835,0.77193,0.36331,0.402985,0.617587,0.354251,1.0
4,0.622642,0.558824,0.521739,0.373754,0.49444,0.690476,0.373455,0.328358,0.490798,0.374494,1.0


### Heatmap for feature correlation

In [7]:
import plotly.graph_objects as go

def heatmap(vars_heat: list):
    fig = go.Figure(data=go.Heatmap(
                    z=model_data[vars_heat].corr(),
                    x=model_data[vars_heat].columns,
                    y=model_data[vars_heat].columns,
                    hoverongaps=False))
    fig.show()

heatmap(x_vars)

So from the heatmap above, we see that there are a couple of high correlations we need to take into account:

* Team efg_pct is highly correlated with ast_ratio, fg2_pct and fg3_pct
* Assist ratio is highly correlated with fg3_pct and fg2_pct
* Fg3a is negatively correlated with fg2a

Between the rest of the variables there doesnt seem to be any other worrisome correlations, but we will need to define a list of base variables which have very low correlations, which can serve as the baseline for the tuning process in the classifiers

In [8]:
base_x = ['fg3a', 'fta', 'ft_pct', 'team_tov_pct', 'team_orb_pct']
heatmap(base_x)

It seems like the 5 variables above are the ones that are not correlated, as we see in the heatmap above. Therefore, as a baseline, we can use 4-5 as the parameter for max-features.

In [9]:
x_1 = base_x + ['team_efg_pct']
x_2 = base_x + ['ast_ratio']
x_3 = base_x + ['fg3_pct', 'fg2_pct']

heatmap(x_1)
heatmap(x_2)
heatmap(x_3)

In the 3 heatmaps above, we see very low correlations between the set of features, therefore we can advance and train different models based on these features, and assess the impact of fg3a on winning

In [10]:
feature_sequences = [base_x, x_1, x_2, x_3]

### Variance Selection

Since the variables all have different ranges to them, we will min max scale all variables above

In [11]:
# define vars for train test split

TEST_SIZE = 0.2
SEED = 4

In [12]:
X, y = model_data_scaled[x_vars].values, model_data_scaled[y_var].values

### Feature Analysis with tree based models

In [13]:
# define models and the parameter grid for which to search
rf = {"name": "rf", "classifier": RandomForestClassifier(), 'searcher':GridSearchCV\
    , "param_grid": {"max_depth":[4,5,6,7,8], "n_estimators":[150,200,250,300,400]}, \
        'max_features': [4,5,6,7]}

gb = {"name": "gb", "classifier": GradientBoostingClassifier(), 'searcher':RandomizedSearchCV\
    , "param_distributions": {"max_depth":[4,5,6,7,8], "n_estimators":[150,200,250,300,400], \
        "learning_rate": [0.1, 0.05, 0.01, 0.001], 'max_features': [4,5,6,7], \
            'subsample': [0.5, 0.6, 0.7, 0.8, 0.9]}}

xgb = {"name": "xgb", "classifier": XGBClassifier(verbosity = 0), 'searcher':RandomizedSearchCV\
    , "param_distributions": {"max_depth":[4,5,6,7,8], "n_estimators":[150,200,250,300], \
        "eta": [0.1, 0.05, 0.01, 0.001], 'max_features': [4,5,6,7], \
            'subsample': [0.5, 0.6, 0.7, 0.8, 0.9]}}

algorithms_params = [xgb, gb, rf]

In [14]:
# define cross validation searchers
gridsearch, randomsearch = GridSearchCV, RandomizedSearchCV

In [15]:
def run_model_commands(model, searcher: object, searcher_params: dict, \
    metrics: list=["accuracy"], test_size: float=0.2):
    
    """
    function to run the commands of a model object
    """
    model.get_best_params(searcher=searcher, searcher_params=searcher_params, \
        metrics=metrics)
    model.set_params(params_to_set=model.best_params)
    model.train_test_split(test_size=test_size)
    model.train_model()
    model.test_model()

In [17]:
estimators = {}

y = model_data_scaled[y_var].values

# run through the different algorithms chosen
for algorithm in algorithms_params:

    # enumrate to keep a constant copy
    for keys, values in enumerate(feature_sequences):

        X = model_data_scaled[values].values

        # create a deep copy of the object
        algorithm_copy = copy.deepcopy(algorithm)

        # create the object for the model
        model = ML_Model(X=X, y=y, base_model=algorithm_copy["classifier"], \
            seed=SEED)

        # define searcher and searcher params
        searcher=algorithm_copy['searcher']
        params={'estimator': model.base_model, 'scoring':'accuracy', 'n_jobs':-1}

        # add the param distribution based on the searcher
        if searcher==GridSearchCV:
            params['param_grid'] = algorithm_copy['param_grid']
        else:
            params['param_distributions'] = algorithm_copy['param_distributions']
            params['n_iter'] = 15

        name = algorithm_copy['name']

        # run the commands from the class necessary to create the model
        print(f"running commands {name}_{keys} & {'accuracy'}...")
        print(f'var_list: {values}')
        run_model_commands(model=model, searcher=searcher, searcher_params=params, \
            metrics=['accuracy'])

        print("storing model ...")

        # storing the file
        file_object = Path(f"models/{name}_model{keys}.pkl").open("wb")
        joblib.dump(model.base_model, file_object)

        # create feature importances
        model.feature_importances(x_vars=values, name=name, instance=f"{name}_{keys}")
        # saving it in the dictionary
        estimators[f"{algorithm_copy['name']}_accuracy_{keys}"] = model

running commands xgb_0 & accuracy...
var_list: ['fg3a', 'fta', 'ft_pct', 'team_tov_pct', 'team_orb_pct']
searching for best parameters
Best score: 0.5891815935632675
{'objective': 'binary:logistic', 'use_label_encoder': None, 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'gpu_id': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': 7, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 200, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': None, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'sub


A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



Best score: 0.5905149709094075
{'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.01, 'loss': 'log_loss', 'max_depth': 5, 'max_features': 7, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 400, 'n_iter_no_change': None, 'random_state': None, 'subsample': 0.8, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



Model train score: 0.7016666666666667
Model test score: 0.5933333333333334
storing model ...
running commands gb_1 & accuracy...
var_list: ['fg3a', 'fta', 'ft_pct', 'team_tov_pct', 'team_orb_pct', 'team_efg_pct']
searching for best parameters



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



Best score: 0.7325927885952517
{'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.001, 'loss': 'log_loss', 'max_depth': 5, 'max_features': 4, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 300, 'n_iter_no_change': None, 'random_state': None, 'subsample': 0.7, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



Model train score: 0.7566666666666667
Model test score: 0.7437037037037038
storing model ...
running commands gb_2 & accuracy...
var_list: ['fg3a', 'fta', 'ft_pct', 'team_tov_pct', 'team_orb_pct', 'ast_ratio']
searching for best parameters



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



Best score: 0.6660756250052675
{'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.05, 'loss': 'log_loss', 'max_depth': 4, 'max_features': 4, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 150, 'n_iter_no_change': None, 'random_state': None, 'subsample': 0.6, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



Model train score: 0.7311111111111112
Model test score: 0.6607407407407407
storing model ...
running commands gb_3 & accuracy...
var_list: ['fg3a', 'fta', 'ft_pct', 'team_tov_pct', 'team_orb_pct', 'fg3_pct', 'fg2_pct']
searching for best parameters



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



Best score: 0.7300751038749063
{'ccp_alpha': 0.0, 'criterion': 'friedman_mse', 'init': None, 'learning_rate': 0.01, 'loss': 'log_loss', 'max_depth': 8, 'max_features': 4, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 150, 'n_iter_no_change': None, 'random_state': None, 'subsample': 0.5, 'tol': 0.0001, 'validation_fraction': 0.1, 'verbose': 0, 'warm_start': False}



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().



Model train score: 0.8564814814814815
Model test score: 0.7251851851851852
storing model ...
running commands rf_0 & accuracy...
var_list: ['fg3a', 'fta', 'ft_pct', 'team_tov_pct', 'team_orb_pct']
searching for best parameters



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



Best score: 0.5905144441603901
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 7, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 250, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



Model train score: 0.6818518518518518
Model test score: 0.5866666666666667
storing model ...
running commands rf_1 & accuracy...
var_list: ['fg3a', 'fta', 'ft_pct', 'team_tov_pct', 'team_orb_pct', 'team_efg_pct']
searching for best parameters



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



Best score: 0.7352611235345843
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 7, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 150, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



Model train score: 0.7827777777777778
Model test score: 0.7422222222222222
storing model ...
running commands rf_2 & accuracy...
var_list: ['fg3a', 'fta', 'ft_pct', 'team_tov_pct', 'team_orb_pct', 'ast_ratio']
searching for best parameters



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



Best score: 0.6657797676404733
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 8, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 200, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



Model train score: 0.7696296296296297
Model test score: 0.654074074074074
storing model ...
running commands rf_3 & accuracy...
var_list: ['fg3a', 'fta', 'ft_pct', 'team_tov_pct', 'team_orb_pct', 'fg3_pct', 'fg2_pct']
searching for best parameters



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



Best score: 0.7318518283107229
{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': 7, 'max_features': 'sqrt', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 200, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}



A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().



Model train score: 0.7937037037037037
Model test score: 0.7355555555555555
storing model ...


In [46]:
@dataclass
class BoxChart():

    estimators: dict

    def consolidate_dfs(self):
        """
        Parameters:
            estimators (dict): A dictionary containing the estimators to be consolidated.

        Returns:
            pd.DataFrame: A pandas DataFrame object containing the consolidated dataframes from the estimators.
        """
        
        return pd.concat([estimators[key].importances_df for (key, v) in self.estimators.items()])


    def feature_importance_chart(self):
            """
            Parameters:
            estimators (dict): A dictionary containing the estimators to be visualized.

            Returns:
            None: The boxplot is displayed in a new window.
            """

            df = self.consolidate_dfs()
            self.fig=px.box(data_frame=df, x='feature', y='score', color='model')
            self.button_layout(buttons = ['xgb', 'gb', 'rf'], multiplier=4)
            self.fig.show()

    
    def feature_importance_chart_bar(self):
            """
            Parameters:
            estimators (dict): A dictionary containing the estimators to be visualized.

            Returns:
            None: The boxplot is displayed in a new window.
            """

            df = self.consolidate_dfs()
            self.fig=px.bar(data_frame=df, x='feature', y='score', color='model_instance', barmode='group')
            self.button_layout(buttons = ['xgb', 'gb', 'rf'], multiplier=4)
            self.fig.show()


    def button_layout(self, buttons:list, multiplier:int):

        def button_organizer(buttons:list=buttons):
            
            buttons_output=[]
            buttons_length=len(buttons)

            none_button=dict(label="None", method="update", args=[{"visible":[True]}, {"title":"ALL"}])
            buttons_output.append(none_button)

            base_array=[False]*buttons_length*multiplier

            for key, button in enumerate(buttons):

                button_array = copy.deepcopy(base_array)
                button_array[key*multiplier:(key*multiplier)+multiplier] = [True]*multiplier
                button_model=dict(label=f"{button}", method="update", args=[{"visible":button_array}, {"title":f"{button}"}])
                buttons_output.append(button_model)

            return buttons_output

        self.fig.update_layout(updatemenus=[dict(active=0,buttons=list(button_organizer(buttons)))])

In [47]:
box = BoxChart(estimators=estimators)
box.feature_importance_chart()

As we can see from the graph above, fg3a have a relatively low importance in the models. From the variables above, fg3a had the lowest median scores across models. To have more detail, we can also see by model instance how fg3a scored in each model.

In [48]:
box.feature_importance_chart_bar()

As we can see from the chart above, fg3a scored high only in the first instances of the model, where we used the basic variables and did not add the highly correlated variables to the list of Features. We should also take a look at the accuracies to analyze further, but we do have strong evidence that fg3a is not a good predictor to determine whether a team will win or lose.