## Feature Analysis

To analyze what are the most important indicators whether a team will win or lose, we will now build a simple binary predictor model and study which are the variables which help predict wins and losses more.

Since we are only trying to analyze the features, we will use the same game variables for our x values, since the aim is not build predictions.

To analyze the importance of the features, we will use the following models:

* Logistics Regression
* Random Forest Classifier
* Gradient Boost Classifier
* XGBoost Classifier

In [None]:
import pandas as pd
import numpy as np
import copy

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.preprocessing import MinMaxScaler

from modules.model import Model, ML_Model

In [None]:
team_data = pd.read_csv(f'./data/team_data.csv')

For this model, this will be the following features we will include:

* fg3a
* fg2a
* fta
* fg3_pct
* fg2_pct
* ft_pct
* ast_ratio
* team_tov_pct
* team_orb_pct
* team_efg_pct

Our target variable will be the Win variable

In [None]:
x_vars = ['fg3a', 'fg2a', 'fta', 'fg3_pct', 'fg2_pct', 'ft_pct', 'ast_ratio', 'team_tov_pct', 'team_orb_pct', 'team_efg_pct']
y_var = ['win']

In [None]:
model_data = team_data.copy(deep=True)
model_data.columns = [x.lower().strip() for x in model_data.columns]
model_data = model_data[x_vars + y_var]
model_data.head()

Since the variables all have different ranges to them, we will min max scale all variables above

In [None]:
# define vars for train test split

TEST_SIZE = 0.2
SEED = 4

In [None]:
X, y = model_data[x_vars].values, model_data[y_var].values

In [None]:
# min max scale variables
min_max_scaler = MinMaxScaler(feature_range=(0,1))

X_scaled = min_max_scaler.fit_transform(X)

### Feature Analysis with tree based models

In [None]:
rf = {"Name": "RF", "Classifier": RandomForestClassifier(), "Parameter Grid": {"max_depth":[6,7,8,9,10], "n_estimators":[150,200,250,300,400]}, 'max_features': [4,5,6,7]}
gb = {"Name": "GB", "Classifier": GradientBoostingClassifier(), "Parameter Grid": {"max_depth":[6,7,8,9,10], "n_estimators":[150,200,250,300,400], "learning_rate": [0.1, 0.05, 0.01, 0.001], 'max_features': [4,5,6,7], 'subsample': [0.5, 0.6, 0.7, 0.8, 0.9]}}
xgb = {"Name": "XGB", "Classifier": XGBClassifier(verbosity = 0), "Parameter Grid": {"max_depth":[6,7,8,9,10], "n_estimators":[150,200,250,300,400], "eta": [0.1, 0.05, 0.01, 0.001], 'subsample': [0.5, 0.6, 0.7, 0.8, 0.9]}}

algorithms_params = [rf, gb, xgb]


In [None]:
def run_model_commands(model, metrics = ["accuracy"], test_size: float=0.2):
    """
    function to run the commands of a model object
    """
    model.get_best_params(metrics)
    model.set_params(metrics[0])
    model.get_best_score_gs(metrics[0])
    model.train_test_split(test_size = test_size)
    model.train_model()
    model.test_model()

In [None]:
estimators = {}

# run through the different algorithms chosen
for algorithm in algorithms_params:

    for metric in ["accuracy"]:

        # create a deep copy of the object
        algorithm_copy = copy.deepcopy(algorithm)

        # create the object for the model
        model = ML_Model(X=X_scaled, y=y, base_model=algorithm_copy["Classifier"], param_grid=algorithm_copy["Parameter Grid"], seed=SEED)
    
        # run the commands from the class necessary to create the model
        print(f"running commands {algorithm_copy['Classifier']} & {metric}...")
        run_model_commands(model, [metric])

        print("storing model ...")
        estimators[f"{algorithm_copy['Name']}_{metric}"] = model