# Modeling

Begin writing a function that creates and attempts to optimize a Random Forest Classifier model. It will utilize cross-validation and grid search. Once this is complete and functional, we can begin adding other algorithms.

In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, GridSearchCV, train_test_split, cross_validate
from sklearn.ensemble import RandomForestClassifier

# Get Some Data

In [27]:
df = pd.read_csv('../final.csv')
df

Unnamed: 0,assistsplayer_1,assistsplayer_10,assistsplayer_2,assistsplayer_3,assistsplayer_4,assistsplayer_5,assistsplayer_6,assistsplayer_7,assistsplayer_8,assistsplayer_9,...,team_totalGold_100,team_totalGold_200,team_trueDamageDoneToChampions_100,team_trueDamageDoneToChampions_200,team_ward_player_100,team_ward_player_200,team_assistsplayer_100,team_assistsplayer_200,team_xp_100,team_xp_200
0,2.0,3.0,3.0,3.0,2.0,3.0,3.0,3.0,1.0,2.0,...,36356,35237,2951,2594,74,391,13,12,42198,41697
1,1.0,15.0,3.0,1.0,4.0,5.0,0.0,5.0,11.0,7.0,...,33239,47104,1757,1697,38,95,14,38,37906,47483
2,0.0,5.0,4.0,2.0,1.0,3.0,1.0,5.0,4.0,7.0,...,33257,37239,3897,4351,158,90,10,22,37746,41185
3,3.0,4.0,0.0,7.0,4.0,14.0,5.0,4.0,5.0,2.0,...,40216,35871,4308,1738,82,108,28,20,41354,36424
4,3.0,4.0,6.0,6.0,6.0,5.0,3.0,1.0,2.0,0.0,...,37900,31360,873,1885,50,41,26,10,40723,37217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8768,5.0,7.0,6.0,10.0,5.0,12.0,1.0,8.0,4.0,7.0,...,38176,36746,4398,4646,102,68,38,27,42435,39616
8769,2.0,5.0,9.0,2.0,1.0,6.0,1.0,2.0,1.0,2.0,...,38015,37013,2933,2496,68,79,20,11,42133,41796
8770,4.0,11.0,10.0,3.0,6.0,7.0,1.0,2.0,3.0,6.0,...,43423,43224,2726,4244,130,67,30,23,48350,46779
8771,5.0,17.0,3.0,3.0,0.0,3.0,0.0,9.0,3.0,5.0,...,33444,40786,3882,1100,36,74,14,34,38668,41425


Although this data has already been prepared, I still need to drop the column called 'killsplayer_0'. It represents how many kills were made by game objects, not players, and contains several null values. Then, all I need to do is split it up into X and y groups and then into train and test sets. Please keep in mind this data set is only a fraction of our expected data set, and is only being used to check the funcionality of my model.

__Drop 'killsplayer_0' Column__

In [None]:
#Killsplayer_0 can be dropped because its not an actual player.
df.drop(columns = ['killsplayer_0'], inplace = True)

__Split into X and y Groups__

In [45]:
X, y = df.drop(columns = ['winningTeam']), df.winningTeam

__Create Train and Test Sets__

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 123)

In [47]:
X_train.shape, y_train.shape

((7018, 206), (7018,))

__Create Dummy Variables__

In [48]:
X_train = pd.get_dummies(X_train, drop_first = True)
X_train

Unnamed: 0,assistsplayer_1,assistsplayer_10,assistsplayer_2,assistsplayer_3,assistsplayer_4,assistsplayer_5,assistsplayer_6,assistsplayer_7,assistsplayer_8,assistsplayer_9,...,gameVersion_11.16.390.1945,gameVersion_11.17.393.607,gameVersion_11.17.394.4489,gameVersion_11.18.395.7538,gameVersion_11.19.398.2521,gameVersion_11.19.398.9466,gameVersion_11.20.400.7328,gameVersion_11.21.403.3002,gameVersion_11.22.406.3587,gameVersion_11.23.409.111
8449,0.0,10.0,6.0,4.0,2.0,4.0,1.0,4.0,4.0,7.0,...,0,0,0,0,0,0,0,1,0,0
1364,2.0,12.0,3.0,7.0,3.0,7.0,2.0,4.0,9.0,10.0,...,0,0,0,0,0,0,0,0,1,0
1822,2.0,6.0,10.0,5.0,3.0,14.0,0.0,1.0,1.0,2.0,...,0,0,0,0,0,0,0,0,1,0
6069,10.0,12.0,7.0,7.0,7.0,17.0,4.0,9.0,3.0,2.0,...,0,0,0,0,0,0,0,1,0,0
390,0.0,9.0,5.0,4.0,4.0,4.0,0.0,6.0,0.0,3.0,...,0,0,0,0,0,0,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7382,1.0,11.0,1.0,0.0,3.0,7.0,4.0,6.0,0.0,1.0,...,0,0,1,0,0,0,0,0,0,0
7763,0.0,5.0,1.0,5.0,5.0,6.0,5.0,3.0,2.0,1.0,...,0,0,0,0,0,0,0,0,1,0
5218,2.0,5.0,0.0,1.0,2.0,2.0,5.0,3.0,2.0,4.0,...,0,0,0,0,0,0,0,1,0,0
1346,1.0,6.0,5.0,6.0,3.0,11.0,0.0,4.0,2.0,4.0,...,0,0,0,0,0,0,1,0,0,0


__Create a Baseline__

Since this is a classification problem, I will set the baseline to whichever team has the most wins.

In [32]:
#Set team 100.0 to be blue_team and team 200.0 to be red_team
def get_team_color(value):
    if value == 100.0:
        return 'blue_team'
    else:
        return 'red_team'

In [51]:
y_train = y_train.apply(get_team_color)

In [52]:
y_train.value_counts()

red_team     3649
blue_team    3369
Name: winningTeam, dtype: int64

In [53]:
#Use the dummy classifier to set the baseline
#red_team has the most wins
from sklearn.dummy import DummyClassifier

baseline = DummyClassifier(strategy = 'constant', constant = 'red_team')
baseline.fit(X_train, y_train)

#Now get the baseline accuracy
baseline.score(X_train, y_train)

0.5199487033342832

__Train a Single Model__

Train a single model to find out about how long it will take with so many features. From there, I will be able to estimate how long the grid search might take to complete.

In [11]:
#Create the model (just use default hyperparameters for now, except random_state)
model = RandomForestClassifier(random_state = 123)

#Fit the model
model.fit(X_train, y_train)

#Score the model
model.score(X_train, y_train)

1.0

The above model finished training extremely quickly, so I don't think there is anything to worry about. Just be mindful of how many models will actually be produced with the given ranges for the hyperparameters.

__Implement GridSearchCV__

In [12]:
clf = RandomForestClassifier(random_state = 123)

grid = GridSearchCV(clf, {'max_depth': range(5, 11), 'min_samples_leaf': range(5, 11)}, cv = 5)
grid.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(random_state=123),
             param_grid={'max_depth': range(5, 11),
                         'min_samples_leaf': range(5, 11)})

In [13]:
#What was the best score and best parameters
grid.best_score_, grid.best_params_

(0.9532983508245877, {'max_depth': 6, 'min_samples_leaf': 5})

__Write RandomForestClassifier Function__

In [63]:
rf_dict = {
    'max_depth': range(1, 16),
    'min_samples_leaf': range(1, 16)
}

In [64]:
def get_random_forest_models(X_train, y_train, param_dict, cv = 5):
    """
    This function creates and returns an optimized random forest classification model. It also
    prints out the best model's mean cross-validated accuracy score and parameters.
    
    This function takes in the X and y training sets to fit the models.
    
    This function takes in a dictionary that contains the parameters to be iterated through.
    
    This function also takes in a value for the number of cross validation folds to do.
    The cv value defaults to 5.
    """
    #Create the classifier model
    clf = RandomForestClassifier(random_state = 123)
    
    #Create the GridSearchCV object
    grid = GridSearchCV(clf, param_dict, cv = 5)
    
    #Fit the GridSearchCV object
    grid.fit(X_train, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Max Depth: ', grid.best_params_['max_depth'])
    print('Min Samples Per Leaf: ', grid.best_params_['min_samples_leaf'])
    
    #Return the best model
    return grid.best_estimator_

In [65]:
best_model = get_random_forest_models(X_train, y_train, rf_dict)

Mean Cross-Validated Accuracy:  0.9674
Max Depth:  14
Min Samples Per Leaf:  3


In [57]:
#Check to see if the function returned the model correctly
#Scoring it on the train data should yield a similar result to the average score
best_model.score(X_train, y_train)

0.9933029353092049

__What were the Most Important Features?__

In [67]:
best_features = pd.DataFrame(best_model.feature_importances_, X_train.columns)
best_features.sort_values(by = 0, ascending = False).head(10)

Unnamed: 0,0
towers_lost_team200,0.153878
inhibs_lost_team100,0.10776
towers_lost_team100,0.10712
inhibs_lost_team200,0.088003
baron_team200,0.044381
team_totalGold_200,0.035326
baron_team100,0.035008
dragon_team200,0.033469
dragon_team100,0.031174
team_totalGold_100,0.026489


### AdaBoostClassifier

I will use the AdaBoostClassifier with a RandomForestClassifier as the base_estimator.

In [58]:
from sklearn.ensemble import AdaBoostClassifier

In [19]:
#Create the RandomForestClassifier object
rf = RandomForestClassifier(random_state = 123)

#Create the AdaBoostClassifier object
adaBoost = AdaBoostClassifier(rf, random_state = 123)

#Create GridSearchCV object
grid = GridSearchCV(adaBoost, {'n_estimators': range(50, 101, 10)}, cv = 5)

#Fit the grid object
grid.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=AdaBoostClassifier(base_estimator=RandomForestClassifier(random_state=123),
                                          random_state=123),
             param_grid={'n_estimators': range(50, 101, 10)})

In [20]:
#What was the best score and best parameters
grid.best_score_, grid.best_params_

(0.9567316341829086, {'n_estimators': 50})

__Let's see if it can improve performance of our best RandomForest model from earlier__

In [21]:
#Create the AdaBoostClassifier object
adaBoost = AdaBoostClassifier(best_model, random_state = 123)

#Create GridSearchCV object
grid = GridSearchCV(adaBoost, {'n_estimators': range(50, 101, 5)}, cv = 5)

#Fit the grid object
grid.fit(X_train, y_train)

#What was the best score and best parameters
grid.best_score_, grid.best_params_

(0.9584557721139431, {'n_estimators': 50})

It is actually slightly better than before

In [59]:
#Create a function for AdaBoost
def get_adaBoosted_model(X_train, y_train, model_to_boost, param_dict, cv = 5):
    """
    This function creates and returns an optimized AdaBoosted random forest classification model. It also
    prints out the best model's mean cross-validated accuracy score and parameters.
    
    This function takes in the X and y training sets to fit the models.
    
    This function takes in a dictionary that contains the parameters to be iterated through.
    
    This function also takes in a value for the number of cross validation folds to do.
    The cv value defaults to 5.
    """
    #Create the AdaBoost Classifier
    adaBoost_clf = AdaBoostClassifier(model_to_boost, random_state = 123)
    
    #Create the GridSearchCV object
    grid = GridSearchCV(adaBoost_clf, param_dict, cv = 5)
    
    #Fit the GridSearchCV object
    grid.fit(X_train, y_train)
    
    #Print the best model's score and parameters
    print('Mean Cross-Validated Accuracy: ', round(grid.best_score_, 4))
    print('Num Estimators: ', grid.best_params_['n_estimators'])
    print('Learning Rate: ', grid.best_params_['learning_rate'])
    
    #Return the best model
    return grid.best_estimator_

In [60]:
adaBoost_params = {
    'n_estimators': range(50, 61),
    'learning_rate': range(1, 6)
}

In [61]:
#Test the above function
ada_boosted_clf = get_adaBoosted_model(X_train, y_train, best_model, adaBoost_params)

Mean Cross-Validated Accuracy:  0.9675
Num Estimators:  50
Learning Rate:  1


In [62]:
#This performed slightly better than the random forest alone.
#What were the most important features?
best_features = pd.DataFrame(ada_boosted_clf.feature_importances_, X_train.columns)
best_features.sort_values(by = 0, ascending = False).head(10)

Unnamed: 0,0
towers_lost_team200,0.116428
towers_lost_team100,0.104307
inhibs_lost_team200,0.098691
inhibs_lost_team100,0.098336
baron_team200,0.040657
baron_team100,0.034809
dragon_team100,0.0274
dragon_team200,0.025379
team_totalGold_100,0.014561
team_xp_100,0.014471
