## Random Forest

Notebook with implementation of the Random Forest algorithm to predict victory in Dota 2

-------------------------------------------------------------------------------------------------------------------------------

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

Preprocessing steps:

* Handle missing data: 'imputing' the mean of the column

* Do we need to check for correlation between features? NO (for RF)

* Do we need to perform feature scaling? NO (for RF) (scaler = MinMaxScaler(feature_range=(0, 1)) X = scaler.fit_transform(X))

## Score blowout matches

In [3]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve
import statistics as st

In [4]:
# Directory for the time blowout group
cwd = os.getcwd()
root_directory = os.path.dirname(cwd)
print(root_directory)
score_blowout_data_dir = root_directory + "/model_features_pre-match_newdata/score_blowout/"
print(score_blowout_data_dir)

C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models
C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models/model_features_pre-match_newdata/score_blowout/


#### Exploration and preprocessing of the data

In [6]:
feature_score_blowout_df = pd.read_csv(score_blowout_data_dir + "dota2_score_blowout_features.csv")

In [7]:
# Print feature names
feature_score_blowout_df.columns

Index(['match_id', 'rad_hero_1', 'rad_hero_2', 'rad_hero_3', 'rad_hero_4',
       'rad_hero_5', 'rad_hero_6', 'rad_hero_7', 'rad_hero_8', 'rad_hero_9',
       ...
       'hero_damagem_hp_hero3_d', 'hero_damagem_hp_hero4_d',
       'hero_damagem_hp_hero5_d', 'healingm_hp_hero1_d', 'healingm_hp_hero2_d',
       'healingm_hp_hero3_d', 'healingm_hp_hero4_d', 'healingm_hp_hero5_d',
       'rad_first_pick', 'win_label'],
      dtype='object', length=459)

In [8]:
# Drop first ccolumn (match id)
feature_score_blowout_df = feature_score_blowout_df.drop(['match_id'], axis=1)

In [9]:
# Print feature names
feature_score_blowout_df.columns

Index(['rad_hero_1', 'rad_hero_2', 'rad_hero_3', 'rad_hero_4', 'rad_hero_5',
       'rad_hero_6', 'rad_hero_7', 'rad_hero_8', 'rad_hero_9', 'rad_hero_10',
       ...
       'hero_damagem_hp_hero3_d', 'hero_damagem_hp_hero4_d',
       'hero_damagem_hp_hero5_d', 'healingm_hp_hero1_d', 'healingm_hp_hero2_d',
       'healingm_hp_hero3_d', 'healingm_hp_hero4_d', 'healingm_hp_hero5_d',
       'rad_first_pick', 'win_label'],
      dtype='object', length=458)

In [10]:
# Existing types
feature_score_blowout_df.dtypes

rad_hero_1                   int64
rad_hero_2                   int64
rad_hero_3                   int64
rad_hero_4                   int64
rad_hero_5                   int64
rad_hero_6                   int64
rad_hero_7                   int64
rad_hero_8                   int64
rad_hero_9                   int64
rad_hero_10                  int64
rad_hero_11                  int64
rad_hero_12                  int64
rad_hero_13                  int64
rad_hero_14                  int64
rad_hero_15                  int64
rad_hero_16                  int64
rad_hero_17                  int64
rad_hero_18                  int64
rad_hero_19                  int64
rad_hero_20                  int64
rad_hero_21                  int64
rad_hero_22                  int64
rad_hero_23                  int64
rad_hero_25                  int64
rad_hero_26                  int64
rad_hero_27                  int64
rad_hero_28                  int64
rad_hero_29                  int64
rad_hero_30         

In [11]:
feature_score_blowout_df.head()

Unnamed: 0,rad_hero_1,rad_hero_2,rad_hero_3,rad_hero_4,rad_hero_5,rad_hero_6,rad_hero_7,rad_hero_8,rad_hero_9,rad_hero_10,...,hero_damagem_hp_hero3_d,hero_damagem_hp_hero4_d,hero_damagem_hp_hero5_d,healingm_hp_hero1_d,healingm_hp_hero2_d,healingm_hp_hero3_d,healingm_hp_hero4_d,healingm_hp_hero5_d,rad_first_pick,win_label
0,0,0,0,0,0,0,0,0,0,0,...,195.814105,,,0.0,0.0,108.144296,,,1.0,1
1,0,0,0,0,0,0,0,0,0,0,...,,,,,,,,,0.0,1
2,0,0,0,0,0,0,0,1,0,0,...,,,,,,,,,0.0,0
3,0,0,0,0,0,0,0,0,0,0,...,,,,,,,,,0.0,1
4,0,0,0,0,0,1,0,0,0,0,...,,,,,,,,,0.0,1


In [12]:
feature_score_blowout_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Columns: 458 entries, rad_hero_1 to win_label
dtypes: float64(161), int64(297)
memory usage: 19.3 MB


In [13]:
feature_score_blowout_df = feature_score_blowout_df.fillna(feature_score_blowout_df.median())

In [14]:
feature_score_blowout_df['rad_first_pick'] = feature_score_blowout_df['rad_first_pick'].astype(int)

In [15]:
# feature_score_blowout_df = feature_score_blowout_df.round(3)

In [16]:
feature_score_blowout_df.head()

Unnamed: 0,rad_hero_1,rad_hero_2,rad_hero_3,rad_hero_4,rad_hero_5,rad_hero_6,rad_hero_7,rad_hero_8,rad_hero_9,rad_hero_10,...,hero_damagem_hp_hero3_d,hero_damagem_hp_hero4_d,hero_damagem_hp_hero5_d,healingm_hp_hero1_d,healingm_hp_hero2_d,healingm_hp_hero3_d,healingm_hp_hero4_d,healingm_hp_hero5_d,rad_first_pick,win_label
0,0,0,0,0,0,0,0,0,0,0,...,195.814105,311.976428,339.120672,0.0,0.0,108.144296,0.0,0.0,1,1
1,0,0,0,0,0,0,0,0,0,0,...,324.867968,311.976428,339.120672,0.0,0.0,0.0,0.0,0.0,0,1
2,0,0,0,0,0,0,0,1,0,0,...,324.867968,311.976428,339.120672,0.0,0.0,0.0,0.0,0.0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,324.867968,311.976428,339.120672,0.0,0.0,0.0,0.0,0.0,0,1
4,0,0,0,0,0,1,0,0,0,0,...,324.867968,311.976428,339.120672,0.0,0.0,0.0,0.0,0.0,0,1


In [17]:
feature_score_blowout_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Columns: 458 entries, rad_hero_1 to win_label
dtypes: float64(160), int32(1), int64(297)
memory usage: 19.3 MB


In [None]:
# check_nan_in_df = feature_score_blowout_df.isna()
# check_nan_in_df

### Model building, training and evaluation

In [18]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
import statistics as st

In [19]:
X, y = feature_score_blowout_df.iloc[:,:-1],feature_score_blowout_df.iloc[:,-1]

In [None]:
X.head()

In [None]:
X.shape

In [None]:
y.head()

In [None]:
y.shape

### Grid search

In [24]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [5, 10, 15, 50],
    'max_features': ['auto', 'sqrt', 'log2'],
    'n_estimators': [50, 100, 200, 300]
}

# Create a based model
rf = RandomForestClassifier()

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = 2, verbose = 2)

grid_search.fit(X_train, y_train)
grid_search.best_params_

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:   15.0s
[Parallel(n_jobs=2)]: Done 144 out of 144 | elapsed:  1.9min finished


{'bootstrap': True,
 'max_depth': 10,
 'max_features': 'log2',
 'n_estimators': 200}

Best parameters:
{'bootstrap': True,
 'max_depth': 10,
 'max_features': 'log2',
 'n_estimators': 200}

In [26]:
# Create the model with 100 trees
model = RandomForestClassifier(bootstrap = True,
                               n_estimators=200, 
                               max_depth=10,
                               max_features = 'log2')

In [27]:
features = [c for c in feature_score_blowout_df.columns if c != 'win_label']
target = 'win_label'

In [28]:
kfolds = KFold(n_splits=10, shuffle=True)

In [29]:
cnf = list()
auc = list()
thres = 0.5

for train_idx, test_idx in kfolds.split(X):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]  
    
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    cnf.append(confusion_matrix(y_test, (preds > thres).astype(int)))
    auc.append(roc_auc_score(y_test, preds))

cnf = sum(cnf)

'Median AUC: {:.04f}'.format(st.median(auc))

# auc = sum(auc) / len(auc)
# 'Average AUC: {:.04f}'.format(auc)

'Median AUC: 0.6953'