## Random Forest

Notebook with implementation of the Random Forest algorithm to predict victory in Dota 2

-------------------------------------------------------------------------------------------------------------------------------

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

Preprocessing steps:

* Handle missing data: 'imputing' the mean of the column

* Do we need to check for correlation between features? NO (for RF)

* Do we need to perform feature scaling? NO (for RF) (scaler = MinMaxScaler(feature_range=(0, 1)) X = scaler.fit_transform(X))

## Regular matches

In [28]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve
import statistics as st

In [None]:
# Directory for the time blowout group
cwd = os.getcwd()
root_directory = os.path.dirname(cwd)
print(root_directory)
regularnewdata_data_dir = root_directory + "/model_features_pre-match_newdata/regular-new/"
print(regularnewdata_data_dir)

#### Exploration and preprocessing of the data

In [None]:
feature_regular_df = pd.read_csv(regularnewdata_data_dir + "dota2_regular-new_features.csv")

In [None]:
# Print feature names
feature_regular_df.columns

In [None]:
# Drop first ccolumn (match id)
feature_regular_df = feature_regular_df.drop(['match_id'], axis=1)

In [None]:
# Print feature names
feature_regular_df.columns

In [None]:
# Existing types
feature_regular_df.dtypes

In [None]:
feature_regular_df.head()

In [None]:
feature_regular_df.info()

In [None]:
feature_regular_df = feature_regular_df.fillna(feature_regular_df.median())

In [None]:
feature_regular_df['rad_first_pick'] = feature_regular_df['rad_first_pick'].astype(int)

In [None]:
#feature_regular_df = feature_regular_df.round(3)

In [None]:
feature_regular_df.head()

In [None]:
feature_regular_df.info()

In [None]:
# check_nan_in_df = feature_regular_df.isna()
# check_nan_in_df

### Model building, training and evaluation

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
import statistics as st

In [23]:
X, y = feature_regular_df.iloc[:,:-1],feature_regular_df.iloc[:,-1]

In [None]:
X.head()

In [None]:
X.shape

In [None]:
y.head()

In [None]:
y.shape

### Grid search

In [30]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [33]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'bootstrap': [True],
    'max_depth': [5, 10, 15, 50],
    'max_features': ['auto', 'sqrt', 'log2'],
    'n_estimators': [50, 100, 200, 300]
}
# Create a based model
rf = RandomForestClassifier()
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = 2, verbose = 2)

In [34]:
grid_search.fit(X_train, y_train)
grid_search.best_params_

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:  1.6min
[Parallel(n_jobs=2)]: Done 144 out of 144 | elapsed: 16.4min finished


{'bootstrap': True,
 'max_depth': 10,
 'max_features': 'auto',
 'n_estimators': 200}

Best parameters:
{'bootstrap': True,
 'max_depth': 10,
 'max_features': 'auto',
 'n_estimators': 200}

In [36]:
# Create the model with 100 trees
model = RandomForestClassifier(bootstrap = True,
                               n_estimators=200, 
                               max_depth=10,
                               max_features='auto')

In [37]:
features = [c for c in feature_regular_df.columns if c != 'win_label']
target = 'win_label'

In [38]:
kfolds = KFold(n_splits=10, shuffle=True)

In [39]:
cnf = list()
auc = list()
thres = 0.5

for train_idx, test_idx in kfolds.split(X):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]  
    
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    cnf.append(confusion_matrix(y_test, (preds > thres).astype(int)))
    auc.append(roc_auc_score(y_test, preds))

cnf = sum(cnf)

'Median AUC: {:.04f}'.format(st.median(auc))

# auc = sum(auc) / len(auc)
# 'Average AUC: {:.04f}'.format(auc)

'Median AUC: 0.5950'