## Random Forest

Notebook with implementation of the Random Forest algorithm to predict victory in Dota 2

-------------------------------------------------------------------------------------------------------------------------------

#### If you are running this code on Google Colab, you need to first upload the following feature file: *dota2_score_blowout_features.csv*

## Score blowout matches

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

In [None]:
# Import necessary libraries
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve, auc
import statistics as st

In [None]:
# NOTE: uncomment this cell if you are running this code on a local machine. Please adjust the following variables to correctly point to the feature file location on your machine

# # Set directory for the score blowout match group
# cwd = os.getcwd()
# root_directory = os.path.dirname(cwd)

# score_blowout_data_dir = root_directory + "\\model_features_pre-match\\score_blowout\\"
# path_to_features = score_blowout_data_dir + "dota2_score_blowout_features.csv"

In [None]:
# NOTE: use this cell if you are running this code on Google Colab

# Set directory for the score blowout match group. Make sure the feature file is uploaded to this Colab session
path_to_features = "/content/dota2_score_blowout_features.csv"

In [None]:
# Read the data (model feature file)
feature_score_blowout_df = pd.read_csv(path_to_features)

### Exploration and preprocessing of the data

In [None]:
# Print feature names
feature_score_blowout_df.columns

In [None]:
# Drop first column (match id)
feature_score_blowout_df = feature_score_blowout_df.drop(['match_id'], axis=1)

In [None]:
# Existing types
feature_score_blowout_df.dtypes

In [None]:
feature_score_blowout_df.head()

In [None]:
feature_score_blowout_df.info()

In [None]:
# Fill in missing values with the median value of the feature
feature_score_blowout_df = feature_score_blowout_df.fillna(feature_score_blowout_df.median())

In [None]:
feature_score_blowout_df['rad_first_pick'] = feature_score_blowout_df['rad_first_pick'].astype(int)

### Model building, training and evaluation

In [None]:
# Import random forest library
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Split into features (X) and label (y)
X, y = feature_score_blowout_df.iloc[:,:-1],feature_score_blowout_df.iloc[:,-1]

### Grid search

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [None]:
# Create the grid of parameters
param_grid = {
    'bootstrap': [True],
    'max_depth': [5, 10, 15, 50],
    'max_features': ['auto', 'sqrt', 'log2'],
    'n_estimators': [50, 100, 200, 300]
}

# Create a based model
rf = RandomForestClassifier()

# Instantiate the grid search model
grid_search = GridSearchCV(estimator = rf, param_grid = param_grid, 
                          cv = 3, n_jobs = 2, verbose = 2)

# Perform search and print the best parameters
grid_search.fit(X_train, y_train)
grid_search.best_params_

Fitting 3 folds for each of 48 candidates, totalling 144 fits


[Parallel(n_jobs=2)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=2)]: Done  37 tasks      | elapsed:   15.0s
[Parallel(n_jobs=2)]: Done 144 out of 144 | elapsed:  1.9min finished


{'bootstrap': True,
 'max_depth': 10,
 'max_features': 'log2',
 'n_estimators': 200}

Best parameters:
{'bootstrap': True,
 'max_depth': 10,
 'max_features': 'log2',
 'n_estimators': 200}

In [None]:
# Create the model using the best paramters
model = RandomForestClassifier(bootstrap = True,
                               n_estimators=200, 
                               max_depth=10,
                               max_features = 'log2')

In [None]:
features = [c for c in feature_score_blowout_df.columns if c != 'win_label']
target = 'win_label'

In [None]:
# Define the number of folders for the k-fold cross-validation
kfolds = KFold(n_splits=10, shuffle=True)

In [None]:
# NOTE: the training process might take a while to execute

auc = list()

for train_idx, test_idx in kfolds.split(X):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]  
    
    model.fit(X_train, y_train)
    preds = model.predict(X_test)
    
    auc.append(roc_auc_score(y_test, preds))

'Median AUC: {:.04f}'.format(st.median(auc))

'Median AUC: 0.6953'