# Random Forest

Notebook with implementation of the Random Forest algorithm to predict victory in Dota 2

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

Preprocessing steps:

* Handle missing data: 'imputing' the mean of the column

* Do we need to check for correlation between features? NO (for RF)

* Do we need to perform feature scaling? NO (for RF) (scaler = MinMaxScaler(feature_range=(0, 1)) X = scaler.fit_transform(X))

## Time blowout matches

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve

In [2]:
# Directory for the time blowout group
cwd = os.getcwd()
root_directory = os.path.dirname(cwd)
print(root_directory)
time_blowout_data_dir = root_directory + "/model_features/time_blowout/"
print(time_blowout_data_dir)

C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models
C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models/model_features/time_blowout/


#### Exploration and preprocessing of the data

In [3]:
feature_time_blowout_df = pd.read_csv(time_blowout_data_dir + "dota2_time_blowout_features.csv")

In [None]:
# Print feature names
feature_time_blowout_df.columns

In [None]:
# Existing types
feature_time_blowout_df.dtypes

In [None]:
feature_time_blowout_df.head()

In [None]:
# Drop first ccolumn (match id)
feature_time_blowout_df = feature_time_blowout_df.drop(['match_id'], axis=1)

In [None]:
feature_time_blowout_df.head()

In [None]:
feature_time_blowout_df.describe()

In [None]:
feature_time_blowout_df.info()

Features with NA values that we need to handle:

- radiant_first_pick
- ability_uses               
- item_uses                 
- different_item_uses        
- obs_placed                 
- sen_placed                 
- actions_per_min  
- tower_damage

In [19]:
feature_time_blowout_df = feature_time_blowout_df.fillna(feature_time_blowout_df.mean())

In [None]:
feature_time_blowout_df.info()

In [None]:
print(feature_time_blowout_df["radiant_first_pick"].isnull().sum())
print(feature_time_blowout_df["ability_uses"].isnull().sum())
print(feature_time_blowout_df["item_uses"].isnull().sum())
print(feature_time_blowout_df["different_item_uses"].isnull().sum())
print(feature_time_blowout_df["obs_placed"].isnull().sum())
print(feature_time_blowout_df["sen_placed"].isnull().sum())
print(feature_time_blowout_df["actions_per_min"].isnull().sum())
print(feature_time_blowout_df["tower_damage"].isnull().sum())

#### Model building, training and evaluation

In [29]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
from sklearn.metrics import roc_auc_score
import statistics as st

In [25]:
X, y = feature_time_blowout_df.iloc[:,:-1],feature_time_blowout_df.iloc[:,-1]

In [None]:
X.head()

In [None]:
X.shape

In [None]:
y.head()

In [None]:
y.shape

In [27]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [28]:
# Create the model with 100 trees
model = RandomForestClassifier(n_estimators=100, 
                               bootstrap = True,
                               max_features = 'sqrt')

In [30]:
# Fit on training data
model.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='sqrt', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=None,
            oob_score=False, random_state=None, verbose=0,
            warm_start=False)

In [31]:
# Probabilities for each class
rf_probs = model.predict_proba(X_test)[:, 1]

In [34]:
# Calculate roc auc
roc_value = roc_auc_score(y_test, rf_probs)

In [35]:
roc_value

0.9999880923219291

In [None]:
# Change type of 'radiant_first_pick' from float to int
# feature_time_blowout_df["radiant_first_pick"] = feature_time_blowout_df["radiant_first_pick"].astype("int")

## Score difference blowout matches

## Regular matches