## Gradient Boosting Machines (XGBoost)

Notebook with implementation of the XGBoost algorithm to predict victory in Dota 2

-------------------------------------------------------------------------------------------------------------------------------

## Regular matches

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

Preprocessing steps:

* Try two methods for handling missing data: 'automatic xgboost handling' and 'imputing'

* Do we need to check for correlation between features? NO (for xgboost)

* Do we need to perform feature scaling? NO (for xgboost)(scaler = MinMaxScaler(feature_range=(0, 1)) X = scaler.fit_transform(X))

In [2]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve
import statistics as st

In [3]:
# Directory for the time blowout group
cwd = os.getcwd()### Exploration and preprocessing of the data
root_directory = os.path.dirname(cwd)
print(root_directory)
regular_data_dir = root_directory + "/model_features_pre-match/regular/"
print(regular_data_dir)

C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models
C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models/model_features_pre-match/regular/


### Exploration and preprocessing of the data

In [4]:
feature_regular_df = pd.read_csv(regular_data_dir + "dota2_regular_features-used_features.csv")

In [None]:
# Print feature names
feature_regular_df.columns

In [None]:
# Drop first ccolumn (match id)
feature_regular_df = feature_regular_df.drop(['match_id'], axis=1)
feature_regular_df.columns

In [None]:
# Existing types
feature_regular_df.dtypes

In [10]:
# Test filling NA values (but xgboost can handle this)
feature_regular_df = feature_regular_df.fillna(feature_regular_df.median())

In [12]:
feature_regular_df['rad_first_pick'] = feature_regular_df['rad_first_pick'].astype(int)

In [13]:
feature_regular_df.head()

Unnamed: 0,role_carry_r,role_support_r,role_nuker_r,role_disabler_r,role_jungler_r,role_durable_r,role_escape_r,role_pusher_r,role_initiator_r,role_carry_d,...,winR_hp_md_d,xpm_hp_md_d,goldm_hp_md_d,deathsm_hp_md_d,killsm_hp_md_d,assistsm_hp_md_d,damagem_hp_md_d,healingm_hp_md_d,rad_first_pick,win_label
0,1,1,1,1,0,1,1,1,1,1,...,0.571429,434.546131,401.444444,0.126172,0.134533,0.285102,236.374833,0.0,1,0
1,1,1,1,1,0,1,1,1,1,1,...,0.777778,414.0,313.888889,0.135955,0.124604,0.205029,0.0,0.0,0,0
2,1,1,1,1,0,1,1,0,1,1,...,0.733333,441.966667,415.7,0.122128,0.159331,0.283245,3.304261,0.332754,0,1
3,1,1,1,1,0,1,1,1,1,1,...,0.506667,425.96,412.433333,0.136051,0.135013,0.254816,0.0,0.0,0,1
4,1,1,1,1,0,1,1,0,1,1,...,0.571429,434.546131,401.444444,0.126172,0.134533,0.285102,236.374833,0.0,1,0


In [None]:
feature_regular_df.describe()

In [14]:
feature_regular_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45130 entries, 0 to 45129
Data columns (total 57 columns):
role_carry_r        45130 non-null int64
role_support_r      45130 non-null int64
role_nuker_r        45130 non-null int64
role_disabler_r     45130 non-null int64
role_jungler_r      45130 non-null int64
role_durable_r      45130 non-null int64
role_escape_r       45130 non-null int64
role_pusher_r       45130 non-null int64
role_initiator_r    45130 non-null int64
role_carry_d        45130 non-null int64
role_support_d      45130 non-null int64
role_nuker_d        45130 non-null int64
role_disabler_d     45130 non-null int64
role_jungler_d      45130 non-null int64
role_durable_d      45130 non-null int64
role_escape_d       45130 non-null int64
role_pusher_d       45130 non-null int64
role_initiator_d    45130 non-null int64
bstr_md_r           45130 non-null int64
bagi_md_r           45130 non-null int64
bint_md_r           45130 non-null int64
strg_md_r           45130 non-

### Model building, training and evaluation

In [15]:
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import statistics as st

In [16]:
X, y = feature_regular_df.iloc[:,:-1],feature_regular_df.iloc[:,-1]

In [None]:
X.head()

In [None]:
X.shape

In [None]:
y.head()

In [None]:
y.shape

In [19]:
# xg_cla = xgb.XGBClassifier(objective ='binary:logistic', colsample_bytree = 0.3, learning_rate = 0.1,
#                 max_depth = 5, alpha = 10, n_estimators = 100, early_stopping_rounds=10, eval_metric='auc')

# # xg_reg_probs = xg_reg.predict_proba(X_test)[:, 1]

In [17]:
features = [c for c in feature_regular_df.columns if c != 'win_label']
target = 'win_label'

In [18]:
kfolds = KFold(n_splits=10, shuffle=True)

In [21]:
param = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'eta': 0.2,
    'colsample_bytree': 0.3,
    'learning_rate': 0.1,
     'max_depth': 10,
     'alpha': 10
}

num_round = 100
thres = 0.5

In [22]:
cnf = list()
auc = list()

for train_idx, test_idx in kfolds.split(X):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]
    
    param['scale_pos_weight'] = (y_train.size - y_train.sum()) / y_train.sum()    
    
    xg_train = xgb.DMatrix(
        X_train.values, feature_names=features, label=y_train.values
    )
    xg_test = xgb.DMatrix(
        X_test.values, feature_names=features, label=y_test.values
    )
    
    watchlist = [(xg_train, 'train'), (xg_test, 'test')]
    bst = xgb.train(param, xg_train, num_round, watchlist, verbose_eval=False)
    preds = bst.predict(xg_test)
    
    cnf.append(confusion_matrix(y_test, (preds > thres).astype(int)))
    auc.append(roc_auc_score(y_test, preds))

cnf = sum(cnf)

'Median AUC: {:.04f}'.format(st.median(auc))

# auc = sum(auc) / len(auc)
# 'Average AUC: {:.04f}'.format(auc)

'Median AUC: 0.6294'