## Gradient Boosting Machines (XGBoost)

Notebook with implementation of the XGBoost algorithm to predict victory in Dota 2

-------------------------------------------------------------------------------------------------------------------------------

#### Note that this is my first version of the XGBoost implementation.

#### The version that I actually used can be found in the *prediction-explanation-SHAP* directory, together with the SHAP technique implementation!

#### If you are running this code on Google Colab, you need to first upload the following feature file: *dota2_time_blowout_features.csv*

## Time blowout matches

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

In [None]:
# Import neccessary libraries
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve, auc
import statistics as st

In [None]:
# NOTE: uncomment this cell if you are running this code on a local machine. Please adjust the following variables to correctly point to the feature file location on your machine

# # Set directory for the time blowout match group
# cwd = os.getcwd()
# root_directory = os.path.dirname(cwd)

# time_blowout_data_dir = root_directory + "\\model_features_pre-match\\time_blowout\\"
# path_to_features = time_blowout_data_dir + "dota2_time_blowout_features.csv"

In [None]:
# NOTE: use this cell if you are running this code on Google Colab

# Set directory for the time blowout match group. Make sure the feature file is uploaded to this Colab session
path_to_features = "/content/dota2_time_blowout_features.csv"

In [None]:
# Read the data (model feature file)
feature_time_blowout_df = pd.read_csv(path_to_features)

### Exploration and preprocessing of the data

In [None]:
# Print feature names
feature_time_blowout_df.columns

In [None]:
# Drop first column (match id)
feature_time_blowout_df = feature_time_blowout_df.drop(['match_id'], axis=1)

In [None]:
# Check the types of the features
feature_time_blowout_df.dtypes

In [None]:
feature_time_blowout_df.head()


In [None]:
feature_time_blowout_df.describe()

In [None]:
feature_time_blowout_df.info()


### Model building, training and evaluation

In [None]:
# Import xgboost libraries
import xgboost as xgb
from xgboost import XGBClassifier

In [None]:
# Split data into features (X) and label (y)
X, y = feature_time_blowout_df.iloc[:,:-1],feature_time_blowout_df.iloc[:,-1]

In [None]:
features = [c for c in feature_time_blowout_df.columns if c != 'win_label']
target = 'win_label'

In [None]:
# Define the number of folds to the K-fold cross-validation
kfolds = KFold(n_splits=10, shuffle=True)

In [None]:
# Define the parameters for the training process
param = {
    'objective': 'binary:logistic',
    'eval_metric': 'auc',
    'eta': 0.2,
    'colsample_bytree': 0.3,
    'learning_rate': 0.1,
     'max_depth': 10,
     'alpha': 10
}

num_round = 100

In [None]:
# NOTE: the training process might take a while to execute

auc = list()

for train_idx, test_idx in kfolds.split(X):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]
    
    param['scale_pos_weight'] = (y_train.size - y_train.sum()) / y_train.sum()    
    
    xg_train = xgb.DMatrix(
        X_train.values, feature_names=features, label=y_train.values
    )
    xg_test = xgb.DMatrix(
        X_test.values, feature_names=features, label=y_test.values
    )
    
    watchlist = [(xg_train, 'train'), (xg_test, 'test')]
    bst = xgb.train(param, xg_train, num_round, watchlist, verbose_eval=False)
    preds = bst.predict(xg_test)
    
    auc.append(roc_auc_score(y_test, preds))


'Median AUC: {:.04f}'.format(st.median(auc))

'Median AUC: 0.8343'