## Gradient Boosting Machines (XGBoost)

Notebook with implementation of the XGBoost algorithm to predict victory in Dota 2

-------------------------------------------------------------------------------------------------------------------------------

## Time blowout matches

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

Preprocessing steps:

* Try two methods for handling missing data: 'automatic xgboost handling' and 'imputing'

* Do we need to check for correlation between features? NO (for xgboost)

* Do we need to perform feature scaling? NO (for xgboost)(scaler = MinMaxScaler(feature_range=(0, 1)) X = scaler.fit_transform(X))

In [4]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve

In [7]:
# Directory for the time blowout group
cwd = os.getcwd()
root_directory = os.path.dirname(cwd)
print(root_directory)
time_blowout_data_dir = root_directory + "/model_features/time_blowout/"
print(time_blowout_data_dir)

C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models
C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models/model_features/time_blowout/


### Exploration and preprocessing of the data

In [11]:
feature_time_blowout_df = pd.read_csv(time_blowout_data_dir + "dota2_time_blowout_features.csv")

In [None]:
# Print feature names
feature_time_blowout_df.columns

In [None]:
# Existing types
feature_time_blowout_df.dtypes

In [25]:
feature_time_blowout_df.head()


Unnamed: 0,match_id,barracks_status_radiant,tower_status_radiant,first_blood_time,radiant_first_pick,base_agility,base_strength,base_intelligence,agility_gain,strength_gain,...,assists,deaths,gold_per_min,experience_per_min,kills_per_min,last_hits_per_min,hero_damage_per_min,hero_healing_per_min,tower_damage,win_label
0,1000290559,6,11,195,1.0,21,15,16,2.3,1.5,...,5,0,358,420,0.194595,3.891892,0.0,0.0,,1
1,1000352316,6,4,119,1.0,23,15,18,3.4,1.3,...,5,6,216,232,0.04902,1.715686,0.0,0.0,,0
2,1003326282,6,9,168,0.0,23,15,22,3.0,1.6,...,10,3,368,428,0.141066,2.962382,0.0,0.0,,1
3,1005038978,6,10,193,1.0,23,14,18,3.4,1.5,...,9,1,320,395,0.201511,2.367758,0.0,0.0,,1
4,1005085064,6,11,271,0.0,25,15,19,3.0,2.5,...,2,4,203,211,0.0,1.488251,0.0,0.0,,0


In [26]:
# Drop first ccolumn (match id)
feature_time_blowout_df = feature_time_blowout_df.drop(['match_id'], axis=1)

In [27]:
feature_time_blowout_df.head()

Unnamed: 0,barracks_status_radiant,tower_status_radiant,first_blood_time,radiant_first_pick,base_agility,base_strength,base_intelligence,agility_gain,strength_gain,intelligence_gain,...,assists,deaths,gold_per_min,experience_per_min,kills_per_min,last_hits_per_min,hero_damage_per_min,hero_healing_per_min,tower_damage,win_label
0,6,11,195,1.0,21,15,16,2.3,1.5,2.5,...,5,0,358,420,0.194595,3.891892,0.0,0.0,,1
1,6,4,119,1.0,23,15,18,3.4,1.3,2.5,...,5,6,216,232,0.04902,1.715686,0.0,0.0,,0
2,6,9,168,0.0,23,15,22,3.0,1.6,2.2,...,10,3,368,428,0.141066,2.962382,0.0,0.0,,1
3,6,10,193,1.0,23,14,18,3.4,1.5,2.5,...,9,1,320,395,0.201511,2.367758,0.0,0.0,,1
4,6,11,271,0.0,25,15,19,3.0,2.5,3.1,...,2,4,203,211,0.0,1.488251,0.0,0.0,,0


In [None]:
feature_time_blowout_df.describe()

In [194]:
feature_time_blowout_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Data columns (total 35 columns):
barracks_status_radiant    5528 non-null int64
tower_status_radiant       5528 non-null int64
first_blood_time           5528 non-null int64
radiant_first_pick         5465 non-null float64
base_agility               5528 non-null int64
base_strength              5528 non-null int64
base_intelligence          5528 non-null int64
agility_gain               5528 non-null float64
strength_gain              5528 non-null float64
intelligence_gain          5528 non-null float64
role_carry                 5528 non-null int64
role_support               5528 non-null int64
role_nuker                 5528 non-null int64
role_disabler              5528 non-null int64
role_jungler               5528 non-null int64
role_durable               5528 non-null int64
role_escape                5528 non-null int64
role_pusher                5528 non-null int64
role_initiator             5528 non-nul

Features with NA values that we need to handle:

- radiant_first_pick
- ability_uses               
- item_uses                 
- different_item_uses        
- obs_placed                 
- sen_placed                 
- actions_per_min  
- tower_damage

In [21]:
print(feature_time_blowout_df["radiant_first_pick"].isnull().sum())
print(feature_time_blowout_df["ability_uses"].isnull().sum())
print(feature_time_blowout_df["item_uses"].isnull().sum())
print(feature_time_blowout_df["different_item_uses"].isnull().sum())
print(feature_time_blowout_df["obs_placed"].isnull().sum())
print(feature_time_blowout_df["sen_placed"].isnull().sum())
print(feature_time_blowout_df["actions_per_min"].isnull().sum())
print(feature_time_blowout_df["tower_damage"].isnull().sum())

63
201
201
201
952
952
206
1816


In [None]:
# Change type of 'radiant_first_pick' from float to int
# feature_time_blowout_df["radiant_first_pick"] = feature_time_blowout_df["radiant_first_pick"].astype("int")

### Model building, training and evaluation

In [207]:
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import statistics as st

In [208]:
X, y = feature_time_blowout_df.iloc[:,:-1],feature_time_blowout_df.iloc[:,-1]

In [209]:
X.head()

Unnamed: 0,barracks_status_radiant,tower_status_radiant,first_blood_time,radiant_first_pick,base_agility,base_strength,base_intelligence,agility_gain,strength_gain,intelligence_gain,...,actions_per_min,assists,deaths,gold_per_min,experience_per_min,kills_per_min,last_hits_per_min,hero_damage_per_min,hero_healing_per_min,tower_damage
0,6,11,195,1.0,21,15,16,2.3,1.5,2.5,...,0.0,5,0,358,420,0.194595,3.891892,0.0,0.0,
1,6,4,119,1.0,23,15,18,3.4,1.3,2.5,...,0.0,5,6,216,232,0.04902,1.715686,0.0,0.0,
2,6,9,168,0.0,23,15,22,3.0,1.6,2.2,...,,10,3,368,428,0.141066,2.962382,0.0,0.0,
3,6,10,193,1.0,23,14,18,3.4,1.5,2.5,...,0.0,9,1,320,395,0.201511,2.367758,0.0,0.0,
4,6,11,271,0.0,25,15,19,3.0,2.5,3.1,...,,2,4,203,211,0.0,1.488251,0.0,0.0,


In [210]:
X.shape

(5528, 34)

In [211]:
y.head()

0    1
1    0
2    1
3    1
4    0
Name: win_label, dtype: int64

In [212]:
y.shape

(5528,)

In [213]:
data_dmatrix = xgb.DMatrix(data=X,label=y)

In [214]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [215]:
xg_reg = xgb.XGBClassifier(objective ='binary:logistic', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100, early_stopping_rounds=10, eval_metric='auc')

#### test and train - auc (ok)

In [201]:
xg_reg.fit(X_train,y_train)

Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.3, early_stopping_rounds=10,
       eval_metric='auc', gamma=0, gpu_id=-1, importance_type='gain',
       interaction_constraints='', learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=nan,
       monotone_constraints='()', n_estimators=100, n_jobs=0,
       num_parallel_tree=1, objective='binary:logistic', random_state=0,
       reg_alpha=10, reg_lambda=1, scale_pos_weight=1, subsample=1,
       tree_method='exact', validate_parameters=1, verbosity=None)

In [202]:
preds = xg_reg.predict(X_test)

In [203]:
fpr, tpr, _ = roc_curve(y_test, preds, pos_label=1)
roc_auc = auc(fpr, tpr)
roc_auc

0.9565543093780801

In [204]:
from sklearn.metrics import precision_recall_fscore_support

stats = precision_recall_fscore_support(y_test, preds, average='weighted')
precision = stats[0]
recall = stats[1]
f_measure = stats[2]
print(precision)
print(recall)
print(f_measure)

0.9567745201751295
0.9564591181032572
0.956456054014763


### k-fold cv - 1

In [98]:
kf = KFold(n_splits=10,shuffle=True, random_state=123)
cv_preds = cross_val_predict(xg_reg,X_train,y_train,cv=kf)

In [99]:
len(cv_preds)

4422

In [None]:
fpr, tpr, _ = roc_curve(y_test, cv_preds, pos_label=1)
roc_auc = auc(fpr, tpr)

### k-fold cv - 2 (ok)

In [119]:
params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}

cv_results = xgb.cv(dtrain=data_dmatrix, params=params, nfold=10,
                    num_boost_round=50,early_stopping_rounds=10,metrics="auc", as_pandas=True, seed=123)

In [120]:
cv_results.head()

Unnamed: 0,train-auc-mean,train-auc-std,test-auc-mean,test-auc-std
0,0.965217,0.000583,0.964546,0.005191
1,0.994421,0.00505,0.994593,0.005806
2,0.997397,0.001623,0.997449,0.001936
3,0.998349,0.001102,0.997995,0.001998
4,0.998542,0.001161,0.998173,0.002183


In [121]:
print((cv_results["test-auc-mean"]).tail(1))

30    0.999751
Name: test-auc-mean, dtype: float64


### k-fold cv - 3 (best)

In [216]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(xg_reg, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print(len(scores))
print('Median ROC AUC: %.5f' % st.median(scores))

30
Median ROC AUC: 0.99997


In [131]:
countclass = 0
for each in y:
    if each==0:
        countclass+=1
print(countclass)

2258


## Score difference blowout matches

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

Preprocessing steps:

* Try two methods for handling missing data: 'automatic xgboost handling' and 'imputing'

* Do we need to check for correlation between features? NO (for xgboost)

* Do we need to perform feature scaling? NO (for xgboost)(scaler = MinMaxScaler(feature_range=(0, 1)) X = scaler.fit_transform(X))

In [217]:
# Directory for the time blowout group
cwd = os.getcwd()### Exploration and preprocessing of the data
root_directory = os.path.dirname(cwd)
print(root_directory)
score_blowout_data_dir = root_directory + "/model_features/score_blowout/"
print(score_blowout_data_dir)

C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models
C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models/model_features/score_blowout/


### Exploration and preprocessing of the data

In [218]:
feature_score_blowout_df = pd.read_csv(score_blowout_data_dir + "dota2_score_blowout_features.csv")

In [None]:
# Print feature names
feature_score_blowout_df.columns

In [None]:
# Existing types
feature_score_blowout_df.dtypes

In [147]:
feature_score_blowout_df.head()

Unnamed: 0,match_id,barracks_status_radiant,tower_status_radiant,first_blood_time,radiant_first_pick,base_agility,base_strength,base_intelligence,agility_gain,strength_gain,...,assists,deaths,gold_per_min,experience_per_min,kills_per_min,last_hits_per_min,hero_damage_per_min,hero_healing_per_min,tower_damage,win_label
0,2249183244,6,7,175,1.0,23,22,18,2.6,2.1,...,21,5,420,476,0.117111,4.098894,169.538061,0.0,817,1
1,2249648072,6,7,263,0.0,19,19,19,2.3,3.0,...,16,3,490,521,0.230137,3.353425,337.841096,14.991781,2367,1
2,2249657217,0,0,0,0.0,20,23,18,2.3,1.8,...,2,7,238,251,0.04158,1.829522,194.636175,0.0,10,0
3,2249657602,6,11,107,0.0,19,15,22,2.4,1.7,...,11,2,533,553,0.28968,5.141823,308.038624,1.629451,898,1
4,2249734660,6,11,77,0.0,18,20,17,2.4,2.5,...,10,1,784,396,0.531646,3.493671,680.050633,19.518987,1643,1


In [223]:
# Drop first ccolumn (match id)
feature_score_blowout_df = feature_score_blowout_df.drop(['match_id'], axis=1)

In [224]:
feature_score_blowout_df.head()

Unnamed: 0,barracks_status_radiant,tower_status_radiant,first_blood_time,radiant_first_pick,base_agility,base_strength,base_intelligence,agility_gain,strength_gain,intelligence_gain,...,assists,deaths,gold_per_min,experience_per_min,kills_per_min,last_hits_per_min,hero_damage_per_min,hero_healing_per_min,tower_damage,win_label
0,6,7,175,1.0,23,22,18,2.6,2.1,1.9,...,21,5,420,476,0.117111,4.098894,169.538061,0.0,817,1
1,6,7,263,0.0,19,19,19,2.3,3.0,2.1,...,16,3,490,521,0.230137,3.353425,337.841096,14.991781,2367,1
2,0,0,0,0.0,20,23,18,2.3,1.8,3.3,...,2,7,238,251,0.04158,1.829522,194.636175,0.0,10,0
3,6,11,107,0.0,19,15,22,2.4,1.7,3.3,...,11,2,533,553,0.28968,5.141823,308.038624,1.629451,898,1
4,6,11,77,0.0,18,20,17,2.4,2.5,1.8,...,10,1,784,396,0.531646,3.493671,680.050633,19.518987,1643,1


In [None]:
feature_score_blowout_df.describe()

In [151]:
feature_score_blowout_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Data columns (total 35 columns):
barracks_status_radiant    5528 non-null int64
tower_status_radiant       5528 non-null int64
first_blood_time           5528 non-null int64
radiant_first_pick         5517 non-null float64
base_agility               5528 non-null int64
base_strength              5528 non-null int64
base_intelligence          5528 non-null int64
agility_gain               5528 non-null float64
strength_gain              5528 non-null float64
intelligence_gain          5528 non-null float64
role_carry                 5528 non-null int64
role_support               5528 non-null int64
role_nuker                 5528 non-null int64
role_disabler              5528 non-null int64
role_jungler               5528 non-null int64
role_durable               5528 non-null int64
role_escape                5528 non-null int64
role_pusher                5528 non-null int64
role_initiator             5528 non-nul

Features with NA values that we need to handle:

- radiant_first_pick
- ability_uses               
- item_uses                 
- different_item_uses        
- obs_placed                 
- sen_placed                 
- actions_per_min  

In [153]:
print(feature_score_blowout_df["radiant_first_pick"].isnull().sum())
print(feature_score_blowout_df["ability_uses"].isnull().sum())
print(feature_score_blowout_df["item_uses"].isnull().sum())
print(feature_score_blowout_df["different_item_uses"].isnull().sum())
print(feature_score_blowout_df["obs_placed"].isnull().sum())
print(feature_score_blowout_df["sen_placed"].isnull().sum())
print(feature_score_blowout_df["actions_per_min"].isnull().sum())

11
156
156
156
561
561
156


### Model building, training and evaluation

In [225]:
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import statistics as st

In [226]:
X, y = feature_score_blowout_df.iloc[:,:-1],feature_score_blowout_df.iloc[:,-1]

In [227]:
X.head()

Unnamed: 0,barracks_status_radiant,tower_status_radiant,first_blood_time,radiant_first_pick,base_agility,base_strength,base_intelligence,agility_gain,strength_gain,intelligence_gain,...,actions_per_min,assists,deaths,gold_per_min,experience_per_min,kills_per_min,last_hits_per_min,hero_damage_per_min,hero_healing_per_min,tower_damage
0,6,7,175,1.0,23,22,18,2.6,2.1,1.9,...,158.0,21,5,420,476,0.117111,4.098894,169.538061,0.0,817
1,6,7,263,0.0,19,19,19,2.3,3.0,2.1,...,161.0,16,3,490,521,0.230137,3.353425,337.841096,14.991781,2367
2,0,0,0,0.0,20,23,18,2.3,1.8,3.3,...,124.0,2,7,238,251,0.04158,1.829522,194.636175,0.0,10
3,6,11,107,0.0,19,15,22,2.4,1.7,3.3,...,158.0,11,2,533,553,0.28968,5.141823,308.038624,1.629451,898
4,6,11,77,0.0,18,20,17,2.4,2.5,1.8,...,148.0,10,1,784,396,0.531646,3.493671,680.050633,19.518987,1643


In [228]:
X.shape

(5528, 34)

In [229]:
y.head()

0    1
1    1
2    0
3    1
4    1
Name: win_label, dtype: int64

In [230]:
y.shape

(5528,)

In [231]:
data_dmatrix = xgb.DMatrix(data=X,label=y)

In [232]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [233]:
xg_reg = xgb.XGBClassifier(objective ='binary:logistic', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100, early_stopping_rounds=10, eval_metric='auc')

In [234]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(xg_reg, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print(len(scores))
print('Median ROC AUC: %.5f' % st.median(scores))

30
Median ROC AUC: 0.99990


In [166]:
countclass = 0
for each in y:
    if each==1:
        countclass+=1
print(countclass)

2720


## Regular matches

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

Preprocessing steps:

* Try two methods for handling missing data: 'automatic xgboost handling' and 'imputing'

* Do we need to check for correlation between features? NO (for xgboost)

* Do we need to perform feature scaling? NO (for xgboost)(scaler = MinMaxScaler(feature_range=(0, 1)) X = scaler.fit_transform(X))

In [235]:
# Directory for the time blowout group
cwd = os.getcwd()### Exploration and preprocessing of the data
root_directory = os.path.dirname(cwd)
print(root_directory)
regular_data_dir = root_directory + "/model_features/regular/"
print(regular_data_dir)

C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models
C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models/model_features/regular/


### Exploration and preprocessing of the data

In [236]:
feature_regular_df = pd.read_csv(regular_data_dir + "dota2_regular_features.csv")

In [None]:
# Print feature names
feature_regular_df.columns

In [None]:
# Existing types
feature_regular_df.dtypes

In [None]:
feature_regular_df.head()

In [237]:
# Drop first ccolumn (match id)
feature_regular_df = feature_regular_df.drop(['match_id'], axis=1)

In [238]:
feature_regular_df.head()

Unnamed: 0,barracks_status_radiant,tower_status_radiant,first_blood_time,radiant_first_pick,base_agility,base_strength,base_intelligence,agility_gain,strength_gain,intelligence_gain,...,assists,deaths,gold_per_min,experience_per_min,kills_per_min,last_hits_per_min,hero_damage_per_min,hero_healing_per_min,tower_damage,win_label
0,0,0,74,1.0,21,15,21,2.4,2.0,1.8,...,16,11,438,637,0.32,1.408,0.0,0.0,,0
1,4,3,165,0.0,21,20,23,2.0,1.9,2.7,...,6,12,354,386,0.04904,2.304863,0.0,0.0,,0
2,6,8,64,0.0,19,18,17,2.4,2.2,2.2,...,10,3,604,718,0.177449,6.632163,0.0,0.0,,1
3,6,8,222,0.0,21,21,15,2.5,2.5,1.5,...,11,2,404,396,0.188383,3.045526,0.0,0.0,,1
4,2,2,134,1.0,23,16,17,2.6,1.9,2.5,...,6,6,267,307,0.065862,2.206367,0.0,0.0,,0


In [None]:
feature_regular_df.describe()

In [None]:
feature_regular_df.info()


Features with NA values that we need to handle:

- radiant_first_pick
- ability_uses               
- item_uses                 
- different_item_uses        
- obs_placed                 
- sen_placed                 
- actions_per_min
- tower_damage

In [None]:
print(feature_regular_df["radiant_first_pick"].isnull().sum())
print(feature_regular_df["ability_uses"].isnull().sum())
print(feature_regular_df["item_uses"].isnull().sum())
print(feature_regular_df["different_item_uses"].isnull().sum())
print(feature_regular_df["obs_placed"].isnull().sum())
print(feature_regular_df["sen_placed"].isnull().sum())
print(feature_regular_df["actions_per_min"].isnull().sum())
print(feature_regular_df["tower_damage"].isnull().sum())

### Model building, training and evaluation

In [239]:
import xgboost as xgb
from xgboost import XGBClassifier
from sklearn.metrics import roc_curve
from sklearn.metrics import auc
import statistics as st

In [240]:
X, y = feature_regular_df.iloc[:,:-1],feature_regular_df.iloc[:,-1]

In [241]:
X.head()

Unnamed: 0,barracks_status_radiant,tower_status_radiant,first_blood_time,radiant_first_pick,base_agility,base_strength,base_intelligence,agility_gain,strength_gain,intelligence_gain,...,actions_per_min,assists,deaths,gold_per_min,experience_per_min,kills_per_min,last_hits_per_min,hero_damage_per_min,hero_healing_per_min,tower_damage
0,0,0,74,1.0,21,15,21,2.4,2.0,1.8,...,0.0,16,11,438,637,0.32,1.408,0.0,0.0,
1,4,3,165,0.0,21,20,23,2.0,1.9,2.7,...,0.0,6,12,354,386,0.04904,2.304863,0.0,0.0,
2,6,8,64,0.0,19,18,17,2.4,2.2,2.2,...,0.0,10,3,604,718,0.177449,6.632163,0.0,0.0,
3,6,8,222,0.0,21,21,15,2.5,2.5,1.5,...,0.0,11,2,404,396,0.188383,3.045526,0.0,0.0,
4,2,2,134,1.0,23,16,17,2.6,1.9,2.5,...,0.0,6,6,267,307,0.065862,2.206367,0.0,0.0,


In [242]:
X.shape

(45130, 34)

In [243]:
y.head()

0    0
1    0
2    1
3    1
4    0
Name: win_label, dtype: int64

In [244]:
y.shape

(45130,)

In [192]:
countclass = 0
for each in y:
    if each==0:
        countclass+=1
print(countclass)

22430


In [245]:
data_dmatrix = xgb.DMatrix(data=X,label=y)

In [246]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

In [247]:
xg_reg = xgb.XGBClassifier(objective ='binary:logistic', colsample_bytree = 0.3, learning_rate = 0.1,
                max_depth = 5, alpha = 10, n_estimators = 100, early_stopping_rounds=10, eval_metric='auc')

In [248]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold

# define evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(xg_reg, X, y, scoring='roc_auc', cv=cv, n_jobs=-1)
# summarize performance
print(len(scores))
print('Median ROC AUC: %.5f' % st.median(scores))

30
Median ROC AUC: 0.99364


## TESTS

In [251]:
xg_reg.fit(X_train, y_train)

Parameters: { early_stopping_rounds } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




XGBClassifier(alpha=10, base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bynode=1, colsample_bytree=0.3, early_stopping_rounds=10,
       eval_metric='auc', gamma=0, gpu_id=-1, importance_type='gain',
       interaction_constraints='', learning_rate=0.1, max_delta_step=0,
       max_depth=5, min_child_weight=1, missing=nan,
       monotone_constraints='()', n_estimators=100, n_jobs=0,
       num_parallel_tree=1, objective='binary:logistic', random_state=0,
       reg_alpha=10, reg_lambda=1, scale_pos_weight=1, subsample=1,
       tree_method='exact', validate_parameters=1, verbosity=None)

In [254]:
xg_reg_probs = xg_reg.predict_proba(X_test)[:, 1]

In [255]:
from sklearn.metrics import roc_auc_score

# Calculate roc auc
roc_value = roc_auc_score(y_test, xg_reg_probs)


In [256]:
roc_value

0.9918359572882528