## Logistic Regression

Notebook with implementation of the Logistic Regression algorithm to predict victory in Dota 2

-------------------------------------------------------------------------------------------------------------------------------

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

Preprocessing steps:

* Handle missing data: 'imputing' the mean of the column

* Do we need to check for correlation between features? YES

* Do we need to perform feature scaling? YES (scaler = MinMaxScaler(feature_range=(0, 1)) X = scaler.fit_transform(X))

In [2]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve
from sklearn import metrics
import statistics as st
from sklearn import preprocessing

In [3]:
# Directory for the time blowout group
cwd = os.getcwd()
root_directory = os.path.dirname(cwd)
print(root_directory)
time_blowout_data_dir = root_directory + "/model_features_pre-match/score_blowout/"
print(time_blowout_data_dir)

C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models
C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models/model_features_pre-match/score_blowout/


### Exploration and preprocessing of the data

In [4]:
feature_time_blowout_df = pd.read_csv(time_blowout_data_dir + "dota2_score_blowout_features-used_features.csv")

In [5]:
# Drop first ccolumn (match id)
feature_time_blowout_df = feature_time_blowout_df.drop(['match_id'], axis=1)

# Print feature names
feature_time_blowout_df.columns

Index(['role_carry_r', 'role_support_r', 'role_nuker_r', 'role_disabler_r',
       'role_jungler_r', 'role_durable_r', 'role_escape_r', 'role_pusher_r',
       'role_initiator_r', 'role_carry_d', 'role_support_d', 'role_nuker_d',
       'role_disabler_d', 'role_jungler_d', 'role_durable_d', 'role_escape_d',
       'role_pusher_d', 'role_initiator_d', 'bstr_md_r', 'bagi_md_r',
       'bint_md_r', 'strg_md_r', 'agig_md_r', 'intg_md_r', 'bhealth_md_r',
       'bhealth_reg_md_r', 'mspeed_md_r', 'bstr_md_d', 'bagi_md_d',
       'bint_md_d', 'strg_md_d', 'agig_md_d', 'intg_md_d', 'bhealth_md_d',
       'bhealth_reg_md_d', 'mspeed_md_d', 'winR_md_r', 'winR_md_d',
       'winR_plr_md_r', 'winR_plr_md_d', 'winR_hp_md_r', 'xpm_hp_md_r',
       'goldm_hp_md_r', 'deathsm_hp_md_r', 'killsm_hp_md_r',
       'assistsm_hp_md_r', 'damagem_hp_md_r', 'healingm_hp_md_r',
       'winR_hp_md_d', 'xpm_hp_md_d', 'goldm_hp_md_d', 'deathsm_hp_md_d',
       'killsm_hp_md_d', 'assistsm_hp_md_d', 'damagem_hp_md_d'

In [6]:
feature_time_blowout_df.head()

Unnamed: 0,role_carry_r,role_support_r,role_nuker_r,role_disabler_r,role_jungler_r,role_durable_r,role_escape_r,role_pusher_r,role_initiator_r,role_carry_d,...,winR_hp_md_d,xpm_hp_md_d,goldm_hp_md_d,deathsm_hp_md_d,killsm_hp_md_d,assistsm_hp_md_d,damagem_hp_md_d,healingm_hp_md_d,rad_first_pick,win_label
0,1,1,1,1,0,1,1,0,1,1,...,0.0,462.0,393.0,0.12605,0.084034,0.105042,0.0,0.0,1.0,1
1,0,1,1,1,1,1,1,1,1,1,...,0.333333,281.666667,264.833333,0.138787,0.046137,0.230105,0.0,0.0,1.0,0
2,1,1,1,1,1,1,1,1,1,1,...,0.666667,452.666667,408.666667,0.115375,0.228942,0.260469,0.0,0.0,0.0,1
3,1,1,1,1,1,1,1,1,1,1,...,1.0,365.666667,276.833333,0.131327,0.094068,0.211092,0.0,0.0,1.0,1
4,1,1,1,1,0,1,1,1,1,1,...,0.746667,433.853333,388.16,0.091623,0.193855,0.282416,0.0,0.0,0.0,0


In [7]:
feature_time_blowout_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Data columns (total 58 columns):
role_carry_r        5528 non-null int64
role_support_r      5528 non-null int64
role_nuker_r        5528 non-null int64
role_disabler_r     5528 non-null int64
role_jungler_r      5528 non-null int64
role_durable_r      5528 non-null int64
role_escape_r       5528 non-null int64
role_pusher_r       5528 non-null int64
role_initiator_r    5528 non-null int64
role_carry_d        5528 non-null int64
role_support_d      5528 non-null int64
role_nuker_d        5528 non-null int64
role_disabler_d     5528 non-null int64
role_jungler_d      5528 non-null int64
role_durable_d      5528 non-null int64
role_escape_d       5528 non-null int64
role_pusher_d       5528 non-null int64
role_initiator_d    5528 non-null int64
bstr_md_r           5528 non-null int64
bagi_md_r           5528 non-null int64
bint_md_r           5528 non-null int64
strg_md_r           5528 non-null float64
agig_md_r  

In [8]:
# Drop first ccolumn (match id)
feature_time_blowout_df = feature_time_blowout_df.drop(['deathsm_hp_md_r'], axis=1)
feature_time_blowout_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Data columns (total 57 columns):
role_carry_r        5528 non-null int64
role_support_r      5528 non-null int64
role_nuker_r        5528 non-null int64
role_disabler_r     5528 non-null int64
role_jungler_r      5528 non-null int64
role_durable_r      5528 non-null int64
role_escape_r       5528 non-null int64
role_pusher_r       5528 non-null int64
role_initiator_r    5528 non-null int64
role_carry_d        5528 non-null int64
role_support_d      5528 non-null int64
role_nuker_d        5528 non-null int64
role_disabler_d     5528 non-null int64
role_jungler_d      5528 non-null int64
role_durable_d      5528 non-null int64
role_escape_d       5528 non-null int64
role_pusher_d       5528 non-null int64
role_initiator_d    5528 non-null int64
bstr_md_r           5528 non-null int64
bagi_md_r           5528 non-null int64
bint_md_r           5528 non-null int64
strg_md_r           5528 non-null float64
agig_md_r  

In [None]:
feature_time_blowout_df = feature_time_blowout_df.fillna(feature_time_blowout_df.median())
feature_time_blowout_df['rad_first_pick'] = feature_time_blowout_df['rad_first_pick'].astype(int)
feature_time_blowout_df.info()

### Model building, training and evaluation

In [61]:
from sklearn.linear_model import LogisticRegression

In [62]:
X, y = feature_time_blowout_df.iloc[:,:-1],feature_time_blowout_df.iloc[:,-1]

In [63]:
features = [c for c in feature_time_blowout_df.columns if c != 'win_label']
target = 'win_label'

In [64]:
kfolds = KFold(n_splits=10, shuffle=True)

In [65]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()

columns_to_scale = list(range(18,55))

In [66]:
cnf = list()
auc = list()
thres = 0.5

mm_scaler = preprocessing.MinMaxScaler()

for train_idx, test_idx in kfolds.split(X):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]  
    
    X_train.iloc[:,columns_to_scale] = mm_scaler.fit_transform(X_train.iloc[:,columns_to_scale])
    logreg.fit(X_train, y_train)
    
    X_test.iloc[:,columns_to_scale] = mm_scaler.transform(X_test.iloc[:,columns_to_scale])
    y_pred_proba =logreg.predict_proba(X_test)[::,1]
    auc.append(metrics.roc_auc_score(y_test, y_pred_proba))
    
    cnf.append(confusion_matrix(y_test, (y_pred_proba > thres).astype(int)))
    
cnf = sum(cnf)

'Median AUC: {:.04f}'.format(st.median(auc))

# auc = sum(auc) / len(auc)
# 'Average AUC: {:.04f}'.format(auc)

  return self.partial_fit(X, y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
  return self.partial_fit(X, y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cav

  return self.partial_fit(X, y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
  return self.partial_fit(X, y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cav

'Median AUC: 0.7343'