## Logistic Regression

Notebook with implementation of the Logistic Regression algorithm to predict victory in Dota 2

-------------------------------------------------------------------------------------------------------------------------------

Useful functions to use to explore the data and preprocessing steps before feeding the data into the algorithm:

* df.columns : to see the names of the columns (i.e., features)
* df.dtype : to see the types in the data
* data.head()
* data.info()
* df.describe()

Preprocessing steps:

* Handle missing data: 'imputing' the mean of the column

* Do we need to check for correlation between features? YES

* Do we need to perform feature scaling? YES (scaler = MinMaxScaler(feature_range=(0, 1)) X = scaler.fit_transform(X))

In [1]:
import os
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import confusion_matrix, precision_score, recall_score, roc_auc_score, roc_curve
from sklearn import metrics
import statistics as st
from sklearn import preprocessing

In [2]:
# Directory for the time blowout group
cwd = os.getcwd()
root_directory = os.path.dirname(cwd)
print(root_directory)
time_blowout_data_dir = root_directory + "/model_features_pre-match_newdata/time_blowout/"
print(time_blowout_data_dir)

C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models
C:\Users\markos-ece\Desktop\Viggiato\PhD - UofA\Research\2-Dota2\git-repo-code\data-analysis\prediction-models/model_features_pre-match_newdata/time_blowout/


### Exploration and preprocessing of the data

In [3]:
feature_time_blowout_df = pd.read_csv(time_blowout_data_dir + "dota2_time_blowout_features.csv")

In [4]:
# Drop first ccolumn (match id)
feature_time_blowout_df = feature_time_blowout_df.drop(['match_id'], axis=1)

# Print feature names
feature_time_blowout_df.columns

Index(['rad_hero_1', 'rad_hero_2', 'rad_hero_3', 'rad_hero_4', 'rad_hero_5',
       'rad_hero_6', 'rad_hero_7', 'rad_hero_8', 'rad_hero_9', 'rad_hero_10',
       ...
       'hero_damagem_hp_hero3_d', 'hero_damagem_hp_hero4_d',
       'hero_damagem_hp_hero5_d', 'healingm_hp_hero1_d', 'healingm_hp_hero2_d',
       'healingm_hp_hero3_d', 'healingm_hp_hero4_d', 'healingm_hp_hero5_d',
       'rad_first_pick', 'win_label'],
      dtype='object', length=458)

In [5]:
feature_time_blowout_df.head()

Unnamed: 0,rad_hero_1,rad_hero_2,rad_hero_3,rad_hero_4,rad_hero_5,rad_hero_6,rad_hero_7,rad_hero_8,rad_hero_9,rad_hero_10,...,hero_damagem_hp_hero3_d,hero_damagem_hp_hero4_d,hero_damagem_hp_hero5_d,healingm_hp_hero1_d,healingm_hp_hero2_d,healingm_hp_hero3_d,healingm_hp_hero4_d,healingm_hp_hero5_d,rad_first_pick,win_label
0,0,0,0,0,0,0,0,0,0,0,...,,0.0,,0.0,,,0.0,,1.0,1
1,0,1,0,0,0,0,0,0,0,0,...,0.0,,0.0,0.0,,0.0,,0.0,1.0,0
2,0,0,0,0,0,0,0,0,0,0,...,,,0.0,0.0,0.0,,,0.0,0.0,1
3,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,,0.0,,0.0,0.0,,1.0,1
4,0,0,0,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0


In [6]:
feature_time_blowout_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Columns: 458 entries, rad_hero_1 to win_label
dtypes: float64(161), int64(297)
memory usage: 19.3 MB


In [7]:
feature_time_blowout_df = feature_time_blowout_df.fillna(feature_time_blowout_df.median())
feature_time_blowout_df['rad_first_pick'] = feature_time_blowout_df['rad_first_pick'].astype(int)
feature_time_blowout_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5528 entries, 0 to 5527
Columns: 458 entries, rad_hero_1 to win_label
dtypes: float64(160), int32(1), int64(297)
memory usage: 19.3 MB


### Model building, training and evaluation

In [8]:
from sklearn.linear_model import LogisticRegression

In [9]:
X, y = feature_time_blowout_df.iloc[:,:-1],feature_time_blowout_df.iloc[:,-1]

In [10]:
features = [c for c in feature_time_blowout_df.columns if c != 'win_label']
target = 'win_label'

In [11]:
kfolds = KFold(n_splits=10, shuffle=True)

In [12]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression()

columns_to_scale = list(range(238,456))

In [13]:
cnf = list()
auc = list()
thres = 0.5

mm_scaler = preprocessing.MinMaxScaler()

for train_idx, test_idx in kfolds.split(X):
    X_train, y_train = X.iloc[train_idx], y.iloc[train_idx]
    X_test, y_test = X.iloc[test_idx], y.iloc[test_idx]  
    
    X_train.iloc[:,columns_to_scale] = mm_scaler.fit_transform(X_train.iloc[:,columns_to_scale])
    logreg.fit(X_train, y_train)
    
    X_test.iloc[:,columns_to_scale] = mm_scaler.transform(X_test.iloc[:,columns_to_scale])
    y_pred_proba =logreg.predict_proba(X_test)[::,1]
    auc.append(metrics.roc_auc_score(y_test, y_pred_proba))
    
    cnf.append(confusion_matrix(y_test, (y_pred_proba > thres).astype(int)))
    
cnf = sum(cnf)

'Median AUC: {:.04f}'.format(st.median(auc))

# auc = sum(auc) / len(auc)
# 'Average AUC: {:.04f}'.format(auc)

  return self.partial_fit(X, y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
  return self.partial_fit(X, y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cav

  return self.partial_fit(X, y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
  return self.partial_fit(X, y)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the cav

'Median AUC: 0.7902'