# Dota 2 results prediction

Results prediction of Dota 2 games. Data from https://archive.ics.uci.edu/ml/datasets/Dota2+Games+Results

Information about the data from UCI Machine Learning repository page:
Attribute Information:
Each row of the dataset is a single game with the following features (in the order in the vector):
1. Team won the game (1 or -1)
2. Cluster ID (related to location)
3. Game mode (eg All Pick)
4. Game type (eg. Ranked)
5 - end: Each element is an indicator for a hero. Value of 1 indicates that a player from team '1' played as that hero and '-1' for the other team. Hero can be selected by only one player each game. This means that each row has five '1' and five '-1' values.

## EDA

In [185]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

train = pd.read_csv('dota2Train.csv', header = None)
test = pd.read_csv('dota2Test.csv', header = None)

train = train.add_prefix('col')
test = test.add_prefix('col')
train = train.rename(columns={'col0': 'label'})
test = test.rename(columns={'col0': 'label'})
train.info()
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 92650 entries, 0 to 92649
Columns: 117 entries, label to col116
dtypes: int64(117)
memory usage: 82.7 MB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10294 entries, 0 to 10293
Columns: 117 entries, label to col116
dtypes: int64(117)
memory usage: 9.2 MB


In [186]:
train.head()
test.head()

Unnamed: 0,label,col1,col2,col3,col4,col5,col6,col7,col8,col9,...,col107,col108,col109,col110,col111,col112,col113,col114,col115,col116
0,-1,223,8,2,0,-1,0,0,0,0,...,-1,0,0,0,0,0,0,0,0,0
1,1,227,8,2,0,0,0,0,0,0,...,-1,0,0,0,0,0,0,0,0,0
2,-1,136,2,2,1,0,0,0,-1,0,...,0,0,0,0,0,0,0,0,0,0
3,1,227,2,2,-1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,184,2,3,0,0,0,-1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [187]:
print(train.col1.unique())
print(train.col2.unique())
print(train.col3.unique())

[223 152 131 154 171 122 224 227 111 151 145 231 188 156 144 153 225 155
 186 181 183 121 187 232 185 192 136 123 132 182 161 191 138 137 134 184
 112 133 212 204 124 261 213 135 211 241 251]
[2 8 6 9 1 3 4 7 5]
[2 3 1]


Columns col1, col2 and col3 represents respectively cluster ID, game mode and game type, so they should be transformed to categorical. 

Columns from col4 to col116 represent different types of heroes and the value indicates if a team chose that hero and which team did it (0 for not chosen, 1 for team 1 and -1 for team -1). If necessary, it can be separated between team 1 and team -1, but I'll leave it like this for now.

Next I'll use get_dummies method for encoding col1, col2, col3 and separate the labels from the features.

In [188]:
cat_col = ['col1', 'col2', 'col3']
data = pd.concat([train, test])
for col in cat_col:
    ohe = pd.get_dummies(data[col], prefix=col)
    data = pd.concat([data, ohe], axis=1)
data.drop(cat_col, axis=1, inplace=True)
label_dummie = pd.get_dummies(data['label'], prefix='label')
# y will be 1 if the winner is team 1 and 0 if it's -1
y_train = label_dummie.iloc[:train.shape[0],1]
y_test = label_dummie.iloc[train.shape[0]:,1]

In [189]:
ttrain = data.iloc[:train.shape[0],:]
ttest = data.iloc[train.shape[0]:,:]
X_train = ttrain.drop(['label'], axis=1)
X_test = ttest.drop(['label'], axis=1)

In [190]:
print(ttrain.head())
print(ttest.head())
print(y_train.head())

   label  col4  col5  col6  col7  col8  col9  col10  col11  col12  ...  \
0     -1     0     0     0     0     0     0      0      0      0  ...   
1      1     0     0     0     1     0    -1      0      0      0  ...   
2      1     0     0     0     1     0    -1      0      0      0  ...   
3      1     0     0     0     0     0     0     -1      0      0  ...   
4     -1     0     0     0     0     0    -1      0      0     -1  ...   

   col2_3  col2_4  col2_5  col2_6  col2_7  col2_8  col2_9  col3_1  col3_2  \
0       0       0       0       0       0       0       0       0       1   
1       0       0       0       0       0       0       0       0       1   
2       0       0       0       0       0       0       0       0       1   
3       0       0       0       0       0       0       0       0       1   
4       0       0       0       0       0       0       0       0       0   

   col3_3  
0       0  
1       0  
2       0  
3       0  
4       1  

[5 rows x 173 colum

Now that the categorical data has been encoded, I'll compare a few simple classification models.

## Model selection

In [72]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.ensemble import RandomForestClassifier, VotingClassifier,BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.preprocessing import OneHotEncoder
import xgboost


In [200]:
models = []

lr = LogisticRegression(solver='saga')
models.append(lr)
#knn = KNeighborsClassifier()
#models.append(knn)
#dt = DecisionTreeClassifier()
#models.append(dt)

for m in models:
    cv_results = cross_val_score(m, X_train, y_train, cv=8)
    result = np.mean(cv_results)
    std = np.std(cv_results)
    print('%s : %s  +-  %s'%(type(m).__name__,result, std))   

LogisticRegression : 0.5996224215944337  +-  0.003115060178632311


In [10]:
lr2 = LogisticRegression(solver = 'lbfgs')
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
lr_cv=GridSearchCV(lr2,grid,cv=10)
lr_cv.fit(X_train,y_train)
print("tuned hpyerparameters :(best parameters) ",lr_cv.best_params_)
print("accuracy :",lr_cv.best_score_)



tuned hpyerparameters :(best parameters)  {'C': 1.0, 'penalty': 'l2'}
accuracy : 0.6001295196977874


Logistic regression's best parameters were C = 1.0 and penalty = l2 for now. 

Now trying some neural networks

In [103]:
from tensorflow import keras
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow import keras
from tensorflow.keras.callbacks import EarlyStopping



y_train_cat = keras.utils.to_categorical(y_train, 2)
n_cols = X_train.shape[1]
esm = EarlyStopping(patience=4)

mod_nn = Sequential()
mod_nn.add(Dense(n_cols, activation='relu', input_shape = (n_cols,)))
mod_nn.add(Dense(8*n_cols, activation='relu'))
mod_nn.add(Dropout(0.2))
mod_nn.add(Dense(2, activation='softmax'))

mod_nn.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

mod_nn.fit(X_train,y_train_cat, epochs=20, validation_split=0.3, callbacks = [esm])

Train on 64854 samples, validate on 27796 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20


<tensorflow.python.keras.callbacks.History at 0x7fc65423ad50>

The obtained accuracy as 0.7022, but now we need to test this model against the test data.


In [104]:
y_test_cat = keras.utils.to_categorical(y_test, 2)
mod_nn.evaluate(X_test,y_test_cat)



[0.7573288412424855, 0.56178355]

The neural network model's accuracy didn't hold up against the test data, probably due to overfitting.

In [105]:
mod_nn.save_weights('mod_nn.h5')
y_nn_pred = mod_nn.predict(X_test)

Now we should go back and try some feature engineering, parameter tuning and maybe try a few ensemble models.

## Featuring engineering
Getting another look at the features, their values and meaning.

In [197]:
data = pd.concat([train, test])
zeroes = []
for col in data.columns:
    if data[col].value_counts().iloc[0] == 102944:
        zeroes.append(col)
print(zeroes) #shows if there's any empty hero column
print(data.col111.value_counts())

['col27', 'col111']
0    102944
Name: col111, dtype: int64


At least 2 columns can be dropped, since they represent heroes that were never chosen.

In [198]:
data.drop(zeroes, axis=1, inplace=True)

Since all columns are categoricals, I'll use OneHotEncoder on all of the data.

In [223]:
y = data.label
y = y.replace(-1,0)
y_train_ohe = y[:train.shape[0]]
y_test_ohe = y[train.shape[0]:]
X = data.drop(['label'], axis=1)
onehot = OneHotEncoder(categories='auto')
X_ohe = onehot.fit_transform(X)
X_ohe.shape
X_train_ohe = X_ohe[:train.shape[0],:]
X_test_ohe = X_ohe[train.shape[0]:,:]

In [211]:
print(X_train_ohe)
#plt.figure(figsize=(5,5))
#sns.heatmap(train[['label','col1', 'col2', 'col3']].corr())
#plt.show()

  (0, 38)	1.0
  (0, 48)	1.0
  (0, 57)	1.0
  (0, 60)	1.0
  (0, 63)	1.0
  (0, 66)	1.0
  (0, 69)	1.0
  (0, 72)	1.0
  (0, 75)	1.0
  (0, 78)	1.0
  (0, 81)	1.0
  (0, 84)	1.0
  (0, 88)	1.0
  (0, 90)	1.0
  (0, 93)	1.0
  (0, 96)	1.0
  (0, 100)	1.0
  (0, 102)	1.0
  (0, 105)	1.0
  (0, 108)	1.0
  (0, 110)	1.0
  (0, 114)	1.0
  (0, 117)	1.0
  (0, 120)	1.0
  (0, 122)	1.0
  :	:
  (92649, 318)	1.0
  (92649, 321)	1.0
  (92649, 324)	1.0
  (92649, 327)	1.0
  (92649, 330)	1.0
  (92649, 333)	1.0
  (92649, 336)	1.0
  (92649, 339)	1.0
  (92649, 342)	1.0
  (92649, 345)	1.0
  (92649, 348)	1.0
  (92649, 351)	1.0
  (92649, 354)	1.0
  (92649, 357)	1.0
  (92649, 360)	1.0
  (92649, 363)	1.0
  (92649, 366)	1.0
  (92649, 369)	1.0
  (92649, 372)	1.0
  (92649, 375)	1.0
  (92649, 378)	1.0
  (92649, 381)	1.0
  (92649, 384)	1.0
  (92649, 387)	1.0
  (92649, 390)	1.0


In [215]:
esm = EarlyStopping(patience=4)
n_cols2 = X_train_ohe.shape[1]
mod_nn_ohe = Sequential()
mod_nn_ohe.add(Dense(n_cols2, activation='relu', input_shape = (n_cols2,)))
mod_nn_ohe.add(Dense(8*n_cols2, activation='relu'))
mod_nn_ohe.add(Dropout(0.2))
mod_nn_ohe.add(Dense(2, activation='softmax'))

mod_nn_ohe.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

mod_nn_ohe.fit(X_train_ohe,y_train_cat, epochs=20, validation_split=0.3, callbacks = [esm])

Train on 64854 samples, validate on 27796 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20


<tensorflow.python.keras.callbacks.History at 0x7fc63e75c4d0>

In [217]:
mod_nn_ohe.evaluate(X_test_ohe,y_test_cat)



[0.6625449380898721, 0.5885953]

In [224]:
models = []

lr = LogisticRegression(solver='saga')
models.append(lr)
knn = KNeighborsClassifier()
models.append(knn)
dt = DecisionTreeClassifier()
models.append(dt)

for m in models:
    cv_results = cross_val_score(m, X_train_ohe, y_train_ohe, cv=8)
    result = np.mean(cv_results)
    std = np.std(cv_results)
    print('%s : %s  +-  %s'%(type(m).__name__,result, std))   



LogisticRegression : 0.5991798220987938  +-  0.0029563386142659113
KNeighborsClassifier : 0.520906566465684  +-  0.0034122224690387505
DecisionTreeClassifier : 0.5196116425549561  +-  0.004272437180805086
