### Assignment 02: Deep FF Networks (Predicting Winners in DOTA 2)

The assignment consists of fitting DL model to predict which of two teams will win a DOTA2 game.This is a
simple binary classification problem: aka output -1 if team1 wins and output +1 if team2 wins (no draws)

In [14]:
# Import Library
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import tensorflow as tf

from sklearn.utils import shuffle
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint

from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, Activation, BatchNormalization

#### Data Loading

In [2]:
# Reading Dota2 Games Results
train_data = pd.read_csv('dota2_results\dota2Train.csv', header = None)
test_data = pd.read_csv('dota2_results\dota2Test.csv', header = None)

In [3]:
# creating header for the datasets
tn_col_list = ['x'+ str(col) for col in range(len(train_data.columns))]
ts_col_list = ['x'+ str(col) for col in range(len(test_data.columns))]

# Add header to datasets
train_data.columns = tn_col_list
test_data.columns = ts_col_list

In [4]:
# Slicing Target columns from feature columns
X_train = train_data.iloc[:,1:]
y_train = train_data.iloc[:,0]

X_test = test_data.iloc[:,1:]
y_test = test_data.iloc[:,0]

In [5]:
X_train.shape

(92650, 116)

#### Data Processing

1. We shuffle the train data row-wise to remove any possible stratification imposed on the data.

In [6]:
# Shuffle train data
y_shfl_train, X_shfl_train = shuffle(y_train, X_train)

2. We'll use scikit-learn's MinMaxScaler class to scale x_shfl_train data down to be on a scale from 0 to 1.

In [7]:
# implement MinMaxScaler
scaler = MinMaxScaler(feature_range=(0,1))
X_scl_train = scaler.fit_transform(X_shfl_train)
X_scl_test = scaler.fit_transform(X_test)

3. Given that we can't calculate cross entropy loss with labels -1 and 1. we convert target values to 0 and 1. Using the LabelEncoder class from scikit-learn, the fit_transform () function is applied on the output variable to create a new values.

In [8]:
# encode class values as integers
encoder = LabelEncoder()
y_encd_train = encoder.fit_transform(y_shfl_train)
y_encd_test = encoder.fit_transform(y_test)

4. Reshaping the output variables into 2D.

In [9]:
# Reshape the target variables
y_rshp_train = y_encd_train.reshape(-1,1)
y_rshp_test = y_encd_test.reshape(-1,1)

##### Fitting and Evaluating Deep Feed Forward Neural Network Model
The model is developed using the following steps
1. Define the structure of model. Specifying an input_shape of 116 for the input layer becuase our dataset has 116  input variables.

2. Compile the model with binary_crossentropy cost function, adam optimizer, and performance metric (accuracy).

3. Fit the compiled model on the train dataset. Additionally, a validation set of 30% of the train data is held back for validation, using the validation_split parameter.

4. Evaluate the model using the test data. we will run for a small number of iterations of 50 epochs and use a relatively small batch size of 1000.

In [13]:
# Specifying the structure of the DFF Neural Network model
model = Sequential()
model.add(Dense(64, input_shape=(116,), activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(BatchNormalization(axis=1))
model.add(Dense(1, activation='sigmoid'))

In [14]:
# Compiling the model 
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

In [68]:
# Fit the model
model.fit(X_scl_train, y_rshp_train, validation_split=0.10, epochs=50, batch_size=100)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x15d81c77790>

In each iteration, the output of the loss function shows higher value. This means that the model is performing very poorly. A low value for the loss would mean our model is performing very well.

In [69]:
# Evaluate the model

test_loss, test_acc = model.evaluate(X_scl_test, y_rshp_test)
print('Accuracy in the testing data:', test_acc)

Accuracy in the testing data: 0.5941324830055237


The testing accuracy is 0.5941, while my training is 0.6037. Indeed, the model isn't performing so well.

#### Fitting and Evaluating RandomForest Model

We create an instance of the Random Forest model, using Scikit-Learn’s RandomizedSearchCV, which will randomly search parameters within a range of hyperparameters defined. We then fit this to our training data. We pass both the features and the target variable, so the model can learn. RandomizedSearchCV will train many models till it gets the best model fitting the train data. This function also uses cross validation, which means it splits the data into five equal-sized groups and uses 4 to train and 1 to test the result. It will loop through each group and give an accuracy score, which is averaged to find the best model.

In [15]:
param_dist = {'n_estimators': randint(50,500),
              'max_depth': randint(1,20)}

# Create a random forest classifier
rf = RandomForestClassifier()

# Use random search to find the best hyperparameters
rand_search = RandomizedSearchCV(rf, 
                                 param_distributions = param_dist, 
                                 n_iter=5, 
                                 cv=5)

# Fit the random search object to the data
rand_search.fit(X_scl_train, y_rshp_train.ravel())

In [16]:
# Create a variable for the best model
best_rf = rand_search.best_estimator_

# Print the best hyperparameters
print('Best hyperparameters:',  rand_search.best_params_)

Best hyperparameters: {'max_depth': 19, 'n_estimators': 318}


In [17]:
# Generate predictions with the best model
y_pred = best_rf.predict(X_scl_test)

In [18]:
accuracy = accuracy_score(y_rshp_test.ravel(), y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.5820866524188848


#### Fitting and Evaluating Logistic Regression classifier 

We create a Logistic Regression classifier object using the LogisticRegression() function with random_state for reproducibility. Then, fit your model on the train set using fit() and perform prediction on the test set using predict(). Accuracy is computed by comparing actual test set values and predicted values.

In [10]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression(random_state=16)

# fit the model with data
logreg.fit(X_scl_train, y_rshp_train.ravel())

y_pred = logreg.predict(X_scl_test)

In [12]:
# Model Accuracy: how often is the classifier correct?
print("Accuracy:",accuracy_score(y_rshp_test.ravel(), y_pred))

Accuracy: 0.5980182630658636


Comparing the accuracy of the DFF neural network model with that of other ML models like the Random Forest Classifier and Logistic Regression Classifier, all of them performed relatively the same. 

Accuracy of DFF neural Network = 59.41% 

Accuracy of DFF neural Network = 58.20%

Accuracy of DFF neural Network = 59.80%

A possible reason could be because the dataset is not variate enough. They are lot of similar values in your dataset. That might be a reason of the low accuracy.