## DOTA2 Team Prediction Model using machine learning techniques

Dota2 by Valve is a popular computer game with a player base of 10 million unique users. Each Dota 2 match consists of two teams of five players pitted against each other. Before a match begins, each player selects a character to play as, known as a “hero,” from a pool of 113 different heroes. Once a player chooses a hero, no other player can select that hero for the same match. Heroes have a wide-range of characteristics and abilities that, combined with the massive hero pool, make each match unique.

An interesting aspect of the game is that in choosing a hero, players must keep in mind not only the individual strengths and weaknesses of each hero, but also how the strengths and weaknesses of that hero interacts with the heroes already chosen by other players. An effective hero pick is one that synergizes with the heroes chosen by teammates, and both exploits the weaknesses and minimizes the strengths of the heroes chosen by the opposing team. Assuming equally skilled teams, the ramifications of hero selection can be so staggering that well devised hero choices can implicitly give a team a large advantage before the match even begins. The goal of our project is to recommend heroes that will perform well against an opposing team of heroes.

With 113 heroes to choose from and five heroes per team, we are attempting to find the best five heroes for any given matchup, which results in over eight quadrillion possible team combinations. On a deeper level, recommending heroes using machine learning is challenging because it tries to capture via raw data what professional players have developed a gut instinct for through hundreds of thousands of hours of play time.

Our data satisfies the following requirements:  The game mode is either all pick, single draft, all random, random draft, captain’s draft, captain’s mode, or least played. These game modes are the closest to the true vision of Dota 2.  We pulled 8000 records from the API out of which only unique records are used to create a database in MongoDB. The data for each match is structured as JSON and includes which heroes were chosen for each team, how those heroes performed over the course of the game, and which team ultimately won the game. We exported 90% of the matches from our database to form a training set. The remaining 10% of our database was used to form a test set.

#### Getting the Data from Steam Client

Importing data using Dota2API provided by Steam in JSON format, and then creating a MongoDB database using the results. Since, there is a limit to the number of match details that can be pulled from the game client, we have used a try-catch block to avoid disruption of our code due to the API limit. Each request pulls data for 100 matches. We have implemented a while loop to pull data for 8000 matches in all.

Dota2API is an open source library of functions provided by Steam (Valve Corporation). To install this api, use: pip install dota2api. The web link for the same is found at: https://dota2api.readthedocs.io/en/latest/index.html

In [None]:
import pandas as pd
import dota2api

In [None]:
#Initialise the API using a Steam account token key
api = dota2api.Initialise("C4478A705AA9040E7660A53B0DD61092")
count = 0
docs = []
i = 2639900000 #Choose a random match sequence number to begin pulling the data
while count<=8000:  
    try:
      matchData = api.get_match_history_by_seq_num(start_at_match_seq_num=i)
      docs = docs + matchData["matches"]
      count += len(matchData["matches"])      
    except :
        pass
        #print("Api limit exception error occured,retrying next set of data")
    finally:
        i += 100
print("Data downloaded")

We then proceed to add the pulled data into a MongoDB database. To avoid duplicate entries, we again employ a try-catch block.

In [None]:
import pymongo
from pymongo import MongoClient
con=MongoClient()
DotaDB=con.Dota
matches=DotaDB.NewMatches

In [None]:
coll = []
for j in docs:           
        j['_id'] = j["match_id"]
        coll.append(j)
try:
   matches.insert_many(coll, ordered=False)
except:
    print("Records inserted,duplicate matches details ignored")
finally:
    print("Total Records inserted", matches.find().count())

In [None]:
print('Sample match data:' + coll[0])

#### Applying Logistic Regression


In [None]:
import pylab
import pymongo
from sklearn.linear_model import LogisticRegression
import numpy as np
from pymongo import MongoClient
connection = pymongo.MongoClient("mongodb://localhost")

DotaDB = connection.Dota
matches = DotaDB.NewMatches

In [None]:
np.set_printoptions(threshold = np.nan)
NUM_HEROES = 114
NUM_FEATURES = NUM_HEROES * 2
NUM_MATCHES = matches.count()

# Initialize training matrix
X = np.zeros((NUM_MATCHES, NUM_FEATURES), dtype = np.int8)

# Initialize training label vector
Y = np.zeros(NUM_MATCHES, dtype = np.int8)


for i, record in enumerate(matches.find()):    
    Y[i] = 1 if record['radiant_win'] else 0
    players = record['players']
    for player in players:
        hero_id = player['hero_id'] - 1               
        player_slot = player['player_slot']
        if player_slot >= 128:
            hero_id += NUM_HEROES
        X[i, hero_id] = 1

In [None]:
indices = np.random.permutation(NUM_MATCHES)
test_indices = indices[0:NUM_MATCHES/10]
train_indices = indices[NUM_MATCHES/10:NUM_MATCHES]

X_test = X[test_indices]
Y_test = Y[test_indices]
X_train = X[train_indices]
Y_train = Y[train_indices]

In [None]:
print('Sample Train data: ' + X_train[0])

In [10]:
num_samples = len(X_train)
model = LogisticRegression().fit(X_train[0:num_samples], Y_train[0:num_samples])

In [11]:
def generateMatrix(my_team, their_team):
        X = np.zeros(NUM_FEATURES, dtype=np.int8)
        for hero_id in my_team:
            X[hero_id] = 1
        for hero_id in their_team:
            X[hero_id] = 1
        return X

In [None]:
def predict(query,model):
        radiant_query = query
        dire_query = np.concatenate((radiant_query[NUM_HEROES:NUM_FEATURES], radiant_query[0:NUM_HEROES]))
        rad_prob = model.predict_proba(radiant_query)[0][1]
        dire_prob = model.predict_proba(dire_query)[0][0]
        return (rad_prob + dire_prob) / 2

In [None]:
#Applying model on test dataset 
prediction = []
for i in range(len(X_test)):
    list1 = predict(X_test[i],model)   
    prediction.append((list1,Y_test[i])) 

In [None]:
print('Prediction for each match: ' + prediction)

#### Generating a graph to depict the test data accuracy

In [None]:
import matplotlib.pyplot as plt

In [None]:
def evaluateModel(model, X, Y,positive_class, negative_class):    
    correct_predictions = 0.0
    for i, radiant_query in enumerate(X):
        overall_prob = predict(radiant_query,model)
        prediction = positive_class if (overall_prob > 0.5) else negative_class
        result = 1 if prediction == Y[i] else 0
        correct_predictions += result
    return correct_predictions/len(Y)

In [None]:
def plot_model_accuracy(X_test, Y_test):
    test_error = evaluateModel(model, X_test, Y_test, 1, 0) 
    plt.bar (1,test_error,width=0.2,align='center')    
    plt.ylabel('Accuracy')
    plt.xlabel('Model')
    plt.title('Logistic Regression Model Efficiency')
    frame = plt.gca()    
    frame.axes.get_xaxis().set_visible(False)
    pylab.show()

In [None]:
plot_model_accuracy(X_test,Y_test)

In [29]:
def getHeroList(data):
    radiant_list = []
    dire_list = []
    for i in range(NUM_FEATURES):
        if data[i] == 1 and i < NUM_HEROES:
            radiant_list.append(i)
        elif data[i] == 1 and i >= NUM_HEROES:
            dire_list.append(i)          
    return radiant_list,dire_list            

In [None]:
print('Sample Radiant and Dire Team List: ' + getHeroList(X_test[0]))

In [None]:
def get_recommendation(data):
    rad_team,dire_team = getHeroList(data)
    hero_candidates = np.arange(0,113)
    hero_candidates.reshape(1,-1)
    probs = recommend(rad_team,dire_team,hero_candidates)
    return probs

In [None]:
print('Sample probablity: ' + get_recommendation(X_test[0]))

In [None]:
def recommend(my_team, their_team, hero_candidates):
        my_team.pop()
        team_possibilities = [(candidate, my_team + [candidate]) for candidate in hero_candidates]
        prob_candidate_pairs = []
        for candidate, team in team_possibilities:
            query = generateMatrix(team, their_team)
            prob = predict(query,model) 
            prob_candidate_pairs.append((prob, candidate))        
        return prob_candidate_pairs

In [None]:
#building recommendation matrix 
recommendations = []    
for i in range(len(X_test)):
    pr = get_recommendation(X_test[i])
    recommendations.append(pr)

In [None]:
print('Sample Recommendation: ' + recommendations[0])

#### Recommendation graph

In [None]:
import matplotlib.pyplot as plt

In [None]:
X_test[25].reshape(1,-1)
recommend_list = get_recommendation(X_test[25])
recommend_list.sort(reverse=True)
topFifteen = recommend_list[:15]
values = dict(topFifteen)
#import warnings
#warnings.filterwarnings('ignore')

fig, ax = plt.subplots()
plt.bar(range(len(values)), values.keys(), width = 0.2, color = 'r', align = 'center') 
plt.xticks(range(len(values)),values.values())

ax.set_ylabel('Probability')
ax.set_xlabel('Hero Id')
#plt.legend()
plt.title('Recommedation Probability')
plt.show()

In [None]:
import matplotlib.pyplot as plt
from operator import itemgetter 

In [None]:
#Hero wise Data Analysis

herocount = [0] * 115
kills = [0] * 115
deaths = [0] * 115
assists = [0] * 115
for i in matches.find():
    for j in i["players"]:    
        herocount[j["hero_id"]]+=1
        kills[j["hero_id"]]+=j["kills"]
        deaths[j["hero_id"]]+=j["deaths"]
        assists[j["hero_id"]]+=j["assists"]
maxpick = max(herocount)
mostPickedHeroID = herocount.index(maxpick)

In [None]:
print("The most picked hero is ", mostPickedHeroID)

In [None]:
kda = {}
for i in range(len(kills)):
    num = kills[i]+assists[i]
    den = deaths[i]
    if den !=0:
        kda.update({i:num/den})
    else:
         kda.update({i:num})

sorted_kda=sorted(kda.items(),key=itemgetter(1),reverse=True)
values={}
for i in range(10):
   values.update({sorted_kda[i][0]:sorted_kda[i][1]})

highestImpact=sorted_kda[0][1]
HighestImpactHeroID=sorted_kda[0][0]

In [None]:
print("The hero with best impact ratio: ", HighestImpactHeroID)

In [None]:
import matplotlib.pyplot as plt

In [None]:
#Top 10 heros with the highest impact:
fig, ax = plt.subplots()
plt.bar(range(len(values)),values.values() , width=0.2,color='g', align='center') 
plt.xticks(range(len(values)),values.keys())

ax.set_ylabel('KDA')
ax.set_xlabel('Hero Id')
plt.title('Top Ten Impact Ratio')
plt.show()

Conclusions and Further Expansions

We are using data to recommend how substituting any hero of a particular team would change the probablity to winning the game. We are also finding out facts such as: most played hero, hero with the highest impact in terms of getting kills and assists.

Further, expansion can be an inclusion of analysis on the other factors (i.e hero properties like experience per minute, gold per minute, denies, barrack status and tower status) and how the players synergize with each other. Also doing the analysis only on the top players would result into a better model since the top players will be able to make full use of the hero capabilities. 
We are just using 6000 records to form the model whereas the actual dataset that we were able to pull was around 1.5 lakhs which would improve the model.