# Dota 2 statistics

Guzman-Diaz S., 2023

## INTRODUCTION
Multiplayer online battle arena (MOBA) is a subgenre of strategy video games in which two teams of players compete against each other on a predefined battlefield<sup>[1](https://en.wikipedia.org/wiki/Multiplayer_online_battle_arena)</sup>. DOTA 2 is one of the most popular games of this kind with about 700,000 daily players<sup>[2](https://store.steampowered.com/charts/mostplayed)</sup>. Dota 2 is played in matches between two teams of five players each, named Radiant and Dire. Each team occupy and defend their own separate base on the map. Each of the ten players independently controls a powerful character known as a "hero" that all have unique abilities and differing styles of play<sup>[3](https://en.wikipedia.org/wiki/Dota_2)</sup>. 

To be succesfull in a DOTA 2 match it is required efective team comunication, good player skils (cuantified by their mmr) and a balanced combination of heroes in the team or draft. With a total of 124 different heroes is well known that some heroes present a higger win rate than others, especially at some skill levels<sup>[4](https://www.dotabuff.com/heroes/meta)</sup>. However the individual hero win rate cannot determine the output of a match.

In this explaratory analysis I try to identify the efect that team compositions and player behavior have in the outcome of a DOTA 2 match. I collected data from more than 40,000 matches and perfomred statistical analysis in those. I splited the dataset by skill bracket and identify factors that can predict the output of a match.

## Methods
For this analysis matches data was collected throug May 2023. For this the OpenDota API<sup>[5](https://www.opendota.com/api-keys)</sup> was used as well as the script *data_collector.py* available at this repo. Match data was collected iterativelly from the API and stored in a SQLite database for later use. All data collected correspond to the game version 7.33c. Aditionally heroes data and individual match data was collected from the same API and combined with the matches data. Data was analized for team (Dire or Radiant) and in a combined manner.

Eigth ranks based on the average player mmr of the matches were defined. Data was analized at each of these skill brackets. The relationship between the heroes selected by a team and the probabilities of winning for that team was observed. Data from duos (pairs of heroes present in a team) and trios (groups of three heroes) was analyzed to identify combined efects.

Finally match data was used to train a model to predict the outcome of a match based on the combination of heroes selected for a team.



In [1]:
### import modules and data
import json
import pandas as pd
from plotly.subplots import make_subplots
import plotly.graph_objects as go
import plotly.express as px 


from utils.dota_data_utils import DotaData 
import utils.dota_plots_utils as dota_plots_utils 
from utils.dota_analytics_utils import DotaPredictionModels

args = json.load(open('dota_data_config.json'))
data = DotaData(**args['dota_data'])

## Results and Discussion
### General results
Data from a total of XX,XXX matches was collected. 
From this we extracted the matches corresponding to game_mode = 1 ('All pick') and game_mode = 22 ('Ranked all pick'). Aditionally we removed matches with NA values resulting in a total of XX,XXX remaining matches.

In [15]:
### Table 1. Match data collected
data.match_data = data.match_data.loc[data.match_data['game_mode'].isin([1, 22])]
data.match_data.dropna(inplace=True)
# print(data.match_data['avg_mmr'].median())
print('Table 1. Match data collected')
print(data.match_data[['radiant_win', 'duration', 'avg_mmr', 'game_mode', 'dota rank']].describe())

Table 1. Match data collected
        radiant_win      duration       avg_mmr     game_mode
count  30721.000000  30721.000000  30721.000000  30721.000000
mean       0.595912   1714.628463   3127.390742     21.998633
std        0.490723    272.337344    977.207525      0.169438
min        0.000000    366.000000      1.000000      1.000000
25%        0.000000   1569.000000   2539.000000     22.000000
50%        1.000000   1736.000000   3172.000000     22.000000
75%        1.000000   1900.000000   3755.000000     22.000000
max        1.000000   2509.000000   8576.000000     22.000000


### MMR ranks
The mmr of the maches varied from 1 to 8576 with a mean of 3130 and a median of 3172. 
We defined eigth mmr ranks based on this results. 


In [8]:
### Figure 1. mmr ranks defined for our dataset
ranks_size = round(data.match_data['avg_mmr'].dropna().max()/8)

# for n, rank_n in enumerate(data.dota_ranks.keys()):  
#     if n == 0:
#         low_range = round(n*ranks_size)
#     else:
#         low_range = high_range + 1
#     high_range = low_range + ranks_size
#     # print(f'{rank_n} range: {low_range} - {high_range}')

fig = make_subplots(rows=1, cols=2, subplot_titles=("Match average mmr distribution",
                                                    "Number of matches by rank"))
fig.add_trace(go.Histogram(x=data.match_data['avg_mmr'].dropna()), row=1, col=1)
fig.add_trace(go.Histogram(x=data.match_data['dota rank'].dropna().sort_values()), row=1, col=2)
fig.update_layout(showlegend=False, title_text="Figure 1. mmr ranks defined for our dataset")

Analysing data by rank we can see that there is a higger win rate for the radiant team in all skill brackets. Althoug the Dota 2 map is considered to be balanced is often discussed that the position of the Radiant base is often more confortable to play for most players which can be related to this disparity<sup>[6](https://www.reddit.com/r/DotA2/comments/133ub5r/radiant_wins_more_than_dire_in_berlin_major/)</sup>.

In [10]:
### Fig 2. Win rates of each team by player's skill rank
fig_1 = make_subplots(rows=2, cols=4, subplot_titles=list(data.dota_ranks.keys()))
pos = {0:[1,1], 1:[1,2], 2:[1,3], 3:[1,4], 
       4:[2,1], 5:[2,2], 6:[2,3], 7:[2,4]}

for n, rank in enumerate(data.dota_ranks.keys()):
    fig_1.add_trace(go.Histogram(x=data.match_data.loc[data.match_data['dota rank'] == rank,
                                                     'radiant_win'] \
                               .sort_values() \
                               .replace(1, 'Radiant win') \
                               .replace(0, 'Dire win'), 
                               histnorm='percent'),
                  row=pos[n][0], col=pos[n][1])
fig_1.update_layout(showlegend=False,
                  title_text="Figure 2. Win rates of each team by player's skill rank")
fig_1.show()
    

### Heroes individual pick and win rates

Wen looking a heroes data, there are heroes with a higher pick rate in all skills ranks. While other heroes are not even used in the higest brackets. Legion Commander was the most selected hero present in 30% of all draft and in 75% of the matches in Immortal Rank. In the other hand Leshrac is used only in 1.2 % of games and in the dataset does not appear selected in the 7th and 8th ranks.

In [26]:
###Figure 3. Most and least picked heroes by rank
heroes_pick_df = data.heroes_pick_df('both')
heroes_pick_df.set_index('localized_name', inplace=True)
heroes_pick_df['Mean'] = heroes_pick_df.mean(axis=1)
heroes_pick_df.sort_values(by='Mean', inplace=True)
pick_rank = heroes_pick_df.copy().sort_values(by='Mean', ascending=False)
heroes_pick_df = pd.concat([heroes_pick_df.head(10), heroes_pick_df.tail(10)])

fig_2 = go.Figure(data=go.Heatmap(x=heroes_pick_df.columns,
                                y=list(heroes_pick_df.index),
                                z=heroes_pick_df))
fig_2.update_traces(showscale=False)
fig_2.layout.height = 500
fig_2.update_xaxes(side='top', tickangle=-15)
fig_2.update_layout(title_text="Figure 3. Most and least picked heroes by rank")
fig_2.show()
    

Hero win rate also varied by rank. Huskar, Clinkz, Bloodseeker and Naga Siren presented a win rate higher than 62%. While enigma, Phantom assassin Faceles Void and Leshrac fall bellow the 40% win rate. 


In [27]:
### Figure 4. Higest and lower win rate heroes by rank
heroes_wlr_df = data.heroes_wlr_by_rank()
heroes_wlr_df.set_index('localized_name', inplace=True)
heroes_wlr_df['Mean'] = heroes_wlr_df.mean(axis=1)
heroes_wlr_df.sort_values(by='Mean', inplace=True)
wlr_rank = heroes_wlr_df.copy().sort_values(by='Mean', ascending=False)
heroes_wlr_df = pd.concat([heroes_wlr_df.head(10), heroes_wlr_df.tail(10)])

fig = go.Figure(data=go.Heatmap(x=heroes_wlr_df.columns,
                                y=list(heroes_wlr_df.index),
                                z=heroes_wlr_df))
fig.update_traces(showscale=False)
fig.layout.height = 500
fig.update_xaxes(side='top', tickangle=-15)
fig.update_layout(title_text="Figure 4. Higest and lower win rate heroes by rank")
fig.show()

Looking at the results its easy to see why Clinkz and Techies are some of the most picked heroes, because they also have some of the higest win rates in the game. In the other hand, Leshrac the most ignored heroe currently and also that reflects win rate that presents.

On the other hand, it is notable that the most picked heroes does not necesarily are the ones with the higest win rate. Example of this is Juggernaut which is the eigth most selected hero but has one of the lowest average win rates. The opposite is true for heroes such as Lone Druid, Naga Siren, Shadow Demon or Viper which are not in the top most selected heroes but they are some of the highest win rate heroes currently.



In [28]:
###Figure 5. Comparisson of hero pick rank and win rate rank
pick_rank.drop(list(data.dota_ranks.keys()), axis=1, inplace=True)
pick_rank.drop('Mean', axis=1, inplace=True)
for n, hero in enumerate(pick_rank.index):
    pick_rank.at[hero, 'Pick rank'] = n + 1

wlr_rank.drop(list(data.dota_ranks.keys()), axis=1, inplace=True)
wlr_rank.drop('Mean', axis=1, inplace=True)
for n, hero in enumerate(wlr_rank.index):
    wlr_rank.at[hero, 'Win rank'] = n + 1
    
combined_rank = pick_rank.merge(wlr_rank, how='left', on='localized_name')
combined_rank.sort_values(by='localized_name', inplace=True)
combined_rank = combined_rank.transpose()

fig = go.Figure()
for hero in combined_rank.columns:
    fig.add_trace(go.Scatter(x=combined_rank.index, y=combined_rank[hero],
                             mode='lines+markers+text', # 'lines' or 'markers'
                             name=hero))
    fig.update_traces(line_dash='dot')
fig.update_layout(margin_pad=50)
fig.update_layout(title_text="Figure 5. Comparisson of hero pick rank and win rate rank")
fig.layout.height = 1000
fig.update_yaxes(autorange="reversed")
fig.show()


### Team analysis

Because many duos are very rarelly included in a draft those that appears in less than 1% of the sampling were removed. This leave only 647 duos in the analysis.

When looking a heroes draf data is clear that the heroes most used individually appear in the most common duos. That is the case of Pudge, Legion Commandar and Slark. However, when looking at the win rate of the duos things change. The duos formed by Clinkz + Undying, Undying + Medusa, and Venomancer + Clinkz presents a win rate above 70%. Althoug this data is limited due the relative small number of matches it is interesting that the heroes with high win rates apeart to archieve even higher rates when play togeter. 

Looking into grups of three heroes data is not clear due the small number of matcches samples. Groups of fourth and five heroes were not analyzed due to computational and time constrains.

In [33]:
### Figure 6. Hero duos with highest and lowest win rate
try:
    combinations_df = pd.read_csv('combinations_df.csv', index_col=0)
except:
    combinations_df = data.get_heroes_combinations(data, max_heroes=2,
                                                   store=True, file='combinations_df.csv')

### keep only duos and above
combinations_df = combinations_df.iloc[124:,]
### selecting combinations that apperar in at least 1% of matches
min_count = data.match_data.shape[0]/100*1
combinations_df = combinations_df.loc[combinations_df['counts'] > min_count]
## Show only highest and lowest win rates
combinations_df.sort_values(by='win_rate', inplace=True, ascending=False)
combinations_df_hl = pd.concat([combinations_df.head(10), combinations_df.tail(10)])
# print('Most common duos:')
# print(combinations_df.sort_values(by='counts', ascending=False) \
#         .head(10)[['combined_ids',  'counts', 'win_rate']])

# create subplots
fig = make_subplots(rows=1, cols=2, shared_xaxes=True,
                    shared_yaxes=True, horizontal_spacing=0.1,
                    subplot_titles=['Win rates (%)', 'Matches count'])

fig.append_trace(go.Bar(y=combinations_df_hl['combined_ids'],
                         x=combinations_df_hl['win_rate'],
                         orientation='h', width=0.4, showlegend=False), 1, 1)
fig.append_trace(go.Bar(y=combinations_df_hl['combined_ids'], 
                        x=combinations_df_hl['counts'], 
                        orientation='h', 
                        width=0.4, showlegend=False), 1, 2,)
fig.layout.height = 500
fig.update_traces(width=0.9)
fig['layout']['xaxis']['autorange'] = 'reversed'
fig['layout']['yaxis']['autorange'] = 'reversed'
fig['layout']['yaxis2']['autorange'] = 'reversed'
fig['layout']['yaxis']['showticklabels'] = False
fig['layout']['yaxis2']['showticklabels'] = True
fig.update_layout(showlegend=False,
                  title_text="Figure 6. Hero duos with highest and lowest win rate")

fig.show()

### Win prediction based on team draft
The collected data was used to train and test a model which can predict the outcome of a match based on the hero draft selected by the players. Two different algoriths were tested for this: Logistic Regression (LR) and Random Forest (RF). 



In [2]:
model_dataset = data.combine_teams(enemy=False)
model_dataset.drop(columns=['match_seq_num', 'start_time', 'avg_mmr', 'num_mmr', 
                   'lobby_type', 'game_mode', 'avg_rank_tier', 'num_rank_tier',
                   'cluster', 'radiant_team', 'dire_team',], inplace=True)
models_results = pd.DataFrame()
models_results['model'] = ['logistic regression', 'random forest']

### for the general dataset
all_ranks = model_dataset.copy()
all_ranks.drop(columns=['duration', 'dota rank', 'Team name', 'radiant_win'], inplace=True)
all_ranks.set_index('match_id', inplace=True) 
### split xy
data_y = list(all_ranks['Team win'])
data_x = all_ranks.copy()
data_x.drop(columns='Team win', inplace=True)
### train model
model = DotaPredictionModels(data_x=data_x, data_y=data_y, method='logistic')
model.rec_feat_elim(feat2keep=30)
model.hero_significanceself()
selected_features = model.select_features
model.split_dataset()
model.train_model()
model.test_model()
print(f'Logistic regression general score is: {model.score}')

# model = DotaPredictionModels(data_x=data_x, data_y=data_y, method='forest')
# # model.rec_feat_elim(feat2keep=30)
# # model.hero_significanceself()
# model.select_features = selected_features
# model.split_dataset()
# model.train_model()
# model.test_model()
# print(f'Logistic regression general score is: {model.score}')



Reducing number of features
calculating features significance
fitting logistic regression
Logistic regression general score is: 0.5843244783135104
Reducing number of features
calculating features significance
fitting random forest


In [None]:
model.select_features

In [None]:

### model by rank
for rank in list(data.dota_ranks.keys()):
    # rank = list(data.dota_ranks.keys())[0]
    rank_data = model_dataset.loc[model_dataset['dota rank'] == rank]
    if rank_data.shape[0] > 10:
        rank_data = rank_data.copy()
        rank_data.drop(columns=['duration', 'dota rank', 'Team name', 'radiant_win'], inplace=True)
        rank_data.set_index('match_id', inplace=True) 
        ### split xy
        data_y = list(rank_data['Team win'])
        data_x = rank_data.copy()
        data_x.drop(columns='Team win', inplace=True)

        model = DotaPredictionModels(data_x=data_x, data_y=data_y, method='logistic')
        model.rec_feat_elim(feat2keep=30)
        model.hero_significanceself()
        model.split_dataset()
        model.train_model()
        model.test_model()
        print(f'Logistic regression score for rank {rank} is: {model.score}')
