In [371]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
import data_cleaning as dc
import analytics as ana
import matplotlib.pyplot as plt
from itertools import combinations

In [2]:
pd.set_option('display.max_rows', 999)
pd.set_option('display.max_columns', 999)
d1 = {a: a for a in xrange(24)}
d2 = {a: a-1 for a in xrange(25, 113)}
correction_mapping = dict(d1.items()+d2.items())

In [30]:
# Get players dataframe, this gives me all hero picks
# Filling NaNs with 0 makes perfect sense here
# All NaNs basically say that a player did not commit a certain action
players_df = pd.read_csv('dota-2-matches/players.csv').fillna(0)
# Get hero names, which allows me to parse nice names for my hero selection dataframe
heros_chart = pd.read_csv('dota-2-matches/heros_chart_corrected.csv')
# Get match info, mainly used for getting target labels for now
games_df = pd.read_csv('dota-2-matches/match.csv')

In [447]:
players_df = dc.assign_player_team(players_df)
hero_selection_df = dc.construct_hero_selection_df(players_df, heros_chart)

In [448]:
X = hero_selection_df.values
y = games_df.radiant_win.astype(int).values

# Base Models

In [449]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Logistic Regression
The Logisitic Regression Model performed better than I thought. The OG study performed 10% better, but then again it only used the ranked matches from the top 8% players. This is good news for me. I believe that the performance decrease is very much likely caused by the interaction between hero selections and other features within the game.

In [450]:
# Logistic Regression
lr = LogisticRegression()
print 'CV score for Logit: ', cross_val_score(lr, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for Logit:  0.600525296242


### Random Forest
I am surprised that the Random Forest Classifier did not perform as well as Logistic Regression, given its excellent track record with my case studies. However, given the my featuer construction, and the way DotA 2 is played, Random Forest can be the sub-optimal classifier when I only hero selection as my feature matrix.

In [36]:
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=150, criterion='entropy', n_jobs=-1)
print 'CV score for RF: ', cross_val_score(rf, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for RF:  0.581874919769


### K-Nearest Neighbors
This, following the results from Logistic Regression, is somewhat expected. The OG study had KNN performing marginally better than the Logistic Regression. However, KNN is merely better than chance here. That, again, lends validity to interactions between hero selections and user/game features.

In [37]:
# K-Nearest Neighbors
knn = KNeighborsClassifier(n_jobs=-1)
print 'CV score for KNN: ', cross_val_score(knn, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for KNN:  0.528375556609


# Advanced Models
Allow me to preface this section by ditching KNN models for future testing. As I add more features beyond what I have so far, KNN will perform worse in prediction both time-wise and accuracy-wise. Adding features makes the distance calculation even harder. Moreover, any distanced based method is prone to the curse of dimensionality. Therefore, KNN seems like a less reasonable choice. You will not see any KNN beyond this section.

### 1st Engineered Feature(s): 10 minute benchmark for every player
I believe that, by cutting off my observation at the 10 minute benchmark, I am preventing leaks. My job is essentially predicting the outcome of the endgame by looking at the first 10 mintues within the game. This does not violate that condition I imposed upon my study. The benchmark consists of the max running gold, number of last hits and experience level before the 10-minute mark. Luckily these can all be calculated from a split-apply-max algorithm. The choice of max RUNNING average is intentional, as the gold amount recorded in player_time.csv is the the gold owned at the recorded time. However, early games involves lots of purchases. A higher max running gold amount implies that a player could potentially purchase better items within that time period. Using the gold exactly AT the 10-minute mark will not necessarily capture that information. Experience and number of last hits are non-decreasing, hence choosing their running max is equivalent to getting the 10-minute mark record.

In [451]:
# This dataframe gives gold, last hits, experience at 60 seconds intervals
players_time_df = pd.read_csv('dota-2-matches/player_time.csv')

In [453]:
# Construct the time bracket data, which cuts off everything after the time threshold
ten_min_player_df = dc.construct_x_seconds_df(players_time_df, threshold=600)
# Save the dataframe for future use, also saves memory usage
ten_min_player_df.to_csv('dota-2-matches/ten_min_player_time.csv', index=False)
# Construct the player info exactly at the 10 minute mark
ten_min_max_wealth = dc.construct_x_seconds_max_wealth(ten_min_player_df)
# Combine that info with my hero_selection_df
first_adv_feature_mat = hero_selection_df.join(ten_min_max_wealth)
X = first_adv_feature_mat.values

In [454]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Logistic Regression
Logistic regression benefitted quite a bit from the addition of 10-minute benchmarks of every player. It goes to show that even the first ten minutes of a game can be quite telling about the outcome of a game. While I try to fit the same feature matrix with a random forest classifier, I would check for other information that can be extracted from the first 10 minutes.

In [455]:
# Logistic Regression
lr = LogisticRegression()
print 'CV score for Logit: ', cross_val_score(lr, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for Logit:  0.686724754547


### Random Forest
Again, random forest performed worse than logistic regression. The addition of time info does not capture all interactions between player skills and hero selection. Since the added benchmarks consist of information on gold, last hits and experience for all 10 players within a match, splitting on a single player's information probably would not accomplish much.

In [42]:
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=150, criterion='entropy', n_jobs=-1)
print 'CV score for RF: ', cross_val_score(rf, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for RF:  0.651875546992


### 2nd Engineered Feature(s): 10 minute gold growth benchmarks
My motivation for investigating gold growth is two-fold. First, the first 10 minutes of the game was actually richer than I thought. I want to investigate information provided within this timeframe even more. Second, the max is a point estimate. It is not necessarily representative of the flow of information. Hence I am calcualting the mean and standard deviation of gold growth (calculated on a minute-to-minute basis) for each player within the first 10 minutes. The point gold calculation will be normalized by -100gold/min, since this is the gold growth even if a player stays idle. Mean and standard deviation will be calculated after the normalization.

In [456]:
# Construct the gold growth benchmark within the first 10 minutes of the game for every player
# The gold_growth_benchmark contains both mean and standard deviation of gold growths
gold_growth_benchmark = dc.construct_x_seconds_gold_growth_benchmark(ten_min_player_df)
# Combine the info with the first advanced feature matrix from above
second_adv_feature_mat = first_adv_feature_mat.join(gold_growth_benchmark)
X = second_adv_feature_mat.values

In [457]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

### Logistic Regression
I guess it should not come as a surprise that logistic regression failed to perform any better than before. The previous max_wealth dataframe, which contains the max gold, number of last hits and experience, has told most of the story already. It seems that there are not much to be gained from the new metrics, which deals mainly with gold growth for each player.

In [458]:
# Logistic Regression
lr = LogisticRegression()
print 'CV score for Logit: ', cross_val_score(lr, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for Logit:  0.689149394655


### Random Forest
Random Forest performed even worse than before. Again, it should be due to the fact that features are too decentralized for decision trees to make any meaningful splits. In that sense, the myriad of features actually negatively affects the accuracy of the random forest classifier.

In [46]:
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=150, criterion='entropy', n_jobs=-1)
print 'CV score for RF: ', cross_val_score(rf, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for RF:  0.647150091138


### 3rd Engineered Feature(s): net death counts from team fights before the 10-minute mark
The motivation for this feature is to attack information other than the amount of gold owned by each player within the first 10 minutes. For that I turn to teamfight information within the first 10 mintues. I want to get the death count for both radiant and dire before the 10 minute mark. Although death of an enemy player rewards the killer and nearby teammates with gold, death itself represents information about the interaction between teams, whose outcome does not rely solely on gold difference in early game.

In [459]:
# Deleted to save memory, also because second advanced feature matrix did not do well at all
del second_adv_feature_mat

In [55]:
team_fights_df = pd.read_csv('dota-2-matches/teamfights.csv')
num_team_fights_before_ten_min_df = dc.construct_num_team_fights(team_fights_df)
# Get the number of team fights before 10 minute mark for every match from above
# Get information on those team fights down here
teamfight_players_df = pd.read_csv('dota-2-matches/teamfights_players.csv')

In [57]:
# Extremely heavy computation, DAMN YOU FOR LOOPS
net_death_count_before_ten_min_df = dc.construct_net_death_count_from_teamfights(teamfight_players_df, 
                                                                                 num_team_fights_before_ten_min_df)

In [62]:
# Saved for future use, also because the current implementation of the function is very inefficient
# Will improve later
net_death_count_before_ten_min_df.to_csv('dota-2-matches/net_death_count_before_ten_min.csv')

In [58]:
third_adv_feature_mat = first_adv_feature_mat.join(net_death_count_before_ten_min_df)
X = third_adv_feature_mat.values

In [59]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Logistic Regression
At this point, I believe my model's performance is being bottlenecked. Logistic regression performed more or less on the same level as what we did before. The informatin space of the first 10 minutes of every game is being exhausted, resulting in the meagre improvement in cross validation accuracy. If random forest fails to yield any huge improvement, it should corroborate this very claim.

In [60]:
# Logistic Regression
lr = LogisticRegression()
print 'CV score for Logit: ', cross_val_score(lr, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for Logit:  0.685724434303


### Random Forest
Here ends my search for features within the first 10 minutes of all matches. Random forest did not see a significant improvement to its performance either. From this point on, I will attempt to go back to game-based features. My next feature will be the team compositions. Since each hero takes on different roles, a team's composition can be potentially useful for predicting match outcomes.

In [61]:
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=150, criterion='entropy', n_jobs=-1)
print 'CV score for RF: ', cross_val_score(rf, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for RF:  0.649798619947


### 4th Engineered Feature(s): Role Compositions
The motivation for this feature is simple. Every hero takes on different roles. If hero selection matters, then it should not just be about counters, but also how each hero operates throughout the game. Following this line of logic, I am going to first scrape the official DotA 2 website for hero roles and construct a hero attribute dataframe, then get the hero composition for each team in every game and see if my model gets higher accuracies.

### Part 1

In [63]:
# The hero index mapping is a dictionary that has hero name as keys and index as values
hero_index_mapping = dc.get_hero_index_mapping(heros_chart)
# Scrape the website for hero role information, and return a dictionary with hero names as keys and their list of roles as values
hero_roles = dc.construct_hero_roles(hero_index_mapping)
# Compile all possible roels as features, give each hero 1 for taking on a role, and 0 otherwise
hero_attributes = dc.construct_hero_attribute_df(hero_roles)
# For each team in every game, sum hero attributes for selected heros (hence max value in any team role is 5)
# Roles in considerations are: Carry, Support, Disabler, Initiator. The rest are dropped during construction.
hero_composition_df = dc.construct_hero_composition_df(players_df, hero_attributes)

In [64]:
# It makes sense to have interaction between carries and supports, so this cell handles that
# Gives two new columns, each representing the quantity of support per unit of carry
hero_composition_df['Carry_Support_radiant'] = hero_composition_df['Support_radiant'] / (hero_composition_df['Carry_radiant']+0.01)
hero_composition_df['Carry_Support_dire'] = hero_composition_df['Support_dire'] / (hero_composition_df['Carry_dire']+0.01)

In [65]:
fourth_adv_feature_mat = first_adv_feature_mat.join(hero_composition_df)
X = fourth_adv_feature_mat.values

In [66]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Logistic Regression
This is actually pretty disappointing. My Logistic Regression's performance is still bottlenecked. At this point, I believe that I am not using my domain knowledge of the game enough. DotA 2's official role assignment is not necessarily accurate. Almost every hero takes on more than 3 roles. The separation of duties is not as clear-cut as the model would like it to be. Moreover, I assigned 1 for simply taking on a role. However, in reality these each hero takes on roles with differing degrees. My current assignment fails to capture that. I will hard code a version of hero role information based on domain knowledge and a bit of research to address this issue in the second part.

In [67]:
# Logistic Regression
lr = LogisticRegression(penalty='l1', C=10)
print 'CV score for Logit: ', cross_val_score(lr, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for Logit:  0.689724967356


### Random Forest
The Random Forest continues to perform worse than the Logistic Regression. I've mentioned 2 potential problems above. Let's try to address them next.

In [68]:
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=150, criterion='entropy', n_jobs=-1)
print 'CV score for RF: ', cross_val_score(rf, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for RF:  0.650675572452


### Part 2

In [460]:
# This is the hardcoded version of hero_attributes based on domain knowledge and a bit of reasearch into each hero
# Every hero has its role values sum up to 1 (still, max value in a role for each team is 5)
# However we do have better handle on how deep into a role a hero is
hero_attribute_df = dc.construct_hard_coded_hero_attribute_df()
hero_composition_df = dc.construct_hero_composition_df(players_df, hero_attribute_df)
# Again, interaction terms are worth looking into
hero_composition_df['Carry_Support_radiant'] = hero_composition_df['Carry_radiant'] * hero_composition_df['Hard_Support_radiant']
hero_composition_df['Carry_Support_dire'] = hero_composition_df['Carry_dire'] * hero_composition_df['Hard_Support_dire']
fourth_adv_feature_mat = first_adv_feature_mat.join(hero_composition_df)
X = fourth_adv_feature_mat.values

In [461]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

### Logistic Regression

In [462]:
# Logistic Regression
lr = LogisticRegression(penalty='l1', C=10)
print 'CV score for Logit: ', cross_val_score(lr, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for Logit:  0.688225560961


### Random Forest

In [97]:
# Random Forest Classifier
rf = RandomForestClassifier(n_estimators=150, criterion='entropy', n_jobs=-1)
print 'CV score for RF: ', cross_val_score(rf, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for RF:  0.651375


### Extra: Firstblood status
I am only looking at first kills that takes place within the first 10 minutes. In pro scenes, first bloods usually do not carry much implications. In lower brackets, however, a first kill can affect players' mood and can possibly affect the outcome of the game. This is my motivation for looking into firstbloods.

In [463]:
objectives_df = pd.read_csv('dota-2-matches/objectives.csv').drop('key', axis=1)
ten_min_firstblood = dc.construct_ten_min_firstblood_df(objectives_df)

In [464]:
fifth_adv_feature_mat = ten_min_firstblood.join(fourth_adv_feature_mat, on='match_id', how='right')\
                        .sort_values(by='match_id').set_index('match_id').fillna(-1)
dummies = pd.get_dummies(fifth_adv_feature_mat.radiant_firstblood, prefix='firstblood')\
          .drop('firstblood_-1.0', axis=1).rename(columns={'firstblood_1.0':'Radiant_Firstblood',
                                                           'firstblood_0.0':'Dire_Firstblood'})
fifth_adv_feature_mat = fifth_adv_feature_mat.join(dummies).drop('radiant_firstblood', axis=1)
X = fifth_adv_feature_mat.values

In [465]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [466]:
# Logistic Regression
lr = LogisticRegression(penalty='l1', C=10, fit_intercept=False)
print 'CV score for Logit: ', cross_val_score(lr, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for Logit:  0.685700346703


### Extra 2: Role Checks
Check whether a team has too many carries (num >= 3). Give 1 if there are and 0 otherwise.

In [468]:
game_hero_info = dc.get_game_hero_info(players_df, hero_attribute_df)
role_checks = dc.construct_role_check_df(game_hero_info)
sixth_adv_feature_mat = fourth_adv_feature_mat.join(role_checks)
X = sixth_adv_feature_mat.values

In [469]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [470]:
# Logistic Regression
lr = LogisticRegression(penalty='l1', C=10)
print 'CV score for Logit: ', cross_val_score(lr, X_train, y_train, scoring='accuracy', cv=10, n_jobs=-1).mean()

CV score for Logit:  0.688374796191


### 5th Engineered Feature(s): Interaction between Role and Gold
Often it is likely the case that when a team's support is doing extremely well in the early game, the team will likely win the game. This offers a potential angle of investigation that is neither purely game based or player based. Rather, it is based on interactions between player skills and their heros' roles. Initial check is going to be looking at each hero's max gold difference from the mean max gold of all 10 players within the first ten minutes of a match.

In [471]:
long_player_gold = dc.construct_long_player_gold_df(ten_min_max_wealth)
hero_gold_info = game_hero_info.merge(long_player_gold, on=['match_id', 'player_slot'])
hero_gold_avg_comp = hero_gold_info.groupby('match_id')['max_gold'].apply(lambda x: x-x.mean())
role_gold_comp = hero_gold_info.join(hero_gold_avg_comp, rsuffix='_diff_from_mean')

In [472]:
role_gold_comp.head()

Unnamed: 0,match_id,hero_id,player_slot,radiant_player,role,max_gold,max_gold_diff_from_mean
0,0,85,0,True,Hard_Support,2211,-600.5
1,0,50,1,True,Mid,3379,567.5
2,0,82,2,True,Hard_Support,1650,-1161.5
3,0,11,3,True,Carry,2859,47.5
4,0,66,4,True,Hard_Support,3745,933.5
