# Feature Engineering

The goal here is to generate features for each match that will be generate its unique characteristics. In particular, we want to capture the meta-game (*meta*), which will mean here what the optimal strategies for a particular period of the `DotA 2` game. This will mean capturing features like drafts, items, timings, outcomes, and stuff like that. 

I could have used the hidden embeddings of project 1 below, but the above projects do not fully consider the meta.

A lot of the following will re-use code/ideas from my previous two dota projects. Please follow the undocumented projects if you are interested:
1. https://github.com/beelze-b/Open-Dota-Exploration
2. https://github.com/beelze-b/Dota-ItemSequence

In [1]:
import numpy as np
import pandas as pd

### Remembering what is inside a match

Haven't used the Open-Dota API in a while

In [2]:
ti8 = np.load('data/ti8_data.npy').item()
match = ti8[ti8.keys()[0]]

In [3]:
ti8.keys()[0]

4074748928

In [4]:
match.keys()

[u'replay_url',
 u'dire_team_id',
 u'barracks_status_dire',
 u'match_id',
 u'radiant_win',
 u'barracks_status_radiant',
 u'dire_team',
 u'cluster',
 u'replay_salt',
 u'first_blood_time',
 u'chat',
 u'dire_score',
 u'duration',
 u'game_mode',
 u'skill',
 u'human_players',
 u'tower_status_dire',
 u'teamfights',
 u'series_type',
 u'objectives',
 u'all_word_counts',
 u'draft_timings',
 u'version',
 u'cosmetics',
 u'leagueid',
 u'start_time',
 u'engine',
 u'radiant_score',
 u'lobby_type',
 u'picks_bans',
 u'stomp',
 u'match_seq_num',
 u'series_id',
 u'radiant_team',
 u'tower_status_radiant',
 u'negative_votes',
 u'comeback',
 u'league',
 u'radiant_team_id',
 u'positive_votes',
 u'my_word_counts',
 u'radiant_xp_adv',
 u'region',
 u'patch',
 u'radiant_gold_adv',
 u'players']

In [5]:
match['players'][0].keys()

[u'purchase_time',
 u'lane_efficiency',
 u'gold',
 u'sentry_kills',
 u'lane_efficiency_pct',
 u'firstblood_claimed',
 u'lane_pos',
 u'damage_inflictor_received',
 u'lh_t',
 u'repicked',
 u'is_contributor',
 u'damage_taken',
 u'kill_streaks',
 u'cosmetics',
 u'rank_tier',
 u'hero_id',
 u'necronomicon_kills',
 u'kills_log',
 u'sentry_uses',
 u'account_id',
 u'abandons',
 u'kills_per_min',
 u'start_time',
 u'isRadiant',
 u'backpack_1',
 u'backpack_0',
 u'leaver_status',
 u'actions',
 u'killed',
 u'stuns',
 u'gold_per_min',
 u'name',
 u'level',
 u'last_hits',
 u'actions_per_min',
 u'damage_inflictor',
 u'purchase_tpscroll',
 u'patch',
 u'item_4',
 u'item_5',
 u'item_2',
 u'item_3',
 u'item_0',
 u'lose',
 u'sen_left_log',
 u'buyback_count',
 u'obs_placed',
 u'match_id',
 u'kda',
 u'pings',
 u'gold_reasons',
 u'ability_targets',
 u'total_gold',
 u'item_uses',
 u'duration',
 u'sen',
 u'lobby_type',
 u'denies',
 u'ancient_kills',
 u'killed_by',
 u'neutral_kills',
 u'permanent_buffs',
 u'last_l

#### Some Immediate Features of Interest

`radiant_win` because of who wins the match, ofc. 

`draft_timings` looks good too because hero selection is very intimately tied to the meta. `picks_bans` seems to contain less information.

The statuses of the various barracks and towers look good too: does the meta prefer ratting or other methods of sieging bases in response to teamfights (or lack of)? Ratting or backdooring refers to attacking the enemy base while ignoring the enemy players or as they leave their base. Towers use 16 bit encoding. Barracks use 8 bit encoding.

The last values `radiant_gold_adv` and `radiant_xp_adv` are good too: are matches pretty close?

`duration` will be useful to see if the meta prefers short or longer games.

`radiant_score` and `dire_score` for similar reasons.

`objectives` will be useful for the roshan kill information. I am going to ignore the building information; it is most certainly useful, but I am pressed for time.

`first_blood_time` will be included. I don't think it will carry much information, but it is low hanging fruit.

###### On a deeper level, let's good at information inside `players`.

Need a function to recover approximate position (1, 2, 3, 4, 5). I am going to ignore lane swaps and the possibility for dual mid.

`hero_id` will be used.

`lane` is useful. It is important that lanes are given 1, 2, 3 going up. So Dire offlane/radiant safelane is marked as 1. Middlane is 2 for both. Radiant offlane, Dire safelane is 3.

`lane_role` will be useful to differentiate between two players in a lane. 

`total_gold` and `total_xp` will be useful.

`purchases` will be useful to notice meta items. Will need to differentiate big purchases based on some minimum cost. Will ignore upgrades to items like repeated dagon purchases. Will ignore `divine rapier`. 

Going to ignore abilities since heroes get reworked, and there is too much variety there to form a consistent set of data mining scheme. To explain: What is the difference between Invoker and Ember's W? They both might get orchid for similar reasons though, which is why I look at purchases even though there is a lot of variety there.

The following `numerical` data points will also be used:
`ancient_kills`, `neutral_kills`, `lane_kills`, `last_hits`, `denies`, `tower_damage`, `hero_damage`, `hero_healing`, `damage_taken`, `sentry_kills`, `observer_kills`, `kills`, `deaths`, `assists`, `stuns`, 
`win`, `sen_placed`, `obs_placed`, `lane_efficiency_pct`, `camps_stacked`, `buyback_count`, and `teamfight_participation` 

###### Categorical and numerical features
Some features in the above will clearly be categorical and some will be numerical. I will try to use one hot encoding for the categorical features if the algorithm desires it. Numerical features will be max-min scaled on partitioned `TI` sets.

`patch` is not GOOD: this will be a data leakage; intuitively, there should be a nearly disjoint correspondance between the meta and the patch (some patches might be similar in small aspects). We want to model characteristics inherent to the matches that tie to the overall game. The patch version, however, is  a characteristic embedded in the overall game of **DotA 2**.

## Formalizing the Above

Strategy is to identify pos 5, pos 2, pos 1, pos 3, and pos 4 in that order. Radiant and then dire. 
For each of the above, collect the above player features. In addition, collect the match features.

One hot encode all the categorical features except for tower and barracks status. Max-Min scale all the numerical features. 

Draft timings need to be encoded. Simply count the number of roshans kills. 

In [6]:
## Draft Timings

def draftTimings(match_obj):
    pass

def parsePlayer(player_obj):
    pass

def identify_players(match_obj):
    radiant_players_idx = range(5)
    dire_players_idx = range(5, 10)
    positions_left_radiant = range(5)
    positions_left_dire = range(5)
    # identify 
    
# find which player had highest networth but was in lane 2
def identify_mid(players_obj):
    possible_mids = filter(lambda x: x['lane_role'] == 2, match_players)
    sorted_mids = sorted(possible_mids, key = lambda x: x['total_gold'], reverse = True)
    mid_player = sorted_mids[0]
    return mid_player

# carry can be either in safe lane or offlane
# take the one with the highest gold that isn't mid
def identify_carry(players_obj):
    return sorted(players_obj, key = lambda x: x['total_gold'], reverse = True)[0]

In [7]:
player = match['players'][0]

In [8]:
player

{u'abandons': 0,
 u'ability_targets': {u'silencer_last_word': {u'npc_dota_hero_drow_ranger': 6,
   u'npc_dota_hero_enchantress': 13,
   u'npc_dota_hero_gyrocopter': 9,
   u'npc_dota_hero_phoenix': 6,
   u'npc_dota_hero_spirit_breaker': 5}},
 u'ability_upgrades_arr': [5377,
  5378,
  5377,
  5379,
  5377,
  5380,
  5377,
  5379,
  5379,
  6016,
  5379,
  5380,
  5378,
  5378,
  6008,
  5378,
  5380],
 u'ability_uses': {u'silencer_curse_of_the_silent': 61,
  u'silencer_global_silence': 11,
  u'silencer_last_word': 39},
 u'account_id': 19672354,
 u'actions': {u'1': 5132,
  u'10': 274,
  u'11': 15,
  u'13': 4,
  u'15': 26,
  u'16': 109,
  u'17': 2,
  u'19': 73,
  u'2': 109,
  u'20': 1,
  u'23': 1,
  u'24': 2,
  u'27': 1,
  u'3': 15,
  u'31': 1,
  u'33': 274,
  u'36': 3,
  u'4': 312,
  u'5': 199,
  u'6': 637,
  u'7': 6,
  u'8': 82,
  u'9': 15},
 u'actions_per_min': 167,
 u'additional_units': None,
 u'ancient_kills': 7,
 u'assists': 19,
 u'backpack_0': 46,
 u'backpack_1': 0,
 u'backpack_2': 

In [9]:
hero_id = player['hero_id']
is_radiant  = player['isRadiant']
lane = player['lane']
net_worth = player['total_gold']
num_feat = ['ancient_kills', 'neutral_kills', 'lane_kills', 'last_hits', 'denies', 'tower_damage', 'hero_damage', 'hero_healing', 'damage_taken', 'sentry_kills', 'observer_kills', 'kills', 'deaths', 'assists', 'stuns', 
'win', 'sen_placed', 'obs_placed', 'lane_efficiency_pct', 'camps_stacked', 'buyback_count', 'teamfight_participation']

SyntaxError: invalid syntax (<ipython-input-9-649d1eed4a4a>, line 6)

In [None]:
player.keys()

In [None]:
player['purchase_log']

In [11]:
range(5, 10)

[5, 6, 7, 8, 9]

In [12]:
match_players = match['players']

In [16]:
mids = filter(lambda x: x['lane_role'] == 2, match_players)

In [34]:
sorted_mids = sorted(mids, key = lambda x: x['total_gold'], reverse = True)

In [27]:
len(sorted_mids)

2

In [35]:
sorted_mids[0]['total_gold']

26555

In [32]:
m = sorted_mids[1]

In [33]:
match_players.remove(m)