# Dataset Visualization & Feature Extraction

After some sanitization of inputs (discarding ineligible replays) and standardizing the formatting of replay filenames, we have a shortlist of **1979** unique games to explore with. Based on initial results, the same procedure can be run on a larger set of non-pro player replays in a wider range of ranked leagues.

In [1]:
import os
import sys
import spawningtool.parser
from collections import defaultdict

In [2]:
# Where replays are located
replay_dir = "replays"
# Where we should copy over the sanitized replays
output_dir = "sanitized_replays"

In [3]:
matchup_dict = defaultdict(list)
race_dict = defaultdict(list)

for (dirpath, dirnames, filenames) in os.walk(replay_dir):
    for filename in filenames:
        try:
            matchup, _, _ = filename.split('_')
            # Build up classification by matchup
            matchup_dict[''.join(sorted(matchup))].append(filename)
            # Build up classification by race
            one, two = matchup.split('v')
            race_dict[one].append((filename, 1))
            race_dict[two].append((filename, 2))
        except Exception as e:
            print(e)
            print(filename)
    break

## Dataset-level population metrics

Below are some metrics on the population of replays we are classifying on. There is a bias towards the Protoss race, likely due to the relative proportion of active professional players utilizing this race in the recent time period. As seen below, games that involve Protoss players occupy roughly 75% of this dataset while games involving the other two races are around 60%. This should not affect our classification results since we will be classifying within each race, we may simply have less confidence on the non-Protoss results.

In [4]:
for key in matchup_dict.keys():
    print(f"Matchup {key}: {len(matchup_dict[key])}")

for key in race_dict.keys():
    print(f"Race {key}: {len(race_dict[key])}")

Matchup PPv: 303
Matchup PTv: 459
Matchup PZv: 484
Matchup TTv: 144
Matchup TZv: 388
Matchup ZZv: 201
Race P: 1549
Race T: 1135
Race Z: 1274


## Dissecting a game

Many things are happening simultaneously in a game of Starcraft. Broadly speaking, players balance between 3 goals: Economy, Offense, and Defence. More specifically, there are 3 resources in the game players need to balance, Minerals, Vespene Gas, and Supply. Minerals and Gas are used to produce both Buildings, Upgrades, and Units. A set amount of Supply is consumed by each unit on the playing field currently.

There are a few classes of buildings and units which we will breakdown as follows (P/T/Z):
- Economy
    - Nexus/Command Center/Hatchery: Main collection point for resources and (only) initial building given to players
- Supply
    - Pylon/Supply Depot/\[Overlord\]: Structures which increase available Supply (capped at 200). Note that for Zerg, this is a unit instead.
- Production (basic)
    - Gateway,Stargate,Robotics Facility/Barracks,Factory,Starport/\[Larvae\]: Basic offensive unit production building. Note that for Zerg, ALL other units are produced from the basic *Larvae* unit which is spawned at a constant rate from Hatcheries.
- Tech (basic)
    - Cybernetics Core,Twilight Council/Ghost Academy,Techlab add-on/Spawning Pool,Baneling Nest,Roach Warren,Lair
- Tech (advanced)
    - Templar Archives,Dark Shrine,Fleet Beacon,Robotics Bay/Fusion Core/Hydralist Den,Lurker Den,Spire,Hive,Ultralisk Cavern,Greater Spire
- Upgrades
    - Forge/Armory,Engineering Bay/Evolution Chamber
    
This is just a personal categorical breakdown of the classes of structures and their roles in the game. Each building also has dependencies to be fulfilled before they can be built. Furthermore, for some cases, structures fulfils both a Tech role as well as an Upgrade role (one unlocks both new units to create as well as new upgrades to research). One can access the full structure tree for each race at:
- [Protoss](https://liquipedia.net/starcraft2/Protoss_Tech_Tree_(Legacy_of_the_Void))
- [Terran](https://liquipedia.net/starcraft2/Terran_Tech_Tree_(Legacy_of_the_Void))
- [Zerg](https://liquipedia.net/starcraft2/Zerg_Tech_Tree_(Legacy_of_the_Void))

The purpose of creating buildings outside the Production class is to unlock a greater variety of units that one can create. With units that require more investment supposedly generating more efficiency (in terms of resources utilized against potential damage caused to the opponent).

## Visualizing a game

For this exercise, we will dig into a particular replay (`PvT_1614448222_7313968-7314011.SC2Replay`) to visualize the features we will be extracting for classification.

We will attempt to visualize a recent game in the 2020 world championships between two pro players PartinG (P) and TY (T). *reason being I've watched this replay and would be familiar with what happened in this match*

In [15]:
def fingerprint(game):
    players = game['players']
    matchup = f"{players[1]['race'][0]}v{players[2]['race'][0]}"
    timestamp = game['unix_timestamp']
    uidPair = f"{players[1]['uid']}-{players[2]['uid']}"
    print(f"PlayerOne: {players[1]['name']}")
    print(f"PlayerTwo: {players[2]['name']}")
    return "_".join([matchup, str(timestamp), uidPair])

sample_replay = "PvT_1614448222_7313968-7314011.SC2Replay"
sample_game = spawningtool.parser.parse_replay(sample_replay)
print("Game unique fingerprint:", fingerprint(sample_game))

PlayerOne: PartinG
PlayerTwo: TYTY
Game unique fingerprint:  PvT_1614448222_7313968-7314011


## Outline of Approach

We treat each action as a multivariate vector of as many variables as there are unique structures,