# Cluster the Spire
Will Wright

### Purpose and Context

[todo]

In [34]:
# Load packages
import shutil
from os import listdir
import json
import glob
import os
import numpy as np
import pandas as pd

All the data currently lives in several zipped tar.gz files within the 'zipped' folder.  These need to be extracted into an unzipped folder.

**PROTIP:** If you have the files already extracted (as they are in the repo), skip this step to avoid the lengthy unpacking process

In [2]:
def extract_all(archives, extract_path, zip_format = "gztar"):
    '''
    input: path to zipped file archives, path to extract, and type of zipped file
    output: unzipped contents of each zipped file within the extract path
    '''
    for filename in listdir(archives):
        shutil.unpack_archive(archives+filename, extract_path, zip_format)

In [32]:
extract_all("../data_raw/zipped/","../data_raw/unzipped/", "gztar")

In [3]:
# Start here if the files are already unzipped
read_files = glob.glob("../data_raw/unzipped/*/*.json", recursive = True)

To give more context about the data we're working with, lets see exactly how many raw game runs we have:

In [4]:
len(read_files)

279848

Almost 280K games! We'll need to subset down to games for The Defect on Ascension 20 that resulted in wins before we can determine the relevant sample size though. In order to do that, we'll want to read these files together and use relevant JSON keys to narrow our focus.

In [5]:
# this approach creates a list of JSON strings from all the read_files
output_list = []

for f in read_files:
    try:
        with open(f, "r") as infile:
            # test if the file isn't empty and that the name doesn't contain 'undefined' (1 file, contents are "File doesn't exists)")
            if (os.path.getsize(f)>0) & (('undefined' in f)==False):
                output_list.append(json.load(infile))
            else:
                pass
    except UnicodeDecodeError: # some unicode can't be read so just don't load those games (I think it's a particular monster name)
        pass
    

In [6]:
len(output_list)

279693

In [7]:
len(read_files)-len(output_list)

155

We've excluded 155 games that were either empty, had unreadable unicode, or were 'undefined'.  It's possible that this may introduce some bias (e.g. removing relevant games with particular qualities), but given that this represents such a small volume of games relative to all 280K and I haven't seen any apparent bias in looking through a sample of the files, I don't think this should be a major concern.

After more attempts to get the data into the right format, it looks like there is a single case where the JSON is wrapped in '[ ]'.  Since this game is for Ironclad, I'll simply remove from the dataset.

In [8]:
len(output_list)

279693

In [9]:
output_list[:] = [s for s in output_list if str(s)[0]!='[']

In [10]:
len(output_list)

279692

With that single exception removed, we can now subset to a list of Defect games, which pass the conditions of being the Defect character, a victory, and Ascension 20.  Since it's possible that I'll want to expand this investigation to the other two characters later, I'll also set aside their games in their own lists.

In [25]:
# Winning Ascension 20 games by character
defect_asc20_win_games = []
ironclad_asc20_win_games = []
silent_asc20_win_games = []

# Losing Ascension 20 games by character
defect_asc20_lose_games = []
ironclad_asc20_lose_games = []
silent_asc20_lose_games = []

for i in range(len(output_list)):
    if output_list[i] is not None:
        # test to ensure the game data has all the required elements (character, ascention level, and victory status)
        if ('character_chosen' in dict(output_list[i])) and \
        ('ascension_level' in dict(output_list[i])) and \
        ('victory' in dict(output_list[i])):
            
            # DEFECT WINNING
            if (output_list[i]['character_chosen']=='DEFECT') & \
            (output_list[i]['victory']==True) & \
            (output_list[i]['ascension_level']==20):
                defect_asc20_win_games.append(output_list[i])
            
            # DEFECT LOSING
            if (output_list[i]['character_chosen']=='DEFECT') & \
            (output_list[i]['victory']==False) & \
            (output_list[i]['ascension_level']==20):
                defect_asc20_lose_games.append(output_list[i])
            
            # IRONCLAD WINNING  
            if (output_list[i]['character_chosen']=='IRONCLAD') & \
            (output_list[i]['victory']==True) & \
            (output_list[i]['ascension_level']==20):
                ironclad_asc20_win_games.append(output_list[i])
                
            # IRONCLAD LOSING  
            if (output_list[i]['character_chosen']=='IRONCLAD') & \
            (output_list[i]['victory']==False) & \
            (output_list[i]['ascension_level']==20):
                ironclad_asc20_lose_games.append(output_list[i])
                
            # SILENT WINNING  
            if (output_list[i]['character_chosen']=='THE_SILENT') & \
            (output_list[i]['victory']==True) & \
            (output_list[i]['ascension_level']==20):
                silent_asc20_win_games.append(output_list[i])
                
            # SILENT LOSING
            if (output_list[i]['character_chosen']=='THE_SILENT') & \
            (output_list[i]['victory']==False) & \
            (output_list[i]['ascension_level']==20):
                silent_asc20_lose_games.append(output_list[i])

I'm curious about character winrates.  Lets compare to the total games per character.

In [45]:
# Calculate all summary statistics
defect_winning = len(defect_asc20_win_games)
defect_losing = len(defect_asc20_lose_games)
defect_total = len(defect_asc20_win_games)+len(defect_asc20_lose_games)
defect_winrate = defect_winning/defect_total

# Calculate all summary statistics
ironclad_winning = len(ironclad_asc20_win_games)
ironclad_losing = len(ironclad_asc20_lose_games)
ironclad_total = len(ironclad_asc20_win_games)+len(ironclad_asc20_lose_games)
ironclad_winrate = ironclad_winning/ironclad_total

# Calculate all summary statistics
silent_winning = len(silent_asc20_win_games)
silent_losing = len(silent_asc20_lose_games)
silent_total = len(silent_asc20_win_games)+len(silent_asc20_lose_games)
silent_winrate = silent_winning/silent_total


asc20_games_summary = pd.DataFrame({'Character':['Defect','Ironclad','Silent'],
                                    'Winning Games':[defect_winning,
                                                     ironclad_winning,
                                                     silent_winning],
                                     'Total Games':[defect_total,
                                                    ironclad_total,
                                                    silent_total],
                                     'Winrate':[defect_winrate,
                                               ironclad_winrate,
                                               silent_winrate]})



In [46]:
asc20_games_summary

Unnamed: 0,Character,Winning Games,Total Games,Winrate
0,Defect,1669,16863,0.098974
1,Ironclad,1716,14798,0.115962
2,Silent,1811,14278,0.126838


It would seem that although Defect is the most-played character, it has the lowest winrate.  This supports the claim that Defect is the hardest character (at least on Ascension 20).  In any case, we have 1669 victorious Defect Ascension 20 games, which should be adequate sample for clustering.

Next, we need to convert this list of JSON objects to a dataframe we can cluster.  Ideally, the shape of the data is one-row-per-game with columns for all the cards and relics. In order to do that, we'll want to create a vector of all unique cards and relics.  

#### Getting Unique Cards and Relics  

In order to get all unique cards and relics, we can simply pull all cards and relics from all games, then apply the `unique()` function.

In [57]:
all_game_decks = []
all_game_relics = []

for i in range(len(output_list)):
    if output_list[i] is not None:
        # ensure the run data has the deck and relics to avoid errors in rare cases
        if ('master_deck' in dict(output_list[i])) and ('relics' in dict(output_list[i])):
            all_game_decks.append(output_list[i]['master_deck'])
            all_game_relics.append(output_list[i]['relics'])

In [66]:
# Within each game, each card and relic needs to be pulled out into a flat list.
all_cards = []

for i in range(len(all_game_decks)):
    for j in range(len(all_game_decks[i])):
        all_cards.append(all_game_decks[i][j])
        
all_relics = []

for i in range(len(all_game_relics)):
    for j in range(len(all_game_relics[i])):
        all_relics.append(all_game_relics[i][j])


In [67]:
# create unique lists
unique_cards = list(np.unique(all_cards))
unique_relics = list(np.unique(all_relics))

In [68]:
len(unique_cards)

3164

In [69]:
len(unique_relics)

876

Looks like we have 3164 unique cards and 876 unique relics.  This is a fair bit more than expected, so lets take a look at the head and tail of cards:

In [78]:
unique_cards[0:10]

['6A',
 '6A+1',
 'A Thousand Cuts',
 'A Thousand Cuts+1',
 'Abandon',
 'Abandon+1',
 'AbeCurse',
 'AbsoluteMagnitude+1',
 'Absolvement',
 'Absolvement+1']

In [79]:
unique_cards[-10:-1]

['vexMod:StarBlast',
 'vexMod:StrikeStorm',
 'vexMod:StrikeStorm+1',
 'vexMod:Taunt+1',
 'vexMod:TrainingStrike',
 'vexMod:TrainingStrike+1',
 'vexMod:UltimateCard',
 'vexMod:VenomSigh',
 'vexMod:VolumeVengeance']

This reveals two issues: there are the standard and "+1" versions of cards (players can upgrade cards once) as well as cards from game mods (essentially, player-made extensions of the game).  Thankfully, my domain expertise makes it fairly easy to know which cards aren't in the base game and it seems like most of the modded cards have a ':' in their name so they should be fairly easy to exclude.  

After testing, it looks like there are a few other exceptions for specific mods that don't follow the usual 'mod:card' structure.  I'll go ahead and simply remove those cases as well.

In [88]:
unique_cards[:] = [s for s in unique_cards if '+' not in s \
                   and ':' not in s\
                   and 'animator_' not in s\
                   and 'Haku_' not in s]

In [90]:
len(unique_cards)

716

In review of the new card list, I can still see some non-base cards, but I'm not too concerned with this affecting the final results due to the expected low frequency of those cards (0 in cases where the character isn't one of the base characters).  

Next, the same cleansing will be applied to the relics. Generally speaking, relics have the same issue with mods as the cards, but there are not upgrades available.

In [92]:
unique_relics[:] = [s for s in unique_relics if '_' not in s \
                   and ':' not in s]

In [93]:
len(unique_relics)

391

Again, this isn't a perfect methodology, but since there are no flags for being a mod within the game data, it is difficult to use a single signal as a subsetting criteria to only the base game.

___

At this point, we can build a table of all unique cards and relics and then fill in Trues and Falses for whether the card was present per completed game.