# Cluster the Spire
Will Wright

### Purpose and Context

[todo]

In [2]:
# Load packages
import shutil
from os import listdir
import json
import glob
import os

All the data currently lives in several zipped tar.gz files within the 'zipped' folder.  These need to be extracted into an unzipped folder.

**PROTIP:** If you have the files already extracted (as they are in the repo), skip this step to avoid the lengthy unpacking process

In [31]:
def extract_all(archives, extract_path, zip_format = "gztar"):
    '''
    input: path to zipped file archives, path to extract, and type of zipped file
    output: unzipped contents of each zipped file within the extract path
    '''
    for filename in listdir(archives):
        shutil.unpack_archive(archives+filename, extract_path, zip_format)

In [32]:
extract_all("../data_raw/zipped/","../data_raw/unzipped/", "gztar")

In [13]:
# Start here if the files are already unzipped
read_files = glob.glob("../data_raw/unzipped/*/*.json", recursive = True)

To give more context about the data we're working with, lets see exactly how many raw game runs we have:

In [14]:
len(read_files)

279848

Almost 280K games! We'll need to subset down to games for The Defect on Ascension 20 that resulted in wins before we can determine the relevant sample size though. In order to do that, we'll want to read these files together and use relevant JSON keys to narrow our focus.

In [112]:
# this approach creates a list of JSON strings from all the read_files
output_list = []

for f in read_files:
    try:
        with open(f, "r") as infile:
            # test if the file isn't empty and that the name doesn't contain 'undefined' (1 file, contents are "File doesn't exists)")
            if (os.path.getsize(f)>0) & (('undefined' in f)==False):
                output_list.append(json.load(infile))
            else:
                pass
    except UnicodeDecodeError: # some unicode can't be read so just don't load those games (I think it's a particular monster name)
        pass
    

In [107]:
len(output_list)

279693

In [108]:
len(read_files)-len(output_list)

155

We've excluded 155 games that were either empty, had unreadable unicode, or were 'undefined'.  It's possible that this may introduce some bias (e.g. removing relevant games with particular qualities), but given that this represents such a small volume of games relative to all 280K and I haven't seen any apparent bias in looking through a sample of the files, I don't think this should be a major concern.

After more attempts to get the data into the right format, it looks like there is a single case where the JSON is wrapped in '[ ]'.  Since this game is for Ironclad, I'll simply remove from the dataset.

In [121]:
len(output_list)

279693

In [122]:
output_list[:] = [s for s in output_list if str(s)[0]!='[']

In [123]:
len(output_list)

279692

With that single exception removed, we can now subset to a list of included_games, which pass the conditions of being the Defect character, a victory, and Ascension 20.

In [138]:
if 'ascension_level' and 'victory' in dict(output_list[0]):
    print("yes")

yes


In [148]:
included_games = []

for i in range(len(output_list)):
    if output_list[i] is not None:
        if ('character_chosen' in dict(output_list[i])) and \
        ('ascension_level' in dict(output_list[i])) and \
        ('victory' in dict(output_list[i])):
            if (output_list[i]['character_chosen']=='DEFECT') & \
            (output_list[i]['victory']==True) & \
            (output_list[i]['ascension_level']==20):
                included_games.append(output_list[i])

In [149]:
len(included_games)

1669

With 1669 victorious, Ascension 20 Defect games, we should have ample sample size to do a proper clustering.

In [150]:
included_games[0]

{'gold_per_floor': [118,
  135,
  155,
  155,
  172,
  203,
  203,
  233,
  233,
  233,
  262,
  274,
  292,
  327,
  327,
  401,
  401,
  421,
  696,
  715,
  715,
  725,
  725,
  755,
  161,
  239,
  164,
  189,
  207,
  242,
  253,
  253,
  324,
  324,
  344,
  363,
  374,
  474,
  493,
  493,
  521,
  552,
  614,
  626,
  215,
  215,
  234,
  265,
  265,
  265,
  265,
  265,
  265,
  23,
  58,
  58],
 'floor_reached': 57,
 'playtime': 6764,
 'items_purged': ['Regret', 'Strike_B'],
 'score': 3083,
 'play_id': '01e4ed40-7a67-4d84-bf22-2a22cb0adad3',
 'local_time': '20190830164308',
 'is_ascension_mode': True,
 'campfire_choices': [{'data': 'Redo', 'floor': 7.0, 'key': 'SMITH'},
  {'data': 'Chaos', 'floor': 10.0, 'key': 'SMITH'},
  {'data': 'Glacier', 'floor': 15.0, 'key': 'SMITH'},
  {'data': 'Meteor Strike', 'floor': 23.0, 'key': 'SMITH'},
  {'data': 'Capacitor', 'floor': 32.0, 'key': 'SMITH'},
  {'data': 'Heatsinks', 'floor': 40.0, 'key': 'SMITH'},
  {'floor': 44.0, 'key': 'RECALL'

Next, we need to convert this list of JSON objects to a dataframe we can cluster.  Ideally, the shape of the data is one-row-per-game with columns for all the cards and relics.