# A5: Final Project Plan
Will Wright  
2019-11-14

# I. Introduction
### Motivation/Problem Statement:
**Context:** Slay the Spire is currently the 64th most played game on Steam (out of more than 30,000 games).  It was released in late January 2019 and has received a 10/10 rating with “overwhelmingly positive” feedback.  It’s essentially a turn-based deck-building strategy and tactics game with rogue-like, procedurally-generated properties such that no two games play out the same. Winrates are generally low and few resources exist to help players improve their play.

**Gameplay:** After choosing one of three characters (Ironclad, The Silent, and Defect), players start with the same initial deck of basic cards and, through events in the game, are able to synergize toward more powerful decks as they ascend the floors of the Spire. The event types are combat (normal, elite, and boss), campfires (a heal/upgrade decision), merchants (to purchase cards/relics or remove bad cards), and "?" events, of which there are more than 100 unique decisions with random elements to help or hinder your journey. Choosing a path through these events challenges the player to balance their competing priorities to stay alive and form a solid chance of winning through building their resources. Per character, there are about 100 cards that vary in type, rarity, and upgrades. Additionally, there are over 100 relics, which are passive items that provide bonuses to gameplay. The number of permutations for deck + relic combinations alone provides for innumerable play styles, but not all are winning strategies and actual winrates are generally less than 20%.  

**Winning**: In total, there are 4 acts which take about 1.5hrs to complete and conclude with beating the final boss: the Heart of the Spire.  Only by making the correct decisions about which path, cards, and relics to pick can players form a strong enough deck to stand a chance of winning. However, even with very strong decks, the randomness of how each combat plays out means there are no guarantees of victory. Per character, once the heart is beaten, this opens up what are known as "Ascension levels"—essentially, end-game challenges that require beating the game again and gradually ramp up the difficulty per level up to the highest level: Ascension 20.  At Ascension 20, winning is very rare and some characters are more likely to win than others.  Winrates are 6%, 5%, and 2% for Ironclad, The Silent, and Defect, respectively.

**Problem:** Anyone who gets into the game and looks for resources on which resources are better soon finds https://spirelogs.com/ as the top resource.  The idea here is that you upload your game data and can see how you compare to others while providing spirelogs with data useful for determining ratings for cards and other powerups. It is worth noting that the site's data is biased toward more-hardcore players since casual players are less likely to want to upload their game data or explore optimization of their play style. Still, while the findings on this site are extremely useful to players, they lack decision-context or a sense of which final clusters of cards/relics are best.  For instance, a certain card or relic might have a high rating (i.e. is associated with a higher winrate and/or is picked more often), but be damaging for the resource-build a player currently has. My goal is to remedy this by clustering the winning deck/relic combinations into distinct archetypes and provide that context. 

**Scope of Analysis:** Because there are 3 characters, 20 Ascension levels, and not every deck wins, it wouldn't be fair to bucket all games into one group for the analysis.  Instead, I intend to limit the scope to only the winning resource combinations (cards + relics) at the highest difficulty (Ascension 20) for the Defect character (if time allows, extending this to all characters).  Because less-optimized resource combinations can win at lower Ascension levels, limiting to the highest tier should point more directly at the purest forms of winning archetypes.  I chose Defect simply because I'm currently playing through this character now and have been stuck on Ascenscion 19 for about 15 runs! With a 2% winrate at the next level, my hope is that I can use the results of this analysis to beat the final level and provide other players with insights about the winning archetypes that will help guide decisions and boost winrates. 

### Research Question
**RQ1.** How many distinct deck archetypes beat the game at Ascension 20 as Defect?

My hypothesis is that there are only about 8 ways to beat the game with these parameters, but this will not formally be tested since the methods for producing clusters (discussed below) are not statistical.


### Background and Related Work
Very little exists in the way of related work since the game was only recently released on January 23rd, 2019 and data, while public-facing, isn't easy to find without investigation.  As far as I know, Spirelogs is the only resource which exposes game data to interested players.

Still, this lack of data hasn't stopped dedicated fans from attempting to improve their gameplay with purely local data from their only personal runs.  [Learn the Spire](https://github.com/kajchang/LearnTheSpire), for example, uses local game data to train a neural net and recommend decisions in-game via an overlaying mod. The best this can do, however, is surface recommendations based on the players' past decisions, which may not be optimal.  



# II. Data Sources and Preparation

### Data Used
I wrote the creator of spirelogs, Alleji, and asked for permission to use his data.  He responded with:

_Hey!_
 
_I have an archive of runs since 1.0 right here: https://spirelogs.com/archive/_
 
_You're right in that there isn't any documentation on it, but I do get requests like yours occasionally and I'm certainly happy to share the data. Let me know if you publish your results anywhere online, I'd be interested in seeing them!_
 
_Alleji_

In these zipped files, there are over **280K completed games** in JSON format.

While no link on his page directs to the archive of data, the link provided by the creator is publically available with his blessings (as per the email). The only ethical consideration could be that individuals might be identifiable if you know their playstyle well-enough (e.g. always chooses a certain set of cards or character to play), but because identifying an individual could only give you insights about their playstyle and when they were playing, I don’t think this is a major concern. 

As for the data itself, there is one tar.gz file per month since release. Within each of these zipped folders is a folder for each day of the month and within each of those is hundreds to thousands of game .run data.  This is all in JSON format and details which decisions the users make throughout their playthroughs and their results along the way.

I’d like to point out that there isn’t a license for the data and I’m interpreting the creator’s expressed happiness to share the data as evidence that that I can use it. 

### Data Prep
As per the scope of the analysis, after extracting all JSON files, the data will be subset to only winning decks + relics for Defect in Ascension 20.  If time allows, similar analyses will be performed for Ironclad and The Silent.

Within this subset, the final card and relic combinations will be converted to a table of columns for all cards and relics with rows for each run.  A simple binary of 0 and 1 will indicate whether the card was present in the final set of resources.

As for the data itself, below is an example of the run data for a Defect Ascension 20 game:

In [53]:
import json

with open('example_run.json') as json_file:
    example_run = json.load(json_file)

In [54]:
example_run

{'gold_per_floor': [111,
  123,
  204,
  222,
  94,
  94,
  120,
  120,
  190,
  211,
  211,
  223,
  223,
  258,
  258,
  333,
  333,
  311,
  311,
  311],
 'floor_reached': 20,
 'playtime': 600,
 'items_purged': ['Strike_B'],
 'score': 360,
 'play_id': 'c30323f1-129e-4280-b47d-7a9d0448a355',
 'local_time': '20181212222942',
 'is_ascension_mode': True,
 'campfire_choices': [{'data': 'Sunder', 'floor': 6.0, 'key': 'SMITH'},
  {'data': 'Loop', 'floor': 8.0, 'key': 'SMITH'},
  {'floor': 13, 'key': 'REST'},
  {'data': 'Go for the Eyes', 'floor': 15, 'key': 'SMITH'}],
 'neow_cost': 'NONE',
 'seed_source_timestamp': 35655965030300,
 'circlet_count': 0,
 'master_deck': ['AscendersBane',
  'Defend_B',
  'Defend_B',
  'Defend_B',
  'Zap',
  'Dualcast',
  'Echo Form',
  'Sunder+1',
  'Loop+1',
  'Ball Lightning',
  'Cold Snap',
  'Compile Driver',
  'Rip and Tear+1',
  'Lockon',
  'Go for the Eyes+1',
  'Electrodynamics',
  'Darkness+1',
  'Stack+1',
  'Tempest+1'],
 'special_seed': 0,
 'relics

Within this data, we're interested solely in `master_deck`, `relics`, `victory`, `character_chosen`, and `ascension_level`. This contains all final cards, final relics, win state (for subsetting to wins), the character (for subsetting to Defect), and the Ascension level (for subsetting to 20).

### Unknown and dependencies  
Sprielogs may stop hosting the data or be turned off completely in the future. As it stands as of mid-November 2019, there are several tables in the site that are broken and Alleji makes comments such as, "Unfortunately, this page got broken by Act 4 :(".  As what appears to be a pet-project, I imagine this is something that gets occassional maintenance, but it may go down on occassion. This being the case, I’ll build my scripts to pick up the data directly from the archive (in the case that it does remain online) and if it does go down, references to the already-picked-up files in the repo will ensure reproducibility.

# III. Methodology
Clustering binary data is relatively straight-forward and I'd like to use this project as a test-bed to see the differences in output of more than one algorithm. Depending on the cluster shapes, some algorithms perform better than others. Generally speaking, we'll want to choose from the unsupervised clustering algorithms, which include k-means, k-means++, hierarchical clustering, density-based spatial clustering of applications with noise (DBSCAN), gaussian mixture models (GMM), and potentially others.

Historically, I've worked extensively with the [partitioning around medoids (PAM) algorithm](https://en.wikipedia.org/wiki/K-medoids), but I'm not sure if it will be appropriate for this project since it specializes in quantitative data as opposed to binary factor data.  In working with the data, it should become apparent if this method will be useful as an alternative to the above.

In order to choose the number of clusters, I'll implement either the Elbow method or the Silhouette width method to see what a reasonable number looks like.

Ultimately, the end result will to list out the cards and relics in each cluster along with their relative frequency within each cluster.  I expect some cards or relics to be the central points around which powerful resource-synergies are formed. Much like the work I do in market research, I intend to label each cluster to indicate its primary source of synergy (e.g. "High-Focus Ice Build" or "Infinite-Draw Powers").

I intend to post this online in Slay the Spire forums and connect back with Alleji to let him know the results of the analysis.  If there is a positive reception (and my interest maintains), I'd like to continue this investigation and perhaps create the deep neural net which 'solves' the game using this rich dataset.