This notebook is going to take you through an analysis of platformer games in the VGLC as well as a custom dataset made for another game. Before going through the details, we are going to read everything in. Each platformer will be read into a [n-gram](https://en.wikipedia.org/wiki/N-gram).

In [1]:
%matplotlib inline
from scipy.stats import mode
from pathlib import Path
from enum import Enum

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import json
import os

We have two cases. The first is the custom format which is a json array. The second is the VGLC format which is a text file, very similar to a JSON array. For both we read htem in and return the contents of the files.

In [2]:
def read_json(file_path):
    f = open(file_path)
    content = f.read()
    f.close()
    
    level_matrix = json.loads(content)
    columns = [[] for i in range(len(level_matrix[0]))]

    for row in level_matrix:
        for i in range(len(row)):
            columns[i].append(row[i])

    return columns

In [3]:
def read_txt(file_path, vertical):
    f = open(file_path)
    lines = f.readlines()
    f.close()
    
    if vertical:
        columns = []
        for line in reversed(lines):
            columns.append(list(line.strip()))
    else:
        columns = [[] for _ in range(len(lines[0]) - 1)]
        for line in lines:
            line = line.strip()
            for columnIndex in range(len(line)):
                columns[columnIndex].insert(0, line[columnIndex])
            
    return columns

In [4]:
def get_file_contents(path, vertical=False, json=False):
    file_contents = []
    for file_name in os.listdir(path):
        # meta is a special case for custom game which is in Unity.
        if 'meta' in file_name:
            continue
        
        file_path = os.path.join(path, file_name)
        if os.path.isdir(file_path):
            continue
        
        if json:
            file_contents.append(read_json(file_path))
        else:
            file_contents.append(read_txt(file_path, vertical))
    
    return file_contents

I have the [VGLC](https://github.com/TheVGLC/TheVGLC) stored in `~/data/TheVGLC`. If you have it somewhere else then you will need to modify this function. Also note that there is a vertical flag since the [Kid Icarus levels are in a vertical format](https://github.com/TheVGLC/TheVGLC/blob/master/Kid%20Icarus/Processed/kidicarus_1.txt).

In [5]:
def get_vglc_path(game_name):
    return os.path.join(Path.home(), 'data', 'TheVGLC', game_name, 'Processed')

In [6]:
class Game(Enum):
    SuperMarioBros = 0
    SuperMarioBros2 = 1
    SuperMarioBros2Japan = 2
    SuperMarioLand = 3
    KidIcarus = 4
    Custom = 5

In [7]:
data = {}
data[Game.SuperMarioBros] = get_file_contents(get_vglc_path("Super Mario Bros"))
data[Game.SuperMarioBros2Japan] = get_file_contents(get_vglc_path("Super Mario Bros 2 (Japan)"))
data[Game.SuperMarioLand] = get_file_contents(get_vglc_path("Super Mario Land"))
data[Game.KidIcarus] = get_file_contents(get_vglc_path("Kid Icarus"), vertical=True)
data[Game.Custom] = get_file_contents(os.path.join('..', 'Assets', 'Resources', 'Levels'), json=True)

path = os.path.join(get_vglc_path("Super Mario Bros 2"), 'WithEnemies')
data[Game.SuperMarioBros2] = get_file_contents(path)

Now that we have loaded in all the games we are going to build them into n-grams. We'll do n-grams of 2 to 5.

In [8]:
min_n = 2
max_n = 5

This function receives the levels we just built above and an n to build a n-gram.

In [9]:
def build_n_gram(n, levels):
    ngram = {}
    
    for columns in levels:
        prior = []
        for col in columns:
            json_col = json.dumps(col)
            if len(prior) == n - 1:
                json_prior = json.dumps(prior)
                
                if json_prior not in ngram:
                    ngram[json_prior] = {}
                    ngram[json_prior][json_col] = 1
                elif json_col not in ngram[json_prior]:
                    ngram[json_prior][json_col] = 1
                else:
                    ngram[json_prior][json_col] += 1
                
                prior.pop(0)

            prior.append(col)

    return ngram

In [10]:
n_grams = {}
for n in range(min_n, max_n + 1):
    n_gram_data = {}
    
    for game in Game:
        n_gram_data[game] = build_n_gram(n, data[game])
        
    n_grams[n] = n_gram_data

The first exploratory analysis we'll do is look at the number of priors versus the number of outputs.

In [11]:
def build_prior_and_num_outputs_df(n):
    game_to_n_gram = n_grams[n]
    rows = []
    for game in game_to_n_gram:
        n_gram = game_to_n_gram[game]
        
        priors = len(n_gram)
        output_count = 0
        for prior in n_gram:
            for output in n_gram[prior]:
                output_count += n_gram[prior][output]
                
        rows.append({
            'Game': str(game), 
            'Prior Count': priors, 
            'Number of Outputs': output_count,
            'Number of Outputs / Prior': output_count / priors})
    
    return pd.DataFrame(rows)

In [12]:
prior_and_num_output_dfs = {}
for n in n_grams:
    print(f'n = {n}')
    prior_and_num_output_dfs[n] = build_prior_and_num_outputs_df(n)
    print(prior_and_num_output_dfs[n].to_string(index=False), '\n\n\n')

n = 2
                      Game  Prior Count  Number of Outputs  Number of Outputs / Prior
       Game.SuperMarioBros          271               2908                  10.730627
      Game.SuperMarioBros2          217               3289                  15.156682
 Game.SuperMarioBros2Japan          604               4497                   7.445364
       Game.SuperMarioLand          631               2787                   4.416799
            Game.KidIcarus          348               1249                   3.589080
               Game.Custom          226                852                   3.769912 



n = 3
                      Game  Prior Count  Number of Outputs  Number of Outputs / Prior
       Game.SuperMarioBros          681               2893                   4.248164
      Game.SuperMarioBros2          575               3273                   5.692174
 Game.SuperMarioBros2Japan         1320               4475                   3.390152
       Game.SuperMarioLand         122

These tables show that as n increases, so too does the number of priors. This fits our expectations since we expect fewer duplicates. This, in part, contributes to the problem of memorization. The number of outputs decreases as n increases as well because the effective dataset size decreases. The final column shows how many outputs there are per a prior. We know the value will always be greater than or equal to 1 and we expect and see that it goes lower as n increases. The problem with this ratio is that doesn't give a great view on a per ouput basis which we look at next. To do this we don't look at how many output occurrences exist for each prior but instead look at how many different potential outputs occur per a prior. 

In [13]:
def n_gram_prior_stats(n):
    game_to_n_gram = n_grams[n]
    rows = []
    
    for game in game_to_n_gram:
        n_gram = game_to_n_gram[game]
        outputs = []
        
        for prior in game_to_n_gram[game]:
            outputs.append(len(n_gram[prior]))
            
        outputs = np.array(outputs)

        data = {}
        data['Game'] = str(game)
        data['Mean'] = np.mean(outputs)
        data['Median'] = np.median(outputs)
        data['Mode'] = mode(outputs)[0][0]
        data['Min'] = np.min(outputs)
        data['Max'] = np.max(outputs)
        data['STD'] = np.std(outputs)
        data['Prios with 1 Output / Num Priors'] = np.count_nonzero(outputs == 1) / len(outputs)
        rows.append(data)
        
    return pd.DataFrame(rows) 

In [14]:
for n in n_grams:
    print(f'n = {n}')
    print(n_gram_prior_stats(n).to_string(index=False), '\n\n\n')

n = 2
                      Game      Mean  Median  Mode  Min  Max       STD  Prios with 1 Output / Num Priors
       Game.SuperMarioBros  2.512915     1.0     1    1   58  4.838234                          0.535055
      Game.SuperMarioBros2  2.649770     1.0     1    1   47  4.752580                          0.585253
 Game.SuperMarioBros2Japan  2.192053     1.0     1    1   60  4.353145                          0.607616
       Game.SuperMarioLand  1.941363     1.0     1    1   55  3.124676                          0.670365
            Game.KidIcarus  1.882184     1.0     1    1   99  5.429318                          0.729885
               Game.Custom  1.831858     1.0     1    1   24  2.300068                          0.663717 



n = 3
                      Game      Mean  Median  Mode  Min  Max       STD  Prios with 1 Output / Num Priors
       Game.SuperMarioBros  1.547724     1.0     1    1   50  2.398206                          0.794420
      Game.SuperMarioBros2  1.638261   

First and unsuprisingly, we see the max number of outputs, standard deviation, and mean decrease as n increases. The next thing to note is that the median and mode for every game, for every size of n is 1. This alone shows that there exists a problem since the majority of priors will always return the same output. We can learn even more about this by looking at the last column which shows that, even at n=2, *Super Mario Bros.* has 53% of its priors with only a single output. These tables point to a serious problem in game level datasets. If we want to ML algorithms to generate a diverse set of levels than we need datasets that allow it.