This example notebook shows how to load the [Undergraduate Games Corpus](https://github.com/barrettrees/undergraduate_games_corpus) and inspect the contents of one of the games that it references.

# Loading the corpus metadata

We provide all metadata in a single [JSON](https://www.json.org/) file, so loading it is straight forward:

In [None]:
import json

with open("corpus.json") as f:
    corpus = json.load(f)
    
print(len(corpus), 'games total')

# Inspecting a single game

The `corpus` object maps [archival resource keys (ARKs)](https://en.wikipedia.org/wiki/Archival_Resource_Key) to game metadata objects. Let's look at the entry for the first game that happens to be built on [Twine](https://twinery.org/).

In [None]:
for ark, game in corpus.items():
    if game['engine'] == 'Twine':
        break
        
ark, game

# Downloading and interpreting game project source files

Twine games are usually composed of just a single HTML files. Let's download that first file mentioned in the entry seen above.

In [None]:
import urllib.request
url = game['files'][0]['downloadLink']
with urllib.request.urlopen(url) as u:
    source = u.read()
    
print(source[:1024])
len(source), type(source)

## Parsing a Twine game to recover passage text and other data

We'll use the [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/) library to parse this HTML file and recover the Twine-specific tags.

In [None]:
!pip install --quiet bs4

In [None]:
import bs4

soup = bs4.BeautifulSoup(source)

{
    'meta': soup.find('tw-storydata').attrs,
    'contents': list( {
        'meta': passagedata.attrs,
        'contents': ''.join(passagedata.contents)
    } for passagedata in soup.find_all('tw-passagedata'))
}