# ETCSRI JSON Scraper

In [1]:
from urllib.request import urlopen
import json
import csv
import os

This JSON scraper will prepare the JSON data from http://oracc.museum.upenn.edu/etcsri as training data in .csv files.

## Get list of texts from etcsri corpus

In [2]:
texts = dict()
corpus_url = "http://oracc.museum.upenn.edu/etcsri/corpus.json"
try:
    corpus_data = json.loads(urlopen(corpus_url).read().decode('UTF-8'))
    texts = corpus_data['members']
except ValueError:
    pass

json.loads() deserializes the JSON string object, found in corpus_url, into a Python dictionary.
We access the list of texts in the corpus_data dict by accesing the value stored at key 'members'.
We store the corpus_data dict as a dict variable, texts, for easier access later on.

In [3]:
lines = list()
all_lines = list()
fragmented = True
def recurse_keys(df, indent = '  '):
    global fragmented
    for key in df.keys():
        if key == 'form':
            if 'gw' in df.keys() and 'pos' in df.keys():
                if df['gw'].isdigit() and 'cf' in df.keys():
                    lines.append((df[key], df['cf'], df['pos']))
                else:
                    lines.append((df[key], df['gw'], df['pos']))
                fragmented = False
        if key == 'type':
            if 'label' in df.keys(): 
                if df['type'] == 'line-start':
                    if not fragmented: 
                        all_lines.append(lines[:])
                        lines.clear()
                    fragmented = True
        if key == 'f':
            recurse_keys(df[key], indent + '   ')
        if key == 'cdl':
            for i in range(len((df[key]))):
                if isinstance(df[key][i], dict):
                    recurse_keys(df[key][i], indent+'   ')

recurse_keys() is a method that will allow us to recursively traverse down the nested tree of nodes by the keys of a dict inside JSON files for individual texts. 

The tree three primary node types: 
1. c - a chunk of text
2. d - a discontinuity
3. l - a lemmatization of the text

The name of the array of children of any chunk node is called "cdl" based on these three members. All text is stored inside the "cdl" elements, so we recursively call `recurse_keys` on all elements of a "cdl" array that are dicts to get to the bottom of the nested tree.

The actual text is broken down into smaller chunks that together comprise of a bigger chunk (e.g. sentence $\rightarrow$ discourse $\rightarrow$ text). Each sentence may have discontinuities, and "d" nodes which also have "type" = "line-start" and a "label" element signify the beginning of a new line of text. 

We collect the words on one line, and when we hit a new line, we append a deep copy of the list of words in the current line, `lines`, into an `all_lines` list, and then we clear `lines` to start a new line.

To collect words, we look for the key "f", which contains yet another dict that contains the actual word we want, as "f" is part of the lemmatized "l" node. We call `recurse_keys` on the "f" dict. We then look inside the "f" dict for the key "form". The value at "form" is a single word. We then append this word to the `lines` list that keeps track of words on the current line, along with the "gw" and "pos".

The `fragmented` boolean variable is to ensure that text too fragmented for translation is not included in our training data.

In [4]:
keys = list(texts.keys())
vals = list(texts.values())
for i in range(len(vals)):
    text_url = "http://oracc.museum.upenn.edu/etcsri/" + vals[i]
    try:
        text_data = json.loads(urlopen(text_url).read().decode('UTF-8'))
        recurse_keys(text_data)
        all_lines.append(lines[:])
        del all_lines[0]
        forms = [[tup[0] for tup in line] for line in all_lines]
        guide_words = [[tup[1] for tup in line] for line in all_lines]
        part_of_speeches = [[tup[2] for tup in line] for line in all_lines]
        sentences = [' '.join(line) for line in forms]
        sentences_gw = [' '.join(line) for line in guide_words]
        sentences_pos = [' '.join(line) for line in part_of_speeches]
        rows = zip(sentences, sentences_gw, sentences_pos)
        filename = 'train/etcsri/' + keys[i] + '.csv'
        os.makedirs(os.path.dirname(filename), exist_ok=True)
        with open(filename, 'w',newline="\n", encoding="utf-8") as f:
            writer = csv.writer(f)
            writer.writerow(('Form', 'Gw', 'Pos'))
            for r in rows:
                writer.writerow(r)
    except ValueError:
        pass
    lines.clear()
    all_lines.clear()

We store the keys and values of the `texts` dict as separate lists `keys` and `vals`, respectively. We then iterate through the values in the dict (every text in the etcsri corpus), and deserialize the JSON string object of the individual text JSON file to a dict, `text_data`. We call `recurse_keys` on `text_data` to build up our `all_lines` list from the nested tree.

We have to add the last line of text manually, since `recurse_keys` only adds a line at the next "line-start" node, and the last line doesn't have a next "line-start" node. We also have to delete the first element of `all_lines`, which is always an empty list, because `recurse_keys` adds the empty list the first time it sees a "line-start" node. There is no node that marks the end of the line, so I have to identify line boundaries with "line-start", add the missing last line, and delete the extraneous first line.

We then have to flatten each `line` in `all_lines`, which is currently a list of tuples containing (form, gw, pos) in each line with the words as separate elements, into 3 separate lists for each element. This list will contain the forms/gw/pos of one sentence joined together as a string, separated by ' '. The list of these forms/gw/pos lists is stored in variables `sentences`, `sentences_gw`, and `sentences_pos`.

To prepare the training data as a csv file, we write the three lists as columns of a csv file named after the current text.

After we are done writing the training data of one text to the csv file of that text, we clear `lines` and `all_lines` and repeat the procedure for the rest of the texts inside the for loop. 