# Extract Lemmatization from JSON: Extended Parser
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract lemmatization data for one or more projects. The code shows how the word-by-word data structure can be reformatted to a line-by-line or document-by-document structure and discusses various other options.

The output of the Extended Parser contains text IDs, line IDs, lemmas, and (potentially) other data. The first few code blocks are identical with the Basic Parser.

In [1]:
import pandas as pd
import zipfile
import json
import tqdm
import requests
import errno
import os

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. If they do not exist they are created, else: do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
directories = ['jsonzip', 'output']
for d in directories:
    try:
        os.mkdir(d)
    except OSError as exc:
        if exc.errno !=errno.EEXIST:
            raise
        pass

## 1.1 Input Project Names
Provide a list of one or more project names, separated by commas. Note that subprojects must be listed separately, they are not included in the main project. For instance:

`saao/saa01,saao/saa02,blms`

In [3]:
projects = input('Project(s): ').lower()

Project(s): saao/saa17, dcclt, blms, bla


## 1.2 Split the List of Projects
Split the list of projects and create a list of project names.

In [4]:
p = projects.split(',')               # split at each comma and make a list called `p`
p = [x.strip() for x in p]        # strip spaces left and right of each entry in `p`

## 1.3 Download the ZIP files
For each project in the list download all the `json` files from `http://build-oracc.museum.upenn.edu/json/`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

If you have downloaded the files by hand (and put them in the `jsonzip` directory) you may skip this cell and jump directly to section [2.1 The Parsejson() function](#head21).

In [None]:
CHUNK = 16 * 1024
for project in p:
    project = project.replace('/', '-')
    url = "http://build-oracc.museum.upenn.edu/json/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    r = requests.get(url)
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in tqdm.tqdm(r.iter_content(chunk_size=CHUNK)):
                f.write(c)
    else:
        print(url + " does not exist.")

## <a name="head21"></a>2.1 The `parsejson()` function
The `parsejson()` function is essentially identical with that function in `First_JSON_parser.ipynb`, but it fetches more data. The field `word_id` consists of three parts, namely a text ID, line ID, and word ID, in the format `Q000039.76.2` meaning: the second word in line 76 of text object `Q000039`. Note that `76` is not a line number strictly speaking but an object reference within the text object. Things like horizontal rulings, columns, and breaks also get object references. The `word_id` field allows us to put lines together in the proper order.

The field `label` is a human-legible label that refers a line or another part of the text; it may look like `o i 23` (obverse column 1 line 23) or `r v 23'` (reverse column 5 line 23 prime). The `label` field is used in online [ORACC](http://oracc.org) editions to indicate line numbers.

The fields `extent`, `scope`, and `state` give metatextual data about the condition of the object; they capture the number of broken lines or columns and similar information. 


In [17]:
def parsejson(text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject)
        if "label" in JSONobject:
            meta_d["label"] = JSONobject['label']
        if "type" in JSONobject and JSONobject["type"] == "field-start": # this is for sign lists, identifying fields such as
            meta_d["fieldtype"] = JSONobject["subtype"]                    # sign, pronunciation, translation.
        if "type" in JSONobject and JSONobject["type"] == "field-end":
            meta_d.pop("field", None)                           # remove the key "subtype" to prevent it from being copied 
                                                              # to all subsequent lemmas (which may not have fields)
        if "f" in JSONobject:
            lemma = JSONobject["f"]
            lemma["id_word"] = JSONobject["ref"]
            lemma['label'] = meta_d["label"]
            lemma["id_text"] = meta_d["id_text"]
            if "fieldtype" in meta_d:
                lemma["field"] = meta_d["fieldtype"]
            lemm_l.append(lemma)
        if "strict" in JSONobject and JSONobject["strict"] == "1":
            lemma = {key: JSONobject[key] for key in meta_d["dollar_keys"]}
            lemma["id_word"] = JSONobject["ref"] + '.0'   # make compatible with other id_word
            lemma["id_text"] = meta_d["id_text"]
            lemm_l.append(lemma)
    return

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file is located in the directory `jsonzip`, named PROJECT.zip. The `zip` file contains a file that is called `corpus.json` that contains a full list of all the text IDs available in that corpus (P, Q, and X numbers) under the key `members`. This list is used to identify the files that contain the text data and that will be parsed. The `zip` file contains a directory `corpusjson` that holds the text files - each one is called `P######.json` (or `Q######.json` or `X######.json`).

Each of these files is extracted from the `zip` file and read with the command command `json.loads()`, which reads the json data and transforms it into a Python dictionary (a sequence of keys and values).

This dictionary, which is called `text` is now sent to the `parsejson()` function, with the text ID as second argument. The function adds lemmata to the `lemm_l` list.

In [18]:
lemm_l = []
meta_d = {"label": None, "id_text": None, "dollar_keys" : ["extent", "scope", "state"]}
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in files:                            #iterate over the file names
        id_text = project + filename[-13:-5] # id_text is, for instance, blms/P414332
        meta_d["id_text"] = id_text
        try:
            st = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
            data_json = json.loads(st)                # make it into a json object (essentially a dictionary)
            parsejson(data_json)               # and send to the parsejson() function
        except:
            print(id_text + ' is not available or not complete')
    z.close()

jsonzip/bla.zip does not exist or is not a proper ZIP file


## 3 Data Structuring
### 3.1 Transform the Data into a DataFrame
The word_l list is transformed into a Pandas dataframe for further manipulation.

For various reasons not all JSON files will have all data types that potentially exist in an [ORACC](http://oracc.org) signature. Only Sumerian words have a `base`, so if your data set has no Sumerian, this column will not exist in the DataFrame.  If a text has no breakage information in the form of `$ 1 line broken` (etc.) the fields `extent`, `scope`, and `state` do not exist. Where such fields are referenced in the code below (sections 2-4), the code may fail and you may need to take out some lines.

In [19]:
words = pd.DataFrame(lemm_l)
words = words.fillna('')   # replace NaN (Not a Number) with empty string
words

Unnamed: 0,base,cf,cont,delim,epos,extent,field,form,gdl,gw,...,label,lang,morph,norm,norm0,pos,scope,sense,state,stem
0,,ūmussu,,,AV,,,UD-mu-us-su,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",daily,...,o 1',akk-x-neobab,,ūmussu,,AV,,daily,,
1,,ana,,,PRP,,,a-na,"[{'v': 'a', 'gdl_utf8': '𒀀', 'id': 'P238656.3....",to,...,o 1',akk-x-neobab,,ana,,PRP,,to,,
2,,ṭūbu,,,N,,,ṭu-ub,"[{'v': 'ṭu', 'gdl_utf8': '𒂅', 'id': 'P238656.3...",goodness,...,o 1',akk-x-neobab,,ṭūb,,N,,goodness,,
3,,libbu,,,N,,,ŠA₃-bi,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",interior,...,o 1',akk-x-neobab,,libbi,,N,,heart,,
4,,ṭūbu,,,N,,,ṭu-ub,"[{'v': 'ṭu', 'gdl_utf8': '𒂅', 'id': 'P238656.3...",goodness,...,o 1',akk-x-neobab,,ṭūb,,N,,goodness,,
5,,šīru,,,N,,,UZU,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",flesh,...,o 1',akk-x-neobab,,šīri,,N,,flesh,,
6,,ša,,,DET,,,ša₂,"[{'v': 'ša₂', 'gdl_utf8': '𒃻', 'id': 'P238656....",of,...,o 2',akk-x-neobab,,ša,,DET,,of,,
7,,šarru,,,N,,,LUGAL,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",king,...,o 2',akk-x-neobab,,šarri,,N,,king,,
8,,bēlu,,,N,,,be-li₂-ia,"[{'v': 'be', 'gdl_utf8': '𒁁', 'id': 'P238656.4...",lord,...,o 2',akk-x-neobab,,bēlīya,,N,,lord,,
9,,Bel,,,DN,,,d.EN,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",1,...,o 2',akk-x-neobab,,Bel,,DN,,1,,


## 3.2 Remove Spaces and Commas from Guide Word and Sense
Spaces and commas in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens and nothing (empty string), respectively.

In [20]:
findreplace = {' ' : '-', ',' : ''}
words = words.replace({'gw' : findreplace, 'sense' : findreplace}, regex=True)

The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* frag (transliteration; including flags)
* gdl_utf8 (cuneiform)
* gw (Guide Word: main or first translation in standard dictionary)
* id_line (a line ID that begins with the six-digit P, Q, or X number of the text)
* id_text (six-digit P, Q, or X number)
* id_word (word ID that begins with the ID number of the line)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* sig (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `pos`, `id_word`, `id_line`, and `id_text` should always be there. An unlemmatized word has `pos` 'X' (for unknown). Broken words have `pos` 'u' (for 'unlemmatizable).

## Create Line ID
The DataFrame currently has a word-by-word data representation. We will add to each word a field `id_line` that will make it possible to reconstruct lines. This newly created field `id_line` is different from a traditional line number in two ways. First, id_line is an integer, so that lines are sorted correctly. Traditional line numbers are stored in the field `label` which is a string and has the format o ii 7' (obverse column 3 line 7 prime). Second, `id_line` is assigned to words, but also to gaps and horizontal drawings on the tablet. The field `id_line` will allow us to keep all these elements in their proper order.

The field id_line is created by splitting the field id_word into three elements. The format of id_word is IDtext.line.word. The middle part, id_line is made into an integer.

In [21]:
words['id_line'] = [int(wordid.split('.')[1]) for wordid in words['id_word']]

## 4 Save Results in CSV file or in Pickle
The output file is called `parsed.csv` and is placed in the directory `output`. In most computers, `csv` files open automatically in Excel. This program does not deal well with `utf-8` encoding. If you intend to use the file in Excel, change `encoding ='utf-8'` to `encoding='utf-16'`. For usage in computational text analysis applications `utf-8` is usually preferred. 

(Alternatively, use the instructions [here](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0) to import a `utf-8` file into Excel).

The Pandas function `to_pickle()` writes a binary file that can be opened in a later phase of the project with the `read_pickle()` command and will reproduce exactly the same DataFrame. The resulting file cannot be used in other programs.

In [26]:
savefile =  'parsed.csv'
with open('output/' + savefile, 'w', encoding="utf-8") as w:
    words.to_csv(w, index=False)
pickled = "parsed.p"
with open('output/' + pickled, 'wb') as w:
    words.to_pickle(w)

# 5 Post Processing
# 5.1 Manipulate for Analysis on Line level
For analyses that use a line as unit of analysis (e.g. lines in lexical texts as used in Chapter 3) one may need to create lemmas and combine these into lines by using the `id_line` variable.

## 5.1.1 Create Lemma Column
A lemma, [ORACC](http://oracc.org) style, combines Citation Form, GuideWord and POS into a unique reference to one particular lemma in a standard dictionary, as in `lugal[king]N` (Sumerian) or `šarru[king]N`. Usually, not all words in a text are lemmatized, because a word may be (partly) broken and/or unknown. Unlemmatized and unlemmatizable words will receive a place-holder lemmatization that consists of the transliteration of the word (instead of the Citation Form), with `NA` as GuideWord and `NA` as POS, as in `i-bu-x[NA]NA`. Note that `NA` is a string.

In [27]:
words["lemma"] = words.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) 
                            if r["cf"] != '' else r['form'] + '[NA]NA', axis=1)
words['lemma'] = [lemma if not lemma == '[NA]NA' else '' for lemma in words['lemma'] ] # kick out empty forms

## 5.1.2 Group by Line
In the `words` dataframe each word has a separate row. In order into change this to a line-by-line representation we use the Pandas `.groupby` function, using `id_text`, `id_line` and `label` fields as the sorting arguments. 

The fields that are aggregated are `lemma`, `extent`, `scope`, and `state`. The fields `extent`, `scope`, and `state` represent data on the number of broken lines. For instance, the notation `4 lines missing` in the [ORACC](http://oracc.org) edition will result in `extent = "4"`, `scope = "line"`, `state = "missing"` (note that the value of `extent` is a string and will be `"n"` if the number of missing lines or columns is unknown).

In [28]:
lines = words.groupby([words['id_text'], words['id_line'], words['label']]).agg({
        'lemma': ' '.join,
        'extent': ''.join, 
        'scope': ''.join,
        'state': ''.join
    }).reset_index()
lines

Unnamed: 0,id_text,id_line,label,lemma,extent,scope,state
0,blms/P223392,5,,,1,line,ruling
1,blms/P223392,6,r 1',x-u₂[NA]NA ša[of]DET x-x[NA]NA x[NA]NA x[NA]NA...,,,
2,blms/P223392,7,,,1,line,ruling
3,blms/P223392,8,r 2',x-ṣe-et[NA]NA ul[not]MOD ta-šem-x[NA]NA,,,
4,blms/P223392,9,,,1,line,ruling
5,blms/P223392,10,r 3',x[NA]NA patnu[tough]AJ ul[not]MOD ta-rab-ab-an...,,,
6,blms/P223392,11,,,1,line,ruling
7,blms/P223392,12,r 4',x[NA]NA annû[this]DP masnaqtu[inspection]N ul[...,,,
8,blms/P223392,13,,,1,line,ruling
9,blms/P223392,14,r 5',x[NA]NA lā[not]MOD ta-ta-na-ad-DUN[NA]NA lā[no...,,,


## 5.2 Alternative: Texts in Normalized Transcription
This code (which is only useful for Akkadian texts) will produce a text in normalized transcription, essentially following the pattern of the preceding. Before grouping words into lines, we need to take care of words that have not been normalized (for instance because of breakage), using the field `form`. The field `norm1` now has the normalized form of the word if it is available; if not it has the raw transliteration.

In [29]:
words["norm1"] = words.apply(lambda r: (r["norm"]) if r["norm"] != '' else r['form'], axis=1)

In [30]:
texts_norm = words.groupby([words['id_text']]).agg({
        'norm1': ' '.join,
    }).reset_index()
texts_norm

Unnamed: 0,id_text,norm1
0,blms/P223392,x-u₂ ša x-x x x x x-ṣe-et ul ta-šem-x x pet...
1,blms/P223478,x x x x x x am₂-gig-ga-ni-ta x x am₂-gig-ga-ni...
2,blms/P237717,x ba-ra-nu-tuku-a x ipallahu x sa₂ he₂-en-dug₄...
3,blms/P237742,x ana būrti lā alê liddûšu gi-sal-ta x ša ina ...
4,blms/P238386,ur-saŋ diŋir-re-e-ne ni₂-tuku e₃-a kalag-ga du...
5,blms/P238404,x x x x umun dug₄-ga zid-da šag₄-zu x sipad sa...
6,blms/P238467,me-e {d}utu-ra a-ra-zu ga-an-na-ab-dug₄ ur-sa...
7,blms/P238534,x x x x x x x kug-ga-bi x x x īnšu el-x x-zu m...
8,blms/P238563,en be-lu {d}utu di-kud-mah dingir-e-ne {d}ša₂-...
9,blms/P238678,x aŋ₂-gig-ga-ŋu₁₀-ta ki-za an-kiŋ₂-kiŋ₂-e x ma...


### 5.2.1 Save Normalized Transcriptions
The `texts_norm` DataFrame has one complete document in normalized transcription in each row. The code below saves each row as a separate `.txt` file, named after the document's ID.

In [31]:
for idx, Q in enumerate(texts_norm["id_text"]):
    savefile =  Q[-7:] + '.txt'
    with open('output/' + savefile, 'w', encoding="utf-8") as w:
        texts_norm.iloc[idx].to_csv(w, index = False)