# Extract Lemmatization from JSON: Extended Parser
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract lemmatization data for one or more projects. The code shows how the word-by-word data structure can be reformatted to a line-by-line or document-by-document structure and discusses various other options.

The output of the Extended Parser contains text IDs, line IDs, lemmas, and (potentially) other data. The first few code blocks are identical with the Basic Parser.

The code in this notebook (sections 0 to 3: downloading, parsing, and formatting in DataFrame) is also available in the module `utils` in the directory `utils` and can be called as follows: 
```python
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *
projects = "dcclt, saao/saa01" # (or any other sequence of ORACC projects, separated by commas)
words_df = get_data(projects)  
```

In [1]:
import pandas as pd
import zipfile
import json
import tqdm
import os
import sys
util_dir = os.path.abspath('../utils')
sys.path.append(util_dir)
from utils import *

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. The directories are created with the function `make_dirs()` from the `utils` module. 

In [2]:
directories = ['jsonzip', 'output']
make_dirs(directories)

## 1.1 Input Project Names
Provide a list of one or more project names, separated by commas. Note that subprojects must be listed separately, they are not included in the main project. For instance:

`saao/saa01,saao/saa02,blms`

The input is split into a proper list that python can iterate over using the `format_project_list()` function in the `utils` module. The code of this function is discussed in more detail in 2.1.0. Download ORACC JSON Files.

In [3]:
projects = input('Project(s): ').lower().strip()
p = format_project_list(projects)

Project(s): saao/saa01, saao/saa02, saao/saa03, saao/saa04, saao/saa05, saao/saa06, saao/saa07, saao/saa08, saao/saa09, saao/saa10, saao/saa11, saao/saa12, saao/saa13, saao/saa14, saao/saa15, saao/saa16, saao/saa17, saao/saa18, saao/saa19, saao/saa20, saao/saa21, adsd/adart1, adsd/adart2, adsd/adart3, aemw/idrimi, akklove, blms, cams/anzu, cams/barutu, cams/gkab, cams/selbi, ccpo, cmawro/cmawro1, cmawr/cmawro2, cmawro/maqlu, dccmt, glass, hbtin, riao, ribo/babylon2, ribo/babylon3, ribo/babylon4, ribo/babylon5, ribo/babylon6, ribo/babylon7, ribo/babylon8, ribo/babylon10, rimanum, rinap/rinap1, rinap/rinap3, rinap/rinap4, rinap/rinap5, suhu


## Download the ZIP files.
Download the zipped JSON files using the `oracc_download()` function in the `utils` module. The code of this function is discussed in more detail in 2.1.0. Download ORACC JSON Files.

In [4]:
oracc_download(p)

Downloading http://build-oracc.museum.upenn.edu/json/saao-saa01.zip saving as jsonzip/saao-saa01.zip


293it [00:02, 114.41it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa02.zip saving as jsonzip/saao-saa02.zip


146it [00:01, 96.88it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa03.zip saving as jsonzip/saao-saa03.zip


82it [00:01, 78.47it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa04.zip saving as jsonzip/saao-saa04.zip


223it [00:04, 46.50it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa05.zip saving as jsonzip/saao-saa05.zip


280it [00:03, 87.77it/s] 


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa06.zip saving as jsonzip/saao-saa06.zip


361it [00:03, 106.40it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa07.zip saving as jsonzip/saao-saa07.zip


211it [00:02, 88.92it/s] 


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa08.zip saving as jsonzip/saao-saa08.zip


395it [00:04, 89.33it/s] 


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa09.zip saving as jsonzip/saao-saa09.zip


25it [00:00, 30.64it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa10.zip saving as jsonzip/saao-saa10.zip


481it [00:03, 141.18it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa11.zip saving as jsonzip/saao-saa11.zip


173it [00:01, 119.89it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa12.zip saving as jsonzip/saao-saa12.zip


81it [00:01, 77.36it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa13.zip saving as jsonzip/saao-saa13.zip


206it [00:02, 86.19it/s] 


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa14.zip saving as jsonzip/saao-saa14.zip


360it [00:03, 100.81it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa15.zip saving as jsonzip/saao-saa15.zip


322it [00:02, 128.24it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa16.zip saving as jsonzip/saao-saa16.zip


233it [00:02, 114.44it/s]


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa17.zip saving as jsonzip/saao-saa17.zip


243it [00:03, 73.37it/s] 


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa18.zip saving as jsonzip/saao-saa18.zip


256it [00:02, 91.10it/s] 


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa19.zip saving as jsonzip/saao-saa19.zip


293it [00:03, 89.88it/s] 


Downloading http://build-oracc.museum.upenn.edu/json/saao-saa20.zip saving as jsonzip/saao-saa20.zip


170it [00:01, 100.29it/s]


http://build-oracc.museum.upenn.edu/json/saao-saa21.zip does not exist.
http://build-oracc.museum.upenn.edu/json/adsd-adart1.zip does not exist.
http://build-oracc.museum.upenn.edu/json/adsd-adart2.zip does not exist.
http://build-oracc.museum.upenn.edu/json/adsd-adart3.zip does not exist.
http://build-oracc.museum.upenn.edu/json/aemw-idrimi.zip does not exist.
Downloading http://build-oracc.museum.upenn.edu/json/akklove.zip saving as jsonzip/akklove.zip


127it [00:01, 85.18it/s]


Downloading http://build-oracc.museum.upenn.edu/json/blms.zip saving as jsonzip/blms.zip


774it [00:03, 210.73it/s]


Downloading http://build-oracc.museum.upenn.edu/json/cams-anzu.zip saving as jsonzip/cams-anzu.zip


38it [00:00, 65.29it/s]


Downloading http://build-oracc.museum.upenn.edu/json/cams-barutu.zip saving as jsonzip/cams-barutu.zip


12it [00:00, 25.70it/s]


Downloading http://build-oracc.museum.upenn.edu/json/cams-gkab.zip saving as jsonzip/cams-gkab.zip


1662it [00:11, 143.88it/s]


Downloading http://build-oracc.museum.upenn.edu/json/cams-selbi.zip saving as jsonzip/cams-selbi.zip


11it [00:00, 27.36it/s]


Downloading http://build-oracc.museum.upenn.edu/json/ccpo.zip saving as jsonzip/ccpo.zip


431it [00:03, 134.48it/s]


http://build-oracc.museum.upenn.edu/json/cmawro-cmawro1.zip does not exist.
http://build-oracc.museum.upenn.edu/json/cmawr-cmawro2.zip does not exist.
Downloading http://build-oracc.museum.upenn.edu/json/cmawro-maqlu.zip saving as jsonzip/cmawro-maqlu.zip


150it [00:01, 110.62it/s]


Downloading http://build-oracc.museum.upenn.edu/json/dccmt.zip saving as jsonzip/dccmt.zip


248it [00:01, 128.83it/s]


Downloading http://build-oracc.museum.upenn.edu/json/glass.zip saving as jsonzip/glass.zip


58it [00:00, 74.26it/s]


Downloading http://build-oracc.museum.upenn.edu/json/hbtin.zip saving as jsonzip/hbtin.zip


1448it [00:09, 150.68it/s]


Downloading http://build-oracc.museum.upenn.edu/json/riao.zip saving as jsonzip/riao.zip


936it [00:04, 192.67it/s]


Downloading http://build-oracc.museum.upenn.edu/json/ribo-babylon2.zip saving as jsonzip/ribo-babylon2.zip


71it [00:00, 75.69it/s]


Downloading http://build-oracc.museum.upenn.edu/json/ribo-babylon3.zip saving as jsonzip/ribo-babylon3.zip


9it [00:00, 24.86it/s]


Downloading http://build-oracc.museum.upenn.edu/json/ribo-babylon4.zip saving as jsonzip/ribo-babylon4.zip


5it [00:00, 23.58it/s]


Downloading http://build-oracc.museum.upenn.edu/json/ribo-babylon5.zip saving as jsonzip/ribo-babylon5.zip


2it [00:00, 25.64it/s]


Downloading http://build-oracc.museum.upenn.edu/json/ribo-babylon6.zip saving as jsonzip/ribo-babylon6.zip


237it [00:03, 72.17it/s] 


Downloading http://build-oracc.museum.upenn.edu/json/ribo-babylon7.zip saving as jsonzip/ribo-babylon7.zip


59it [00:01, 39.44it/s]


Downloading http://build-oracc.museum.upenn.edu/json/ribo-babylon8.zip saving as jsonzip/ribo-babylon8.zip


18it [00:00, 30.51it/s]


Downloading http://build-oracc.museum.upenn.edu/json/ribo-babylon10.zip saving as jsonzip/ribo-babylon10.zip


7it [00:00, 23.18it/s]


Downloading http://build-oracc.museum.upenn.edu/json/rimanum.zip saving as jsonzip/rimanum.zip


180it [00:04, 43.75it/s]


Downloading http://build-oracc.museum.upenn.edu/json/rinap-rinap1.zip saving as jsonzip/rinap-rinap1.zip


150it [00:03, 46.22it/s]


Downloading http://build-oracc.museum.upenn.edu/json/rinap-rinap3.zip saving as jsonzip/rinap-rinap3.zip


617it [00:05, 120.48it/s]


Downloading http://build-oracc.museum.upenn.edu/json/rinap-rinap4.zip saving as jsonzip/rinap-rinap4.zip


397it [00:03, 113.17it/s]


Downloading http://build-oracc.museum.upenn.edu/json/rinap-rinap5.zip saving as jsonzip/rinap-rinap5.zip


618it [00:03, 190.51it/s]


Downloading http://build-oracc.museum.upenn.edu/json/suhu.zip saving as jsonzip/suhu.zip


70it [00:00, 80.55it/s]


## <a name="head21"></a>2.1 The `parsejson()` function
The `parsejson()` function is identical in structure with the function of that same name in `First_JSON_parser.ipynb`, but it fetches more data. The field `word_id` consists of three parts, namely a text ID, line ID, and word ID, in the format `Q000039.76.2` meaning: the second word in line 76 of text object `Q000039`. Note that `76` is not a line number strictly speaking but an object reference within the text object. Things like horizontal rulings, columns, and breaks also get object references. The `word_id` field allows us to put lines, breaks, and horizontal drawings together in the proper order.

The field `label` is a human-legible label that refers to a line or another part of the text; it may look like `o i 23` (obverse column 1 line 23) or `r v 23'` (reverse column 5 line 23 prime). The `label` field is used in online [ORACC](http://oracc.org) editions to indicate line numbers.

The fields `extent`, `scope`, and `state` give metatextual data about the condition of the object; they capture the number of broken lines or columns and similar information. 

The field `field` is used primarily in lexical texts. For the field abbreviations and their meanings, see the [documentation](http://oracc.museum.upenn.edu/doc/help/editinginatf/lexicaltexts/index.html). The field label looks like `wp` (word or phrase), `sg` (Sign) and is found under the JSON key `subtype` after a `field-start` entry. The field label is copied to the `meta_d` dictionary (under the key `field`), but this key is removed from `meta_d` as soon as the parser encounters a `field-end` value (with the `pop()` function). The great majority of lemmas have no field attribute - the key is "popped" so that it does not get copied in advertently to all subsequent lemmas.

This version of the `parsejson()` function is also available is the module `utils`.


In [5]:
def parsejson(text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject)
        if "label" in JSONobject:
            meta_d["label"] = JSONobject['label']
        if "type" in JSONobject and JSONobject["type"] == "field-start": # this is for sign lists, identifying fields such as
            meta_d["field"] = JSONobject["subtype"]                    # sign, pronunciation, translation.
        if "type" in JSONobject and JSONobject["type"] == "field-end":
            meta_d.pop("field", None)                           # remove the key "field" to prevent it from being copied 
                                                              # to all subsequent lemmas (which may not have fields)
        if "f" in JSONobject:
            lemma = JSONobject["f"]
            lemma["id_word"] = JSONobject["ref"]
            lemma['label'] = meta_d["label"]
            lemma["id_text"] = meta_d["id_text"]
            if "field" in meta_d:
                lemma["field"] = meta_d["field"]
            lemm_l.append(lemma)
        if "strict" in JSONobject and JSONobject["strict"] == "1":
            lemma = {key: JSONobject[key] for key in dollar_keys}
            lemma["id_word"] = JSONobject["ref"]
            lemma["id_text"] = meta_d["id_text"]
            lemm_l.append(lemma)
    return

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file is located in the directory `jsonzip`, named PROJECT.zip. The `zip` file contains a directory that is called `corpusjson` that contains a JSON file for every text that is available in that corpus. The files are called after their text IDs in the pattern `P######.json` (or `Q######.json` or `X######.json`).

The function `namelist()` of the `zipfile` package is used to create a list of the names of all the files in the ZIP. From this list we select all the file names in the `corpusjson` directory with extension `.json` (this way we exclude the name of the directory itself). 

Each of these files is read from the `zip` file and loaded with the command `json.loads()`, which transforms the string into a proper JSON object. 

This JSON object (essentially a Python dictionary), which is called `data_json` is now sent to the `parsejson()` function. The function adds lemmata to the `lemm_l` list. In the end, `lemm_l` will contain as many list elements as there are words in all the texts in the projects requested.

The dictionary `meta_d` is created to hold temporary information. The value of the key `id_text` is updated in the main process every time a new JSON file is opened and send to the `parsejson()` function. The `parsejson()` function itself will change values or add new keys, depending on the information found while iterating through the JSON file. When a new lemma row is created, `parsejon()` will supply data such as `id_text`, `label` and (potentially) other information from `meta_d`.

In [6]:
lemm_l = []
meta_d = {"label": None, "id_text": None}
dollar_keys = ["extent", "scope", "state"]
for project in p:
    print("Parsing " + project)
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in tqdm.tqdm(files):                            #iterate over the file names
        id_text = project + filename[-13:-5] # id_text is, for instance, blms/P414332
        meta_d["id_text"] = id_text
        try:
            st = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
            data_json = json.loads(st)                # make it into a json object (essentially a dictionary)
            parsejson(data_json)               # and send to the parsejson() function
        except:
            print(id_text + ' is not available or not complete')
    z.close()

Parsing saao/saa01


100%|███████████████████████████████████████| 264/264 [00:01<00:00, 143.24it/s]


Parsing saao/saa02


100%|██████████████████████████████████████████| 15/15 [00:00<00:00, 15.87it/s]


Parsing saao/saa03


100%|██████████████████████████████████████████| 52/52 [00:00<00:00, 55.32it/s]


Parsing saao/saa04


 33%|████████████▊                          | 116/354 [00:01<00:02, 104.77it/s]

saao/saa04/P336097 is not available or not complete
saao/saa04/P336343 is not available or not complete


 66%|█████████████████████████▌             | 232/354 [00:01<00:00, 147.79it/s]

saao/saa04/P237370 is not available or not complete


 79%|███████████████████████████████▌        | 279/354 [00:02<00:00, 94.87it/s]

saao/saa04/P336332 is not available or not complete


100%|███████████████████████████████████████| 354/354 [00:02<00:00, 124.56it/s]


Parsing saao/saa05


100%|███████████████████████████████████████| 300/300 [00:01<00:00, 151.52it/s]


Parsing saao/saa06


  3%|█▎                                      | 11/350 [00:00<00:03, 106.80it/s]

saao/saa06/P335176 is not available or not complete


  7%|██▊                                     | 25/350 [00:00<00:02, 113.30it/s]

saao/saa06/P335202 is not available or not complete
saao/saa06/P335322 is not available or not complete


 12%|████▉                                   | 43/350 [00:00<00:02, 126.13it/s]

saao/saa06/P335204 is not available or not complete


 24%|█████████▋                              | 85/350 [00:00<00:02, 127.95it/s]

saao/saa06/P335372 is not available or not complete


 67%|██████████████████████████▋             | 233/350 [00:02<00:01, 90.46it/s]

saao/saa06/P335226 is not available or not complete


100%|███████████████████████████████████████| 350/350 [00:02<00:00, 125.81it/s]


Parsing saao/saa07


 34%|█████████████▋                          | 75/219 [00:00<00:01, 138.05it/s]

saao/saa07/P335792 is not available or not complete


100%|███████████████████████████████████████| 219/219 [00:01<00:00, 156.76it/s]


Parsing saao/saa08


100%|███████████████████████████████████████| 568/568 [00:02<00:00, 197.70it/s]


Parsing saao/saa09


100%|██████████████████████████████████████████| 11/11 [00:00<00:00, 64.33it/s]


Parsing saao/saa10


100%|███████████████████████████████████████| 389/389 [00:03<00:00, 123.73it/s]


Parsing saao/saa11


100%|███████████████████████████████████████| 234/234 [00:01<00:00, 208.00it/s]


Parsing saao/saa12


 10%|████▎                                     | 10/98 [00:00<00:00, 99.01it/s]

saao/saa12/P235242 is not available or not complete


 21%|█████████                                 | 21/98 [00:00<00:00, 95.18it/s]

saao/saa12/P285576 is not available or not complete


100%|█████████████████████████████████████████| 98/98 [00:00<00:00, 108.17it/s]


Parsing saao/saa13


100%|███████████████████████████████████████| 210/210 [00:01<00:00, 114.63it/s]


Parsing saao/saa14


  0%|                                                  | 0/479 [00:00<?, ?it/s]

saao/saa14/P335530 is not available or not complete


  9%|███▋                                    | 44/479 [00:00<00:02, 215.05it/s]

saao/saa14/P335415 is not available or not complete
saao/saa14/P335587 is not available or not complete


 14%|█████▌                                  | 66/479 [00:00<00:01, 213.36it/s]

saao/saa14/P335263 is not available or not complete


 18%|███████▎                                | 87/479 [00:00<00:01, 211.06it/s]

saao/saa14/P335079 is not available or not complete
saao/saa14/P335107 is not available or not complete


 31%|████████████▏                          | 150/479 [00:00<00:01, 252.41it/s]

saao/saa14/P334977 is not available or not complete
saao/saa14/P334991 is not available or not complete
saao/saa14/P335305 is not available or not complete


 44%|█████████████████                      | 209/479 [00:00<00:00, 272.32it/s]

saao/saa14/P336196 is not available or not complete
saao/saa14/P335038 is not available or not complete
saao/saa14/P335943 is not available or not complete
saao/saa14/P335574 is not available or not complete


 50%|███████████████████▌                   | 240/479 [00:00<00:00, 278.06it/s]

saao/saa14/P336029 is not available or not complete
saao/saa14/P335257 is not available or not complete


 56%|█████████████████████▊                 | 268/479 [00:01<00:00, 269.78it/s]

saao/saa14/P335525 is not available or not complete


 63%|████████████████████████▍              | 300/479 [00:01<00:00, 280.87it/s]

saao/saa14/P335197 is not available or not complete
saao/saa14/P335081 is not available or not complete
saao/saa14/P224949 is not available or not complete


 69%|██████████████████████████▊            | 329/479 [00:01<00:00, 282.72it/s]

saao/saa14/P335459 is not available or not complete


 76%|█████████████████████████████▌         | 363/479 [00:01<00:00, 294.67it/s]

saao/saa14/P335080 is not available or not complete


 82%|███████████████████████████████▉       | 393/479 [00:01<00:00, 280.47it/s]

saao/saa14/P335489 is not available or not complete
saao/saa14/P335539 is not available or not complete
saao/saa14/P335180 is not available or not complete
saao/saa14/P336194 is not available or not complete


 90%|██████████████████████████████████▉    | 429/479 [00:01<00:00, 297.40it/s]

saao/saa14/P335537 is not available or not complete


 96%|█████████████████████████████████████▍ | 460/479 [00:01<00:00, 286.86it/s]

saao/saa14/P335196 is not available or not complete
saao/saa14/P335154 is not available or not complete


100%|███████████████████████████████████████| 479/479 [00:01<00:00, 272.47it/s]


Parsing saao/saa15


100%|███████████████████████████████████████| 389/389 [00:01<00:00, 224.86it/s]


Parsing saao/saa16


100%|███████████████████████████████████████| 246/246 [00:02<00:00, 103.23it/s]


Parsing saao/saa17


100%|███████████████████████████████████████| 207/207 [00:01<00:00, 132.30it/s]


Parsing saao/saa18


100%|███████████████████████████████████████| 204/204 [00:01<00:00, 116.37it/s]


Parsing saao/saa19


100%|███████████████████████████████████████| 229/229 [00:01<00:00, 137.13it/s]


Parsing saao/saa20


100%|██████████████████████████████████████████| 55/55 [00:02<00:00, 26.31it/s]


Parsing saao/saa21
jsonzip/saao-saa21.zip does not exist or is not a proper ZIP file
Parsing adsd/adart1
jsonzip/adsd-adart1.zip does not exist or is not a proper ZIP file
Parsing adsd/adart2
jsonzip/adsd-adart2.zip does not exist or is not a proper ZIP file
Parsing adsd/adart3
jsonzip/adsd-adart3.zip does not exist or is not a proper ZIP file
Parsing aemw/idrimi
jsonzip/aemw-idrimi.zip does not exist or is not a proper ZIP file
Parsing akklove


100%|█████████████████████████████████████████| 32/32 [00:00<00:00, 100.31it/s]


Parsing blms


100%|███████████████████████████████████████| 395/395 [00:03<00:00, 110.40it/s]


Parsing cams/anzu


100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 22.22it/s]


Parsing cams/barutu


100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 52.63it/s]


Parsing cams/gkab


 45%|█████████████████▊                      | 261/585 [00:05<00:07, 40.67it/s]

cams/gkab/P363695 is not available or not complete


100%|████████████████████████████████████████| 585/585 [00:10<00:00, 55.56it/s]


Parsing cams/selbi


100%|███████████████████████████████████████████| 3/3 [00:00<00:00, 136.36it/s]


Parsing ccpo


100%|████████████████████████████████████████| 205/205 [00:03<00:00, 56.91it/s]


Parsing cmawro/cmawro1
jsonzip/cmawro-cmawro1.zip does not exist or is not a proper ZIP file
Parsing cmawr/cmawro2
jsonzip/cmawr-cmawro2.zip does not exist or is not a proper ZIP file
Parsing cmawro/maqlu


100%|████████████████████████████████████████████| 9/9 [00:01<00:00,  7.50it/s]


Parsing dccmt


100%|███████████████████████████████████████| 222/222 [00:02<00:00, 107.35it/s]


Parsing glass


100%|██████████████████████████████████████████| 20/20 [00:00<00:00, 58.82it/s]


Parsing hbtin


  0%|                                                  | 0/485 [00:00<?, ?it/s]

hbtin/P342246 is not available or not complete
hbtin/P303987 is not available or not complete


  1%|▌                                         | 6/485 [00:00<00:08, 58.25it/s]

hbtin/P296731 is not available or not complete


  2%|▊                                        | 10/485 [00:00<00:09, 49.16it/s]

hbtin/P342485 is not available or not complete
hbtin/P296703 is not available or not complete
hbtin/P296757 is not available or not complete


  3%|█▏                                       | 14/485 [00:00<00:10, 45.22it/s]

hbtin/P296732 is not available or not complete
hbtin/P296688 is not available or not complete
hbtin/P296685 is not available or not complete
hbtin/P342486 is not available or not complete


  4%|█▌                                       | 18/485 [00:00<00:11, 41.61it/s]

hbtin/P296739 is not available or not complete
hbtin/P342461 is not available or not complete


  5%|█▉                                       | 23/485 [00:00<00:10, 42.81it/s]

hbtin/P296697 is not available or not complete


  6%|██▎                                      | 28/485 [00:00<00:10, 43.57it/s]

hbtin/P296750 is not available or not complete
hbtin/P342216 is not available or not complete
hbtin/P342462 is not available or not complete


  7%|███                                      | 36/485 [00:00<00:08, 49.90it/s]

hbtin/P342249 is not available or not complete
hbtin/P342255 is not available or not complete
hbtin/P342504 is not available or not complete
hbtin/P296742 is not available or not complete
hbtin/P296746 is not available or not complete


  8%|███▍                                     | 41/485 [00:00<00:10, 42.00it/s]

hbtin/P304009 is not available or not complete
hbtin/P296754 is not available or not complete


  9%|███▉                                     | 46/485 [00:01<00:10, 41.18it/s]

hbtin/P303983 is not available or not complete
hbtin/P303981 is not available or not complete
hbtin/P296726 is not available or not complete


 11%|████▎                                    | 51/485 [00:01<00:11, 37.76it/s]

hbtin/P342456 is not available or not complete


 11%|████▋                                    | 55/485 [00:01<00:11, 36.72it/s]

hbtin/P296716 is not available or not complete
hbtin/P296721 is not available or not complete


 12%|████▉                                    | 59/485 [00:01<00:11, 36.81it/s]

hbtin/P296713 is not available or not complete
hbtin/P296730 is not available or not complete


 13%|█████▎                                   | 63/485 [00:01<00:11, 36.38it/s]

hbtin/P342446 is not available or not complete
hbtin/P342505 is not available or not complete
hbtin/P303986 is not available or not complete


 14%|█████▋                                   | 67/485 [00:01<00:11, 35.50it/s]

hbtin/P342450 is not available or not complete
hbtin/P342254 is not available or not complete


 15%|██████▏                                  | 73/485 [00:01<00:10, 39.19it/s]

hbtin/P342452 is not available or not complete
hbtin/P303975 is not available or not complete


 16%|██████▌                                  | 78/485 [00:01<00:10, 39.43it/s]

hbtin/P342453 is not available or not complete
hbtin/P342465 is not available or not complete
hbtin/P342270 is not available or not complete


 17%|███████                                  | 83/485 [00:02<00:10, 38.41it/s]

hbtin/P342238 is not available or not complete
hbtin/P296737 is not available or not complete


 18%|███████▍                                 | 88/485 [00:02<00:10, 39.61it/s]

hbtin/P296758 is not available or not complete


 19%|███████▊                                 | 93/485 [00:02<00:10, 38.89it/s]

hbtin/P304010 is not available or not complete
hbtin/P303993 is not available or not complete
hbtin/P342222 is not available or not complete


 20%|████████▏                                | 97/485 [00:02<00:10, 35.94it/s]

hbtin/P296680 is not available or not complete
hbtin/P303989 is not available or not complete
hbtin/P342458 is not available or not complete


 21%|████████▍                               | 103/485 [00:02<00:09, 40.77it/s]

hbtin/P296759 is not available or not complete
hbtin/P296743 is not available or not complete


 22%|████████▉                               | 109/485 [00:02<00:08, 44.02it/s]

hbtin/P304336 is not available or not complete
hbtin/P342503 is not available or not complete
hbtin/P342262 is not available or not complete


 24%|█████████▍                              | 114/485 [00:02<00:08, 42.40it/s]

hbtin/P296722 is not available or not complete
hbtin/P296765 is not available or not complete
hbtin/P304003 is not available or not complete


 25%|█████████▉                              | 120/485 [00:02<00:08, 45.54it/s]

hbtin/P342457 is not available or not complete
hbtin/P296736 is not available or not complete
hbtin/P297042 is not available or not complete
hbtin/P296741 is not available or not complete


 26%|██████████▍                             | 126/485 [00:02<00:07, 48.26it/s]

hbtin/P296699 is not available or not complete
hbtin/P304008 is not available or not complete
hbtin/P297043 is not available or not complete
hbtin/P304000 is not available or not complete
hbtin/P304012 is not available or not complete
hbtin/P296769 is not available or not complete


 27%|██████████▉                             | 132/485 [00:03<00:07, 46.83it/s]

hbtin/P342215 is not available or not complete
hbtin/P342210 is not available or not complete
hbtin/P312907 is not available or not complete
hbtin/P342490 is not available or not complete
hbtin/P296747 is not available or not complete
hbtin/P342471 is not available or not complete


 29%|███████████▌                            | 140/485 [00:03<00:06, 52.74it/s]

hbtin/P342451 is not available or not complete
hbtin/P296753 is not available or not complete
hbtin/P303997 is not available or not complete


 30%|████████████                            | 146/485 [00:03<00:06, 52.99it/s]

hbtin/P303973 is not available or not complete
hbtin/P303995 is not available or not complete
hbtin/P304011 is not available or not complete
hbtin/P342233 is not available or not complete


 31%|████████████▌                           | 152/485 [00:03<00:06, 52.33it/s]

hbtin/P342498 is not available or not complete
hbtin/P296695 is not available or not complete


 33%|█████████████▏                          | 160/485 [00:03<00:05, 58.39it/s]

hbtin/P296707 is not available or not complete
hbtin/P297040 is not available or not complete
hbtin/P342281 is not available or not complete
hbtin/P305851 is not available or not complete
hbtin/P342212 is not available or not complete


 34%|█████████████▊                          | 167/485 [00:03<00:05, 57.94it/s]

hbtin/P304004 is not available or not complete
hbtin/P303990 is not available or not complete
hbtin/P296738 is not available or not complete


 36%|██████████████▎                         | 174/485 [00:03<00:05, 57.77it/s]

hbtin/P342475 is not available or not complete
hbtin/P342468 is not available or not complete


 38%|███████████████                         | 182/485 [00:03<00:04, 62.58it/s]

hbtin/P296756 is not available or not complete
hbtin/P342420 is not available or not complete
hbtin/P311794 is not available or not complete


 39%|███████████████▌                        | 189/485 [00:04<00:04, 62.06it/s]

hbtin/P342264 is not available or not complete
hbtin/P342223 is not available or not complete
hbtin/P342208 is not available or not complete
hbtin/P342493 is not available or not complete


 40%|████████████████▏                       | 196/485 [00:04<00:04, 59.64it/s]

hbtin/P296692 is not available or not complete
hbtin/P342473 is not available or not complete
hbtin/P342280 is not available or not complete
hbtin/P342497 is not available or not complete
hbtin/P296774 is not available or not complete


 42%|████████████████▋                       | 203/485 [00:04<00:05, 55.71it/s]

hbtin/P303979 is not available or not complete
hbtin/P296691 is not available or not complete


 43%|█████████████████▏                      | 209/485 [00:04<00:05, 55.05it/s]

hbtin/P342476 is not available or not complete
hbtin/P296698 is not available or not complete
hbtin/P296678 is not available or not complete
hbtin/P296744 is not available or not complete
hbtin/P303982 is not available or not complete
hbtin/P342207 is not available or not complete


 44%|█████████████████▋                      | 215/485 [00:04<00:05, 53.29it/s]

hbtin/P296704 is not available or not complete
hbtin/P296687 is not available or not complete


 46%|██████████████████▏                     | 221/485 [00:04<00:04, 54.99it/s]

hbtin/P342247 is not available or not complete
hbtin/P342248 is not available or not complete
hbtin/P296684 is not available or not complete
hbtin/P342229 is not available or not complete


 47%|██████████████████▊                     | 228/485 [00:04<00:04, 58.62it/s]

hbtin/P342265 is not available or not complete
hbtin/P296686 is not available or not complete
hbtin/P342449 is not available or not complete


 48%|███████████████████▍                    | 235/485 [00:04<00:04, 58.83it/s]

hbtin/P303984 is not available or not complete
hbtin/P342472 is not available or not complete
hbtin/P296708 is not available or not complete
hbtin/P296702 is not available or not complete


 50%|███████████████████▉                    | 241/485 [00:04<00:04, 57.15it/s]

hbtin/P304001 is not available or not complete
hbtin/P342447 is not available or not complete
hbtin/P296773 is not available or not complete


 51%|████████████████████▍                   | 248/485 [00:05<00:03, 59.40it/s]

hbtin/P342466 is not available or not complete
hbtin/P303994 is not available or not complete
hbtin/P296696 is not available or not complete
hbtin/P296677 is not available or not complete


 53%|█████████████████████                   | 255/485 [00:05<00:03, 61.90it/s]

hbtin/P304337 is not available or not complete
hbtin/P296694 is not available or not complete
hbtin/P296711 is not available or not complete
hbtin/P296724 is not available or not complete


 54%|█████████████████████▌                  | 262/485 [00:07<00:24,  8.98it/s]

hbtin/P342231 is not available or not complete
hbtin/P296728 is not available or not complete
hbtin/P296760 is not available or not complete


 55%|██████████████████████                  | 267/485 [00:07<00:18, 11.53it/s]

hbtin/P303978 is not available or not complete


 56%|██████████████████████▌                 | 273/485 [00:07<00:14, 15.05it/s]

hbtin/P342469 is not available or not complete
hbtin/P342227 is not available or not complete
hbtin/P342239 is not available or not complete


 57%|██████████████████████▉                 | 278/485 [00:07<00:11, 18.33it/s]

hbtin/P342419 is not available or not complete
hbtin/P296690 is not available or not complete
hbtin/P303999 is not available or not complete


 58%|███████████████████████▎                | 283/485 [00:08<00:09, 21.78it/s]

hbtin/P342261 is not available or not complete
hbtin/P296681 is not available or not complete
hbtin/P342470 is not available or not complete
hbtin/P296714 is not available or not complete


 59%|███████████████████████▊                | 288/485 [00:08<00:07, 24.89it/s]

hbtin/P342230 is not available or not complete
hbtin/P296679 is not available or not complete


 61%|████████████████████████▏               | 294/485 [00:08<00:06, 30.05it/s]

hbtin/P304013 is not available or not complete


 62%|████████████████████████▋               | 300/485 [00:08<00:05, 33.90it/s]

hbtin/P342455 is not available or not complete
hbtin/P296749 is not available or not complete
hbtin/P296693 is not available or not complete


 63%|█████████████████████████▏              | 305/485 [00:08<00:04, 37.11it/s]

hbtin/P296735 is not available or not complete
hbtin/P304338 is not available or not complete
hbtin/P342240 is not available or not complete


 64%|█████████████████████████▌              | 310/485 [00:08<00:04, 36.12it/s]

hbtin/P296712 is not available or not complete
hbtin/P296710 is not available or not complete
hbtin/P303992 is not available or not complete
hbtin/P296763 is not available or not complete


 65%|█████████████████████████▉              | 315/485 [00:08<00:04, 34.80it/s]

hbtin/P342269 is not available or not complete
hbtin/P342448 is not available or not complete
hbtin/P296700 is not available or not complete
hbtin/P296727 is not available or not complete


 66%|██████████████████████████▎             | 319/485 [00:08<00:05, 33.15it/s]

hbtin/P303980 is not available or not complete
hbtin/P296745 is not available or not complete


 67%|██████████████████████████▉             | 326/485 [00:09<00:04, 39.10it/s]

hbtin/P296748 is not available or not complete
hbtin/P296761 is not available or not complete
hbtin/P296723 is not available or not complete


 68%|███████████████████████████▎            | 331/485 [00:09<00:04, 37.68it/s]

hbtin/P303988 is not available or not complete
hbtin/P342418 is not available or not complete


 71%|████████████████████████████▏           | 342/485 [00:09<00:03, 45.76it/s]

hbtin/P296734 is not available or not complete
hbtin/P296720 is not available or not complete
hbtin/P342496 is not available or not complete
hbtin/P342467 is not available or not complete


 72%|████████████████████████████▋           | 348/485 [00:09<00:03, 44.45it/s]

hbtin/P342242 is not available or not complete
hbtin/P296767 is not available or not complete
hbtin/P342499 is not available or not complete
hbtin/P304002 is not available or not complete


 73%|█████████████████████████████▏          | 354/485 [00:09<00:02, 43.67it/s]

hbtin/P296755 is not available or not complete
hbtin/P342237 is not available or not complete


 74%|█████████████████████████████▋          | 360/485 [00:09<00:02, 46.78it/s]

hbtin/P342492 is not available or not complete
hbtin/P342491 is not available or not complete
hbtin/P296683 is not available or not complete
hbtin/P342228 is not available or not complete


 75%|██████████████████████████████▏         | 366/485 [00:09<00:02, 39.90it/s]

hbtin/P296764 is not available or not complete


 76%|██████████████████████████████▌         | 371/485 [00:10<00:02, 40.31it/s]

hbtin/P296701 is not available or not complete
hbtin/P342214 is not available or not complete
hbtin/P296715 is not available or not complete
hbtin/P303991 is not available or not complete


 78%|███████████████████████████████         | 376/485 [00:10<00:02, 38.63it/s]

hbtin/P342495 is not available or not complete


 79%|███████████████████████████████▍        | 381/485 [00:10<00:02, 38.58it/s]

hbtin/P296709 is not available or not complete
hbtin/P342217 is not available or not complete
hbtin/P296725 is not available or not complete


 80%|███████████████████████████████▉        | 387/485 [00:10<00:02, 42.48it/s]

hbtin/P342487 is not available or not complete
hbtin/P342421 is not available or not complete
hbtin/P342481 is not available or not complete


 81%|████████████████████████████████▎       | 392/485 [00:10<00:02, 42.66it/s]

hbtin/P303977 is not available or not complete
hbtin/P296718 is not available or not complete
hbtin/P342241 is not available or not complete


 82%|████████████████████████████████▋       | 397/485 [00:10<00:02, 43.58it/s]

hbtin/P342218 is not available or not complete
hbtin/P342245 is not available or not complete
hbtin/P342480 is not available or not complete


 83%|█████████████████████████████████▏      | 402/485 [00:10<00:01, 42.12it/s]

hbtin/P342234 is not available or not complete
hbtin/P296733 is not available or not complete
hbtin/P342483 is not available or not complete


 84%|█████████████████████████████████▌      | 407/485 [00:10<00:01, 42.19it/s]

hbtin/P342484 is not available or not complete
hbtin/P304007 is not available or not complete
hbtin/P303985 is not available or not complete
hbtin/P342232 is not available or not complete
hbtin/P296706 is not available or not complete


 85%|█████████████████████████████████▉      | 412/485 [00:11<00:01, 36.91it/s]

hbtin/P304005 is not available or not complete
hbtin/P342251 is not available or not complete


 86%|██████████████████████████████████▍     | 417/485 [00:11<00:01, 39.96it/s]

hbtin/P342266 is not available or not complete
hbtin/P342211 is not available or not complete
hbtin/P296682 is not available or not complete
hbtin/P296740 is not available or not complete


 87%|██████████████████████████████████▊     | 422/485 [00:11<00:01, 38.68it/s]

hbtin/P342425 is not available or not complete
hbtin/P303996 is not available or not complete
hbtin/P303974 is not available or not complete
hbtin/P342423 is not available or not complete


 88%|███████████████████████████████████▏    | 427/485 [00:11<00:01, 39.16it/s]

hbtin/P342235 is not available or not complete
hbtin/P342500 is not available or not complete
hbtin/P309742 is not available or not complete


 89%|███████████████████████████████████▋    | 432/485 [00:11<00:01, 40.07it/s]

hbtin/P342220 is not available or not complete
hbtin/P297902 is not available or not complete
hbtin/P342267 is not available or not complete
hbtin/P296689 is not available or not complete


 91%|████████████████████████████████████▏   | 439/485 [00:11<00:01, 45.52it/s]

hbtin/P296719 is not available or not complete
hbtin/P296705 is not available or not complete


 92%|████████████████████████████████████▌   | 444/485 [00:11<00:00, 45.25it/s]

hbtin/P342263 is not available or not complete
hbtin/P342226 is not available or not complete
hbtin/P342225 is not available or not complete


 93%|█████████████████████████████████████   | 450/485 [00:11<00:00, 47.46it/s]

hbtin/P296772 is not available or not complete
hbtin/P342494 is not available or not complete
hbtin/P342464 is not available or not complete
hbtin/P303976 is not available or not complete


 94%|█████████████████████████████████████▋  | 457/485 [00:11<00:00, 51.49it/s]

hbtin/P342252 is not available or not complete
hbtin/P296771 is not available or not complete


 95%|██████████████████████████████████████▏ | 463/485 [00:12<00:00, 51.96it/s]

hbtin/P342474 is not available or not complete
hbtin/P303998 is not available or not complete
hbtin/P342268 is not available or not complete


 97%|██████████████████████████████████████▋ | 469/485 [00:12<00:00, 53.42it/s]

hbtin/P342250 is not available or not complete
hbtin/P342219 is not available or not complete
hbtin/P342488 is not available or not complete


 98%|███████████████████████████████████████▏| 475/485 [00:12<00:00, 53.04it/s]

hbtin/P296717 is not available or not complete
hbtin/P342463 is not available or not complete
hbtin/P311834 is not available or not complete
hbtin/P296770 is not available or not complete


 99%|███████████████████████████████████████▋| 481/485 [00:12<00:00, 54.65it/s]

hbtin/P296751 is not available or not complete


100%|████████████████████████████████████████| 485/485 [00:12<00:00, 38.99it/s]


Parsing riao


100%|███████████████████████████████████████| 883/883 [00:03<00:00, 222.75it/s]


Parsing ribo/babylon2


  0%|                                                   | 0/38 [00:00<?, ?it/s]

ribo/babylon2/Q006275 is not available or not complete


100%|█████████████████████████████████████████| 38/38 [00:00<00:00, 258.50it/s]


Parsing ribo/babylon3


100%|███████████████████████████████████████████| 4/4 [00:00<00:00, 285.71it/s]


Parsing ribo/babylon4


100%|███████████████████████████████████████████| 6/6 [00:00<00:00, 999.99it/s]


Parsing ribo/babylon5


100%|████████████████████████████████████████████████████| 1/1 [00:00<?, ?it/s]


Parsing ribo/babylon6


100%|███████████████████████████████████████| 126/126 [00:00<00:00, 180.26it/s]


Parsing ribo/babylon7


100%|█████████████████████████████████████████| 30/30 [00:00<00:00, 222.22it/s]


Parsing ribo/babylon8


100%|████████████████████████████████████████████| 3/3 [00:00<00:00, 96.77it/s]


Parsing ribo/babylon10


100%|███████████████████████████████████████████| 1/1 [00:00<00:00, 111.11it/s]


Parsing rimanum


100%|███████████████████████████████████████| 378/378 [00:00<00:00, 588.78it/s]


Parsing rinap/rinap1


100%|█████████████████████████████████████████| 92/92 [00:00<00:00, 223.30it/s]


Parsing rinap/rinap3


100%|████████████████████████████████████████| 261/261 [00:05<00:00, 48.11it/s]


Parsing rinap/rinap4


100%|███████████████████████████████████████| 183/183 [00:01<00:00, 126.21it/s]


Parsing rinap/rinap5


100%|████████████████████████████████████████| 140/140 [00:02<00:00, 59.35it/s]


Parsing suhu


100%|██████████████████████████████████████████| 33/33 [00:00<00:00, 99.40it/s]


## 3 Data Structuring
### 3.1 Transform the Data into a DataFrame
The list `lemm_l` is transformed into a `pandas` dataframe for further manipulation.

For various reasons not all JSON files will have all data types that potentially exist in an [ORACC](http://oracc.org) signature. Only Sumerian words have a `base`, so if your data set has no Sumerian, this column will not exist in the DataFrame.  If a text has no breakage information in the form of `1 line broken` (etc.) the fields `extent`, `scope`, and `state` do not exist. Where such fields are referenced below, the code may fail and you may need to adjust some lines.

In [7]:
words_df = pd.DataFrame(lemm_l)
words_df = words_df.fillna('')   # replace NaN (Not a Number) with empty string
words_df

Unnamed: 0,base,cf,cont,contrefs,delim,epos,extent,field,form,gdl,...,lang,morph,norm,norm0,pos,scope,sense,state,stem,syntax_ub-after
0,,awātu,,,,N,,,a-bat,"[{'v': 'a', 'gdl_utf8': '𒀀', 'id': 'P334190.2....",...,akk-x-neoass,,abat,,N,,word,,,
1,,šarru,,,,N,,,LUGAL,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,akk-x-neoass,,šarri,,N,,king,,,
2,,ana,,,,PRP,,,a-na,"[{'v': 'a', 'gdl_utf8': '𒀀', 'id': 'P334190.2....",...,akk-x-neoass,,ana,,PRP,,to,,,
3,,Ašipa,,,,PN,,,{1}a-ši-pa-a,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'n...",...,akk-x-neoass,,Ašipa,,PN,,1,,,
4,,šulmu,,,,N,,,DI-mu,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,akk-x-neoass,,šulmu,,N,,health,,,
5,,yâšim,,,,IP,,,ia-a-ši,"[{'v': 'ia', 'gdl_utf8': '𒅀', 'id': 'P334190.3...",...,akk-x-neoass,,ayāši,,IP,,me,,,
6,,libbu,,,,N,,,ŠA₃-ba-ka,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,akk-x-neoass,,libbaka,,N,,mood,,,
7,,lū,,,,MOD,,,lu,"[{'v': 'lu', 'gdl_utf8': '𒇻', 'id': 'P334190.4...",...,akk-x-neoass,,lū,,MOD,,may,,,
8,,ṭābu,,,,AJ,,,DUG₃.GA-ka,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,akk-x-neoass,,ṭābka,,AJ,,good,,,
9,,ūmu,,,,N,,,UD-mu,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,akk-x-neoass,,ūmu,,N,,day,,,


## 3.2 Remove Spaces and Commas from Guide Word and Sense
Spaces and commas in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens and nothing (empty string), respectively. By default the `replace()` function in `pandas` will match the entire string (that is, "lugal" matches "lugal" but there is no match between "l" and "lugal"). In order to match partial strings the parameter `regex` must be set to `True`.

The `replace()` function takes a nested dictionary as argument. The top-level keys identify the columns on which the `replace()` function should operate (in this case 'gw' and 'sense'). The value of each key is another dictionary with the search string as key and the replace string as value.

In [8]:
findreplace = {' ' : '-', ',' : ''}
words_df = words_df.replace({'gw' : findreplace, 'sense' : findreplace}, regex=True)

The columns in the resulting DataFrame correspond to the elements of a full [ORACC](http://oracc.org) signature, plus information about text, line, and word ids:
* base (Sumerian only)
* cf (Citation Form)
* cont (continuation of the base; Sumerian only)
* epos (Effective Part of Speech)
* form (transliteration, omitting all flags such as indication of breakage)
* frag (transliteration; including flags)
* gdl_utf8 (cuneiform)
* gw (Guide Word: main or first translation in standard dictionary)
* id_text (six-digit P, Q, or X number)
* id_word (word ID in the format Text_ID.Line_ID.Word_ID)
* label (traditional line number in the form o ii 2' (obverse column 2 line 2'), etc.)
* lang (language code, including sux, sux-x-emegir, sux-x-emesal, akk, akk-x-stdbab, etc)
* morph (Morphology; Sumerian only)
* norm (Normalization: Akkadian)
* norm0 (Normalization: Sumerian)
* pos (Part of Speech)
* sense (contextual meaning)
* sig (full ORACC signature)

Not all data elements (columns) are available for all words. Sumerian words never have a `norm`, Akkadian words do not have `norm0`, `base`, `cont`, or `morph`. Most data elements are only present when the word is lemmatized; only `lang`, `form`, `id_word`, and `id_text` should always be there.

## Create Line ID
The DataFrame currently has a word-by-word data representation. We will add to each word a field `id_line` that will make it possible to reconstruct lines. This newly created field `id_line` is different from a traditional line number (found in the field "label") in two ways. First, id_line is an integer, so that lines are sorted correctly. Second, `id_line` is assigned to words, but also to gaps and horizontal drawings on the tablet. The field `id_line` will allow us to keep all these elements in their proper order.

The field "id_line" is created by splitting the field "id_word" into (two or) three elements. The format of "id_word" is IDtext.line.word. The middle part, id_line, is selected and its data type is changed from string to integer. Rows that represent gaps in the text or horizontal drawings have an "id_word" in the format IDtext.line (consisting of only two elements), but are treated in exactly the same way.

In [9]:
words_df['id_line'] = [int(wordid.split('.')[1]) for wordid in words_df['id_word']]

## 4 Save Results in CSV file or in Pickle
The output file is called `parsed.csv` and is placed in the directory `output`. In most computers, `csv` files open automatically in Excel. This program does not deal well with `utf-8` encoding (files in `utf-8` need to be imported; see the instructions [here](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0). If you intend to use the file in Excel, change `encoding ='utf-8'` to `encoding='utf-16'`. For usage in computational text analysis applications `utf-8` is usually preferred. 

The Pandas function `to_pickle()` writes a binary file that can be opened in a later phase of the project with the `read_pickle()` command and will reproduce exactly the same DataFrame with the same data structure. The resulting file cannot be used in other programs.

In [None]:
savefile =  'parsed.csv'
with open('output/' + savefile, 'w', encoding="utf-8") as w:
    words_df.to_csv(w, index=False)
pickled = "parsed.p"
with open('output/' + pickled, 'wb') as w:
    words_df.to_pickle(w)

# 5 Post Processing
# 5.1 Manipulate for Analysis on Line level
For analyses that use a line as unit of analysis (e.g. lines in lexical texts as used in Chapter 3) one may need to create lemmas and combine these into lines by using the `id_line` variable.

## 5.1.1 Create Lemma Column
A lemma, [ORACC](http://oracc.org) style, combines Citation Form, GuideWord and POS into a unique reference to one particular lemma in a standard dictionary, as in `lugal[king]N` (Sumerian) or `šarru[king]N`. Usually, not all words in a text are lemmatized, because a word may be (partly) broken and/or unknown. Unlemmatized and unlemmatizable words will receive a place-holder lemmatization that consists of the transliteration of the word (instead of the Citation Form), with `NA` as GuideWord and `NA` as POS, as in `i-bu-x[NA]NA`. Note that `NA` is a string. Finally, rows representing horizontal rulings or broken lines have the empty string in both Citation Form and Form. In those cases Lemma should have the empty string, too.

In [11]:
words_df['lemma'] = words_df["cf"] + '[' + words_df["gw"] + ']' + words_df["pos"]
words_df.loc[words_df["cf"] == "" , 'lemma'] = words_df['form'] + '[NA]NA'
words_df.loc[words_df["form"] == "", 'lemma'] = ""
words_df

Unnamed: 0,base,cf,cont,contrefs,delim,epos,extent,field,form,gdl,...,norm,norm0,pos,scope,sense,state,stem,syntax_ub-after,id_line,lemma
0,,awātu,,,,N,,,a-bat,"[{'v': 'a', 'gdl_utf8': '𒀀', 'id': 'P334190.2....",...,abat,,N,,word,,,,2,awātu[word]N
1,,šarru,,,,N,,,LUGAL,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,šarri,,N,,king,,,,2,šarru[king]N
2,,ana,,,,PRP,,,a-na,"[{'v': 'a', 'gdl_utf8': '𒀀', 'id': 'P334190.2....",...,ana,,PRP,,to,,,,2,ana[to]PRP
3,,Ašipa,,,,PN,,,{1}a-ši-pa-a,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'n...",...,Ašipa,,PN,,1,,,,2,Ašipa[1]PN
4,,šulmu,,,,N,,,DI-mu,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,šulmu,,N,,health,,,,3,šulmu[completeness]N
5,,yâšim,,,,IP,,,ia-a-ši,"[{'v': 'ia', 'gdl_utf8': '𒅀', 'id': 'P334190.3...",...,ayāši,,IP,,me,,,,3,yâšim[to-me]IP
6,,libbu,,,,N,,,ŠA₃-ba-ka,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,libbaka,,N,,mood,,,,4,libbu[interior]N
7,,lū,,,,MOD,,,lu,"[{'v': 'lu', 'gdl_utf8': '𒇻', 'id': 'P334190.4...",...,lū,,MOD,,may,,,,4,lū[may]MOD
8,,ṭābu,,,,AJ,,,DUG₃.GA-ka,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,ṭābka,,AJ,,good,,,,4,ṭābu[good]AJ
9,,ūmu,,,,N,,,UD-mu,"[{'gg': 'logo', 'gdl_type': 'logo', 'group': [...",...,ūmu,,N,,day,,,,5,ūmu[day]N


## 5.1.2 Group by Line
In the `words_df` dataframe each word has a separate row. In order to change this into a line-by-line representation we use the `pandas` `groupby()` function, using `id_text`, `id_line` and `label` fields as the sorting arguments. 

The fields that are aggregated are `lemma`, `extent`, `scope`, and `state`. The fields `extent`, `scope`, and `state` represent data on the number of broken lines. For instance, the notation `4 lines missing` in the [ORACC](http://oracc.org) edition will result in `extent = "4"`, `scope = "line"`, `state = "missing"` (note that the value of `extent` is a string and will be `"n"` if the number of missing lines or columns is unknown).

If your data does not have the fields `extent`, `scope`, and `state` the code below will fail - simply delete the lines that reference those fields.

In [None]:
lines = words_df.groupby([words_df['id_text'], words_df['id_line'], words_df['label']]).agg({
        'lemma': ' '.join,
        'extent': ''.join, 
        'scope': ''.join,
        'state': ''.join
    }).reset_index()
lines

## 5.2 Alternative: Texts in Normalized Transcription
This code (which is useful mostly for Akkadian texts) will produce a text in normalized transcription, essentially following the pattern of the preceding. Before grouping words into documents, we need to take care of words that have not been normalized (for instance because of breakage), using the field `form`. The field `norm1` now has the normalized form of the word if it is available; if not it has the raw transliteration (without flags or breakage information).

In [None]:
words_df["norm1"] = words_df["norm"]
words_df.loc[words_df["norm1"] == "" , 'norm1'] = words_df['form']

In [None]:
texts_norm = words_df.groupby([words_df['id_text']]).agg({
        'norm1': ' '.join,
    }).reset_index()
texts_norm

### 5.2.1 Save Normalized Transcriptions
The `texts_norm` DataFrame has one complete document in normalized transcription in each row. The code below saves each row as a separate `.txt` file, named after the document's ID.

In [None]:
for idx, Q in enumerate(texts_norm["id_text"]):
    savefile =  Q[-7:] + '.txt'
    with open('output/' + savefile, 'w', encoding="utf-8") as w:
        texts_norm.iloc[idx].to_csv(w, index = False)