# Extract Lemmatization from ORACC JSON: Basic Parser
The code in this notebook will parse [ORACC](http://oracc.org) `JSON` files to extract lemmatization data for one or more projects. The resulting `csv` (Comma Separated Values) file is named `parsed.csv` and has two fields: a Text ID (e.g. `dcclt/Q000039`) and a string of lemmas in the format `lugal[king]N` (or `šarru[king]N` for Akkadian texts).

The output of the Basic Parser contains *only* text IDs and lemmas. This format is useful for so-called [bag of words](https://en.wikipedia.org/wiki/Bag-of-words_model) techniques such as word clouds or [topic modeling](https://en.wikipedia.org/wiki/Topic_model). The `JSON` files, however, contain a wealth of other data, including language (Sumerian, Akkadian, Emesal, etc.), orthographic form, morphology (currently only for Sumerian and Emesal), line numbers, breakage, meta-data, etc. The extended JSON parser notebook (####), building upon the techniques demonstrated here, will  show how to extract any type of data.

In [1]:
import pandas as pd
import zipfile
import json
import tqdm
import requests
import errno
import os

## 0 Create Directories, if Necessary
The two directories needed for this script are `jsonzip` and `output`. If they do not exist they are created, else: do nothing.

For the code, see [Stack Overflow](http://stackoverflow.com/questions/18973418/os-mkdirpath-returns-oserror-when-directory-does-not-exist).

In [2]:
directories = ['jsonzip', 'output']
for d in directories:
    try:
        os.mkdir(d)
    except OSError as exc:
        if exc.errno !=errno.EEXIST:
            raise
        pass

## 1.1 Input Project Names
Provide a list of one or more project names, separated by commas. Note that subprojects must be listed separately, they are not included in the main project. For instance:

`saao/saa01,saao/saa02,blms`

In [3]:
projects = input('Project(s): ').lower()

Project(s): obmc, saao/saa01


## 1.2 Split the List of Projects
Split the list of projects and create a list of project names.

In [4]:
p = projects.split(',')               # split at each comma and make a list called `p`
p = [x.strip() for x in p]        # strip spaces left and right of each entry in `p`

## 1.3 Download the ZIP files
For each project in the list download all the `json` files from `http://build-oracc.museum.upenn.edu/json/`. The file is called `PROJECT.zip` (for instance: `dcclt.zip`). For subprojects the file is called `PROJECT-SUBPROJECT.zip` (for instance `cams-gkab.zip`). 

For larger projects (such as [DCCLT](http://oracc.org/dcclt)) the `zip` file may be 25Mb or more. Downloading may take some time and it may be necessary to chunk the downloading process. The `iter_content()` function in the `requests` library takes care of that.

If you have downloaded the files by hand (and put them in the `jsonzip` directory) you may skip this cell and jump directly to section [2.1 The Parsejson() function](#head21).

In [5]:
CHUNK = 16 * 1024
for project in tqdm.tqdm(p):
    project = project.replace('/', '-')
    url = "http://build-oracc.museum.upenn.edu/json/" + project + ".zip"
    file = 'jsonzip/' + project + '.zip'
    r = requests.get(url)
    if r.status_code == 200:
        print("Downloading " + url + " saving as " + file)
        with open(file, 'wb') as f:
            for c in r.iter_content(chunk_size=CHUNK):
                f.write(c)
    else:
        print(url + " does not exist.")

  0%|                                                    | 0/2 [00:00<?, ?it/s]

Downloading http://build-oracc.museum.upenn.edu/json/obmc.zip saving as jsonzip/obmc.zip


 50%|██████████████████████                      | 1/2 [00:06<00:06,  6.82s/it]

Downloading http://build-oracc.museum.upenn.edu/json/saao-saa01.zip saving as jsonzip/saao-saa01.zip


100%|████████████████████████████████████████████| 2/2 [00:22<00:00,  9.35s/it]


## <a name="head21"></a>2.1 The `parsejson()` function
The `parsejson()` function will "dig into" the `json` file (transformed into a dictionary) until it finds the relevant data. The `json` file consists of a hierarchy of `cdl` nodes; only the lowest nodes contain lemmatization data. The function goes down this hierarchy by calling itself when another `cdl` node is encountered. For nore information about the data hierarchy in the [ORACC](http://oracc.org) `json` files, see [ORACC Open Data](http://oracc.museum.upenn.edu/doc/opendata/index.html).

The argument of the `parsejson()` function is a `JSON` object, essentially a Python dictionary that initially contains the entire contents of the original JSON file. The code takes the key `cdl` which itself contains an array (a list) of `JSON` objects. Iterating through these objects, if an object contains another `cdl` node, the function calls itself with this lower-level object as argument. This way the function digs deeper and deeper into the `JSON` tree, until it does not encounter a `cdl` key anymore. Here we are at the level of individual words. The code checks for a key `f`, if it exists the value of that key (a dictionary) is appended to the list `lemm_l`. The list `lemm_l`, which is initiated outside of the function proper, will become a list of dictionaries, where each dictionary represents a single word.

The variable `id_text` consists of a project abbreviation, such as `blms` or `cams/gkab` plus a text ID, in the format `cams/gkab/P338616` or `dcclt/Q000039`. The `id_text` is a global variable that is defined in the main process. Therefore, it can be accessed from within the function and is added to the lemmatization data of every word.

In [6]:
def parsejson(text):
    for JSONobject in text["cdl"]:
        if "cdl" in JSONobject: 
            parsejson(JSONobject)
        if "f" in JSONobject:
            lemm = JSONobject["f"]
            lemm["id_text"] = id_text
            lemm_l.append(lemm)
    return

## 2.2 Call the `parsejson()` function for every `JSON` file
The code in this cell will iterate through the list of projects entered above (1.1). For each project the `JSON` zip file is located in the directory `jsonzip`, named PROJECT.zip. The `zip` file contains a directory that is called `corpusjson` that contains a JSON file for every text that is available in that corpus. The files are called after there text IDs in the pattern `P######.json` (or `Q######.json` or `X######.json`).

The function `namelist()` of the `zipfile` package is used to create a list of all the files in the ZIP. From this list we select all the files in the `corpusjson`. 

Each of these files is read from the `zip` file and loaded with the command `json.loads()`, which transforms the string into a proper JSON object. 

This JSON object (essentially a Python dictionary), which is called `data_json` is now sent to the `parsejson()` function. The function adds lemmata to the `lemm_l` list.

In [24]:
lemm_l = [] # initiate the list that will hold all the lemmatization data
for project in p:
    file = "jsonzip/" + project.replace("/", "-") + ".zip"
    try:
        z = zipfile.ZipFile(file)       # create a Zipfile object
    except:
        print(file + " does not exist or is not a proper ZIP file")
        continue
    files = z.namelist()     # list of all the files in the ZIP
    files = [name for name in files if "corpusjson" in name and name[-5:] == '.json']                                                                                                  #that holds all the P, Q, and X numbers.
    for filename in files:                            #iterate over the file names
        id_text = project + filename[-13:-5] # id_text is, for instance, blms/P414332
        try:
            text = z.read(filename).decode('utf-8')         #read and decode the json file of one particular text
            data_json = json.loads(text)                # make it into a json object (essentially a dictionary)
            parsejson(data_json)               # and send to the parsejson() function
        except:
            print(id_text + ' is not available or not complete')

## 3 Data Structuring
### 3.1 Transform the Data into a DataFrame
The word_l list is transformed into a Pandas dataframe for further manipulation.

For various reasons not all JSON files will have all data types that potentially exist in an [ORACC](http://oracc.org) signature. Only Sumerian words have a `base`, so if your data set has no Sumerian, this column will not exist in the DataFrame.  If a text has no breakage information in the form of `$ 1 line broken` (etc.) the fields `extent`, `scope`, and `state` do not exist. Where such fields are referenced in the code below (sections 2-4), the code may fail and you may need to take out some lines.

In [25]:
word_df = pd.DataFrame(lemm_l)
word_df = word_df.fillna('')      # replace NaN (Not a Number) with empty string
word_df

Unnamed: 0,base,cf,cont,delim,epos,form,gdl,gw,id_text,lang,morph,norm,norm0,pos,sense
0,{iti}ab-e₃,Abbaʾe,,,MN,{iti}ab-e₃-še₃,"[{'det': 'semantic', 'pos': 'pre', 'seq': [{'v...",1,obmc/P411563,sux,"~,eše",,"Abbaʾe,eše",MN,1
1,še,še,,,N,še,"[{'v': 'še', 'id': 'P411563.6.1.0'}]",barley,obmc/P411563,sux,~,,še,N,barley
2,aŋ₂,aŋ,,,V/t,i₃-aŋ₂-e,"[{'v': 'i₃', 'id': 'P411563.6.2.0', 'delim': '...",measure,obmc/P411563,sux,V:~;e,,V:aŋ;e,V/t,to measure
3,tukum-bi,tukumbi,,,CNJ,tukum-bi,"[{'v': 'tukum', 'id': 'P411563.7.1.0', 'delim'...",if,obmc/P411563,sux,~,,tukumbi,CNJ,if
4,šum₂,šum,,,V/t,la-ba-an-šum₂,"[{'v': 'la', 'id': 'P411563.8.1.0', 'delim': '...",give,obmc/P411563,sux,nu.ba.n:~,,nu.ba.n:šum,V/t,to give
5,maš₂,maš,,,N,maš₂,"[{'v': 'maš₂', 'id': 'P411563.9.1.0'}]",interest,obmc/P411563,sux,~,,maš,N,interest (on a loan)
6,,,,,,1(aš),"[{'n': 'n', 'form': '1(aš)', 'id': 'P411563.9....",,obmc/P411563,sux,,,,n,
7,gur,gur,,,N,gur,"[{'v': 'gur', 'id': 'P411563.9.3.0'}]",unit,obmc/P411563,sux,~,,gur,N,unit of capacity
8,,,,,,1(barig),"[{'n': 'n', 'form': '1(barig)', 'id': 'P411563...",,obmc/P411563,sux,,,,n,
9,,,,,,4,"[{'n': 'n', 'sexified': '4(diš)', 'form': '4',...",,obmc/P411563,sux,,,,n,


## 3.1 Remove Spaces and Commas from Guide Word and Sense
Spaces and commas in Guide Word and Sense may cause trouble in computational methods in tokenization, or when saved in Comma Separated Values format. All spaces and commas are replaced by hyphens and nothing (empty string), respectively.

In [26]:
findreplace = {' ' : '-', ',' : ''}
word_df = word_df.replace({'gw' : findreplace, 'sense' : findreplace}, regex = True)

## 3.2 Create a `lemma` column
The following code combines the `cf` (Citation Form), `gw` (Guide Word), and `pos` (Part of Speech) columns to create a new `lemma` column with the format `cf[gw]pos`, for instance `šarru[king]N` or `lugal[king]N`. Unlemmatized words do not have `cf`, `gw`, or `pos` - they only have `form` (the transliteration). The function therefore has a condition: if `cf` is empty, the format should be `form[NA]NA`. Alternatively, one may leave out non-lemmatized words altogether and create the `lemma` column by simply adding up `cf`, `gw`, and `pos`, as follows:
```python
word_df = word_df[word_df['cf'] != '']   # throw out rows with empty CF
word_df['lemma'] = word_df['cf'] + '[' + word_df['gw'] + ']' + word_df['pos']
```

In [None]:
word_df["lemma"] = word_df.apply(lambda r: (r["cf"] + '[' + r["gw"] + ']' + r["pos"]) if r["cf"] != '' 
                                 else r['form'] + '[NA]NA', axis=1)
word_df[['id_text', 'lemma']]

## 3.3 Group by Textid
Get all the lemmas that belong to a single text in one row (one row = one document). The `agg()` (aggregate) function, which works on the result of a `groupby()` process aggregates columns of the original dataframe. The function takes a dictionary in which the keys are column names and the values are functions to be used in the aggregation process. The example below has only one such function (`' '.join` will join all entries in the colum `lemma` with a space in between); one may specify (the same or different) functions for different columns, for instance:
> word_df = word_df.groupby("textid").agg({"lemma": ' '.join, "base": ' '.join})

In [None]:
word_df = word_df.groupby("id_text").agg({"lemma": ' '.join})
word_df.reset_index()

## 4 Save Results in CSV file
The output file is called `parsed.csv` and is placed in the directory `output`. In most cases, `csv` files open automatically in Excel. This program does not deal well with `utf-8` encoding. If you intend to use the file in Excel, change `encoding ='utf-8'` to `encoding='utf-16'`. For usage in computational text analysis applications `utf-8` is usually preferred. 

(Alternatively, use the instructions [here](https://www.itg.ias.edu/content/how-import-csv-file-uses-utf-8-character-encoding-0) to import a `utf-8` file into Excel).

In [None]:
savefile =  'parsed.csv'
with open('output/' + savefile, 'w', encoding="utf-8") as w:
    word_df.to_csv(w)

# Temp

In [None]:
word_temp = word_df[word_df["gw"]!=""]
word_temp[word_temp["id_text"]=="dcclt/P228071"]