
# TODO
- Revise all introductory text
- words in additional/secondary not in main equiv word list (peš-bi)
- Sum glosses
- text (including half brackets, question marks, etc.)

# Parsing ETCSL: XML and Xpath
## 1. Introduction

The Electronic Text Corpus of Sumerian Literature ([ETCSL]; 1998-2006) provides editions and translations of 394 Sumerian literary compositions. Goal of this Notebook is to format the [ETCSL] data in such a way that the (lemmatized) texts are made available for computational text analysis. The [ETCSL] lemmatizations are made compatible with [ORACC] standards (see [ePSD2](http://oracc.org/epsd2/sux), so that [ETCSL] and [ORACC] data can be mixed and matched for text analysis purposes.

For most purposes you do not need to run this scraper, because the final output is made available to you. However, if you need output in a different format or if you wish to know how the output was produced, you may read, adapt, and run this Notebook.

The original [ETCSL] files in `TEI XML` may be downloaded from the [Oxford Text Archive](http://ota.ox.ac.uk/desc/2518) under a Creative Commons Attribution non-Commercial Share-Alike ([BY-NC-SA 3.0](http://creativecommons.org/licenses/by-nc-sa/3.0/)) license.

The editors and copyright holders of [ETCSL] are: Jeremy Black, Graham Cunningham, Jarle Ebeling, Esther Flückiger-Hawker, Eleanor Robson, Jon Taylor, and Gábor Zólyomi.

The [manual](http://etcsl.orinst.ox.ac.uk/edition2/etcslmanual.php) of the [ETCSL] project explains in full detail the editorial principles and technical details. 
[ETCSL]: http://etcsl.orinst.ox.ac.uk
[ORACC]: http://oracc.org

### 1.1 XML
The [ETCSL] files as distributed by the [Oxford Text Archive](http://ota.ox.ac.uk/desc/2518) are encoded in a dialect of `XML` (Extensible Markup Language) that is referred to as `TEI` (Text Encoding Initiative). In this encoding each word (in transliteration) is an *element* that is surrounded by `<w>` and `</w>` tags. Inside the start-tag the word may receive several attributes, encoded as name/value pairs, as in the following random examples:

```xml
<w form="ti-a" lemma="te" pos="V" label="to approach">ti-a</w>
<w form="e2-jar8-bi" lemma="e2-jar8" pos="N" label="wall">e2-jar8-bi</w>
<w form="ickila-bi" lemma="ickila" pos="N" label="shell"><term id="c1813.t1">ickila</term><gloss lang="sux" target="c1813.t1">la</gloss>-bi</w>
```

The `form` attribute is the full form of the word, omitting flags (such as question marks), indication of breakage, or glosses. The `lemma` attribute is the form minus morphology. Some lemmas may be spelled in more than one way in Sumerian; the `lemma` attribute will use a standard  spelling (note that the `lemma` of "ti-a" is "te"). The `lemma` in [ETCSL] (unlike `Citation Form` in [ORACC]) uses actual transliteration with hyphens and sign index numbers (as in `lemma = e2-jar8`):

The `label` attribute gives a general indication of the meaning of the Sumerian word but is not context-sensitive. That is, the `label` of "lugal" is always "king", even if in context the word means "owner". The `pos` attribute gives the Part of Speech, but again the attribute is not context-sensitive. Where a verb (such as sag₉, to be good) is used as an adjective the `pos` is still "V" (for verb). Together `lemma`, `label`, and `pos` define a Sumerian lemma (dictionary entry).

In parsing the [ETCSL] files we will be looking for the `<w>` and `</w>` tags to isolate words and their attributes. Higher level tags identify lines (`<l>` and `</l>`), versions, secondary text (found only in a minority of sources), etcetera.

The [ETCSL] file set includes the [etcslmanual.html](http://etcsl.orinst.ox.ac.uk/edition2/etcslmanual.php) with explanations of the tags, their attributes, and their proper usage.

Goal of the parsing process is to get as much information as possible out of the `XML` tree in a format that is useful for computational text analysis. What "useful" means depends, of course, on the particular project. The output of the parser is a word-by-word (or rather lemma-by-lemma) representation of the entire [ETCSL] corpus in a format that is as close as possible to the output of the [ORACC] parser. For most projects it will be necessary to group words into lines or compositions, or to separate out a particular group of compositions. The data is formatted in a way that that can be achieved with a standard set of Python commands.
[ETCSL]: http://etcsl.orinst.ox.ac.uk
[ORACC]: http://oracc.org

### 1.2 `lxml` and `Xpath`

There are several Python libraries specifically for parsing `XML`, among them the popular `ElementTree` and its twin `cElementTree`. The library `lxml` is largely compatible with `ElementTree` and `cElementTree` but differs from those in its full support of `Xpath`. `Xpath` is a language for navigating `XML` trees. `Xpath` is not an independent language, but rather a set of conventions that can be implemented in multiple programming languages. 

`Xpath` defines a **path** through an `XML` tree in the following format:
```xpath
'l/w'
```
This will select all `w` nodes that are direct children of `l` nodes. In [ETCSL] `XML` files that means: select all "word" nodes that belong to "line" nodes. An attribute is indicated by @; thus
```xpath
'l/w/@lemma'
```
selects all `lemma` attributes of all `w` nodes that are direct children of `l` nodes.

`Xpath` **predicates** allow further filtering. Predicates are put between square brackets and may be inserted at any place in the path, for instance:
```xpath
'l/w[@lemma = "e2-jar8"]'
```
will select all `w` nodes that are direct children of `l` nodes and that have a `lemma` attribute that equals "e2-jar8". 

`Xpath` axes (plural of **axis**) define a relationship between the current node and the target node. Axes include `child`, `descendant`, `parent`, `ancestor`, `following`, `preceding`, and several more. Axes are always followed by a double colon, as in:
```xpath
'l/w[preceding::addSpan]'
```
This expression selects all the `w` nodes that are direct children of `l` nodes and are preceded by an `addSpan` node. We will see that this can be used for selecting "secondary" and "additional" text (text appearing in a minority of sources) in the [ETCSL] `XML` files.

Finally, `Xpath` defines a number of **functions** and **operators**. The function that we will use most is `string()`. The `string()` function retrieves the string value of a node or attribute. Important operators are `|` (or), `and`, and `=`. We will see examples of those below.

The above only provide the barest bones of what `Xpath` can do - we will see other examples (and how to implement those examples) in the code below. Libraries such as `lxml`, `ElementTree` and `cElementTree` offer other functions (not part of `Xpath`) that are occasionally simpler and more efficient than the `Xpath` methods (see http://lxml.de/performance.html#xpath). Thus, instead of `word["cf"] = node.xpath('string(@lemma)')` one may also write `word["cf"] = node.get("lemma")` with the same result. The code below consistently uses `Xpath` solutions because `Xpath` is implemented in a variety of languages (such as `Java`, `R`, `C`, etc.) which increases the portability of the code.
[ETCSL]: http://etcsl.orinst.ox.ac.uk

### 1.3 Input and Output

This scraper expects the following files and directories:

1. Directory `Input`  
   `etcsl.txt`:  a list of [ETCSL] text numbers.  
2. Directory `etcsl/transliterations/`  
   This directory should contain the [ETCSL] `TEI XML` transliteration files.  
3. Directory `Equivalencies`  
   `equivalencies.json`: a set of equivalency dictionaries used at various places in the parser.  

The output is saved in the `Output` directory as a single `.csv` file. Each record in this file represents a single word.
[ETCSL]: http://etcsl.orinst.ox.ac.uk

## 2. Setting Up
### 2.1 Load Libraries
First import the proper packages: 

- re: Regular Expressions
- os: enable Python to perform basic Operating System functions (such as making a directory)
- etree (from lxml): read and analyze an XML file as an ordered tree
- json: read file in `JSON` format (the equivalencies file)
- pandas: transform data into a Dataframe (a table)
- tqdm: creates a progress bar

If you installed Python 3 and Jupyter by installing the [Anaconda Navigator](https://www.continuum.io/downloads), then most of these packages should already be installed, with the exception of `tqdm`. The first line in the cell below installs tqdm. It needs to be installed just once, after installing it you may invalidate that line by putting a # in front of it.

In [1]:
#! pip install tqdm
import re
from lxml import etree
import os
import json
import pandas as pd
import tqdm

### 2.2 Load Equivalencies 
The file `equivalencies.json` contains a number of dictionaries that will be used to search and replace at various places in this notebook. The dictionaries are:
- `ascii_unicode`: ASCII representations of special characters and their Unicode equivalents
- `index_no`: regular digits and their (Unicode) subscript equivalents (including `"x" : "ₓ"`)
- `suxwords`: Sumerian words (Citation Form, GuideWord, and Part of Speech) in [ETCSL] format and their [ORACC] counterparts.
- `emesalwords`: idem for Emesal words
- `propernouns`: idem for proper nouns
- `ampersands`: HTML entities (such as `&aacute;`) and their Unicode counterparts (`á`; see section 3).
- `versions`: [ETCSL] version names and (abbreviated) equivalents

The `equivalencies.json` file is loaded with the `json` library. The dictionaries `suxwords`, `emesalwords` and `propernouns` (which, together, contain the entire [ETCSL] vocabulary) are concatenated into a single dictionary.
[ETCSL]: http://etcsl.orinst.ox.ac.uk
[ORACC]: http://oracc.org

In [2]:
with open("equivalencies/equivalencies.json") as f:
    eq = json.load(f)
equiv = eq["suxwords"]
equiv.extend(eq["emesalwords"])
equiv.extend(eq["propernouns"])

## 3. Preprocessing: HTML-entities
The [ETCSL] `TEI XML` files are written in ASCII and represent special characters (such as š or ī) by a sequence of characters that begins with & and ends with ; (e.g. `&c;` represents `š`). These so-called HTML entities are used in translation, Akkadian glosses, bibliography, and headers, but not in the transliteration of the Sumerian text itself (see below). The entities are, for the most part project-specific, and are declared in the file `etcsl-sux.ent` which is part of the file package and is used by the [ETCSL] project in the process of validating and parsing the `XML` for online publication. The `lxml` library cannot deal with these entities and thus we have to replace them with the actual (Unicode) character that they represent before feeding the data to `etree` (the part of `lxml` that we will use). 

All the entities are listed with their corresponding unicode character (or expression) in the dictionary `ampersands`, which was loaded above (section 2.2) :
```python
    {"&aacute;" : "á",
    "&aleph;" : "ʾ",
    "&amacr;" : "ā",
    "&ance;" : "{anše}",
    etc.
    }
```
The function `ampersands()` uses this dictionary for a search-replace action.

The function `ampersands()` is called in `parsetext()` (see section 11) before the `etree` is built. Note that the `.xml` files themselves are not changed by this process (nor by any other process in this Notebook).
[ETCSL]: http://etcsl.orinst.ox.ac.uk

In [3]:
def ampersands(x):
    for amp in eq["ampersands"]:
        x = x.replace(amp, eq["ampersands"][amp])
    return x

## 4. Marking 'Secondary Text' and/or 'Additional Text'

The [ETCSL] web pages include variants, indicated as '(1 ms. has instead: )', with the variant text enclosed in curly brackets. Two types of variants are distinguished: 'additional text' and 'secondary text'. 'Additional text' refers to a line that appears in a minority of sources (often in only one). 'Secondary text' refers to variant words or variant lines that are found in a minority of sources. The function `mark_extra()` marks the words of 'secondary text' and/or 'additional text' by adding the attribute `status` with the value "additional" or "secondary". 

In [ETCSL] `TEI XML` secondary/additional text is introduced by a tag of the type:
```xml
<addSpan to="c141.v11" type="secondary"/>
```
or
```xml
<addSpan to="c141.v11" type="additional"/>
```

The `to` attribute "c141.v11" represents the text number in [ETCSL] (in this case Inana's Descent, text c.1.4.1) and an identifier for the passage in question ("v11"). The return to the primary text is indicated by a tag of the type:
```xml
<anchor id="c141.v11"/>
```
Note that the `id` attribute in the `anchor` tag is identical to the `to` attribute in the `addSpan` tag.

`Xpath` can identify the word (`<w>`) tags between these `<addSpan>` and `<anchor>` tags with the following expression:
```python
extra = tree.xpath('//w[preceding::addSpan[@type="secondary"]/@to = following::anchor/@id]')
```
meaning: select `w` tags anywhere in the document that are preceded by an `addSpan` tag with attribute `secondary` and a `to` attribute that equals the `id` attribute of a following `anchor` tag.  

After identifying the "secondary" (or "additional") text we can then add a `status = 'secondary'` (or `status = 'additional'`) attribute to each `w` tag in the selection:

```python
for word in extra:
    word.attrib["status"] = "secondary"
```

The function `mark_extra()` is called twice by the function `parsetext()` (see below, section 11), once for "additional" and once for "secondary" text, indicated by the `which` argument. 

[ETCSL]: http://etcsl.orinst.ox.ac.uk

In [4]:
def mark_extra(tree, which):
    extra = tree.xpath('//w[preceding::addSpan[@type="' + which + '"]/@to = following::anchor/@id]')
    for word in extra:
        word.attrib["status"] = which
    return tree

## 5. Transliteration Conventions

Transliteration of Sumerian text in [ETCSL] `TEI XML` files uses **c** for **š**, **j** for **ŋ** and regular numbers for index numbers. The function `tounicode()` replaces each of those. For example **cag4** is replaced by **šag₄**. This function is called in the function `getword()` to format `Citation Forms` and `Forms` (transliteration). The function `tounicode()` uses the dictionaries `ascii_unicode` and `index_no` which are stored in the `equivalencies.json` file.

The replacement of sign index numbers is complicated by the fact that `Citation Forms` and `Forms` may include real numbers, as in **7-ta-am3** where the **7** shoud remain unchanged, while **am3** should become **am₃**. The replacement routine for numbers, therefore, uses a "look-behind" [regular expression](http://www.regular-expressions.info/) to check what character is found before the digit to be replaced. If this is a letter (a-z or A-Z) or a Unicode index number (₀-₉) the digit is replaced by a its Unicode subscript counterpart. Otherwise it is left unchanged.
[ETCSL]: http://etcsl.orinst.ox.ac.uk

In [5]:
def tounicode(x):
    for no in eq["index_no"]:
        x = re.sub(r'(?<=[a-zA-Z₀-₉])'+no, eq["index_no"][no], x)
    for char in eq["ascii_unicode"]:
        x = x.replace(char, eq["ascii_unicode"][char])
    return x

## 6. Replace [ETCSL] by [ORACC] Lemmatization
For every word, once `cf` (Citation Form), `gw` (Guide Word), and `pos` (Part of Speech) have been pulled out of the [ETCSL] `XML` file, it is run through the etcsl/oracc equivalence lists to match it with the [ORACC]/[ePSD2](http://oracc.org/epsd2/sux) standards.

# TODO
Discuss cases where [ETCSL] lemma is replaced by two [ORACC] lemmas.
[ETCSL]: http://etcsl.orinst.ox.ac.uk
[ORACC]: http://oracc.org

In [6]:
def etcsl_to_oracc(word):
    lemma = {key:word[key] for key in ['cf', 'gw', 'pos']}
    for entry in equiv:
        if lemma == entry["etcsl"]:
            word['cf'] = entry["oracc"]["cf"]
            word["gw"] = entry["oracc"]["gw"]
            word["pos"] = entry["oracc"]["pos"]
            if "oracc2" in entry:
                word["cf2"] = entry["oracc2"]["cf"]
                word["gw2"] = entry["oracc2"]["gw"]
                word["pos2"] = entry["oracc2"]["pos"]
    return word

## 7. Formatting Words

A word in the [ETCSL] files is represented by a `<w>` node in the `XML` tree with a number of attributes that identify the `form` (transliteration), `citation form`, `guide word`, `part of speech`, etc. The function `getword()` formats the word as closely as possible to the [ORACC] conventions. Three different types of words are treated in three different ways: Proper Nouns, Sumerian words and Emesal words.

In [ETCSL] **proper nouns** are nouns (`pos` = "N"), which are qualified by an additional attribute `type` (Divine Name, Personal Name, Geographical Name, etc.; abbreviated as DN, PN, GN, etc.). In [ORACC] a word has a single `pos`; for proper nouns this is DN, PN, GN, etc. - so what is `type` in [ETCSL] becomes `pos` in [ORACC]. [ORACC] proper nouns usually do not have a guide word (only a number to enable disambiguation of namesakes). The [ETCSL] guide words (`label`) for names come pretty close to ORACC citation forms. Proper nouns are therefore formatted differently from other nouns.

**Sumerian words** are treated in basically the same way in [ETCSL] and [ORACC], but the `citation forms` and `guide words` are often different. Transformation of citation forms and guide words to [ORACC]/[epsd2] standards takes place in the function `etcsl_to_oracc()` (see above, section 6).

**Emesal words** in [ETCSL] use their Sumerian equivalents as `citation form` (attribute `lemma`), adding a separate attribute (`emesal`) for the Emesal form proper. This Emesal form is the one that is used as `citation form` in the output.

Guide words need removal of commas and spaces. Removal of commas will allow the output files to be read as Comma Separated Value (`csv`) files, which is an efficient input format for processes in Python and R. In the output file commas separate different fields from each other (`id_text`, `text_name`, `line_no`, etc.). Spaces need to be removed because standard tokenizers in computational text analysis will understand spaces as word dividers. 
[ETCSL]: http://etcsl.orinst.ox.ac.uk
[ORACC]: http://oracc.org
[epsd2]: http://oracc.org/epsd2/sux

In [7]:
def getword(node):
    word = {key:meta_d[key] for key in meta_d} # store all meta data in metad_d in the word dictionary
    if node.tag == 'gloss': # these are Akkadian glosses which are not lemmatized
        form = node.xpath('string(.)')
        form = form.replace('\n', ' ').strip() # occasionally an Akkadian gloss may consist of multiple lines
        word["form"] = tounicode(form) # check - is this needed?
        word["lang"] = node.xpath("string(@lang)")
        return word
    
    word["cf"] = node.xpath('string(@lemma)') # xpath('@lemma) returns a list. The string
    word["cf"] = word["cf"].replace('Xbr', '(X)')  # function turns it into a single string
    word["cf"] = tounicode(word["cf"])
    word["gw"] = node.xpath('string(@label)')

    if len(node.xpath('@pos')) > 0:
        word["pos"] = node.xpath('string(@pos)')
    else:
        word["pos"] = 'NA'
        word["gw"] = 'NA'

    form = node.xpath('string(@form)')
    word["form"] = form.replace('Xbr', '(X)')
    word["form"] = tounicode(word["form"])
    
    if len(node.xpath('@emesal')) > 0:
        word["cf"] = node.xpath('string(@emesal)')
        word["lang"] = "sux-x-emesal"
    else:
        word["lang"] = "sux"
        
    if len(node.xpath('@type')) > 0 and word["pos"] == 'N':
        if node.xpath('string(@type)') != 'ideophone':
            word["pos"] = node.xpath('string(@type)')
            word["cf"] = node.xpath('string(@label)')
            word["gw"] = '1'
    if len(node.xpath('@status')) > 0:
        word['status'] = node.xpath('string(@status)')

    word["gw"] = word["gw"].replace(",", ";") #remove commas from guide words (replace by semicolon) to prevent
                                            #problems with processing of the csv format
    word["gw"] = word["gw"].replace(" ", "-") #remove spaces from guide words (replace by hyphen). Spaces
                                            #create problems with tokenizers in computational text analysis.
    word = etcsl_to_oracc(word)   
    return word

## 8. Formatting Lines

Each line consist of a series of words. The function `getline()` iterates over a line, taking one word at a time. The words and their various attributes (language, citation form, guideword, part of speech and form) are retrieved calling the function `getword()`, which returns a dictionary. This dictionary is forwarded to the function `outputformat()` for formatting.

The function `getword()` will supply the Part of Speech 'X' to each word that has no POS tag already.

# TODO
rewrite the commentary; comment on `<gap>` and `<l>` tags (in input) and `<w>` and `<gloss>` tags. Comment on cases where [ORACC] lemmatization has two words for one in [ETCSL].
[ETCSL]: http://etcsl.orinst.ox.ac.uk
[ORACC]: http://oracc.org

In [8]:
def getline(lnode):
    meta_d["line_ref"] += 1
    if lnode.tag == 'gap':
        line = {key:meta_d[key] for key in ["id_text", "text_name", "version", "line_ref"]}
        line["extent"] = lnode.xpath("string(@extent)")
        wordsinline = [line]
        return wordsinline
    wordsinline = [] #initialize list for the words in this line
    for node in lnode.xpath('.//w|.//gloss[@lang="akk"]'):
                        # get <w> nodes and <gloss> nodes, but only Akkadian glosses
        word = getword(node)
        if "cf2" in word:
            word2 = {key:word[key] for key in ["id_text", "text_name","version", "line_ref", "line_no",
                                               "form", "lang"]}
            word2["cf"] = word["cf2"]
            word2["gw"] = word["gw2"]
            word2["pos"] = word["pos2"]            
            word1 = {key:word[key] for key in ["id_text", "text_name","version", "line_ref", "line_no",
                                               "form", "lang", "cf", "gw", "pos"]}
            wordsinline.extend([word1, word2])
        else:
            wordsinline.append(word)
    return wordsinline

## 9. Sections

Some [ETCSL] compositions are divided into **sections**. That is the case, in particular, when a composition has gaps of unknown length. 

The function `getsection()` is called by `getversion()` and receives two arguments: `tree` (the `etree` object), and `meta_d`, a dictionary of meta data. The function `getsection()` checks to see whether a sub-division into sections is present. If so, it iterates over these sections. Each section (or, if there are no sections, the composition/version as a whole) consists of series of lines. The function `getline()` is called to request the contents of each line. The function `getsection()` returns the variable `linesinsection`, which contains the formatted data.
[ETCSL]: http://etcsl.orinst.ox.ac.uk

In [9]:
def getsection(tree):
    linesinsection = []
    sections = tree.xpath('.//div1')
    if len(sections) > 0: # if the text is not divided into sections - skip to else:
        for snode in sections:
            section = snode.xpath('string(@n)')
            for lnode in snode.xpath('.//l|.//gap'):
                if lnode.tag == 'l':
                    line = section + lnode.xpath('string(@n)')
                    meta_d["line_no"] = line
                line = getline(lnode)
                linesinsection.extend(line)
    else:
        for lnode in tree.xpath('.//l|.//gap'):
            if lnode.tag == 'l':
                line_no = lnode.xpath('string(@n)')
                meta_d["line_no"] = line_no
            line = getline(lnode)
            linesinsection.extend(line)
    return linesinsection

## 10. Versions

In some cases an [ETCSL] file contains different versions of the same composition. The versions may be distinguished as 'Version A' vs. 'Version B' or may indicate the provenance of the version ('A version from Urim' vs. 'A version from Nibru'). In the edition of the proverbs the same mechanism is used to distinguish between numerous tablets (often lentils) that contain just one proverb, or a few, and are collected in the files "Proverbs from Susa," "Proverbs from Nibru," etc. ([ETCSL] c.6.2.1 - c.6.2.5).

The function `getversion()` is called by the function `parsetext()` and receives two arguments: `tree` (the `etree` object), and `meta_d`, a dictionary of meta data. The function checks to see if versions are available in the file that is being parsed. If so, the function iterates over these versions while adding the version name to the `meta_d` dictionary. If there are no versions, the version name is left empty. The parsing process is continued by calling `getsection()` to see if the composition/version is further divided into sections.
[ETCSL]: http://etcsl.orinst.ox.ac.uk

In [10]:
def getversion(tree):
    sectionsinversion = []
    versions = tree.xpath('.//body[child::head]')
    if len(versions) > 0: # if the text is not divided into versions - skip 'getversion()':
        for vnode in versions:
            version = vnode.xpath('string(head)')
            version = eq["versions"][version]
            meta_d["version"] = version
            section = getsection(vnode)
            sectionsinversion.extend(section)
    else:
        meta_d["version"] = ''
        section = getsection(tree)
        sectionsinversion.extend(section)
    return sectionsinversion

## 11. Parse a Text

The function `parsetext()` takes one xml file (a composition in [ETCSL]) and parses it, calling a variety of functions defined above. The function returns the list `parsed`. It contains a lemma-by-lemma representation of the text with version label (where applicable), line numbers (including section labels, where applicable) and all the lemmatized words.

The parsing is done by the `etree` package in the `lxml` library. Before the file can be parsed properly the so-called HTML entities need to be replaced by their Unicode equivalents. This is done by calling the `ampersands()` function (see above, section 3: Preprocessing).

[ETCSL]: http://etcsl.orinst.ox.ac.uk

In [14]:
def parsetext(textid):
    with open('etcsl/transliterations/' + textid + '.xml') as f:
        xmltext = f.read()
    xmltext = ampersands(xmltext)          #replace HTML entities by Unicode equivalents
    
    tree = etree.fromstring(xmltext)
    
    tree = mark_extra(tree, "additional") # mark additional words with attribute status = 'additional'
    tree = mark_extra(tree, "secondary")  # mark secondary words with attribute status = 'secondary'
    name = tree.xpath('string(//title)')
    name = name.replace(' -- a composite transliteration', '')
    name = name.replace(',', '')
    meta_d["id_text"] =  textid
    meta_d["text_name"] = name
    meta_d["line_ref"] = 0
    parsed = getversion(tree)

    return parsed

## 12. Main Process

The code below opens a file `etcsl.txt` (in the directory `Input`) which contains all the numbers of [ETCSL] compositions (such as c.1.1.4). For each such number the corresponding `XML` file is opened and the content of the file is sent to the function `parsetext()`. `Parsetext()` returns the variabe `parsed` which is a list of dictionaries, each dictionary representing a single word. The list of dictionaries `parsed` is added to the list `alltexts`. In the end, `alltexts` will be a list of dictionaries that represent all the words in [ETCSL]. The list is transformed into a Pandas DataFrame. All missing values (`NaN`) are replaced by empty strings. The DataFrame is saved as a `CSV` file named `alltexts.csv` in the directory `output`.

# TODO
discuss creation of the `meta_d` dictionary, what it contains and what it is good for.
[ETCSL]: http://etcsl.orinst.ox.ac.uk

In [None]:
with open("Input/etcsl.txt", "r") as f:
    textlist = f.read().splitlines()
if not os.path.exists('Output'):
    os.mkdir('Output')

alltexts = []
for eachtextid in tqdm.tqdm(textlist):
    meta_d = {}
    parsed = parsetext(eachtextid)
    alltexts.extend(parsed)

df = pd.DataFrame(alltexts)
df = df.fillna('')
with open('output/alltexts.csv', 'w') as w:
    df.to_csv(w)

 19%|█▉        | 76/394 [01:43<06:13,  1.18s/it]

In [13]:
df

Unnamed: 0,cf,extent,form,gw,id_text,lang,line_no,line_ref,pos,status,text_name,version
0,dubsaŋ,,dub-saŋ-ta,first,c.0.1.1,sux,1,1,AJ,,Ur III catalogue from Nibru (N1),
1,Enki,,{d}en-ki,1,c.0.1.1,sux,2,2,DN,,Ur III catalogue from Nibru (N1),
2,unu,,unu₂,dwelling,c.0.1.1,sux,2,2,N,,Ur III catalogue from Nibru (N1),
3,gal,,gal,big,c.0.1.1,sux,2,2,V/i,,Ur III catalogue from Nibru (N1),
4,ed,,im-ed₃,ascend,c.0.1.1,sux,2,2,V/i,,Ur III catalogue from Nibru (N1),
5,anzag,,an-zag-še₃,horizon,c.0.1.1,sux,3,3,N,,Ur III catalogue from Nibru (N1),
6,anŋi,,an-ŋi₆,eclipse,c.0.1.1,sux,4,4,N,,Ur III catalogue from Nibru (N1),
7,zu,,zu,know,c.0.1.1,sux,4,4,V/t,,Ur III catalogue from Nibru (N1),
8,ama,,ama,mother,c.0.1.1,sux,4,4,N,,Ur III catalogue from Nibru (N1),
9,tu,,tu₆,incantation,c.0.1.1,sux,4,4,N,,Ur III catalogue from Nibru (N1),
