# ETCSL - Oracc Harmonization

This script will harmonize the output of the ETCSL TEI XML scraper (Notebook scrape-etcsl-XML) with Oracc lemmatization standards (epsd2). The output of this script is put in the `Cleaned` directory. The files in the `Cleaned` directory are compatible with files scraped from Oracc with the Scrape Oracc Notebook, using the same lemmatization standards and the same POS tags.

The script needs the following files:

1. Directory Input:
  * etcsl.txt holding all the ETCSL text numbers
  * Three vocabulary files with ETCSl - ORACC equivalencies (by Niek Veldhuis and Terri Tanaka):
     * etcsl_epsd2_sux2.txt
     * etcsl_epsd2_emesal2.txt
     * etcsl_epsd2_propernouns2.txt
2. Directory Output:
  * a set of scraped etcsl files; extension .txt.

The script assumes that the scraped etcsl files have the lemmatization format `sux:lugal[king]N`. If another format has been produced (by modifying the function `outputformat()` in the Notebook scrape-etcsl-XML) the script needs to be modified accordingly.

In [1]:
import os
import re
from tqdm import *

Create a list of names for the ETCSL - EPSD2 equivalency files. Create an empty dictionary that will hold these equivalencies.

In [2]:
vocab_equiv_files = ['etcsl_epsd2_sux2.txt', 'etcsl_epsd2_emesal2.txt', 'etcsl_epsd2_propernouns2.txt']
equiv_dict = {}

The function `add_dict()` takes a line from one of the equivalencies files, splits the line in two and assigns the first half of the line to the key, the second to the value of a new item in the dictionary `equiv_dict`. The function `add_dict()` is called by the function `readfile()`.

In [3]:
def add_dict(equivalencies):
    for equiv in equivalencies:
        etcsl = equiv.split(' = ')[0]
        epsd2 = equiv.split(' = ')[1]
        equiv_dict[etcsl] = epsd2
    #equiv_dict = {equiv.split(' = ')[0] : equiv.split(' = ')[1] for equiv in equivalencies} ; this doesn't work
    return equiv_dict

The function `readfile()` reads the equivalencies lists and calls the function `add_dict()` to add each line to the dictionary `equiv_dict`.

In [4]:
def readfile(file):
    with open('Equivalencies/' + file, mode = 'r', encoding='utf8') as f:
        equivalencies = f.read().splitlines()
    equiv_dict = add_dict(equivalencies)
    return equiv_dict

The following cell iterates over the list vocab_equiv_files and forwards each of these file names to the function `readfiles()`. 

In [5]:
for file in vocab_equiv_files:
    equiv_dict = readfile(file)

The len() function is inserted here as a basic check to make sure all equivalencies have been entered in equiv_dict.

In [6]:
len(equiv_dict)

4141

If necessary, the directory `Cleaned` is made.

In [7]:
if not os.path.exists('Cleaned'):
    os.mkdir('Cleaned')

The file etcsl.txt in the directory `Input` is opened to retrieve a list of all ETCSL texts.

In [8]:
with open('Input/etcsl.txt', mode = 'r') as f:
    textlist = f.read().splitlines()

The main process iterates over the list of ETCSL texts, opens the corresponding item in the `Output` folder (this is output from the scrape-etcsl-XML Notebook) and uses the equiv_dict to replace ETCSL forms (in the key) with EPSD2 forms. The regular expression in the search/replace function looks for entries that are preceded by colon, space, or comma. This prevents partial entries from being replaced. Finally, the new version of the text is saved in the `Cleaned` directory.

In [9]:
for textno in tqdm(textlist):
    with open('Output/' + textno + '.txt', mode = 'r', encoding='utf8') as g:
        text = g.read()
        for entry in equiv_dict:
            text = re.sub(r'(?<=[ :,])'+re.escape(entry), equiv_dict[entry], text)
    with open('Cleaned/' + textno + '.txt', mode = 'w', encoding='utf8') as writeFile:
        writeFile.write(text)

100%|██████████| 394/394 [24:55<00:00,  5.27s/it]
