# Dataset Generation for Egyptian-English Neural Machine Translation

# Generate the aligned corpora

We will use 3 types of files to build our dataset for semi-supervised machine translation.
- .txt files containing aligned examples from Egyptian grammars
- A series of files compiled by Mark-Jan Nederhof for his hieroglyphic display and input program PhilologEg
- A monolingual corpus taken from [a site dedicated to Egyptian transliterations](http://www.egyptomaniak.gr/Egyptian%20Texts.htm)

The first two file types are separately preprocessed to account for formatting differences, and then merged to form a small aligned corpus of around 8,000 sentences. The monolingual corpus adds an additional 50,000 sentences to improve translation accuracy.

## 1. Parse .txt Files from Grammars

In [1]:
try:
    import pandas as pd
except:
    !pip install pandas
    import pandas as pd
import string
pd.set_option('display.width', None)
pd.set_option('display.max_colwidth', None)

In [2]:
import re
txt_files = ['https://mjn.host.cs.st-andrews.ac.uk/egyptian/grammars/Allen.txt',
             'https://mjn.host.cs.st-andrews.ac.uk/egyptian/grammars/Callender.txt',
             'https://mjn.host.cs.st-andrews.ac.uk/egyptian/grammars/Gardiner.txt',
             'https://mjn.host.cs.st-andrews.ac.uk/egyptian/grammars/Loprieno.txt',
             'https://mjn.host.cs.st-andrews.ac.uk/egyptian/grammars/Ockinga.txt']

### Imports each .txt file and converts it to a series of AlignedSentences
- Split imported files into paragraphs
- Filters paragraphs by whether they contain both "transliteration" and "translation" keys
- Constructs dictionaries with only the necessary keys
- Constructs AlignedSentence objects from `{ '%tr' : "...", '%al' : "..." }` dictionaries

In [3]:
class AlignedSentence(object):
    def __init__(self, translit, translate, signs = None, from_file = None):
        self.translit = translit
        self.translate = translate 
        self.signs = signs
        self.from_file = from_file
        
    def __hash__(self):
        return hash((self.translit, self.translate))

    def __eq__(self, other):
        if not isinstance(other, type(self)): return NotImplemented
        return self.translit == other.translit and self.translate == other.translate
    
    def __str__(self):
        return "Translit: {}, Translate: {}".format(self.translit, self.translate)
    
    def str(self):
        return "Translit: {}, Translate: {}".format(self.translit, self.translate)

In [4]:
import requests
import os
from os.path import exists

def parse_text_file(file):
    file_name = "./texts/grammar-texts/" + file.split('/')[-1]
    if not exists(file_name):
        res = requests.get(file)
        with open(file_name, 'wb') as to_write:
            to_write.write(res.content)
            
    with open(file_name, 'r') as opened_file:
        all_text = opened_file.read()
        split_para = all_text.split('\n\n')
        with_tr_al = [item for item in split_para if '%al' in item and '%tr' in item]
        return [parse_paragraph(item, file_name) for item in with_tr_al]
            
def parse_paragraph(paragraph, file_name):
    split_by_header = paragraph.split('\n')
    merged = []
    for i in range(len(split_by_header)):
        if not split_by_header[i].startswith('%'):
            merged[-1] += " " + split_by_header[i]
        else:
            merged.append(split_by_header[i])
    as_dict = {item[:3].strip() : item[3:].strip() for item in merged if item.startswith('%al') or item.startswith('%tr')}
    return AlignedSentence(as_dict['%al'], as_dict['%tr'], from_file = file_name)

In [5]:
parsed = []
for tf in txt_files:
    parsed_f = parse_text_file(tf)
    parsed.extend(parsed_f)

In [6]:
for item in parsed[:10]:
    print(item.translate)
print(len(parsed))

I am one who says what is good and repeats what is loved.
His Majesty appointed me scribe of the cadaster, and His Majesty showed me great favor.
When it was read to me, I placed myself on my stomach...
[You are the rudder of the entire land,] and how Egypt sails depends on what you command.
[Eloquence is more hidden than the emerald,] although it is found with slave-girls at the mill-stones.
behold, I am before thee.
I placed myself on my belly.
overseer of the house, i.e. steward.
possessor of veneration, venerable.
the king of Egypt.
137


### Preprocess Translation Fields

- Keep the contents of <em\> and <i\> but remove other html tags' contents. Remove all tags themselves.
- Remove square brackets & their contents
- Remove parentheses (and contents where they explain an existing noun)
- Remove all "lit." and "i.e." explanation clauses
- Standardizes "favour" with "favor"
- Standardize ellipsis length to 3
- Standardize spacing by removing multiple contiguous spaces
- Add spaces before unremoved punctuation ("the cat's toy" becomes "the cat 's toy")
- Convert translation to lowercase

In [7]:
def preprocess_translation(parsed):
    with_brackets = [item for item in parsed if '<' in item.translate or '>' in item.translate]

    for item in with_brackets:
        item.translate = re.sub(r"\<em\>(.+)\<\/em\>", "\\1", item.translate)
        item.translate = re.sub(r"\<i\>(.+)\<\/i\>", "\\1", item.translate)
        item.translate = re.sub(r"\<.+\>(.+)\<\/.+\>", "", item.translate)
        
    with_lit = [item for item in parsed if 'lit' in item.translate.lower()]
    for item in with_lit:
        item2 = re.sub(r"\(\s*\)", "", item.translate)
        item3 = re.sub(r"\(lit\.(.+?)\)", "", item2)
        item4 = re.sub(r"(,|\.|\!|\?)\s*(l|L)it\..+(,|\.|\!|\?)$", "\\3", item3)
        item4 = re.sub(r"(,|\.|\!|\?)\s*(l|L)it\..+$", "", item3)
        item.translate = item4
    
    with_paren = [item for item in parsed if '(' in item.translate or ')' in item.translate]
    for item in with_paren:
        item2 = re.sub(r"(it) \(.+?\)", "\\1", item.translate)
        item3 = re.sub(r"\(=.+?\)", "", item2)
        item4 = item3.replace(')', "").replace('(', '')
        item.translate = item4

    with_square = [item for item in parsed if '[' in item.translate or ']' in item.translate]
    for item in with_square:
        item.translate = re.sub(r"\[.+?\]\s*", "", item.translate)
        
    with_ie = [item for item in parsed if 'i.e' in item.translate.lower()]
    for item in with_ie:
        item.translate = re.sub("(,|\.) (i|I)\.e\..+$", ".", item.translate)

    for item in parsed:
        item.translate = item.translate.replace('favor', 'favour')
        item.translate = ' '.join(item.translate.split())
        item.translate = re.sub(r"\.{4,}", r"...", item.translate)
        item.translate = item.translate.replace('[...]', '...').replace('.', ' .').replace(' . . .', ' [...]').replace(':', '').replace(';', '')
        item.translate = item.translate.replace(',', '').replace('???', '...').replace('?', '').replace("'", "").replace('"', "")
        item.translate = re.sub(r'([^\.\.\.\]])\]', "\\1", item.translate)
        item.translate = re.sub(r'\[([^\.\.\.\]])', "\\1", item.translate)
        item.translate = ' '.join([x for x in item.translate.split() if x != '.' and len(x)])
        item.translate = item.translate.lower()

In [8]:
preprocess_translation(parsed)

### Preprocess Transliteration Fields:
- For `<no>Replace __ with __</no>` the two words are exchanged in the transliteration and the tag is removed
- `i` is standardized to `j` to match the longer corpora
- Remove `^`, which only serves as artificial capitalization
- Remove `(?)`
- Standardize ellipse length
- Remove `*` as it represents a character unwritten elsewhere
- Standardize `:` and `=` to `.`
- Remove parentheses
- Add a space before any verb-endings

In [9]:
for item in parsed[:20]:
    print(item.translit)

jnk Dd nfrt wHm mrrt
rd(y): wi Hm.f r sXA n(y) tmA Hs(y): wi Hm.f r aAt wrt
Sd(y):n.t(w).f n.i rd(y):n.i wi Hr Xt.i
sqdd* tA xft wjD.*.k
iw gm(y)=tw.s m-a Hjmwt Hr bnwt
mk wi m-bAH.k
rdi.n.(i) wi Hr Xt.i
imy-r pr
nb imAx
nsw n ^kmt
wrw nw AbDw
mk wi r nHm aA.k, sxty, Hr wnm.f Sma.i
HD.n.i, wn hrw
pA pw ^wsir
m-xt iAw n.k-imy
nn is n sbi Hr Hm.f
n(y)-sw mH 30
ntk nbw
mk tw m niwt, nn HqA-Hwt.s
iw nA m sbAyt


In [10]:
def preprocess_translit(parsed):
    with_no_tag = [item for item in parsed if "<no>" in item.translit]
    for item in with_no_tag:
        matches = list(re.finditer(r'\<\/no\>(.+?)\<no\>', item.translit))
        replaced_miswritten = item.translit.replace(matches[0].groups()[0], matches[1].groups()[0])
        without_tags = re.sub(r'\<no\>(.+)\<\/no\>', '', replaced_miswritten)
        item.translit = without_tags.strip()
        
    with_j = [item for item in parsed if 'i' in item.translit or 'y' in item.translit]
    for item in with_j:
        item.translit = item.translit.replace('i', 'j').replace('y', 'j')
        
    with_caret = [item for item in parsed if '^' in item.translit]
    for item in with_caret:
        item.translit = item.translit.replace('^', '')
        
    with_qm = [item for item in parsed if '(?)' in item.translit]
    for item in with_qm:
        item.translit = item.translit.replace('(?)', "")
        
    with_ask = [item for item in parsed if "*" in item.translit]
    for item in with_ask:
        item.translit = item.translit.replace('.*', '').replace('*', "")
        
    with_colon = [item for item in parsed if ':' in item.translit]
    for item in with_colon:
        item.translit = item.translit.replace(': ', ' ').replace(':', '.')
        
    with_eq = [item for item in parsed if '=' in item.translit]
    for item in with_eq:
        item.translit = item.translit.replace('=', '.')
        
    with_hyphen = [item for item in parsed if '-' in item.translit]
    for item in with_hyphen:
        item.translit = item.translit.replace('-', ' ')

    for item in parsed:
        item.translit = re.sub(r"\.{4,}", r"...", item.translit)
        
    for item in parsed:
        item.translit = item.translit.replace('.', ' .').replace(' . . .', '[...]')
        item.translit = item.translit.replace('(', '').replace(')', '').replace(',', ' ,')
        item.translit = re.sub(r'([^\.\.\.\]])\]', '\\1', item.translit)
        item.translit = re.sub(r'\[([^\.\.\.\]])', '\\1', item.translit)

In [11]:
preprocess_translit(parsed)

## 2. Import Nederhof files

- Eliminates prepending header
- Splits files into paragraph chunks, each of which represents a sentence, in the following format:

    [Sign Data]<br/> 
    ,<br/> 
    [Translit Data]<br/> 
    ;<br/> 
    [Translation data]<br/> 

- Parses the given data into AlignedSentence objects

In [12]:
import re
def parse_nederhof_file(file_name, file_path):
    # Open file
    with open(file_path, 'r') as current_file:
        file_text = current_file.read()
        
    # Remove file header
    section_break = "###"
    matches = list(re.finditer(section_break, file_text))
    assert len(matches) == 2
    without_header = file_text[matches[1].span()[1] + 1:].strip()
    
    # Split paragraphs and return AlignedSententence list
    asList = []
    paras = without_header.split('\n\n')
    for para in paras:
        comma_idx = para.index('\n,\n') if '\n,\n' in para else -1
        semicolon_idx = para.index('\n;') if '\n;' in para else -1
        sign = para[:comma_idx] if comma_idx != -1 else None
        translit = para[comma_idx + 3:semicolon_idx] if comma_idx != -1 else para[:semicolon_idx]
        translate = para[semicolon_idx + 3:]
        if 'jw jr.tw n=j' in para:
            translate = 'they made for me'
        asList.append(AlignedSentence(translit, translate, sign, file_name))
    return asList

In [13]:
import os

nederhof_path = './texts/nederhof-texts'
all_parsed = []
for item in os.listdir(nederhof_path):
    parsed_file = parse_nederhof_file(item, '{}/{}'.format(nederhof_path, item))
    all_parsed.extend(parsed_file)

### Preprocess Nederhof Translation

- Remove mid-sentence line breaks
- Removes brackets containing english notes and their contents
- Removes numbers marking chapters and line numbers
- Removes `<al>` tags while leaving contents intact
- Removes improperly parsed quotes `&quot;`
- Adds a space before kept punctuation
- Converts to lowercase

In [14]:
for item in all_parsed[:10]:
    print(item.translate)

Existing:
the son of Ra,
his beloved,
Amenhotep, god and ruler of Thebes
<"c"> Horus of Edfu, great god, lord of heaven, may he give life!
<r1> Words to be spoken:
'I have given you all life and dominion,
all health, and all valour and strength.'
<r2> Month, lord of Thebes.
<r3> The good god, lord of rituals, Menkheperre,


In [15]:
with_brackets = [item for item in all_parsed if '<' in item.translate or '>' in item.translate]
for item in with_brackets:
    item.translate = item.translate.replace('\n', ' ')
    item.translate = re.sub('(<note>.+?<\/note>)', '', item.translate)
    item.translate = re.sub('^.+\<\/note\>', '', item.translate)
    item.translate = re.sub('\<\@?\d+[a-z]?\>?\s?', "", item.translate)
    item.translate = re.sub('\<[A-Za-z]\d+\>\s?', '', item.translate)
    item.translate = re.sub('\<\d+(:|-)\d+\>', '', item.translate)
    item.translate = re.sub('\<"[A-Za-z]+"\>', '', item.translate)
    item.translate = re.sub('(:|-|,|.)\d+\>\s?', '', item.translate)
    item.translate = item.translate.replace('<al>', '').replace('</al>', '')
    item.translate = re.sub(',?\d*\^pre>(.+?),?\d*\^post>', '\\1', item.translate)
    item.translate = re.sub("<I{1,2}", "", item.translate)
    item.translate = item.translate.replace('b>', '')

In [16]:
with_amph = [item for item in all_parsed if '&' in item.translate]
for item in with_amph:
    item.translate = item.translate.replace('&quot;', '')
    
for item in all_parsed:
    item.translate = item.translate.replace('\r\n', ' ').replace('\n', ' ')
    item.translate = item.translate.replace('.', ' .').replace('[ . . .]', '[...]').replace('l .p .h .', 'l.p.h.').replace('!', '')
    item.translate = item.translate.replace(',', '').replace('???','...').replace('?', '').replace("'", "").replace('(', '').replace(')', '').replace('{', '').replace('}', '').replace('-', ' ').replace(':', "").replace(';', '')
    item.translate = re.sub(r'([^\.\.\.\]])\]', "\\1", item.translate)
    item.translate = re.sub(r'\[([^\.\.\.\]])', "\\1", item.translate)
    item.translate = ' '.join([x for x in item.translate.split() if x != '.' and len(x)])
    item.translate = item.translate.lower()

### Preprocess Nederhof Transliteration

- Removes mid-sentence line breaks
- Removes HTML tags
- Removes artificial capitalization via caret
- Standardizes `=` to `.`
- Removes parentheses
- Adds a space before kept punctuation
- Removes hyphenation

In [17]:
for item in all_parsed[:10]:
    print(item.translate)

existing
the son of ra
his beloved
amenhotep god and ruler of thebes
horus of edfu great god lord of heaven may he give life
words to be spoken
i have given you all life and dominion
all health and all valour and strength
month lord of thebes
the good god lord of rituals menkheperre


In [18]:
with_brackets = [item for item in all_parsed if '<' in item.translit or '>' in item.translit]

for item in with_brackets:
    item.translit = item.translit.replace('\n', ' ')
    item.translit = re.sub('(\<note\>.+?\<\/note\>)', '', item.translit)
    item.translit = re.sub('^.+\<\/note\>', '', item.translit)
    item.translit = item.translit.replace('<no>""</no>', '').replace('<no> </no>', '')
    item.translit = re.sub('\<\@?\d+[a-z]?\>?\s?', "", item.translit)
    item.translit = re.sub('\<[A-Za-z]\d+\>\s?', '', item.translit)
    item.translit = re.sub('\<\d+(:|-)\d+\>', '', item.translit)
    item.translit = re.sub('\<"([A-Za-z]+|\d+\s)"\>\s?', '', item.translit)
    item.translit = re.sub('(:|-|,|.)\d+\>\s?', '', item.translit)
    item.translit = item.translit.replace('<al>', '').replace('</al>', '')
    item.translit = re.sub(',?\d*\^pre>(.+?),?\d*\^post>', '\\1', item.translit)
    item.translit = re.sub("<I{1,2}", "", item.translit)
    item.translit = re.sub('[a-z]>\s?', '', item.translit)
    print(item.translit)

^Hr-^bHdtj nTr aA nb pt Dj=f anx
Dd-mdw
^mnTw nb ^wAst
nTr-nfr nb jrt-jxt ^mn-xpr-^ra
dwA-nTr sp 4
Dd-mdw
Dd-mdw
sA-^ra mrj=f ^DHwtj-msjw HqA-mAat
anx
rnpt-sp 22 Abd 2 prt sw 10
saA nxtw=f r rDjt sDd.tw qnn=f
stt=f r Dbt Hmt
Dj=f pr Ssp 3 Hr-sA=f
jr jry=f At sDA-Hr=f
jn.n=f Xnm n smAw 12
DA.n=f jtrw pXr-wr
jn.n=f Sqb m stt Hr xAst rst ^tA-stj
m wDAw r tA n ^DAhj
jw jw Hm=f r Tnw sp
[...] m ^jnbw-HD r smA xAswt ^rTnw Xst
ntj wA r Hns wrt
xrw bdS.w
[...]n=sn Hr sSA [...] jxt jrw Hr psDw=sn
[...] jw m sp wa Xr jnw [...]
[...] rnpt-sp 29 Abd 4 prt sw [...]
bHdtj nTr aA nb pt
^xnsw-m-^wAst ^nfr-Htp
nsw-bjtj nb tAwj
sA-^ra n Xt=f
Dj anx mj ^ra Dt
sA n anx HA=f nb{t}
jr snTr n jt=f ^xnsw-m-^wAst
rn n Hm-nTr wab
n ^xnsw-pA-jr-sxr-m-^wAst
^xnsw-pA-jr-sxr-m-^wAst
nTr aA sHr SmAy
mry Dj anx mj ^ra
^Hr ^kA-nxt twt-xaw

jty jT pDt-9
wr pHtj{t} mj sA ^nwt
mfkAt xAw nb nw tA-nTr Hr psD=sn
wn st nfr(.tj) r aA wr Hr jb n Hm=f r jxt nb
Hr jrj Hsw n jt=f ^jmn-^ra
aHa.n ms=f m-bAH Hm=f Hna jnw=f
jAw n=k ^

In [19]:
with_caret = [item for item in all_parsed if '^' in item.translit]
for item in with_caret:
    item.translit = item.translit.replace('^', '')

In [20]:
with_eq = [item for item in all_parsed if '=' in item.translit]
for item in with_eq:
    item.translit = item.translit.replace('=', '.')

for item in all_parsed:
    item.translit = item.translit.replace('\n', ' ')
    item.translit = item.translit.replace('.', ' .').replace('a .w .s .', 'a.w.s.').replace('[ . . .]', '[...]')
    item.translit = item.translit.replace('(', '').replace(')', '').replace('{', '').replace('}', '')
    item.translit = item.translit.replace('-', ' ')
    item.translit = re.sub(r'([^\.\.\.\]])\]', "\\1", item.translit)
    item.translit = re.sub(r'\[([^\.\.\.\]])', "\\1", item.translit)

## Combine Aligned Corpora and Remove Empty Entries

Combines the .txt and Nederhof aligned corpora and removes entries containing only spaces / numbers / ellipses.

In [21]:
empty = [item for item in all_parsed if item.translit.isspace() or item.translit == '' or item.translate.isspace() or item.translate == '']

all_parsed = [item for item in all_parsed if item not in empty]
print(len(all_parsed))

8300


In [22]:
try:
    all_parsed.extend(parsed)
except:
    pass

In [26]:
aligned_lst = [[item.translit, item.translate] for item in all_parsed if not all([ch.isspace() or ch.isnumeric() or ch in string.punctuation for ch in item.translate]) and not all([ch.isspace() or ch.isnumeric() or ch in string.punctuation for ch in item.translit]) and len(item.translate) and len(item.translit) and '[...]' not in item.translit and '[...]' not in item.translate]
aligned_df = pd.DataFrame(data=aligned_lst, columns=['Transliteration', 'Translation'])
aligned_df.shape

(8176, 2)

In [None]:
pt_df = pd.read_csv("./pyramidtext-parsing/preprocessed.csv", index_col=0)
aligned_df = pd.concat([aligned_df,pt_df])

In [124]:

aligned_df['Transliteration'].to_csv('./compiled_corpora/aligned.egy.csv', index=False)
aligned_df['Translation'].to_csv('./compiled_corpora/aligned.eng.csv', index=False)

In [125]:
import numpy as np
def train_validate_test_split(df, train_percent=.6, validate_percent=.2, seed=None):
    np.random.seed(seed)
    perm = np.random.permutation(df.index)
    m = len(df.index)
    train_end = int(train_percent * m)
    validate_end = int(validate_percent * m) + train_end
    train = df.iloc[perm[:train_end]]
    validate = df.iloc[perm[train_end:validate_end]]
    test = df.iloc[perm[validate_end:]]
    return train, validate, test

In [126]:
train, val, test = train_validate_test_split(aligned_df, train_percent=0.8, validate_percent=0.1, seed=42)
for k, v in {'train': train, 'val': val, 'test': test}.items():
    v['Transliteration'].to_csv('./compiled_corpora/aligned_{}.egy.csv'.format(k), index=False, header=False)
    v['Translation'].to_csv('./compiled_corpora/aligned_{}.eng.csv'.format(k), index=False, header=False)

## Web-scrape single-language corpus

In [47]:
try:
    from bs4 import BeautifulSoup
except:
    !pip install beautifulsoup4
    from bs4 import BeautifulSoup

In [48]:
import requests

links = ['http://www.egyptomaniak.gr/Pyramid%20Texts%202-3-4_N.html',
         'http://www.egyptomaniak.gr/Old%20Kingdom%20Texts_N.html',
         'http://www.egyptomaniak.gr/New%20Kingdom%20Texts%20I_N.html',
         'http://www.egyptomaniak.gr/New%20Kingdom%20Texts%20II_N.html',
         'http://www.egyptomaniak.gr/Book%20of%20the%20Dead_N.html',
         'http://www.egyptomaniak.gr/Various%20Texts_N/Pyramid%20Texts%201_N.html',
         'http://www.egyptomaniak.gr/Various%20Texts_N/Coffin%20Texts%20vol%20I_N.html',
         'http://www.egyptomaniak.gr/Various%20Texts_N/Coffin%20Texts%20vol%20II_N.html',
         'http://www.egyptomaniak.gr/Various%20Texts_N/Coffin%20Texts%20vol%20III_N.html',
         'http://www.egyptomaniak.gr/Various%20Texts_N/Libyan%20anarchy_N.html',
         'http://www.egyptomaniak.gr/Various%20Texts_N/Late%20Period%20Texts_N.html',
         'http://www.egyptomaniak.gr/Various%20Texts_N/Middle%20Kingdom%20Texts/Middle%20Kingdom%20Texts_N.html',
         'http://www.egyptomaniak.gr/Various%20Texts_N/Various%20Texts%202_N.html']


In [49]:
import unicodedata
def get_link_lines(link):
    file_name = 'texts/monolingual-texts/' + link.split('/')[-1]
    if not exists(file_name):
        page = requests.get(link)
        with open(file_name, 'wb') as to_write:
                to_write.write(page.content)
        content = page.content
    else:
        with open(file_name, 'r') as to_read:
            content = to_read.read()
    soup = BeautifulSoup(content, 'html.parser')
    paras = soup.select('p')
    tag_txt = [unicodedata.normalize('NFKD', para.text).replace('\r\n', ' ').replace('\n', ' ').strip() for para in paras if len(para.select('span')) > 0 and all(['style' in item.attrs and 'Times' in item.attrs['style'] and not item.text.isspace() for item in para.select('span')])]
    ans = [item for item in tag_txt if item != "" and not item.isspace() and not item.isnumeric() and not item.startswith('Warning')]
    print('Link: {}, Length: {}'.format(link, len(ans)))
    print(ans[:10])
    print('\n')
    return ans

In [50]:
nonempty = []
sm = 0
for link in links:
    print("Working on " + link)
    lines = get_link_lines(link)
    sm += len(lines)
    nonempty.extend(lines)
print('Total unaligned corpus length before processing: {}'.format(sm))

Working on http://www.egyptomaniak.gr/Pyramid%20Texts%202-3-4_N.html


UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 25331: character maps to <undefined>

#### Standardize alphabetization

In [39]:
to_swap = {'ḫ': 'x', 'ḥ': 'H', 'ȝ': 'A', 'ḳ': 'q', 'ỉ': 'i', 'ḏ': 'D', 'š': 'S', "ʿ": 'a', 'ẖ':'X', 'ṯ':'T'}
times_swapped = {'ḫ': 0, 'ḥ': 0, 'ȝ': 0, 'ḳ': 0, 'ỉ': 0, 'ḏ': 0, 'š': 0, "ʿ": 0, 'ẖ': 0, 'ṯ': 0}
def standardize_translit(inp):
    current = []
    alphabetized = []
    skip = []
    for item in inp:
        temp = item
        for k in to_swap.keys():
            temp = re.sub(k, to_swap[k], temp)
        alphabetized.append(temp)
    return alphabetized

In [40]:
alphabetized = standardize_translit(nonempty)

### Preprocess transliteration

In [41]:
alphabetized = [item for item in alphabetized if not all([ch in string.punctuation or ch.isspace() for ch in item])]

In [42]:
for i in range(len(alphabetized)):
    if re.match(r'(\d+|1A4)\)\s?',alphabetized[i]):
        alphabetized[i] = re.sub(r'(\d+|1A4)\)\s?', '', alphabetized[i])

In [43]:
alphabetized = [item for item in alphabetized if not 'Chapter' in item and not 'Sethe' in item]
len(alphabetized)

50617

In [44]:
df = pd.DataFrame(alphabetized, columns=['Sentences'])
df['Sentences'] = df['Sentences'].astype('string')
df.head()

Unnamed: 0,Sentences
0,wn aA.wy (sAt) n Hr sn aA.wy SAbwt n stS
1,pna.k n.f m xnti-inb-f swA.n N Hr.Tn m itm
2,N pw xaii-tAw Hri-ib ngAw
3,Dd-mdw wab.n N Hna ra m S-iArw
4,Hr sin.f iwf.k N DHwti sin.f rd.wy.k N


In [45]:
for item in df['Sentences'][:10]:
    print(item)

wn aA.wy (sAt) n Hr sn aA.wy SAbwt n stS
pna.k n.f m xnti-inb-f swA.n N Hr.Tn m itm
N pw xaii-tAw Hri-ib ngAw
Dd-mdw wab.n N Hna ra m S-iArw
Hr sin.f iwf.k N DHwti sin.f rd.wy.k N
Sw fA N ir Hrt nwt imi a.T n N
Dd-mdw inD-Hr.k iri-aA n Hr ... arrwt nt wsir
i.Dd mii rn n N ... n Hr
i.n.f Xr psg smA(?) r smA.f pw
mr ir.f tp-Abdw nqm ir(.f) tp-smdwt


In [46]:
df['Sentences'] = df['Sentences'].str.replace('\{', '', regex=True)
df['Sentences'] = df['Sentences'].str.replace('\}', '', regex=True)
df['Sentences'] = df['Sentences'].str.replace('\.\.\.', '[...]', regex=True)
df['Sentences'] = df['Sentences'].str.replace('\(\.\.\.\)', '[...]', regex=True)
df['Sentences'] = df['Sentences'].str.replace('\(\s?\?\)', '', regex=True)
df['Sentences'] = df['Sentences'].str.replace('\.', ' .', regex=True)
df['Sentences'] = df['Sentences'].str.replace('a .w .s .', 'a.w.s.', regex=True)
df['Sentences'] = df['Sentences'].str.replace('\[\s\.\s\.\s\.\]', '[...]', regex=True)
df['Sentences'] = df['Sentences'].str.replace('\(', '', regex=True)
df['Sentences'] = df['Sentences'].str.replace('\)', '', regex=True)
df['Sentences'] = df['Sentences'].str.replace('\-', ' ', regex=True)

### Remove any monolingual entry comprised solely of punctuation, spaces and/or numbers

In [47]:
broken = [item for item in df['Sentences'] if all([ch.isspace() or ch.isnumeric() or ch in string.punctuation for ch in item]) or not len(item)]
cond = df['Sentences'].isin(broken)
df.drop(df[cond].index, inplace = True)

### Save monolingual corpus

In [48]:
print(df.shape)
df.head()

(50456, 1)


Unnamed: 0,Sentences
0,wn aA .wy sAt n Hr sn aA .wy SAbwt n stS
1,pna .k n .f m xnti inb f swA .n N Hr .Tn m itm
2,N pw xaii tAw Hri ib ngAw
3,Dd mdw wab .n N Hna ra m S iArw
4,Hr sin .f iwf .k N DHwti sin .f rd .wy .k N


In [49]:
df.to_csv('./compiled_corpora/egyptian_monolingual.csv', index=False)