# DSIR Capstone

### by: Chris 'Mack' McGowen



Ancient Greek has always been the lesser-taught of the two classical languages (ancient Greek and Latin). Most of the classical language tools, apps, and programs are built specifically for Latin or with ancient Greek as a kind of 'stretch-goal'. In the past decade, NLP has made great strides (digital assistants like Alexa and Siri, bots, Google translate, etc.) and there are a ton of tools out there for processing and modeling modern languages. However, once again, ancient Greek gets none of the NLP love. So with this capstone project I have decided to attempt to create a custom NLP pipeline for ancient Greek. The end goal is a web app which takes input strings of ancient Greek and performs various processes on it, then returning the output.

## Goals:

1. Custom Part of Speech (POS) tagger
2. 'rootitizer': a lemmatizer that gets finds morphophonological roots instead
3. Word2Vec or other vectorization built from custom corpus
4. Unknown model at this time, LogReg may work but possibly stretch to neural networks

## Data Sources:

1. Most of the data is taken from an unreleased textbook (corpus, roots, segmentation rules). This was given to me as a .docx file.
2. A few custom libraries exist for handling ancient Greek text.

## Risks and Assumptions:

1. The data is small, specific, and hard to work with (500+ page .docx file with tons of stuff I don't want). This means the output of the project will be of minimal use on a grand scale. The code may be useful though.

## Preliminary Data cleaning and extraction

In [5]:
import docx
import pandas as pd
import numpy as np
import regex as re

In [6]:
# all of the definitions I used

def filter_greek_roots(doc):
    doc = doc.expandtabs(1)    # expand tabs so I can split on spaces
    doc = doc.split(' ')
    doc = [re.findall('[\p{InGreekExtended}|\p{InGreekAndCoptic}]+/', doc[i]) for i in range(len(doc))]    # use regex to pull out text matching the two unicode blocks for greek
    doc = list(filter(None, doc))    # remove empty lists
    flat_list = [item for sublist in doc for item in sublist]    # flatten the list (unpack the lists of lists into a list)
    return flat_list

def filter_greek(doc):
    doc = doc.expandtabs(1)
    doc = doc.split(' ')
    doc = [re.findall('[\p{InGreekExtended}|\p{InGreekAndCoptic}]+', doc[i]) for i in range(len(doc))]    # use regex to pull out text matching the two unicode blocks for greek
    doc = list(filter(None, doc))
    doc = [''.join(i) for i in doc]
    return doc

def remove_articles(text):    # remove greek articles because they're useless for our purposes
    articles = ['ἡ', 'ὁ', 'τό']
    for i in articles:
        try:    # must use try/except because remove a list item returns ValueError if the item doesn't exist
            text.remove(i)
        except:
            pass
    return text

def remove_prefix(text):    # many of the roots have prefixes separated from the root. remove them to simplify roots
    try:
        del text[-2]    # the prefixes are always the first item in the list if they exist, so -2 stops us from accidentally deleting lists with 1 element
    except:
        pass
    return text

In [7]:
# import the .docx file I'll be working on

doc = docx.Document('./data/lexis corpus.docx')

In [8]:
# extract text from each paragraph object

docs = [doc.paragraphs[i].text for i in range(len(doc.paragraphs))]

In [26]:
# create dataframe from text

df = pd.DataFrame(docs, columns=['root'])
start_size = df.shape[0]
df[:10]

Unnamed: 0,root
0,Α
1,
2,"Α, ΑΝ\t\t\t\tMAY BE A PREFIX OF A COMPOUND VERB"
3,"ἀ-, ἀν- (Ṇ-)\t\t\ta-, an-, un-, in-, non- (neg..."
4,"ἀβουλία/, ἡ \t\t\tsee βουλευ/\t\t\t\t"
5,"ἀγ/ (1) \t\t\t\tlead, act, do (Ital. agent, ag..."
6,"\tἀν/αγ/\t\t\tlead up, celebrate\n\tἀπ/αγ/\t\t..."
7,"εἰσ/αγ/ (Attic)\t\tlead in, introduce\n\tἐσ/αγ..."
8,"ἀγ/ (2)\t\t\t\tshatter, smash"
9,"\t\t\t\tἄγνυμι, ἄξω, ἔαξα, ἔαγα/ἔηγα, ἔαγμαι, ..."


In [27]:
# separate the roots from their definitions etc. A little messy

df[['root', 'definition']] = df.root.str.split(r'\t\t', n=1, expand = True)
df.head()

Unnamed: 0,root,definition
0,Α,
1,,
2,"Α, ΑΝ",\t\tMAY BE A PREFIX OF A COMPOUND VERB
3,"ἀ-, ἀν- (Ṇ-)","\ta-, an-, un-, in-, non- (negative prefix)"
4,"ἀβουλία/, ἡ",\tsee βουλευ/\t\t\t\t


In [28]:
# for some rows the above code put definitions into the root column, this moves them back

for i in range(1, len(df)):
    if df['definition'][i] == None:
        df['definition'][i] = df['root'][i]
        df['root'][i] = ''
        
df.head()

Unnamed: 0,root,definition
0,Α,
1,,
2,"Α, ΑΝ",\t\tMAY BE A PREFIX OF A COMPOUND VERB
3,"ἀ-, ἀν- (Ṇ-)","\ta-, an-, un-, in-, non- (negative prefix)"
4,"ἀβουλία/, ἡ",\tsee βουλευ/\t\t\t\t


In [29]:
# filter out the greek roots from earlier messy separation 

df['root'] = df.apply(lambda x: filter_greek_roots(x['root']), axis=1)
df.head()

Unnamed: 0,root,definition
0,[],
1,[],
2,[],\t\tMAY BE A PREFIX OF A COMPOUND VERB
3,[],"\ta-, an-, un-, in-, non- (negative prefix)"
4,[ἀβουλία/],\tsee βουλευ/\t\t\t\t


In [30]:
# removes empty lists from root column

df.root = np.where(df.root.str.len() == 0, '', df.root)
df.head()

Unnamed: 0,root,definition
0,,
1,,
2,,\t\tMAY BE A PREFIX OF A COMPOUND VERB
3,,"\ta-, an-, un-, in-, non- (negative prefix)"
4,[ἀβουλία/],\tsee βουλευ/\t\t\t\t


In [31]:
''' 
the format of the original document had some issues which lead to definitions
being off by a row from their root.
'''
for i in range(1, len(df)):
    if df['root'][i] == '':
        df['definition'][i-1] = df['definition'][i]
        
df.drop_duplicates('definition', inplace=True, keep='first')
df.reset_index(inplace=True, drop=True)

df.head()

Unnamed: 0,root,definition
0,,
1,,\t\tMAY BE A PREFIX OF A COMPOUND VERB
2,,"\ta-, an-, un-, in-, non- (negative prefix)"
3,[ἀβουλία/],\tsee βουλευ/\t\t\t\t
4,[ἀγ/],"\t\tlead, act, do (Ital. agent, agenda, actor)..."


In [32]:
# drop rows that contain references to other lexis entries

df.drop(index=df.loc[df['definition'].str.contains('see')].index, inplace=True)
df.head()

Unnamed: 0,root,definition
0,,
1,,\t\tMAY BE A PREFIX OF A COMPOUND VERB
2,,"\ta-, an-, un-, in-, non- (negative prefix)"
4,[ἀγ/],"\t\tlead, act, do (Ital. agent, agenda, actor)..."
5,"[ἀν/, αγ/]","\tlead up, celebrate\n\tἀπ/αγ/\t\t\tlead away"


In [33]:
# filter out the english from the definitions

df['definition'] = df.apply(lambda x: filter_greek(x['definition']), axis=1)
df.head()

Unnamed: 0,root,definition
0,,[]
1,,[]
2,,[]
4,[ἀγ/],"[ἄγω, ἄξω, ἤγαγον, ἦχα, ἦγμαι, ἤχθην]"
5,"[ἀν/, αγ/]",[ἀπαγ]


In [34]:
# remove the articles from the definitions

df.definition = df.apply(lambda x: remove_articles(x['definition']), axis=1)
df.head()

Unnamed: 0,root,definition
0,,[]
1,,[]
2,,[]
4,[ἀγ/],"[ἄγω, ἄξω, ἤγαγον, ἦχα, ἦγμαι, ἤχθην]"
5,"[ἀν/, αγ/]",[ἀπαγ]


In [35]:
# drop empty rows or columns again

df.drop(df.loc[(df.root.str.len() == 0) | (df.definition.str.len() == 0)].index, inplace=True)
df.reset_index(inplace=True, drop=True)
df.head()

Unnamed: 0,root,definition
0,[ἀγ/],"[ἄγω, ἄξω, ἤγαγον, ἦχα, ἦγμαι, ἤχθην]"
1,"[ἀν/, αγ/]",[ἀπαγ]
2,"[εἰσ/, αγ/]","[ἐσαγ, περιαγ, προαγ]"
3,[ἀγ/],"[ἄγνυμι, ἄξω, ἔαξα, ἔαγαἔηγα, ἔαγμαι, ἐάγην]"
4,[ἀγγελ/],"[ἄγγελλω, ἀγγελέω, ἤγγειλα, ἤγγελκα, ἤγγελμαι,..."


In [36]:
# df = pd.read_pickle('./data/vocab.pkl')

In [37]:
# remove the prefixes from roots

df['root'] = df.apply(lambda x: remove_prefix(x['root']), axis=1)
df.head()

Unnamed: 0,root,definition
0,[ἀγ/],"[ἄγω, ἄξω, ἤγαγον, ἦχα, ἦγμαι, ἤχθην]"
1,[αγ/],[ἀπαγ]
2,[αγ/],"[ἐσαγ, περιαγ, προαγ]"
3,[ἀγ/],"[ἄγνυμι, ἄξω, ἔαξα, ἔαγαἔηγα, ἔαγμαι, ἐάγην]"
4,[ἀγγελ/],"[ἄγγελλω, ἀγγελέω, ἤγγειλα, ἤγγελκα, ἤγγελμαι,..."


In [38]:
# turn root back into string

df.root = df.root.apply(''.join)
df.head()

Unnamed: 0,root,definition
0,ἀγ/,"[ἄγω, ἄξω, ἤγαγον, ἦχα, ἦγμαι, ἤχθην]"
1,αγ/,[ἀπαγ]
2,αγ/,"[ἐσαγ, περιαγ, προαγ]"
3,ἀγ/,"[ἄγνυμι, ἄξω, ἔαξα, ἔαγαἔηγα, ἔαγμαι, ἐάγην]"
4,ἀγγελ/,"[ἄγγελλω, ἀγγελέω, ἤγγειλα, ἤγγελκα, ἤγγελμαι,..."


In [39]:
# # save to pkl 

# df.to_pickle('./data/vocab.pkl')

In [41]:
end_size = df.shape[0]
print(f'Starting size: {start_size}\nFinal size: {end_size}')

Starting size: 2630
Final size: 852


As you can see, we lost a lot of data here due to the brute-force methods to minimize time spent on cleaning poorly formatted text. This is sufficient to start with, and once predictions are made I can validate them and rebuild the models with the new data. Next step is to start building some models to predict the roots of unseen words. 