# Introduction

The `perseus_nlp_toolkit` include some NLTK/CLTK (or NLTK-like) functionality to work with citable and Capitains-compliant texts. I am mostly testing it with Greek texts from the Perseus DL and the Fist 1k Years of Greek and Latin.

In this notebook I will show how to use the modules in this package to:
* read a corpus of Greek texts
* perform a full morphological analysis
* lemmatize the tagged words

In [1]:
import sys
sys.path.append("../../")

In [2]:
from perseus_nlp_toolkit import lemmatize
from perseus_nlp_toolkit import tagger
from perseus_nlp_toolkit import CapitainCorpusReader

In [3]:
import os
from importlib import reload

# Read the corpus

The module include a `CapitainCorpusReader` to load and tokenize texts that preserve the citation scheme. This reader extends the NLTK regular corpus reader by adding some special methods to access the canonical citations extracted from the TEI- or EpiDOC-compliant texts. The class has methods to tokenize words and sentences and store the citations along with the tokens.

As with any corpus reader, the `CapitainCorpusReader` is constructed with a root to the data, a regexp or a simple path (relative to the root) with the file(s) to read and, optionally, a sentence- and a word-tokenizer. In what follows we use the default: a sentence tokenizer for Greek and the regular word-punct tokenizer.

Another parameter that can be passed to the reader is the list of TEI elements to bypass. By default, the reader ignores the content of the `tei:note` element. When you deal with dialogic or dramatic texts, you might want to include the elements with the speaker's label (e.g. `tei:label`).

In [4]:
root = os.path.expanduser("~/cltk_data/greek/text/canonical-greekLit-master/data/")

We will work with Plato's *Symposium*

In [28]:
# as we are reading a dialogic text, we get rid of the label with the speaker's name
symp = CapitainCorpusReader(root, "tlg0059/tlg011/tlg.+grc2\.xml", 
                        ≠    exclude_tags=["tei:note", "tei:label"])

The `cite_sents(fileid)` method gives you access to the sentences in a specific file (passed with the "fileid" argument), where each token in each sentence preserves also the canonical citaion extracted from the xml file

In [6]:
f = symp.fileids()[0]
cite_sents = symp.cite_sents(f)

In [7]:
cite_sents[1]

[('172', 'καὶ'),
 ('172', 'γὰρ'),
 ('172', 'ἐτύγχανον'),
 ('172', 'πρῴην'),
 ('172', 'εἰς'),
 ('172', 'ἄστυ'),
 ('172', 'οἴκοθεν'),
 ('172', 'ἀνιὼν'),
 ('172', 'Φαληρόθεν'),
 ('172', '·')]

In total, the *Symposium* contains:

In [24]:
print("- {}\t sentences".format(len(cite_sents)))
print("- {}\t tokens".format(len(symp.words())))

- 1041	 sentences
- 20311	 tokens


# Morphological tagging

The module `tagger` include a python wrapper around the [MateTool](https://code.google.com/archive/p/mate-tools/) morph tagger. This wrapper works very well with the pre-trained models that scored the highest accuracy in the experiment by [Celano, Crane, and Majidi](https://doi.org/10.1515/opli-2016-0020).

The models were tested here curtesy of Giuseppe Celano!

The `MateMorphTagger` is instantiated by passing the path to the folder where the mate jar file is stored and the path (relative to the root folder where the jar is) to the trained model.

In [8]:
mate_tagger = tagger.MateMorphTagger(os.path.expanduser("~/Downloads/MateGreek"), 
                                     "LastmateMorph.model")

In [9]:
%%time
cited_tagged_sents = mate_tagger.tag_cite_sents(cite_sents)

CPU times: user 300 ms, sys: 52 ms, total: 352 ms
Wall time: 1min 39s


As can be seen, the tagger is pretty fast! It took only 1:40 minute (on my machine) to tag more than 20k words.

Let's inspect a couple of sentences

In [25]:
cited_tagged_sents[0]

[('172', 'δοκῶ', 'v|1|s|p|i|a|-|-|-'),
 ('172', 'μοι', 'p|-|s|-|-|-|m|d|-'),
 ('172', 'περὶ', 'r|_'),
 ('172', 'ὧν', 'p|-|p|-|-|-|n|g|-'),
 ('172', 'πυνθάνεσθε', 'v|2|p|p|m|e|-|-|-'),
 ('172', 'οὐκ', 'd|_'),
 ('172', 'ἀμελέτητος', 'n|-|s|-|-|-|f|g|-'),
 ('172', 'εἶναι', 'v|-|-|p|n|a|-|-|-'),
 ('172', '.', 'u|_')]

That seems encouraging indeed!

# Lemmatization

A pos-tag + word couplet is generally more than enough to disambiguate between lemmata. One of the classes available in the `lemmatizer` module adopts the same approach as [Giuseppe Celano's work]() and uses the couplet to perform a lookup into Morpheus database. (Again, we use Giuseppe's unicode conversion of the Morpheus tabs).

The `MorpheusLookupLemmatizer` performs the lookup using a bz2-compressed csv table with all the forms, tag, and lemma combination in Morpheus. The table has little less than 1M lines, but a compressed version takes up only 3.3MB of disk space, so it's not that bad. Probably, ad database would be a more efficient way to look up the forms, but it would require users to install, populate and maintain the db to lemmatize a text. A `pandas` dataframe offers a fast and effective way to look up the forms in the table.

In [13]:
%%time
lemmatizer = lemmatize.MorpheusLookupLemmatizer("../lib/morpheus/morpheus_dataframe.csv.bz2")

CPU times: user 2.37 s, sys: 175 ms, total: 2.54 s
Wall time: 3 s


Here's an example of a lemmatized sentence

In [65]:
%%time
lemmatizer.lemmatize_sentence(cited_tagged_sents[0], include_cite=True)

CPU times: user 1.51 s, sys: 17.2 ms, total: 1.52 s
Wall time: 1.53 s


[('172', 'δοκῶ', 'δοκέω', 'v1spia---'),
 ('172', 'μοι', '', 'p-s---md-'),
 ('172', 'περὶ', '', 'r--------'),
 ('172', 'ὧν', 'ὅς', 'p-p---ng-'),
 ('172', 'πυνθάνεσθε', 'πυνθάνομαι', 'v2ppme---'),
 ('172', 'οὐκ', 'οὐ', 'd--------'),
 ('172', 'ἀμελέτητος', '', 'n-s---fg-'),
 ('172', 'εἶναι', 'εἰμί', 'v--pna---'),
 ('172', '.', 'punct', 'u--------')]

The class has a `lemmatize_sentences` method that can be used to lemmatize and entire sequence of sentences. But, as we shall see, it is an extremely slow process, so to get a better sense of how long the task is taking to complete we'll monitor the loop with `tqdm` and re-write some code for the iteration in the following cells

In [11]:
from tqdm import tqdm

In [14]:
lemm_sents = []
for s in tqdm(cited_tagged_sents):
    ls = lemmatizer.lemmatize_sentence(s, include_cite=True)
    lemm_sents.append(ls)

100%|██████████| 1041/1041 [27:53<00:00,  1.61s/it]


In total, it took little less than half an hour to lemmatize 1041 sentences. It's not very good! Also, note that I have already speeded up the process considerably by implementing a memoization using [functools.lru_cache](https://docs.python.org/3/library/functools.html#functools.lru_cache). That obviously brought a sensible gain (on the previous attempt, without caching, I had to stop the process after about 50 minutes and 50% of the sentences to go...).

Let us inspect the results

In [22]:
lemm_sents[102]

[('176', 'τὸν', '', 'p-s---ma-'),
 ('176', 'οὖν', 'οὖν', 'g--------'),
 ('176', 'Ἀριστοφάνη', '', 'n-s---fn-'),
 ('176', 'εἰπεῖν', 'εἶπον', 'v--ana---'),
 ('176', ',', 'punct', 'u--------'),
 ('176', 'τοῦτο', '', 'p-s---na-'),
 ('176', 'μέντοι', '', 'c--------'),
 ('176', 'εὖ', 'εὖ', 'd--------'),
 ('176', 'λέγεις', 'λέγω', 'v2spia---'),
 ('176', ',', 'punct', 'u--------'),
 ('176', 'ὦ', '', 'i--------'),
 ('176', 'Παυσανία', '', 'n-s---mv-'),
 ('176', ',', 'punct', 'u--------'),
 ('176', 'τὸ', 'ὁ', 'l-s---na-'),
 ('176', 'παντὶ', '', 'a-s---md-'),
 ('176', 'τρόπῳ', 'τρόπος', 'n-s---md-'),
 ('176', 'παρασκευάσασθαι', 'παρασκευάζω', 'v--anm---'),
 ('176', 'ῥᾳστώνην', 'ῥᾳστώνη', 'n-s---fa-'),
 ('176', 'τινὰ', '', 'a-s---fa-'),
 ('176', 'τῆς', 'ὁ', 'l-s---fg-'),
 ('176', 'πόσεως', 'πόσις', 'n-s---fg-'),
 ('176', '·', 'punct', 'u--------')]

Not very impressive... Though some mistakes can be corrected in postprocessing.

To sum up, the simple MorpheusLookupLemmatizer can be OK for short texts and a very superficial round of lemmatization.