# Latin Lemmatization with LatMor

!!! WRITE NEW POST !!!! This is the first installment in series of posts/notebooks on Latin lemmatization that will cover: 1. introductions to existing options for Latin lemmatization (esp. those available for Python); 2. comparisons/benchmarks for existing lemmatizers; and 3. development notes on the [CLTK Latin Backoff Lemmatizer](https://disiectamembra.wordpress.com/2016/08/23/wrapping-up-google-summer-of-code/) and related projects

[TreeTagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) is a probabilistic, decision tree-based part-of-speech tagger written by Helmut Schmid in 1994. It is described in this [paper](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger2.pdf). Though originally written for German tagging, parameter files have since been written for a number of languages including Latin. This notebook uses G. Brandolini's parameter file which is based on a [number of sources for Latin lexical and morphological data](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/Latin-parameter-file-readme): PROIEL data, Perseus data, Index Thomisticus data and Whitaker's Words.

Lemmatization is a by-product of TreeTagger's pos-tagging, but a useful one. It runs quickly, performs well, and has two Python wrappers (shown below). This notebook introduces the two Python wrappers—*treetaggerwrapper* and *treetagger-python*—and gives example workflows and some execution time information. The last section of this post offers assistance with installation and configuration of TreeTagger for OSX. \[PJB 5.4.18]

In [1]:
# # Install LatMor

# # Installation information at XXX. See last cell for more information.

In [12]:
# Imports

import re
import subprocess
import shlex

from collections import Counter

from cltk.tokenize.word import WordTokenizer

from pprint import pprint

In [3]:
# Create instances of CLTK tools

tokenizer = WordTokenizer('latin')

## Working with *LatMor* and jupyter bash

In [48]:
test = 'laudat'
cmd = f'echo {test} | fst-infl /usr/local/bin/latmor/latmor.a'
result = !$cmd

print(result)

['reading transducer from file "/usr/local/bin/latmor/latmor.a"...', 'finished.', '> laudat', 'laudare<V><pres><ind><active><sg><3>']


In [5]:
print(result[-1])

laudare<V><pres><ind><active><sg><3>


In [6]:
m = re.search('(.+?)<', result[-1])
print(m.group(1))

laudare


In [83]:
test = 'atque'
cmd = f'echo {test} | fst-infl /usr/local/bin/latmor/latmor.a'
result = !$cmd

pprint(result)

['reading transducer from file "/usr/local/bin/latmor/latmor.a"...',
 'finished.',
 '> atque',
 'atque<CONJ>',
 'at<PREP> que<PART>',
 'at<CONJ> que<PART>']


In [8]:
results = [r.split('<')[0] for r in result if r.endswith('>')]

In [9]:
pprint(results)

['amare', 'amor', 'amor']


## Working with *LatMor* and ```subprocess```

In [13]:
token = 'amor'
cmd1 = f'echo {token}'
cmd2 = '/usr/local/bin/fst-infl /usr/local/bin/latmor/latmor.a'

In [14]:
p1 = subprocess.Popen(shlex.split(cmd1), stdout=subprocess.PIPE)
p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
result = p2.communicate()[0].decode()
print(result)

> amor
amare<V><pres><ind><passive><sg><1>
amor<N><masc><sg><nom>
amor<N><masc><sg><voc>



In [15]:
_result = result.split('\n')
results = [r.split('<')[0] for r in _result if r.endswith('>')]
print(results)

['amare', 'amor', 'amor']


In [16]:
def _choose_weighted_lemmas(lemmas):
    """"""
    
    c = Counter(lemmas)
    weights = [(i, c[i] / len(lemmas) * 100.0) for i in c]
    if weights:
        return max(weights,key=lambda item:item[1])[0]
    else:
        return None

def lm_lemmatize(token):
    cmd1 = f'echo {token}'
    cmd2 = '/usr/local/bin/fst-infl /usr/local/bin/latmor/latmor.a'
    p1 = subprocess.Popen(shlex.split(cmd1), stdout=subprocess.PIPE)
    p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
    result = p2.communicate()[0].decode()
    _result = result.split('\n')
    results = [r.split('<')[0] for r in _result if r.endswith('>')]
    return _choose_weighted_lemmas(results)

In [17]:
%%time

print(lm_lemmatize('amor'))

amor
CPU times: user 4.18 ms, sys: 8.25 ms, total: 12.4 ms
Wall time: 76.5 ms


## Set up batch fst-infl

In [18]:
text = 'arma virumque cano'

In [199]:
def lm_lemmatize_batch(tokens):
    text = '\n'.join([token.lower() for token in tokens])
    cmd1 = ['echo', text]
    cmd2 = '/usr/local/bin/fst-infl /usr/local/bin/latmor/latmor.a'
    p1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE)
    p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
    result = p2.communicate()[0].decode()
    results = result.split('> ')[1:]
    lemmas = []
    for result in results[25:26]:
        if result:
            print(result)
            form = result.split('\n')[0]
            _lemmas = [r.split('<')[0] for r in result.split('\n') if r.endswith('>')]
            print(_lemmas)
            #if form == 'atque':
                #print(result)
            if _lemmas:
                lemmas.append(_lemmas)
            else:
                lemmas.append([form])
    return [_choose_weighted_lemmas(lemma) for lemma in lemmas]

In [200]:
%%time

results = lm_lemmatize_batch(tokenizer.tokenize(text))

atque
atque<CONJ>
at<PREP
['atque']
CPU times: user 5.79 ms, sys: 7.28 ms, total: 13.1 ms
Wall time: 517 ms


In [201]:
pprint(results)

['atque']


In [178]:
# Set up test text

# Sall. Bell. Cat. 1
text = """Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget.
"""

In [179]:
%%time

tokens = tokenizer.tokenize(text)
lemmas = lm_lemmatize_batch(tokens)

que<PART>
> ventri
venter<N><masc><sg><dat>
> oboedientia
oboedire<V><part><pres><active><neut><pl><nom>
oboedire<V><part><pres><active><neut><pl><acc>
oboedire<V><part><pres><active><neut><pl><voc>
oboediens<ADJ><positive><neut><pl><nom>
oboediens<ADJ><positive><neut><pl><acc>
oboediens<ADJ><positive><neut><pl><voc>
oboedientia<N><fem><sg><nom>
oboedientia<N><fem><sg><abl>
oboedientia<N><fem><sg><voc>
> finxit
fingere<V><perf><ind><active><sg><3>
> .
no result for .
> sed
sed<CONJ>
> nostra
noster<PRO><poss><fem><sg><nom>
noster<PRO><poss><fem><sg><abl>
noster<PRO><poss><fem><sg><voc>
noster<PRO><poss><neut><pl><nom>
noster<PRO><poss><neut><pl><acc>
noster<PRO><poss><neut><pl><voc>
> omnis
omnis<ADJ><positive><masc><sg><nom>
omnis<ADJ><positive><masc><sg><gen>
omnis<ADJ><positive><masc><sg><voc>
omnis<ADJ><positive><masc><pl><acc><alt>
omnis<ADJ><positive><fem><sg><nom>
omnis<ADJ><positive><fem><sg><gen>
omnis<ADJ><positive><fem><sg><voc>
omnis<ADJ><positive><fem><pl><acc><alt>
omnis<

In [174]:
print(len(tokens))
print(len(lemmas))
print(lemmas)

151
9
['homo', ',', 'qui', 'sese', 'studere', 'praestare', 'ceterus', 'animalis', ',']


## Performance

In [66]:
# Tagging performance on the entirety of Sallust's *Bellum Catilinum*

from cltk.corpus.latin import latinlibrary
bc = latinlibrary.raw('sall.1.txt')
bc = bc[bc.find('[1]'):bc.find('Sallust The Latin Library The Classics Page')]

In [261]:
# Script for preprocessing texts

import html
import re
import string
from cltk.stem.latin.j_v import JVReplacer

def preprocess(text):
    
    replacer = JVReplacer()
    
    text = html.unescape(text) # Handle html entities
    text = re.sub(r'&nbsp;?', ' ',text) #&nbsp; stripped incorrectly in corpus?
    text = re.sub(r'\x00',' ',text) #Another space problem?
        
    text = text.lower()
    text = replacer.replace(text) #Normalize u/v & i/j    
    
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!«»—"
    translator = str.maketrans({key: " " for key in punctuation})
    text = text.translate(translator)
    
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)
    
    text = re.sub('[ ]+',' ', text) # Remove double spaces
    text = re.sub('\s+\n+\s+','\n', text) # Remove double lines and trim spaces around new lines
    
    return text.strip()

In [262]:
# Preprocess text

bc = preprocess(bc)

In [263]:
%%time

results = lm_lemmatize_batch(tokenizer.tokenize(bc))

CPU times: user 369 ms, sys: 77 ms, total: 446 ms
Wall time: 8.81 s


In [264]:
print(lemmas)

['omnis', 'homo', 'qui', 'sese', 'studere', 'praestare', 'ceterus', 'animalis', 'summus', 'ops', 'niti', 'decere', 'ne', None, 'silentium', 'transire', None, 'pecus', 'qui', 'nare', 'pronus', 'at', None, 'oboedire', 'fingere', 'sed', 'noster', 'omnis', None, 'in', 'animus', 'et', 'corpus', 'sinere', 'esse', 'animus', 'imperium', 'corpus', None, 'magus', 'utere', 'alter', 'ego', 'cum', 'dis', 'alter', 'cum', 'beluus', 'commune', 'esse', 'quis', 'ego', 'rectus', None, 'ingenium', 'quis', None, 'ops', 'gloria', 'quaerere', 'et', 'quoniam', None, 'ipse', 'qui', 'frui', None, 'esse', 'memoria', 'noster', 'quis', 'maxumus', 'longa', 'efficere', 'nam', None, 'et', 'forma', 'glorius', 'fluxus', 'at', 'fragilis', 'esse', None, 'clarus', 'aeternus', None, 'habere', 'sed', 'diu', 'magnus', 'inter', 'mortalis', 'certamen', 'esse', None, 'corpus', 'an', None, 'animus', 'res', 'militaris', 'magus', 'procedere', 'nam', 'et', 'prae', 'quis', 'incipere', 'consultus', 'et', 'ubi', 'consulere', 'maturus'

## Help with installing LatMor

The installation instructions for Treetagger (at least on OSX) are reasonably clear. What I offer here is primarily documentation of how I prefer to install Treetagger with specific attention to working with Latin.

1. Download all of the Treetagger files, i.e. (again for OSX)
    - tree-tagger-MacOSX-3.2.tar.gz
    - tagger-scripts.tar.gz
    - install-tagger.sh
2. Download the Latin parameters file. NB: There are two Latin files—for this notebook I am using G. Brandolini's file (latin-par-linux-3.2.bin.gz)
3. Unzip tree-tagger-MacOSX-3.2.tar.gz
4. Rename this folder ```treetagger``` and put the other three (3) files inside. You should not unzip the other files.
5. Move this folder to ```/usr/local/bin```; a command like ```mv ./treetagger /usr/local/bin``` should work.
6. Change directory to ```/usr/local/bin/treetagger``` and run the install script, i.e. ```sh install-tagger.sh```
7. You should be all set now—try it out with the following:
    - ```echo 'Salve munde!' | cmd/tree-tagger-latin```
    - Output
        - ```Salve	V:IMP	salveo```
        - ```munde	N:voc	mundus```
        - ```!	SENT	!```
8. It is probably a good idea to add treetagger's location to PATH.
    - Open ~/.bash_profile (or the appropriate file for whatever you shell you are using) and add:
        - ```export PATH=/usr/local/bin/treetagger/cmd:/usr/local/bin/treetagger/bin:$PATH```
    - treetagger-python also requires that you add the following line to ~/.bash_profile:
        - ```export TREETAGGER_HOME='/usr/local/bin/treetagger/cmd/'``` 
    
Treetagger should now work as expected in the Notebooks above. If you notice any problems with the installation instructions, please open an issue in this repo.—PJB