# Latin Lemmatization with LatMor

[LatMor](http://www.cis.uni-muenchen.de/~schmid/tools/LatMor/) is a morphological tagger for Latin based on finite-state transducer based on the [SFST toolkit](http://www.cis.uni-muenchen.de/~schmid/tools/SFST/) written by Uwe Springmann, Helmut Schmid, and Dietmar Najock in 2014. It is described in this [paper](https://www.degruyter.com/view/j/opli.2016.2.issue-1/opli-2016-0019/opli-2016-0019.xml). Like Treetagger, lemmatization is a by-product of the morphological analysis. But it is comprehensive and *very* fast, and so is useful for getting lemma information.

There is no Python wrapper for LatMor. This notebook moves in that direction, using ```subprocess``` to generate command line results and then parsing them. In a future post, I will build a wrapper for LatMor that can be used with the CLTK Backoff Lemmatizer.

As in the [Collatinus post](https://github.com/diyclassics/lemmatizer-experiments/blob/master/notebooks/latin-lemmatization-with-collatinus.ipynb), I use a form-frequency measure (i.e., counting the lemma that matches to the most possible forms) in order to extract a single lemma. This is still a problematic measure to be solved in a future post. LatMor results present additional challenges. Like Collatinus, punctuation is ignored. More noticeably problematic, at least from the perspective of comparing lemmatizer results, is that Latin verbs are lemmatized to their infinitive (i.e. their second principal part). Accordingly, the lemmatizer results need to be lemmatized themselves to the first principal part if the results are to be compared to, say, the TreeTagger or Collatinus results. Another task for another day.

The last section of this post offers assistance with installation and configuration of TreeTagger for OSX. \[PJB 5.11.18\]

In [1]:
# # Install LatMor & SFST

# # See last cell for more information.

In [2]:
# Imports

import re
import subprocess
import shlex

from collections import Counter

from cltk.tokenize.word import WordTokenizer

from pprint import pprint

In [3]:
# Create instances of CLTK tools

tokenizer = WordTokenizer('latin')

## Working with *LatMor* and jupyter bash

In [4]:
# Use cell magic to process bash commands

test = 'laudat'
cmd = f'echo {test} | fst-infl /usr/local/bin/latmor/latmor.a'
result = !$cmd

print(result)

['reading transducer from file "/usr/local/bin/latmor/latmor.a"...', 'finished.', '> laudat', 'laudare<V><pres><ind><active><sg><3>']


In [5]:
# Isolate the result string

print(result[-1])

laudare<V><pres><ind><active><sg><3>


In [6]:
# Use regular expressions to parse result string for lemma information

m = re.search('(.+?)<', result[-1])
print(m.group(1))

laudare


In [7]:
# Same as above, but for a token that has more than one possible lemma

test = 'amor'
cmd = f'echo {test} | fst-infl /usr/local/bin/latmor/latmor.a'
result = !$cmd

pprint(result)

['reading transducer from file "/usr/local/bin/latmor/latmor.a"...',
 'finished.',
 '> amor',
 'amare<V><pres><ind><passive><sg><1>',
 'amor<N><masc><sg><nom>',
 'amor<N><masc><sg><voc>']


In [8]:
# Isolate the result strings

results = [r.split('<')[0] for r in result if r.endswith('>')]
pprint(results)

['amare', 'amor', 'amor']


## Working with *LatMor* and ```subprocess```

In [9]:
# Set up subprocess commands

token = 'amor'
cmd1 = f'echo {token}'
cmd2 = '/usr/local/bin/fst-infl /usr/local/bin/latmor/latmor.a'

In [10]:
%%time

# Build subprocess pipe
# NBL: I'll write up shlex and subprocess.PIPE at some point

p1 = subprocess.Popen(shlex.split(cmd1), stdout=subprocess.PIPE)
p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
result = p2.communicate()[0].decode()
print(result)

> amor
amare<V><pres><ind><passive><sg><1>
amor<N><masc><sg><nom>
amor<N><masc><sg><voc>

CPU times: user 2.79 ms, sys: 8.38 ms, total: 11.2 ms
Wall time: 90 ms


In [11]:
# Parse results

_result = result.split('\n')
results = [r.split('<')[0] for r in _result if r.endswith('>')]
print(results)

['amare', 'amor', 'amor']


In [12]:
## Build a lemmatize function; use form frequency to return a single lemma

def _choose_weighted_lemmas(lemmas):
    """"""
    
    c = Counter(lemmas)
    weights = [(i, c[i] / len(lemmas) * 100.0) for i in c]
    if weights:
        return max(weights,key=lambda item:item[1])[0]
    else:
        return None

def lm_lemmatize(token):
    cmd1 = f'echo {token}'
    cmd2 = '/usr/local/bin/fst-infl /usr/local/bin/latmor/latmor.a'
    p1 = subprocess.Popen(shlex.split(cmd1), stdout=subprocess.PIPE)
    p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
    result = p2.communicate()[0].decode()
    _result = result.split('\n')
    results = [r.split('<')[0] for r in _result if r.endswith('>')]
    return _choose_weighted_lemmas(results)

In [13]:
%%time

print(lm_lemmatize('amor'))

amor
CPU times: user 5.1 ms, sys: 9.29 ms, total: 14.4 ms
Wall time: 123 ms


## Set up batch fst-infl

*LatMor* is fast. But it would still be costly to make multiple calls to *LatMor* via ```subprocess```. Luckily *LatMor* can work with batches of tokens. These tokens need to be in a list separated by new line ('\n') characters. We can accomplish this by joining a list of tokens with '\n' and passing this to the ```echo``` command in ```subprocess```. A few things worth noting:
- tokens are lowercased; LatMor appears to be case-sensitive, though I have not fully tested this
- puncuation is ignored, so we check for punctuation marks and restore them to the lemma results
- LatMor returns a list of possible lemmas; this function returns the one with the highest frequency

In [14]:
def lm_lemmatize_batch(tokens):
    from string import punctuation
    text = '\n'.join([token.lower() for token in tokens])
    cmd1 = ['echo', text]
    cmd2 = '/usr/local/bin/fst-infl /usr/local/bin/latmor/latmor.a'
    p1 = subprocess.Popen(cmd1, stdout=subprocess.PIPE)
    p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
    result = p2.communicate()[0].decode()
    results = re.split(r'\B> ', result)[1:]
    lemmas = []
    for result in results:
        if result:
            form = result.split('\n')[0]
            _lemmas = [r.split('<')[0] for r in result.split('\n') if r.endswith('>')]
            if _lemmas:
                lemmas.append(_lemmas)
            elif form in punctuation:
                lemmas.append(form)
            else:
                lemmas.append(None)
    return [_choose_weighted_lemmas(lemma) for lemma in lemmas]

In [15]:
# Set up sample text

# Sall. Bell. Cat. 1
text = """Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget.
"""

In [16]:
%%time

# Tokenize and lemmatize sample text

tokens = tokenizer.tokenize(text)
lemmas = lm_lemmatize_batch(tokens)

CPU times: user 22.3 ms, sys: 13.2 ms, total: 35.5 ms
Wall time: 271 ms


In [17]:
pprint(list(zip(tokens, lemmas))[:10])

[('Omnis', 'omnis'),
 ('homines', 'homo'),
 (',', ','),
 ('qui', 'qui'),
 ('sese', 'sese'),
 ('student', 'studere'),
 ('praestare', 'praestare'),
 ('ceteris', 'ceterus'),
 ('animalibus', 'animalis'),
 (',', ',')]


## LatMor performance on larger text

In [18]:
# Tagging performance on the entirety of Sallust's *Bellum Catilinum*

from cltk.corpus.latin import latinlibrary
bc = latinlibrary.raw('sall.1.txt')
bc = bc[bc.find('[1]'):bc.find('Sallust The Latin Library The Classics Page')]

In [19]:
# Script for preprocessing texts

import html
import re
import string
from cltk.stem.latin.j_v import JVReplacer

def preprocess(text):
    
    replacer = JVReplacer()
    
    text = html.unescape(text) # Handle html entities
    text = re.sub(r'&nbsp;?', ' ',text) #&nbsp; stripped incorrectly in corpus?
    text = re.sub(r'\x00',' ',text) #Another space problem?
        
    text = text.lower()
    text = replacer.replace(text) #Normalize u/v & i/j    
    
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!«»—"
    translator = str.maketrans({key: " " for key in punctuation})
    text = text.translate(translator)
    
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)
    
    text = re.sub('[ ]+',' ', text) # Remove double spaces
    text = re.sub('\s+\n+\s+','\n', text) # Remove double lines and trim spaces around new lines
    
    return text.strip()

In [20]:
# Preprocess text

bc = preprocess(bc)
bc_tokens = tokenizer.tokenize(bc)
print(f'There are {len(bc_tokens)} tokens in Sallust\'s *Bellum catilinae*')

There are 10802 tokens in Sallust's *Bellum catilinae*


In [21]:
%%time

results = lm_lemmatize_batch(tokens[:25])

CPU times: user 3.44 ms, sys: 7.38 ms, total: 10.8 ms
Wall time: 109 ms


## Help with installing LatMor

The installation instructions for LatMor are included inside the zipped file package in README. What I offer here is primarily documentation of how I prefer to install Treetagger with specific attention to working with Latin.

1. Download and unzip LatMor
    - [LatMor.tar.gz](http://www.cis.uni-muenchen.de/~schmid/tools/LatMor/LatMor.tar.gz)
3. Move the unzipped folder to ```/usr/local/bin```; a command like ```mv ./LatMor /usr/local/bin``` should work.
4. Download and unzip [SFST](http://www.cis.uni-muenchen.de/~schmid/tools/SFST/)
    - NB: I used v. 1.4.7d for this Notebook
5. In the uinzipped folder, change directory to ```src```
6. Run the following commands:
    - ```make```
    - ```make install```
    - ```make maninstall```
7. You should be all set now—try it out with the following:
    - ```echo laudat | fst-infl /usr/local/bin/latmor/latmor.a```
    - Output
        - ```reading transducer from file "/usr/local/bin/latmor/latmor.a"...
finished.```
        - ```> laudat```
        - ```laudare<V><pres><ind><active><sg><3>```
    
LatMor should now work as expected in the Notebooks above. If you notice any problems with the installation instructions, please open an issue in this repo.—PJB