# LatMor-based Backoff Lemmatizer

Here is what a [LatMor](#)-based sublemmatizer for the CLTK Latin Backoff Lemmatizer might look like. Just as in the earlier post on TreeTaggerBackoff, this creates a subclass of NLTK's SequentialBackoffTagger that gets lemma information by running *LatMor* via ```subprocess```. In order to return a single result from *LatMor*, I have used the weighting strategy from the earlier post; again, this is not a sound, longterm strategy, only a temporary one.

The examples below are the same as in recent notebooks: 1. Sallust's *Bellum Catilinae* and the opening stanzas of a the nonsense Latin text of a [*Jabberwocky* translation](http://www.thelatinlibrary.com/iabervocius.html). The first is there to show general performance and the second to show the flexibility of the Backoff Lemmatizer.

This backoff class requires *LatMor* and *SFST* to be installed.\[PJB 5.6.18]

In [1]:
# Imports

import re
import subprocess
import shlex
from collections import Counter

from nltk.tag.sequential import SequentialBackoffTagger

from cltk.tokenize.word import WordTokenizer

from pprint import pprint

In [2]:
# Create instances of CLTK tools

tokenizer = WordTokenizer('latin')

In [3]:
# Create TreeTaggerLemmatizer as subclass of NLTK's Sequential Backoff Tagger

from string import punctuation     

class LatMorLemmatizer(SequentialBackoffTagger):
    """"""
    def __init__(self, backoff=None):
        """Setup for LatMorLemmatizer"""
        # Add test blocks for installation of LatMor and SFST
        SequentialBackoffTagger.__init__(self, backoff)
        

    def choose_tag(self, tokens, index, history):
        """Returns a lemma for the token at a given index using LatMor
        :param tokens: List of tokens to be lemmatized
        :param index: Int with current token
        :param history: List with tokens that have already been lemmatized
        :return: String, spec. the lemma found at the current index.
        """             
        if tokens[index] in punctuation: # LatMor does not handle punctuation; return punc immediately
            return tokens[index]
        
        # Set up commands for subprocess.PIPE and build pipe
        cmd1 = f'echo {tokens[index].lower()}'
        cmd2 = '/usr/local/bin/fst-infl /usr/local/bin/latmor/latmor.a'
        p1 = subprocess.Popen(shlex.split(cmd1), stdout=subprocess.PIPE)
        p2 = subprocess.Popen(shlex.split(cmd2), stdin=p1.stdout, stdout=subprocess.PIPE)
        
        # Get the stdout with communicate
        result = p2.communicate()[0].decode()
        
        # Parse the LatMor results
        result_lines = result.split('\n')
        results = [r.split('<')[0] for r in result_lines if r.endswith('>')]
        
        return max(results,key=results.count) if results else None # TAKE THE HIGHEST FREQ LEMMA—FIX!!!
    
    
    def _batch_lemmatize(tokens):
        pass

    
    def lemmatize(self, tokens):
        # Setup global variable to build lemma list more quickly/efficiently?
        # ^^^ use _batch_lemmatize
        return self.tag(tokens) 
 

In [4]:
# Set up lemmatizer

lemmatizer = LatMorLemmatizer()

In [5]:
# Sample text

text = """Omnis homines qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget.
"""

In [6]:
%%time

# Get lemmas

lemma_pairs = lemmatizer.lemmatize(tokenizer.tokenize(text))

CPU times: user 196 ms, sys: 487 ms, total: 683 ms
Wall time: 6.99 s


In [7]:
# Print sample

pprint(lemma_pairs[:10])

[('Omnis', 'omnis'),
 ('homines', 'homo'),
 ('qui', 'qui'),
 ('sese', 'sese'),
 ('student', 'studere'),
 ('praestare', 'praestare'),
 ('ceteris', 'ceterus'),
 ('animalibus', 'animalis'),
 (',', ','),
 ('summa', 'summus')]


In [8]:
# Another sample text

text = """ Est brilgum: tovi slimici
In vabo tererotitant
Brogovi sunt macresculi
Momi rasti strugitant.

"Fuge Gabrobocchia, fili mi,
Qui fero lacerat morsu:
Diffide Iubiubae avi
Es procul ab Unguimanu."""

In [9]:
%%time

# Get lemmas

lemma_pairs = lemmatizer.lemmatize(tokenizer.tokenize(text))

CPU times: user 40.3 ms, sys: 107 ms, total: 147 ms
Wall time: 1.46 s


In [10]:
# Print sample

pprint(lemma_pairs[:10])

[('Est', 'esse'),
 ('brilgum', None),
 (':', ':'),
 ('tovi', None),
 ('slimici', None),
 ('In', 'in'),
 ('vabo', None),
 ('tererotitant', None),
 ('Brogovi', None),
 ('sunt', 'esse')]


Note here that only words like 'sunt' and 'in' (and punctuation) are lemmatized on a first pass. (Even *est*, correctly!) These words are again the functional vocabulary necessary to keep the nonsense Latin recognizable as Latin at all.

In [11]:
# Import additional lemmatizers/resources

from cltk.lemmatize.backoff import UnigramLemmatizer, RegexpLemmatizer
from cltk.lemmatize.latin.latin import latin_sub_patterns

In [12]:
# Another sample text

text = """ Est brilgum tovi slimici
In vabo tererotitant
Brogovi sunt macresculi
Momi rasti strugitant

"Fuge Gabrobocchia fili mi
Qui fero lacerat morsu
Diffide Iubiubae avi
Es procul ab Unguimanu"""

In [13]:
%%capture --no-display 
# ^^^ Ignore cell-specific warnings ^^^

# Set up a more expansive backoff chain

r = RegexpLemmatizer([
    ('(.)(ant)$', '\\1o'), 
    ('(.)(um)$', '\\1us'),
    ('(.)(i)$', '\\1us')
])

u = UnigramLemmatizer(model={'Momi': 'momus'}, backoff=r)
r = RegexpLemmatizer(latin_sub_patterns, backoff=u)
l = LatMorLemmatizer(backoff=r)

In [14]:
%%time

# Get lemmas
tokens = tokenizer.tokenize(text)
lemma_pairs = l.lemmatize(tokens)

CPU times: user 41.8 ms, sys: 102 ms, total: 143 ms
Wall time: 1.43 s


In [15]:
# Print sample

pprint(lemma_pairs[:10])

[('Est', 'esse'),
 ('brilgum', 'brilgus'),
 ('tovi', 'tovus'),
 ('slimici', 'slimicus'),
 ('In', 'in'),
 ('vabo', 'vo'),
 ('tererotitant', 'tererotito'),
 ('Brogovi', 'Brogovus'),
 ('sunt', 'esse'),
 ('macresculi', 'macresculis')]
