# Collatinus-based Backoff Lemmatizer

Here is what a [Collatinus](#)-based sublemmatizer for the CLTK Latin Backoff Lemmatizer might look like. Just as in the earlier post on TreeTaggerBackoff, this creates a subclass of the sequential backoff tagger that gets lemma information from the *PyCollatinus* package. The ```lemmatize``` method takes a list of tokens (like all of the CLTK backoff lemmatizers), joins them, and runs this string through *PyCollatinus*. There is the additional step of reintroducing punctuation, which is ignored by the *PyCollatinus* ```Lemmatiseur``` class. WHAT TO DO WITH NONE. In order to return a single result from *PyCollatinus*, I have used the weighting strategy from the earlier post; again, this is not a sound, longterm strategy, only a temporary one.

<!---
When used on its own, the results are what you would expect from [TreeTagger](https://github.com/diyclassics/lemmatizer-experiments/blob/master/notebooks/latin-lemmatization-with-treetagger.ipynb). This is shown in the first example below using the first paragraph of Sallust's *Bellum Catilinae*.

The next two examples—using the nonsense Latin text of a [*Jabberwocky* translation](http://www.thelatinlibrary.com/iabervocius.html)—shows the flexibility of the Backoff Lemmatizer and how the TreeTaggerLemmatizer results can improved upon.

In the first example, we create a chain of TreeTaggerLemmatizer and RegexpLemmatizer and see that a nonsense word like *vabo* is reasonably(?) lemmatized to *vo*, because the Latin regex substitution patterns include *-abo > -o*. *Vabo* is the hapaxest of hapax legomena—it has never appeared in another Latin text and so will not be found in any lemma dictionary. (It is also admittedly wrong—*vabo* here is clearly the ablative of a noun *vabus* or *vabum* in the phrase *in vabo*. We will fix this in a future post by combining lemmatization with POS-tagging.)
--->

The next example, expands the backoff chain to include additional regex patterns and a custom lemma dictionary (using the UnigramLemmatizer) to lemmatize fully the first ten tokens in the *Jabberwocky*.

This backoff class requires the *PyCollatinus* package.\[PJB 5.6.18]

In [9]:
# Imports

from collections import Counter

from nltk.tag.sequential import SequentialBackoffTagger

from cltk.tokenize.word import WordTokenizer

from pycollatinus import Lemmatiseur

from pprint import pprint

In [7]:
# Create instances of CLTK tools

tokenizer = WordTokenizer('latin')

In [124]:
# Create TreeTaggerLemmatizer as subclass of NLTK's Sequential Backoff Tagger

class CollatinusLemmatizer(SequentialBackoffTagger):
    """"""
    def __init__(self, backoff=None):
        """Setup for TreeTaggerLemmatizer()."""
        SequentialBackoffTagger.__init__(self, backoff)
        self.tagger = Lemmatiseur()
        self._lemmas = []
        
        
    def choose_tag(self, tokens, index, history):
        """Returns the lemma at the index in the _lemmas list created
        by TreeTagger in lemmatize.
        :param tokens: List of tokens to be lemmatized
        :param index: Int with current token
        :param history: List with tokens that have already been lemmatized
        :return: String, spec. the lemma found at the current index.
        """
        #print(tokens)
        print('Coll')
        return self._lemmas[index]    
    
    def lemmatize(self, tokens):
        lemmas = []
        text = " ".join([token.lower() for token in tokens])
        lemmas = []
        for _, _, lemma in self.tagger.tag(text):
            if lemma == '<unknown>':
                lemmas.append(None)
            else:
                lemmas.append(lemma.split('|')[0])
        self._lemmas = lemmas
        return self.tag(tokens)
    

    def _choose_weighted_lemmas(self, results):
        """"""
        # Extract lemmas
        lemmas = []

        for result in results:
            _lemmas = []
            for _result in result:
                _lemmas.append(_result['lemma'])
            lemmas.append(_lemmas)
        
        # Get weighted lemmas
        weighted_lemmas = []

        for lemma in lemmas:
            c = Counter(lemma)
            weights = [(i, c[i] / len(lemma) * 100.0) for i in c]
            weighted_lemmas.append(weights)
            
        # Get max lemma
        
        lemma_max = []

        for weighted_lemma in weighted_lemmas:
            if weighted_lemma:
                weight_max = max(weighted_lemma,key=lambda item:item[1])[0]
            else:
                weight_max = None
            lemma_max.append(weight_max)
            
        return lemma_max

    
    def _align_tokens_lemmas(self, tokens, lemmas):
        from string import punctuation

        lemma_pairs = []

        pos = 0
        for token in tokens:
            if token in punctuation:
                lemma_pairs.append((token, token))
            else:
                lemma_pairs.append((token, lemmas[pos]))
                pos += 1
        
        return lemma_pairs
        
    
    def lemmatize(self, tokens):
        lemmas = []
        text = " ".join([token.lower() for token in tokens])
        results = self.tagger.lemmatise_multiple(text)
        lemmas = self._choose_weighted_lemmas(results)
        aligned_results = self._align_tokens_lemmas(tokens, lemmas)
        tokens = [token for token, _ in lemma_pairs]
        lemmas = [lemma for _, lemma in lemma_pairs]
        self.tokens = tokens
#         print(tokens)
        self._lemmas = lemmas
#         print(lemmas)
        return self.tag(self.tokens) 
 

In [59]:
%%capture --no-display 
# ^^^ Ignore cell-specific warnings ^^^

# Set up lemmatizer

lemmatizer = CollatinusLemmatizer()

In [60]:
# Sample text

text = """Omnis homines qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget.
"""

In [21]:
%%time

# Get lemmas

lemma_pairs = lemmatizer.lemmatize(tokenizer.tokenize(text))

CPU times: user 70.1 ms, sys: 2.43 ms, total: 72.5 ms
Wall time: 70.8 ms


In [22]:
# Print sample

pprint(lemma_pairs[:10])

[('Omnis', 'omnis'),
 ('homines', 'homo'),
 ('qui', 'qui'),
 ('sese', 'se'),
 ('student', 'studeo'),
 ('praestare', 'praesto'),
 ('ceteris', 'ceteri'),
 ('animalibus', 'animalis'),
 (',', ','),
 ('summa', 'summus')]


In [76]:
# Another sample text

text = """ Est brilgum: tovi slimici
In vabo tererotitant
Brogovi sunt macresculi
Momi rasti strugitant.

"Fuge Gabrobocchia, fili mi,
Qui fero lacerat morsu:
Diffide Iubiubae avi
Es procul ab Unguimanu."""

In [24]:
%%time

# Get lemmas

lemma_pairs = lemmatizer.lemmatize(tokenizer.tokenize(text))

CPU times: user 14.1 ms, sys: 276 µs, total: 14.3 ms
Wall time: 14.2 ms


In [25]:
# Print sample

pprint(lemma_pairs[:10])

[('Est', 'edo'),
 ('brilgum', None),
 (':', ':'),
 ('tovi', None),
 ('slimici', None),
 ('In', 'in'),
 ('vabo', None),
 ('tererotitant', None),
 ('Brogovi', None),
 ('sunt', 'sum')]


Note here that only words like 'sunt' and 'in' (and punctuation) are lemmatized on a first pass. (Also, *est*, though incorrectly!) These words are again the functional vocabulary necessary to keep the nonsense Latin recognizable as Latin at all.

In [61]:
# Import additional lemmatizers/resources

from cltk.lemmatize.backoff import UnigramLemmatizer, RegexpLemmatizer
from cltk.lemmatize.latin.latin import latin_sub_patterns

In [125]:
%%capture --no-display 
# ^^^ Ignore cell-specific warnings ^^^

# Set up a more expansive backoff chain

r = RegexpLemmatizer([
    ('(.)(ant)$', '\\1o'), 
    ('(.)(um)$', '\\1us'),
    ('(.)(i)$', '\\1us')
])
u = UnigramLemmatizer(model={'Momi': 'momus'}, backoff=r)
backoff = RegexpLemmatizer(latin_sub_patterns, backoff=u)
lemmatizer = CollatinusLemmatizer(backoff=backoff)

In [126]:
%%time

# Get lemmas

lemma_pairs = lemmatizer.lemmatize(tokenizer.tokenize(text))

Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
CPU times: user 13 ms, sys: 1.06 ms, total: 14.1 ms
Wall time: 13.2 ms


In [127]:
# Print sample

pprint(lemma_pairs[:10])

[('Est', 'sum'),
 ('brilgum', 'brilgus'),
 (':', ':'),
 ('tovi', 'tovus'),
 ('slimici', 'slimicus'),
 ('In', 'in'),
 ('vabo', 'vo'),
 ('tererotitant', 'tererotito'),
 ('Brogovi', 'Brogovus'),
 ('sunt', 'sum')]


Note that all ten of the first ten words of the *Jabberwocky* translation have been lemmatized, plausibly, if not correctly (whatever correctly means in lemmatizing nonsense poetry.)

In [132]:
# Create TreeTaggerLemmatizer as subclass of NLTK's Sequential Backoff Tagger

from treetagger import TreeTagger

class TreeTaggerLemmatizer(SequentialBackoffTagger):
    """"""
    def __init__(self, backoff=None):
        """Setup for TreeTaggerLemmatizer()."""
        SequentialBackoffTagger.__init__(self, backoff)
        self.tagger = TreeTagger(language='latin') # Error trap to see if module is installed!
        self._lemmas = []
        
        
    def choose_tag(self, tokens, index, history):
        """Returns the lemma at the index in the _lemmas list created
        by TreeTagger in lemmatize.
        :param tokens: List of tokens to be lemmatized
        :param index: Int with current token
        :param history: List with tokens that have already been lemmatized
        :return: String, spec. the lemma found at the current index.
        """
        print('TT')
        return self._lemmas[index]    
    
    def lemmatize(self, tokens):
        print(tokens)
        lemmas = []
        text = " ".join([token.lower() for token in tokens])
        lemmas = []
        for _, _, lemma in self.tagger.tag(text):
            if lemma == '<unknown>':
                lemmas.append(None)
            else:
                lemmas.append(lemma.split('|')[0])
        self._lemmas = lemmas
        return self.tag(tokens)
    
    
# Here's the deal—lemmatize is only called once at the beginning of the backoff. All other processing code needs to be in choose_tag and possibly tag!    

In [129]:
# Another sample text

text = """ Est brilgum tovi slimici
In vabo tererotitant
Brogovi sunt macresculi
Momi rasti strugitant

"Fuge Gabrobocchia fili mi
Qui fero lacerat morsu
Diffide Iubiubae avi
Es procul ab Unguimanu"""

In [133]:
%%capture --no-display 
# ^^^ Ignore cell-specific warnings ^^^

# Set up a more expansive backoff chain

r = RegexpLemmatizer([
    ('(.)(ant)$', '\\1o'), 
    ('(.)(um)$', '\\1us'),
    ('(.)(i)$', '\\1us')
])
#u = UnigramLemmatizer(model={'Momi': 'momus'}, backoff=r)
r = RegexpLemmatizer(latin_sub_patterns, backoff=u)
t = TreeTaggerLemmatizer(backoff=None)
lemmatizer = CollatinusLemmatizer(backoff=t)


In [134]:
%%time

# Get lemmas
tokens = tokenizer.tokenize(text)
lemma_pairs = lemmatizer.lemmatize(tokens)

Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
Coll
TT


IndexError: list index out of range

In [75]:
# Print sample

pprint(lemma_pairs[:10])

[('Est', 'sum'),
 ('brilgum', 'brilgus'),
 (':', ':'),
 ('tovi', 'tovus'),
 ('slimici', 'slimicus'),
 ('In', 'in'),
 ('vabo', 'vo'),
 ('tererotitant', 'tererotito'),
 ('Brogovi', 'Brogovus'),
 ('sunt', 'sum')]
