# Latin Lemmatization with Collatinus

!!!NEW POST!!! TreeTagger is a probabilistic, decision tree-based part-of-speech tagger written by Helmut Schmid in 1994. It is described in this paper. Though originally written for German tagging, parameter files have since been written for a number of languages including Latin. This notebook uses G. Brandolini's parameter file which is based on a number of sources for Latin lexical and morphological data: PROIEL data, Perseus data, Index Thomisticus data and Whitaker's Words.

Lemmatization is a by-product of TreeTagger's pos-tagging, but a useful one. It runs quickly, performs well, and has two Python wrappers (shown below). This notebook introduces the two Python wrappers—treetaggerwrapper and treetagger-python—and gives example workflows and some execution time information. The last section of this post offers assistance with installation and configuration of TreeTagger for OSX. [PJB 5.7.18]

In [1]:
from pycollatinus import Lemmatiseur

from cltk.tokenize.word import WordTokenizer

from pprint import pprint

In [2]:
%%capture --no-display

analyzer = Lemmatiseur()

In [3]:
# Set up test text

# Sall. Bell. Cat. 1
text = """Omnis homines, qui sese student praestare ceteris animalibus, summa ope niti decet, ne vitam silentio transeant veluti pecora, quae natura prona atque ventri oboedientia finxit. Sed nostra omnis vis in animo et corpore sita est: animi imperio, corporis servitio magis utimur; alterum nobis cum dis, alterum cum beluis commune est. Quo mihi rectius videtur ingeni quam virium opibus gloriam quaerere et, quoniam vita ipsa, qua fruimur, brevis est, memoriam nostri quam maxume longam efficere. Nam divitiarum et formae gloria fluxa atque fragilis est, virtus clara aeternaque habetur. Sed diu magnum inter mortalis certamen fuit, vine corporis an virtute animi res militaris magis procederet. Nam et, prius quam incipias, consulto et, ubi consulueris, mature facto opus est. Ita utrumque per se indigens alterum alterius auxilio eget.
"""

In [4]:
# Create instances of CLTK tools

tokenizer = WordTokenizer('latin')
tokens = tokenizer.tokenize(text)
text_string = " ".join(tokens)

In [5]:
len(tokens)

151

In [6]:
%%time

results = analyzer.lemmatise_multiple(text_string)

CPU times: user 78.8 ms, sys: 1.21 ms, total: 80 ms
Wall time: 90.6 ms


In [7]:
results[1]

[{'form': 'homines',
  'lemma': 'homo',
  'morph': 'nominatif pluriel',
  'radical': 'homin',
  'desinence': 'es'},
 {'form': 'homines',
  'lemma': 'homo',
  'morph': 'vocatif pluriel',
  'radical': 'homin',
  'desinence': 'es'},
 {'form': 'homines',
  'lemma': 'homo',
  'morph': 'accusatif pluriel',
  'radical': 'homin',
  'desinence': 'es'}]

In [8]:
lemmas = []

for result in results:
    _lemmas = []
    for _result in result:
        _lemmas.append(_result['lemma'])
    lemmas.append(_lemmas)

In [9]:
pprint(lemmas[:3])

[['omne',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis',
  'omnis'],
 ['homo', 'homo', 'homo'],
 ['qui', 'qui', 'quis', 'queo', 'queo', 'qui']]


In [10]:
from collections import Counter

c = Counter(lemmas[0])
print([(i, c[i] / len(lemmas[0]) * 100.0) for i in c])

[('omne', 10.0), ('omnis', 90.0)]


In [11]:
weighted_lemmas = []

for lemma in lemmas:
    c = Counter(lemma)
    weights = [(i, c[i] / len(lemma) * 100.0) for i in c]
    weighted_lemmas.append(weights)

In [12]:
pprint(weighted_lemmas[:10])

[[('omne', 10.0), ('omnis', 90.0)],
 [('homo', 100.0)],
 [('qui', 50.0), ('quis', 16.666666666666664), ('queo', 33.33333333333333)],
 [('se', 100.0)],
 [('studeo', 100.0)],
 [('praesto', 100.0)],
 [('ceteri', 42.857142857142854),
  ('ceterum', 14.285714285714285),
  ('ceterus', 42.857142857142854)],
 [('animal', 25.0), ('animalis', 75.0)],
 [('summa', 23.076923076923077),
  ('summum', 23.076923076923077),
  ('summus', 46.15384615384615),
  ('summo', 7.6923076923076925)],
 [('Opis', 25.0), ('Ops', 25.0), ('ops', 25.0), ('opos', 25.0)]]


In [13]:
lemma_max = []

for weighted_lemma in weighted_lemmas:
    weight_max = max(weighted_lemma,key=lambda item:item[1])[0]
    lemma_max.append(weight_max)

In [14]:
pprint(lemma_max[:10])

['omnis',
 'homo',
 'qui',
 'se',
 'studeo',
 'praesto',
 'ceteri',
 'animalis',
 'summus',
 'Opis']


In [15]:
len(tokens)
len(lemma_max)

126

In [16]:
# Align tokens & lemmas due to missing punctuation

from string import punctuation

pos = 0
for token in tokens:
    if token in punctuation:
        print((token, token))
    else:
        print((token, lemma_max[pos]))
        pos += 1

('Omnis', 'omnis')
('homines', 'homo')
(',', ',')
('qui', 'qui')
('sese', 'se')
('student', 'studeo')
('praestare', 'praesto')
('ceteris', 'ceteri')
('animalibus', 'animalis')
(',', ',')
('summa', 'summus')
('ope', 'Opis')
('niti', 'nitor')
('decet', 'decet')
(',', ',')
('ne', 'ne')
('vitam', 'uita')
('silentio', 'silentium')
('transeant', 'transeo')
('veluti', 'ueluti')
('pecora', 'pecus')
(',', ',')
('quae', 'qui')
('natura', 'nascor')
('prona', 'pronus')
('atque', 'atque')
('ventri', 'venter')
('oboedientia', 'oboedio')
('finxit', 'fingo')
('.', '.')
('Sed', 'sed')
('nostra', 'noster')
('omnis', 'omnis')
('vis', 'uia')
('in', 'in')
('animo', 'animus')
('et', 'et')
('corpore', 'corpus')
('sita', 'sino')
('est', 'edo')
(':', ':')
('animi', 'animus')
('imperio', 'imperium')
(',', ',')
('corporis', 'corpus')
('servitio', 'servitium')
('magis', 'magus')
('utimur', 'utor')
(';', ';')
('alterum', 'alter')
('nobis', 'nos')
('cum', 'cum')
('dis', 'dis')
(',', ',')
('alterum', 'alter')
('cum'