Thinking back to the previous examples for tokenization and lexical counting, there is an obvious shortcoming, that it does not assimilate lexically identical words with one another. For example, we may want to count "est" and "sunt" as instances of "esse".

Lemmatization is the non-trivial process of reconciling inflected forms to their dictionary headword. The CLTK offers several methods. We'll show here one of the less sophisticated approaches. (Documentation for a new statistical method is in the works.)

Note: You may have heard of stemming, which is similar in purpose, however it does not convert a word to a dictionary form, but only reduces commonly related forms into a new, unambiguous string (e.g., 'amicitia' --> 'amiciti'). This is not what we need for Greek and Latin.

In [1]:
cato_agri_praef = "Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiverunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem civem existimarint foeneratorem quam furem, hinc licet existimare. Et virum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, verum, ut supra dixi, periculosum et calamitosum. At ex agricolis et viri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque invidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit."

In [2]:
# First import a repository: the CLTK data models for Latin

from cltk.corpus.utils.importer import CorpusImporter
corpus_importer = CorpusImporter('latin')
corpus_importer.import_corpus('latin_models_cltk')

In [5]:
# Replace j/v and tokenize

from cltk.stem.latin.j_v import JVReplacer
from cltk.tokenize.word import WordTokenizer

jv_replacer = JVReplacer()
cato_agri_praef = jv_replacer.replace(cato_agri_praef.lower())

word_tokenizer = WordTokenizer('latin')
cato_word_tokens = word_tokenizer.tokenize(cato_agri_praef.lower())
cato_word_tokens = [token for token in cato_word_tokens if token not in ['.', ',', ':', ';']]

In [6]:
from cltk.stem.lemma import LemmaReplacer

lemmatizer = LemmaReplacer('latin')
lemmata = lemmatizer.lemmatize(cato_word_tokens)
print(lemmata)

['edo1', 'interdum', 'praesto2', 'mercor', 'res', 'quaero', 'nitor1', 'tam', 'periculosus', 'sum1', 'et', 'ito', 'foenerari', 'si', 'tam', 'honestus', 'magnus', 'noster', 'sic', 'habeo', 'et', 'ito', 'in', 'lex', 'posiuerunt', 'fur', 'duplum', 'condemno', 'foeneratorem', 'quadruplus', 'quantus', 'malus', 'civis', 'existimo', 'foeneratorem', 'qui1', 'fur', 'hinc', 'liceo1', 'existimo', 'et', 'vir', 'bonus', 'cum', 'laudo', 'ito', 'laudo', 'bonus', 'agricola1', 'bonus', '-que', 'colonus', 'amplus', 'laudo', 'existimo', 'qui1', 'ito', 'laudo', 'mercator', 'autem', 'strenuus', 'studiosus', '-que', 'redeo', 'quaero', 'existimo', 'verus', 'ut', 'supra', 'dico2', 'periculosus', 'et', 'calamitosus', 'at', 'ex', 'agricola1', 'et', 'vir', 'fortis', 'et', 'milito', 'strenuus', 'gigno', 'magnus', '-que', 'pius', 'quaestus', 'stabilissimus', '-que', 'consequor', 'minimus', '-que', 'invidiosus', 'minimus', '-que', 'malus', 'cogito', 'sum1', 'qui1', 'in', 'eo1', 'studium', 'occupo', 'sum1', 'nunc', '

In [7]:
# Now we do the same but also return the original form
# This is useful for checking accuracy

lemmata_orig = lemmatizer.lemmatize(cato_word_tokens, return_raw=True)
print(lemmata_orig)

['est/edo1', 'interdum/interdum', 'praestare/praesto2', 'mercaturis/mercor', 'rem/res', 'quaerere/quaero', 'nisi/nitor1', 'tam/tam', 'periculosum/periculosus', 'sit/sum1', 'et/et', 'item/ito', 'foenerari/foenerari', 'si/si', 'tam/tam', 'honestum/honestus', 'maiores/magnus', 'nostri/noster', 'sic/sic', 'habuerunt/habeo', 'et/et', 'ita/ito', 'in/in', 'legibus/lex', 'posiuerunt/posiuerunt', 'furem/fur', 'dupli/duplum', 'condemnari/condemno', 'foeneratorem/foeneratorem', 'quadrupli/quadruplus', 'quanto/quantus', 'peiorem/malus', 'ciuem/civis', 'existimarint/existimo', 'foeneratorem/foeneratorem', 'quam/qui1', 'furem/fur', 'hinc/hinc', 'licet/liceo1', 'existimare/existimo', 'et/et', 'uirum/vir', 'bonum/bonus', 'quom/cum', 'laudabant/laudo', 'ita/ito', 'laudabant/laudo', 'bonum/bonus', 'agricolam/agricola1', 'bonum/bonus', '-que/-que', 'colonum/colonus', 'amplissime/amplus', 'laudari/laudo', 'existimabatur/existimo', 'qui/qui1', 'ita/ito', 'laudabatur/laudo', 'mercatorem/mercator', 'autem/au

In [8]:
# Let's count again

# Count all words

print(len(lemmata))

115


In [10]:
# Count unique words

print(len(set(lemmata)))

73


In [11]:
# Finally, measure lexical diversity, using lemmata

print(len(set(lemmata)) / len(lemmata))

0.6347826086956522
