# Session 2: Getting Familiar with the CLTK, Lemmatization, and Tokenization

In this session, we will explore the basic functionality and syntax of the Classical Language Toolkit (CLTK). The CLTK extends the Natural Language Toolkit (NLTK), the foremost tool for Natural Language Processing (NLP) in most languages, and better fits it to the specific needs of ancient languages. There are many facets to this Python package, so we will only be familiarizing ourselves with a few of the most common functionalities--essentially the bread and butter of computational text analysis with the CLTK.

We'll cover the following:
1. Importing Greek & Latin Texts
2. Tokenization
3. Lemmatization
4. Stopwords & Stoplists

So let's get started!

In [3]:
# We need to import the CLTK package before anything else
import cltk

# Specifically from CLTK, we need the Corpus Importer which will allow us to download texts
from cltk.corpus.utils.importer import CorpusImporter

# The way the Corpus Importer has been created, we must create a python *object* that will
# serve as the downloader, as it were, of our texts
latinDL = CorpusImporter('latin')

# The downloader object has an *attribute* that allows to see what corpora are available
latinDL.list_corpora

['latin_text_perseus',
 'latin_treebank_perseus',
 'latin_text_latin_library',
 'phi5',
 'phi7',
 'latin_proper_names_cltk',
 'latin_models_cltk',
 'latin_pos_lemmata_cltk',
 'latin_treebank_index_thomisticus',
 'latin_lexica_perseus',
 'latin_training_set_sentence_cltk',
 'latin_word2vec_cltk',
 'latin_text_antique_digiliblt',
 'latin_text_corpus_grammaticorum_latinorum',
 'latin_text_poeti_ditalia']

In [4]:
# Now that we know what corpora are available, we can select one to download
# For this session, we need 'latin_text_latin_library' & 'latin_models_cltk'

# This line will automatically download the entirety of the Latin Library to your
# computer in the form of text files that are easy for the machine to read.
latinDL.import_corpus('latin_text_latin_library')

# This line downloads a helper corpus of linguistic models which allow us to 
# lemmatize the text and do a whole bunch of other neat things.
latinDL.import_corpus('latin_models_cltk')

## How to Read-In Texts

In this module, we will read in the text files for Vergil's *Aeneid*.

In [5]:
# Now that we have our texts and models downloaded, we need to import them.
# We need another package called 'os' that allows Python to make use of the operating system.
import os

# You will have to navigate to the location on your machine where the texts are downloaded
# to find the filenames. All of the data that CLTK uses on your machine will be in a folder
# entitled 'cltk_data' in your home directory. This is the same directory that holds folders
# such as 'Documents' and 'Desktop'. We need to summon this textual data.

# Some of the texts are split-up by book, like Vergil's Aeneid, so we need a good way of
# gathering all that information into the system. Below, I have created a list of the filenames
# without the '.txt' appended.
filenames = ['aen1','aen2','aen3','aen4','aen5','aen6','aen7','aen8','aen9','aen10','aen11','aen12']

# Let's create a list called 'aeneid' that will hold all the text. It will have 12 entries, one
# for each book/file. We can leave it blank for now because we're about to read in all the files.
aeneid=[]

# Now we will use a for loop to iterate over all the filenames we just created.
# This basically says: for each file in filenames, open the file and read in it's context to the
# proper place in the list 'aeneid'. 
for filename in filenames:
    with open(f'/Users/chaduhl/cltk_data/latin/text/latin_text_latin_library/vergil/{filename}.txt') as f:
        aeneid.append(f.read())
        
# Let's make sure we did that right. The line below should print out the first book of the
# Aeneid. Remember that counting in python starts at 0.
print(aeneid[0])

Vergil: Aeneid I
		 

		 
		
		 
		
		 
		 
	 
	
 


 P. VERGILI MARONIS AENEIDOS LIBER PRIMVS 


 

Arma virumque cano, Troiae qui primus ab oris 
Italiam, fato profugus, Laviniaque venit 
litora, multum ille et terris iactatus et alto 
vi superum saevae memorem Iunonis ob iram; 
multa quoque et bello passus, dum conderet urbem,    5 
inferretque deos Latio, genus unde Latinum, 
Albanique patres, atque altae moenia Romae.
 

 
Musa, mihi causas memora, quo numine laeso, 
quidve dolens, regina deum tot volvere casus 
insignem pietate virum, tot adire labores    10 
impulerit. Tantaene animis caelestibus irae?
 

 
Urbs antiqua fuit, Tyrii tenuere coloni, 
Karthago, Italiam contra Tiberinaque longe 
ostia, dives opum studiisque asperrima belli; 
quam Iuno fertur terris magis omnibus unam    15 
posthabita coluisse Samo; hic illius arma, 
hic currus fuit; hoc regnum dea gentibus esse, 
si qua fata sinant, iam tum tenditque fovetque. 
Progeniem sed enim Troiano a sanguine duci 
audierat, 

In [6]:
# Before we move any further, we need to account for small inconsistencies, such as
# j/i and u/v transcriptions, especially since the Latin Library is fond of j's and
# both using both u's and v's. 

# We need to import a special package from CLTK to take care of this.
from cltk.stem.latin.j_v import JVReplacer
replacer = JVReplacer()
for i in range(len(aeneid)):
    aeneid[i] = replacer.replace(aeneid[i].lower())
print(aeneid[0])

# And we need to make everything lowercase


uergil: aeneid i
		 

		 
		
		 
		
		 
		 
	 
	
 


 p. uergili maronis aeneidos liber primus 


 

arma uirumque cano, troiae qui primus ab oris 
italiam, fato profugus, lauiniaque uenit 
litora, multum ille et terris iactatus et alto 
ui superum saeuae memorem iunonis ob iram; 
multa quoque et bello passus, dum conderet urbem,    5 
inferretque deos latio, genus unde latinum, 
albanique patres, atque altae moenia romae.
 

 
musa, mihi causas memora, quo numine laeso, 
quidue dolens, regina deum tot uoluere casus 
insignem pietate uirum, tot adire labores    10 
impulerit. tantaene animis caelestibus irae?
 

 
urbs antiqua fuit, tyrii tenuere coloni, 
karthago, italiam contra tiberinaque longe 
ostia, diues opum studiisque asperrima belli; 
quam iuno fertur terris magis omnibus unam    15 
posthabita coluisse samo; hic illius arma, 
hic currus fuit; hoc regnum dea gentibus esse, 
si qua fata sinant, iam tum tenditque fouetque. 
progeniem sed enim troiano a sanguine duci 
audierat, 

## How to Tokenize Texts

Having successfully read-in the text, we now need to tokenize it. Tokenizing a text means splitting up continuous text into chunks--usually either chunks of sentences or individual words. For the purposes of this exercise, we will be chunking the text into work-tokens. 

In [7]:
# To tokenize a text is to split up all the words into individual entities. Luckily, the
# CLTK has a built-in function for doing this. First, we need to import the tokenizer package:
from cltk.tokenize.word import WordTokenizer

# Then, just like the Corpus Importer, we need to create an object called 'tokenizer' that
# will do the job. We also need to tell the tokenizer what language it will be tokenizing.
tokenizer = WordTokenizer('latin')

# Let's create a list to hold all these tokens
aeneid_tokens = []

# A for-loop to actually implement the tokenizer
for i in range(len(aeneid)):
    aeneid_tokens.append(tokenizer.tokenize(aeneid[i]))

# Let's see if it worked. You should see the first 100 "words" of Book 1 of the Aeneid. 
# There will be numbers and punctuation intermingled because we have not yet removed them.
# In the line below, there are two sets of brackets because we have created a 2-dimensional
# array: a list of lists. The first set of brackets indicates the book and the second indicates
# the which token or word we are calling. In this case, the line calls the first 50 tokens.
print(aeneid_tokens[0][:50])

['uergil', ':', 'aeneid', 'i', 'p.', 'uergili', 'maronis', 'aeneidos', 'liber', 'primus', 'arma', 'uirum', '-que', 'cano', ',', 'troiae', 'qui', 'primus', 'ab', 'oris', 'italiam', ',', 'fato', 'profugus', ',', 'lauinia', '-que', 'uenit', 'litora', ',', 'multum', 'ille', 'et', 'terris', 'iactatus', 'et', 'alto', 'ui', 'superum', 'saeuae', 'memorem', 'iunonis', 'ob', 'iram', ';', 'multa', 'quoque', 'et', 'bello', 'passus']


In [8]:
# Let's clean out those line numbers and punctuation with for-loop and a regular expression:
for i in range(12):
    aeneid_tokens[i] = [token for token in aeneid_tokens[i] if any(c.isalpha() for c in token)]

print(aeneid_tokens[0][:50])

['uergil', 'aeneid', 'i', 'p.', 'uergili', 'maronis', 'aeneidos', 'liber', 'primus', 'arma', 'uirum', '-que', 'cano', 'troiae', 'qui', 'primus', 'ab', 'oris', 'italiam', 'fato', 'profugus', 'lauinia', '-que', 'uenit', 'litora', 'multum', 'ille', 'et', 'terris', 'iactatus', 'et', 'alto', 'ui', 'superum', 'saeuae', 'memorem', 'iunonis', 'ob', 'iram', 'multa', 'quoque', 'et', 'bello', 'passus', 'dum', 'conderet', 'urbem', 'inferret', '-que', 'deos']


In [9]:
# We can wrap this up a little bit neater into a single function though:
import string
import re
from collections import Counter
from pprint import pprint

def preprocess(text):    

    text = text.lower()
    text = replacer.replace(text) #Normalize u/v & i/j
    
    punctuation ="\"#$%&\'()*+,-/:;<=>@[\]^_`{|}~.?!«»"
    translator = str.maketrans({key: " " for key in punctuation})
    text = text.translate(translator)
    
    translator = str.maketrans({key: " " for key in '0123456789'})
    text = text.translate(translator)

    remove_list = [r'\bthe latin library\b',
                   r'\bthe classics page\b',
                   r'\bneo-latin\b', 
                   r'\bmedieval latin\b',
                   r'\bchristian latin\b',
                   r'\bchristina latin\b',
                   r'\bthe miscellany\b',
                  ]

    for pattern in remove_list:
        text = re.sub(pattern, '', text)
    
    text = re.sub('[ ]+',' ', text) # Remove double spaces
    text = re.sub('\s+\n+\s+','\n', text) # Remove double lines and trim spaces around new lines
    
    return text

## How to Lemmatize Texts

Now that we have the text split into word-tokens, we can lemmatize it. Lemmatizing entails stripping each word of its inflection or declension. We want each word to be in the form of its most basic dictionary entry. This should not be done in all cases, but it is important for many methods of computational text analysis.

In [10]:
# First we need to import the Lemmatizer package & create a Lemmatizer object.
# The syntax here should be starting to look familiar.
from cltk.stem.lemma import LemmaReplacer
lemmatizer = LemmaReplacer('latin')

aeneid_lemmata = []
for i in range(len(aeneid)):
    aeneid_lemmata.append(lemmatizer.lemmatize(aeneid_tokens[i]))
print(aeneid_lemmata[0][:50])

['uergil', 'aeneid', 'eo1', 'p.', 'uergili', 'maronis', 'aeneidos', 'libo1', 'primus', 'armo', 'vir', '-que', 'cano', 'troiae', 'qui1', 'primus', 'ab', 'aurum', 'italiam', 'for', 'profugus', 'lauinia', '-que', 'venio', 'litus3', 'multus', 'ille', 'et', 'terra', 'jacto', 'et', 'alo', 'vis', 'superus', 'saevus', 'memoro', 'iunonis', 'ob', 'ira', 'multo2', 'quoque', 'et', 'bellus', 'patior', 'dum', 'condo', 'urbs', 'infero', '-que', 'deus']


In [20]:
# This is immediately handy for crafting vocabulary frequency lists
from nltk.probability import FreqDist
fdist = FreqDist(word for word in aeneid_lemmata[0])
fdist.most_common(50)

[('-que', 253),
 ('et', 166),
 ('qui1', 90),
 ('hic', 60),
 ('in', 49),
 ('tu', 40),
 ('atque', 36),
 ('do', 32),
 ('venio', 29),
 ('per', 29),
 ('ille', 28),
 ('aeneus', 28),
 ('sum1', 27),
 ('fero', 27),
 ('cum', 27),
 ('ab', 25),
 ('aurum', 24),
 ('neque', 24),
 ('aut', 24),
 ('ego', 23),
 ('jam', 23),
 ('magnus', 23),
 ('for', 22),
 ('alo', 21),
 ('ad', 21),
 ('sui', 21),
 ('urbs', 20),
 ('terra', 19),
 ('primus', 18),
 ('edo1', 18),
 ('tum', 18),
 ('talis', 17),
 ('dico2', 17),
 ('non', 17),
 ('quis1', 16),
 ('omne', 16),
 ('si', 16),
 ('sic', 16),
 ('ipse', 16),
 ('regina', 15),
 ('-ne', 15),
 ('gens', 15),
 ('fortis', 15),
 ('video', 15),
 ('bellus', 14),
 ('teneo', 14),
 ('longus', 14),
 ('pectus', 14),
 ('medius', 14),
 ('nunc', 14)]

## Stopwords & Crafting a Stoplist 

We have everything tokenized and looking really nice inside of our data structures. Now we might want to consider weeding out some of the words. /more here

In [16]:
from cltk.stop.latin import CorpusStoplist

stop_list_maker = CorpusStoplist()

aeneid_stop_list = stop_list_maker.build_stoplist(aeneid, size=50)
print(aeneid_stop_list)

['ab', 'ac', 'ad', 'aeneas', 'ante', 'arma', 'armis', 'at', 'atque', 'aut', 'cum', 'de', 'est', 'et', 'ex', 'haec', 'haud', 'hic', 'hinc', 'hoc', 'iam', 'ille', 'in', 'inter', 'ipse', 'me', 'mihi', 'nec', 'non', 'nunc', 'omnis', 'pater', 'per', 'qua', 'quae', 'quam', 'quem', 'qui', 'quid', 'quo', 'se', 'sed', 'si', 'sic', 'sub', 'te', 'tibi', 'tum', 'ubi', 'ut']


We can also just use a pre-built stoplist from the Perseus Project.

In [22]:
from cltk.corpus.readers import get_corpus_reader
ll = get_corpus_reader(language='latin', corpus_name='latin_text_latin_library')
files = ll.fileids()
vergil_files = [file for file in files if "vergil" in file]
aeneid_whole = ""
for i in range(len(aeneid)):
    aeneid_whole += ll.raw(vergil_files[i])
#print(aeneid_whole)
aeneid_clean = preprocess(aeneid_whole)
#print(aeneid_clean)

In [25]:
from cltk.stop.latin import STOPS_LIST
aeneid_tokens = tokenizer.tokenize(aeneid_clean)
print(aeneid_tokens)
aeneid_stop_list_2 = [token for token in aeneid_tokens if not token in STOPS_LIST]

['uergil', 'aeneid', 'i', 'p', 'uergili', 'maronis', 'aeneidos', 'liber', 'primus', 'arma', 'uirum', '-que', 'cano', 'troiae', 'qui', 'primus', 'ab', 'oris', 'italiam', 'fato', 'profugus', 'lauinia', '-que', 'uenit', 'litora', 'multum', 'ille', 'et', 'terris', 'iactatus', 'et', 'alto', 'ui', 'superum', 'saeuae', 'memorem', 'iunonis', 'ob', 'iram', 'multa', 'quoque', 'et', 'bello', 'passus', 'dum', 'conderet', 'urbem', 'inferret', '-que', 'deos', 'latio', 'genus', 'unde', 'latinum', 'albani', '-que', 'patres', 'atque', 'altae', 'moenia', 'romae', 'musa', 'mihi', 'causas', 'memora', 'quo', 'numine', 'laeso', 'quid', '-ue', 'dolens', 'regina', 'deum', 'tot', 'uoluere', 'casus', 'insignem', 'pietate', 'uirum', 'tot', 'adire', 'labores', 'impulerit', 'tantaene', 'animis', 'caelestibus', 'irae', 'urbs', 'antiqua', 'fuit', 'tyrii', 'tenuere', 'coloni', 'karthago', 'italiam', 'contra', 'tiberina', '-que', 'longe', 'ostia', 'diues', 'opum', 'studiis', '-que', 'asperrima', 'belli', 'quam', 'iuno