# Session 2: Getting Familiar with the CLTK, Lemmatization, and Tokenization

In this session, we will explore the basic functionality and syntax of the Classical Language Toolkit (CLTK). The CLTK extends the Natural Language Toolkit (NLTK), the foremost tool for Natural Language Processing (NLP) in most languages, and better fits it to the specific needs of ancient languages. There are many facets to this Python package, so we will only be familiarizing ourselves with a few of the most common functionalities--essentially the bread and butter of computational text analysis with the CLTK.

We'll cover the following:
1. Downloading & Importing the CLTK
2. Importing Greek & Latin Texts
3. Tokenization
4. Lemmatization

So let's get started!

In [None]:
# We need to import the CLTK package before anything else
import cltk

# Specifically from CLTK, we need the Corpus Importer which will allow us to download texts
from cltk.corpus.utils.importer import CorpusImporter

# The way the Corpus Importer has been created, we must create a python *object* that will
# serve as the downloader, as it were, of our texts
latinDL = CorpusImporter('latin')

# The downloader object has an *attribute* that allows to see what corpora are available
latinDL.list_corpora

In [None]:
# Now that we know what corpora are available, we can select one to download
# For this session, we need 'latin_text_latin_library' & 'latin_models_cltk'

# This line will automatically download the entirety of the Latin Library to your
# computer in the form of text files that are easy for the machine to read.
latinDL.import_corpus('latin_text_latin_library')

# This line downloads a helper corpus of linguistic models which allow us to 
# lemmatize the text and do a whole bunch of other neat things.
latinDL.import_corpus('latin_models_cltk')

## How to Read-In Texts

In this module, we will read in the text files for Vergil's *Aeneid*.

In [None]:
# Now that we have our texts and models downloaded, we need to import them.
# We need another package called 'os' that allows Python to make use of the operating system.
import os

# You will have to navigate to the location on your machine where the texts are downloaded
# to find the filenames. All of the data that CLTK uses on your machine will be in a folder
# entitled 'cltk_data' in your home directory. This is the same directory that holds folders
# such as 'Documents' and 'Desktop'. We need to summon this textual data.

# Some of the texts are split-up by book, like Vergil's Aeneid, so we need a good way of
# gathering all that information into the system. Below, I have created a list of the filenames
# without the '.txt' appended.
filenames = ['aen1','aen2','aen3','aen4','aen5','aen6','aen7','aen8','aen9','aen10','aen11','aen12']

# Let's create a list called 'aeneid' that will hold all the text. It will have 12 entries, one
# for each book/file. We can leave it blank for now because we're about to read in all the files.
aeneid=[]

# Now we will use a for loop to iterate over all the filenames we just created.
# This basically says: for each file in filenames, open the file and read in it's context to the
# proper place in the list 'aeneid'. 
for filename in filenames:
    with open(f'/Users/chaduhl/cltk_data/latin/text/latin_text_latin_library/vergil/{filename}.txt') as f:
        aeneid.append(f.read())
        
# Let's make sure we did that right. The line below should print out the first book of the
# Aeneid. Remember that counting in python starts at 0.
print(aeneid[0])

In [None]:
# Before we move any further, we need to account for small inconsistencies, such as
# j/i and u/v transcriptions, especially since the Latin Library is fond of j's and
# both using both u's and v's. 

# We need to import a special package from CLTK to take care of this.
from cltk.stem.latin.j_v import JVReplacer
j = JVReplacer()
for i in range(12):
    aeneid[i] = j.replace(aeneid[i].lower())
print(aeneid[0])

# And we need to make everything lowercase


## How to Tokenize Texts

Having successfully read-in the text, we now need to tokenize it. Tokenizing a text means splitting up continuous text into chunks--usually either chunks of sentences or individual words. For the purposes of this exercise, we will be chunking the text into work-tokens. 

In [None]:
# To tokenize a text is to split up all the words into individual entities. Luckily, the
# CLTK has a built-in function for doing this. First, we need to import the tokenizer package:
from cltk.tokenize.word import WordTokenizer

# Then, just like the Corpus Importer, we need to create an object called 'tokenizer' that
# will do the job. We also need to tell the tokenizer what language it will be tokenizing.
tokenizer = WordTokenizer('latin')

# Let's create a list to hold all these tokens
aeneid_tokens = []

# A for-loop to actually implement the tokenizer
for i in range(12):
    aeneid_tokens.append(tokenizer.tokenize(aeneid[i]))

# Let's see if it worked. You should see the first 100 "words" of Book 1 of the Aeneid. 
# There will be numbers and punctuation intermingled because we have not yet removed them.
# In the line below, there are two sets of brackets because we have created a 2-dimensional
# array: a list of lists. The first set of brackets indicates the book and the second indicates
# the which token or word we are calling. In this case, the line calls the first 50 tokens.
print(aeneid_tokens[0][:50])

In [None]:
# Let's clean out those line numbers and punctuation with for-loop and a simple regular expression:
for i in range(12):
    aeneid_tokens[i] = [token for token in aeneid_tokens[i] if any(c.isalpha() for c in token)]

print(aeneid_tokens[0][:50])

## How to Lemmatize Texts

Now that we have the text split into word-tokens, we can lemmatize it. Lemmatizing entails stripping each word of its inflection or declension. We want each word to be in the form of its most basic dictionary entry. This should not be done in all cases, but it is important for many methods of computational text analysis.

In [None]:
# First we need to import the Lemmatizer package & create a Lemmatizer object.
# The syntax here should be starting to look familiar.
from cltk.stem.lemma import LemmaReplacer
lemmatizer = LemmaReplacer('latin')

aeneid_lemmata = []
for i in range(12):
    aeneid_lemmata.append(lemmatizer.lemmatize(aeneid_tokens[i]))
print(aeneid_lemmata[0][:50])